Summary: | GF108 (NVC1) GPU lockup | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | Kevin N. <vekinn> | ||||||||||||||||||
Component: | Driver/nouveau | Assignee: | Nouveau Project <nouveau> | ||||||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> | ||||||||||||||||||
Severity: | normal | ||||||||||||||||||||
Priority: | medium | CC: | jean-louis, pasik | ||||||||||||||||||
Version: | unspecified | ||||||||||||||||||||
Hardware: | Other | ||||||||||||||||||||
OS: | All | ||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||
Attachments: |
|
Pretty sure this is the same issue, but it may be nice if you could verify that the same commit presents the offending issue for you. *** This bug has been marked as a duplicate of bug 69203 *** I can tell you that it doesn't occur on the following kernels. vmlinuz-3.9.5-301.fc19.x86_64 vmlinuz-3.11.1-200.fc19.x86_64 vmlinuz-3.10.11-200.fc19.x86_64 Hm, that's unfortunate. Must be a different issue then. Can you do a bisect between v3.11 and v3.12-rc1 to see what commit killed it? (I guess it could also be a fedora-specific change that made your 3.11.1 kernel work... would be good to test a fresh one.) Looks like Pavel may have already narrowed it down. Feel free to try his patch http://lists.freedesktop.org/archives/nouveau/2013-September/014521.html (In reply to comment #4) > Looks like Pavel may have already narrowed it down. Feel free to try his > patch > http://lists.freedesktop.org/archives/nouveau/2013-September/014521.html I used this patch on 3.12-rc2 and the issues are gone. Created attachment 86729 [details] [review] better patch for sysfb_simplefb that adjusts the BOOTFB resource sysfb_simplefb.c checks all PCI BARs on all VGA devices and adjust the resource to match the BAR in which the original area from screen_info is located. Please test. I'm not sure why nouveau hangs in your case. It doesn't hang for me even if it shows the stack trace. (In reply to comment #6) > Created attachment 86729 [details] [review] [review] > better patch for sysfb_simplefb that adjusts the BOOTFB resource > > sysfb_simplefb.c checks all PCI BARs on all VGA devices and adjust the > resource to match the BAR in which the original area from screen_info is > located. Please test. I'm not sure why nouveau hangs in your case. It > doesn't hang for me even if it shows the stack trace. I tried with your better patch and it also works fine. Works in discrete mode. Probably unrelated but I can manage to make it lock up in Optimus configuration if I plug in displayport while Xorg is running. I see this in journalctl Sep 27 08:57:46 sawako kernel: nouveau 0000:01:00.0: Refused to change power state, currently in D3 Sep 27 08:57:47 sawako kernel: nouveau 0000:01:00.0: Refused to change power state, currently in D3 Sep 27 08:57:47 sawako kernel: nouveau 0000:01:00.0: Refused to change power state, currently in D3 If I start the laptop with it already plugged in displayport works and it doesn't lock up. But I noticed that if I turn off the DP-1 and unplug it, the card never powers down. 0:IGD:+:Pwr:0000:00:02.0 1:DIS: :DynPwr:0000:01:00.0 Created attachment 86881 [details]
dmesg from 3.12-rc2
Sorry for the confusion, I went back to a Gentoo install and am more able to test now, for some reason on Fedora I was able to get into X with 3.12-rc2 with that patch.
dmesg of a boot with 3.12-rc2
Created attachment 86882 [details]
dmesg from 3.11.2
The IORESOURCE_BUSY patch is already on its way to Linus. However, all it does is suppressing this warning. Could you describe what exactly the bug is you were seeing? You said it falls back to "software frame buffer". The log doesn't say anything like this, what do you mean by that? Also, how did you detect a GPU lock-up? If the warning is all you see but everything was working just fine, this bug can be closed. As the warning says "Your kernel is fine." this isn't serious at all and shouldn't cause any issues. If you saw any other weird behavior, please describe what exactly happened (apart from the oops message in the log). (In reply to comment #11) > The IORESOURCE_BUSY patch is already on its way to Linus. However, all it > does is suppressing this warning. > > Could you describe what exactly the bug is you were seeing? You said it > falls back to "software frame buffer". The log doesn't say anything like > this, what do you mean by that? > Also, how did you detect a GPU lock-up? > > If the warning is all you see but everything was working just fine, this bug > can be closed. As the warning says "Your kernel is fine." this isn't serious > at all and shouldn't cause any issues. If you saw any other weird behavior, > please describe what exactly happened (apart from the oops message in the > log). My nvidia card doesn't work right with optimus and nouveau on 3.12-rc2. Starting X does sometimes work but the display stops responding soon after or when I run xrandr. Having my displayport plugged in seems to prevent that. GPU lockup was probably poor choice in words, display stops responding is more accurate. I end up having to power the machine off. In that first log I posted there is nouveau E[ DRM] GPU lockup - switching to software fbcon Thats where I got the software frame buffer thing from. I tried to get more on 3.12-rc3 with the "better patch for sysfb_simplefb that adjusts the BOOTFB resource" patch applied but when I startx it immediately locks up. [ 21.230774] nouveau [ DRM] ACPI backlight interface available, not registering our own [ 21.230788] nouveau W[ DRM] voltage table 0x50 unknown [ 21.230946] nouveau [ DRM] 2 available performance level(s) [ 21.230948] nouveau [ DRM] 1: core 270MHz shader 540MHz memory 405MHz [ 21.230950] nouveau [ DRM] 3: core 475MHz shader 950MHz memory 900MHz voltage 10mV [ 21.230952] nouveau [ DRM] c: core 270MHz shader 540MHz memory 405MHz [ 21.236603] nouveau [ DRM] MM: using COPY0 for buffer copies [ 21.502956] nouveau E[ PBUS][0000:01:00.0] MMIO write of 0x00000401 FAULT at 0x002010 [ IBUS TIMEOUT ] [ 21.527378] nouveau [ DRM] allocated 1920x1080 fb: 0x60000, bo ffff88042ada2000 [ 21.527442] nouveau 0000:01:00.0: fb1: nouveaufb frame buffer device [ 21.527449] [drm] Initialized nouveau 1.1.1 20120801 for 0000:01:00.0 on minor 1 [ 22.891605] btrfs: disk space caching is enabled [ 26.533235] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95) [ 26.534638] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95) [ 71.584645] nouveau E[ DRM] failed to idle channel 0xcccc0001 [DRM] [ 71.708588] pci_pm_runtime_suspend(): nouveau_pmops_runtime_suspend+0x0/0x90 [nouveau] returns -16 sounds like runtime pm is killing it, boot with nouveau.runpm=0 maybe, though it does seem like the GPU is in trouble before that point. Can you try with nouveau.config=NvMSI=0 ? Created attachment 87081 [details]
dmesg 3.12-rc3 with nouveau.runpm=0
nouveau.runpm=0 gets me into Xorg.
I tried to use the nouveau card
#xrandr --setprovideroffloadsink nouveau Intel
#time DRI_PRIME=1 glxinfo
It shows its using nouveau but it takes a while to exit.
DRI_PRIME=1 glxinfo 14.29s user 122.64s system 100% cpu 2:16.82 total
(In reply to comment #17) > Created attachment 87081 [details] > dmesg 3.12-rc3 with nouveau.runpm=0 > > nouveau.runpm=0 gets me into Xorg. > I tried to use the nouveau card > #xrandr --setprovideroffloadsink nouveau Intel > #time DRI_PRIME=1 glxinfo > It shows its using nouveau but it takes a while to exit. > DRI_PRIME=1 glxinfo 14.29s user 122.64s system 100% cpu 2:16.82 total [ 316.460763] nouveau E[ PFIFO][0000:01:00.0] playlist update failed [ 386.536115] nouveau E[glxinfo[4853]] failed to idle channel 0xcccc0000 [glxinfo[4853]] [ 401.553323] nouveau E[glxinfo[4853]] failed to idle channel 0xcccc0000 [glxinfo[4853]] [ 416.570533] nouveau E[glxinfo[4853]] failed to idle channel 0xcccc0000 [glxinfo[4853]] [ 418.572965] nouveau E[ PFIFO][0000:01:00.0] channel 3 [glxinfo[4853]] kick timeout [ 420.575362] nouveau E[ PFIFO][0000:01:00.0] playlist update failed [ 436.801104] nouveau E[ PFIFO][0000:01:00.0] playlist update failed [ 506.994155] nouveau E[glxinfo[5118]] failed to idle channel 0xcccc0000 [glxinfo[5118]] [ 522.011364] nouveau E[glxinfo[5118]] failed to idle channel 0xcccc0000 [glxinfo[5118]] [ 537.028572] nouveau E[glxinfo[5118]] failed to idle channel 0xcccc0000 [glxinfo[5118]] [ 539.030910] nouveau E[ PFIFO][0000:01:00.0] channel 3 [glxinfo[5118]] kick timeout [ 554.048076] nouveau E[glxinfo[5118]] failed to idle channel 0xcccc0000 [glxinfo[5118]] [ 569.065285] nouveau E[glxinfo[5118]] failed to idle channel 0xcccc0000 [glxinfo[5118]] [ 571.067792] nouveau E[ PFIFO][0000:01:00.0] playlist update failed [ 620.625781] nouveau E[ PFIFO][0000:01:00.0] playlist update failed Created attachment 87082 [details]
dmesg 3.12-rc3 with nouveau.config=NvMSI=0
nouveau.config=NvMSI=0
Running startx causes it to stop responding, have to power off.
I'm not experiencing the hangs with 3.12-rc6 any longer. Displayport is working with the Nvidia card as dedicated, but not with Optimus configuration DP-1-1 the TV shows no input. (In reply to comment #20) > I'm not experiencing the hangs with 3.12-rc6 any longer. Displayport is > working with the Nvidia card as dedicated, but not with Optimus > configuration DP-1-1 the TV shows no input. Scratch that, it had been fine for a few days. It hung again just this morning when I ran xrandr, nothing relevant in dmesg, had to power off the machine. After a few power cycles it seems to be working again. Created attachment 88623 [details]
3.12 log with stalls
I ran xrandr in X with the 3.12 final and was able to get a better log before the system became unresponsive.
Nov 04 08:29:34 [logger] ACPI event unhandled: ibm/hotkey LEN0068:00 00000080 00006030
Nov 04 08:29:34 [kernel] thinkpad_acpi: EC reports that Thermal Table has changed
Nov 04 08:29:35 [logger] ACPI event unhandled: video/switchmode VMOD 00000080 00000000
Nov 04 08:29:49 [kernel] nouveau 0000:01:00.0: Refused to change power state, currently in D3
Nov 04 08:29:49 [logger] ACPI event unhandled: video/switchmode VMOD 00000080 00000000
Nov 04 08:29:49 [kernel] nouveau E[ PIBUS][0000:01:00.0] ROP0: 0x10fc7c 0x00030302 (0x38008208)
Nov 04 08:29:49 [kernel] nouveau E[ PFIFO][0000:01:00.0] write fault at 0x0000000000 [NO_CHANNEL] from BAR3/BAR_WRITE on channel 0x0000000000 [unknown]
Nov 04 08:29:49 [kernel] nouveau ![ PFIFO][0000:01:00.0] unhandled status 0x00800000
Nov 04 08:29:52 [kernel] nouveau E[ PFIFO][0000:01:00.0] playlist update failed
- Last output repeated 2 times -
Nov 04 08:30:01 [cron] (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons)
Nov 04 08:30:01 [cron] (root) CMD (root^Itest -x /usr/sbin/run-crons && /usr/sbin/run-crons)
Nov 04 08:30:01 [cron] (wut) CMD (flexget --cron)
Nov 04 08:30:01 [kernel] nouveau E[ VM][0000:01:00.0] vm timeout 1: 0x00200000 1
Nov 04 08:30:04 [kernel] nouveau E[ VM][0000:01:00.0] vm timeout 1: 0x001f0000 1
Nov 04 08:30:07 [kernel] nouveau E[ VM][0000:01:00.0] vm timeout 1: 0x001e0000 1
Nov 04 08:30:10 [kernel] INFO: rcu_sched self-detected stall on CPU { 0} (t=2101 jiffies g=1526 c=1525 q=1897)
Nov 04 08:30:10 [kernel] sending NMI to all CPUs:
Nov 04 08:30:10 [kernel] NMI backtrace for cpu 0
Nov 04 08:30:10 [kernel] CPU: 0 PID: 5929 Comm: X Not tainted 3.12.0-gentoo #2
Nov 04 08:30:10 [kernel] Hardware name: LENOVO 2359CTO/2359CTO, BIOS G4ET62WW (2.04 ) 09/13/2012
Nov 04 08:30:10 [kernel] task: ffff88043c2f6000 ti: ffff8804299c6000 task.ti: ffff8804299c6000
Nov 04 08:30:10 [kernel] RIP: 0010:[<ffffffff813a9d78>] [<ffffffff813a9d78>] __const_udelay+0x12/0x26
Nov 04 08:30:10 [kernel] RSP: 0018:ffff88043e203e10 EFLAGS: 00000006
Nov 04 08:30:10 [kernel] RAX: 0000000001062560 RBX: 0000000000002710 RCX: 0000000000000007
This issue suspitciously looks like bug 69203. Same chipset, same PBUS MMIO write error(s). Give this commit a try http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=968a8d1b6c32c9f466f236032770b9165ece045a I won't have the machine for at least a week or a bit more, I will test that commit as soon as I can. Got the exact same issue I guess: [ 71.384811] pci_pm_runtime_suspend(): nouveau_pmops_runtime_suspend+0x0/0xa0 [nouveau] returns -16 I tried the proposed patch: http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=968a8d1b6c32c9f466f236032770b9165ece045a And this seems to have fixed the isue for me. Xorg isn't crashing anymore when booting my system now. Thank you for testing Jean-Louis. I mark this bug as fixed as the patch will land in Linux 3.13. Created attachment 88952 [details]
Dmesg 3.12 drm-next
I had no problems with booting anymore after using new kernel 3.12 with drm-next patches (until commit 91915260ea5ed9d9b19bfb75d53c989c8ada2ab0).
Now I still had some issues that my card didn't shutdown automaticly when lightdm/gnome-shell was running.
Anyway I decided to upgrade to Ubuntu 14.04 (Trusty).
It didn't work as expected, and the bug came back it seems :(
Attached is a dmesg taken with netconsole. As you can see it prints
[ 49.498006] nouveau 0000:01:00.0: Refused to change power state, currently in D3
When lightdm is starting up.
http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=968a8d1b6c32c9f466f236032770b9165ece045a With this patch the system no longer hangs. However runtime power management seems broken as the card never goes to D3, so its effectively the same as using nouveau.runpm=0 I understand that this bug may be fixed in 3.13.x, so I will wait for that to appear in FC19 or FC20. For information: I have this bug on a Lenovo T530, running Fedora 19 fully up-to-date as of today: kernel: kernel-3.12.7-200.fc19.x86_64 nouveau:xorg-x11-drv-nouveau-1.0.9-1.fc19.x86_64 (In reply to comment #29) > I have this bug on a Lenovo T530, running Fedora 19 fully up-to-date as of > today: > kernel: kernel-3.12.7-200.fc19.x86_64 > nouveau:xorg-x11-drv-nouveau-1.0.9-1.fc19.x86_64 Everything works up through kernel 3.11.10. When F19 updated to 3.12.6 then, when I log in, the window system starts, my desktop (Cinnamon or Gnome) starts, and 23sec after I started the session by entering my user password, the screen freezes and I cannot change another console; must turn off power to restart. If during the live time, I open a terminal and tail -f /var/log/messages the last message before the freeze is (likely unrelated and coincidental): Jan 15 11:53:09 systemname systemd-logind[629]: Removed session c6. and three seconds later, the screen freezes. After reboot with 3.11.10, examining the /var/log/messages file shows that the following messages surround the one above: Jan 15 11:52:57 microwatt kernel: [ 137.601242] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95) Jan 15 11:52:57 microwatt kernel: [ 137.602299] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95) Jan 15 11:52:58 microwatt kernel: [ 138.395359] thinkpad_acpi: EC reports that Thermal Table has changed Jan 15 11:53:09 microwatt systemd-logind[629]: Removed session c6. Jan 15 11:53:11 microwatt kernel: [ 151.742703] thinkpad_acpi: EC reports that Thermal Table has changed Jan 15 11:53:11 microwatt kernel: [ 151.742708] nouveau 0000:01:00.0: Refused to change power state, currently in D3 Jan 15 11:53:11 microwatt kernel: [ 151.813438] nouveau 0000:01:00.0: Refused to change power state, currently in D3 Jan 15 11:53:11 microwatt kernel: [ 151.824445] nouveau 0000:01:00.0: Refused to change power state, currently in D3 Jan 15 11:53:21 microwatt systemd-logind[629]: Power key pressed. Jan 15 11:53:21 microwatt systemd-logind[629]: Powering Off... Jan 15 11:53:21 microwatt systemd-logind[629]: System is powering down. Jan 15 11:53:21 microwatt systemd[1]: Starting Show Plymouth Power Off Screen... at 11:53:21 I turned off power. I can also report that booting with kernel parameter: nouveau.runpm=0 does allow my system to run with kernel 3.12.7 The runtime pm thing is a separate issue. Talking about multiple unrelated issues in the same bug is very confusing. Integers are cheap -- just open a new bug if you have a new issue, no need to save up the bug id's. (In reply to comment #32) > The runtime pm thing is a separate issue. Talking about multiple unrelated > issues in the same bug is very confusing. Integers are cheap -- just open a > new bug if you have a new issue, no need to save up the bug id's. I don't mean to offend by breaking any rules, but Comment 15 of this bug report suggested trying nouveau.runpm=0 as a test. I am merely reporting that I got the same result as others. Perhaps you are concerned that I am reporting on a bug already marked as fixed. It is clear that I am experiencing the same bug reported above, and I thought it may be helpful to others looking for a solution to see additional systems and software versions that experience the bug, and additional confirmation of one work around. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 86010 [details] journalctl log GPU locks up during boot and falls back to software frame buffer. Booting with uefi and Linux stub loader on a t530.