69488 – GF108 (NVC1) GPU lockup

Bug 69488 - GF108 (NVC1) GPU lockup

Summary: GF108 (NVC1) GPU lockup

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/nouveau (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Nouveau Project
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-09-17 19:28 UTC by Kevin N.
Modified:	2014-01-18 03:19 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
journalctl log (128.75 KB, text/plain) 2013-09-17 19:28 UTC, Kevin N.	no flags	Details
better patch for sysfb_simplefb that adjusts the BOOTFB resource (2.10 KB, patch) 2013-09-27 15:28 UTC, Pavel Roskin	no flags	Details \| Splinter Review
dmesg from 3.12-rc2 (63.51 KB, text/plain) 2013-09-30 22:33 UTC, Kevin N.	no flags	Details
dmesg from 3.11.2 (67.00 KB, text/plain) 2013-09-30 22:34 UTC, Kevin N.	no flags	Details
dmesg 3.12-rc3 with nouveau.runpm=0 (67.64 KB, text/plain) 2013-10-03 18:10 UTC, Kevin N.	no flags	Details
dmesg 3.12-rc3 with nouveau.config=NvMSI=0 (67.63 KB, text/plain) 2013-10-03 18:15 UTC, Kevin N.	no flags	Details
3.12 log with stalls (66.53 KB, text/plain) 2013-11-04 16:50 UTC, Kevin N.	no flags	Details
Dmesg 3.12 drm-next (33.96 KB, text/plain) 2013-11-09 23:30 UTC, Jean-Louis Dupond	no flags	Details
View All

Description Kevin N. 2013-09-17 19:28:04 UTC

Created attachment 86010 [details]
journalctl log

GPU locks up during boot and falls back to software frame buffer.

Booting with uefi and Linux stub loader on a t530.

Comment 1 Ilia Mirkin 2013-09-26 22:26:49 UTC

Pretty sure this is the same issue, but it may be nice if you could verify that the same commit presents the offending issue for you.

*** This bug has been marked as a duplicate of bug 69203 ***

Comment 2 Kevin N. 2013-09-27 00:18:42 UTC

I can tell you that it doesn't occur on the following kernels.
vmlinuz-3.9.5-301.fc19.x86_64
vmlinuz-3.11.1-200.fc19.x86_64
vmlinuz-3.10.11-200.fc19.x86_64

Comment 3 Ilia Mirkin 2013-09-27 00:24:10 UTC

Hm, that's unfortunate. Must be a different issue then. Can you do a bisect between v3.11 and v3.12-rc1 to see what commit killed it? (I guess it could also be a fedora-specific change that made your 3.11.1 kernel work... would be good to test a fresh one.)

Comment 4 Emil Velikov 2013-09-27 00:29:08 UTC

Looks like Pavel may have already narrowed it down. Feel free to try his patch
http://lists.freedesktop.org/archives/nouveau/2013-September/014521.html

Comment 5 Kevin N. 2013-09-27 03:07:42 UTC

(In reply to comment #4)
> Looks like Pavel may have already narrowed it down. Feel free to try his
> patch
> http://lists.freedesktop.org/archives/nouveau/2013-September/014521.html

I used this patch on 3.12-rc2 and the issues are gone.

Comment 6 Pavel Roskin 2013-09-27 15:28:19 UTC

Created attachment 86729 [details] [review]
better patch for sysfb_simplefb that adjusts the BOOTFB resource

sysfb_simplefb.c checks all PCI BARs on all VGA devices and adjust the resource to match the BAR in which the original area from screen_info is located.  Please test.  I'm not sure why nouveau hangs in your case.  It doesn't hang for me even if it shows the stack trace.

Comment 7 Kevin N. 2013-09-27 15:51:49 UTC

(In reply to comment #6)
> Created attachment 86729 [details] [review] [review]
> better patch for sysfb_simplefb that adjusts the BOOTFB resource
> 
> sysfb_simplefb.c checks all PCI BARs on all VGA devices and adjust the
> resource to match the BAR in which the original area from screen_info is
> located.  Please test.  I'm not sure why nouveau hangs in your case.  It
> doesn't hang for me even if it shows the stack trace.

I tried with your better patch and it also works fine.

Comment 8 Kevin N. 2013-09-27 16:12:05 UTC

Works in discrete mode.

Probably unrelated but I can manage to make it lock up in Optimus configuration if I plug in displayport while Xorg is running.
I see this in journalctl
Sep 27 08:57:46 sawako kernel: nouveau 0000:01:00.0: Refused to change power state, currently in D3
Sep 27 08:57:47 sawako kernel: nouveau 0000:01:00.0: Refused to change power state, currently in D3
Sep 27 08:57:47 sawako kernel: nouveau 0000:01:00.0: Refused to change power state, currently in D3

If I start the laptop with it already plugged in displayport works and it doesn't lock up.  But I noticed that if I turn off the DP-1 and unplug it, the card never powers down.
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynPwr:0000:01:00.0

Comment 9 Kevin N. 2013-09-30 22:33:26 UTC

Created attachment 86881 [details]
dmesg from 3.12-rc2

Sorry for the confusion, I went back to a Gentoo install and am more able to test now, for some reason on Fedora I was able to get into X with 3.12-rc2 with that patch.

dmesg of a boot with 3.12-rc2

Comment 10 Kevin N. 2013-09-30 22:34:11 UTC

Created attachment 86882 [details]
dmesg from 3.11.2

Comment 11 David Herrmann 2013-10-02 16:07:25 UTC

The IORESOURCE_BUSY patch is already on its way to Linus. However, all it does is suppressing this warning.

Could you describe what exactly the bug is you were seeing? You said it falls back to "software frame buffer". The log doesn't say anything like this, what do you mean by that?
Also, how did you detect a GPU lock-up?

If the warning is all you see but everything was working just fine, this bug can be closed. As the warning says "Your kernel is fine." this isn't serious at all and shouldn't cause any issues. If you saw any other weird behavior, please describe what exactly happened (apart from the oops message in the log).

Comment 12 Kevin N. 2013-10-02 20:06:45 UTC

(In reply to comment #11)
> The IORESOURCE_BUSY patch is already on its way to Linus. However, all it
> does is suppressing this warning.
> 
> Could you describe what exactly the bug is you were seeing? You said it
> falls back to "software frame buffer". The log doesn't say anything like
> this, what do you mean by that?
> Also, how did you detect a GPU lock-up?
> 
> If the warning is all you see but everything was working just fine, this bug
> can be closed. As the warning says "Your kernel is fine." this isn't serious
> at all and shouldn't cause any issues. If you saw any other weird behavior,
> please describe what exactly happened (apart from the oops message in the
> log).

My nvidia card doesn't work right with optimus and nouveau on 3.12-rc2.  Starting X does sometimes work but the display stops responding soon after or when I run xrandr.  Having my displayport plugged in seems to prevent that.

GPU lockup was probably poor choice in words, display stops responding is more accurate.  I end up having to power the machine off.

Comment 13 Kevin N. 2013-10-02 20:23:02 UTC

In that first log I posted there is
nouveau E[     DRM] GPU lockup - switching to software fbcon
Thats where I got the software frame buffer thing from.

Comment 14 Kevin N. 2013-10-02 23:21:45 UTC

I tried to get more on 3.12-rc3 with the "better patch for sysfb_simplefb that adjusts the BOOTFB resource" patch applied but when I startx it immediately locks up.

[   21.230774] nouveau  [     DRM] ACPI backlight interface available, not registering our own
[   21.230788] nouveau W[     DRM] voltage table 0x50 unknown
[   21.230946] nouveau  [     DRM] 2 available performance level(s)
[   21.230948] nouveau  [     DRM] 1: core 270MHz shader 540MHz memory 405MHz
[   21.230950] nouveau  [     DRM] 3: core 475MHz shader 950MHz memory 900MHz voltage 10mV
[   21.230952] nouveau  [     DRM] c: core 270MHz shader 540MHz memory 405MHz
[   21.236603] nouveau  [     DRM] MM: using COPY0 for buffer copies
[   21.502956] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000401 FAULT at 0x002010 [ IBUS TIMEOUT ]
[   21.527378] nouveau  [     DRM] allocated 1920x1080 fb: 0x60000, bo ffff88042ada2000
[   21.527442] nouveau 0000:01:00.0: fb1: nouveaufb frame buffer device
[   21.527449] [drm] Initialized nouveau 1.1.1 20120801 for 0000:01:00.0 on minor 1
[   22.891605] btrfs: disk space caching is enabled
[   26.533235] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95)
[   26.534638] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95)
[   71.584645] nouveau E[     DRM] failed to idle channel 0xcccc0001 [DRM]
[   71.708588] pci_pm_runtime_suspend(): nouveau_pmops_runtime_suspend+0x0/0x90 [nouveau] returns -16

Comment 15 Dave Airlie 2013-10-03 07:33:14 UTC

sounds like runtime pm is killing it,

boot with nouveau.runpm=0 maybe, though it does seem like the GPU is in trouble before that point.

Comment 16 Maarten Lankhorst 2013-10-03 07:39:04 UTC

Can you try with nouveau.config=NvMSI=0 ?

Comment 17 Kevin N. 2013-10-03 18:10:46 UTC

Created attachment 87081 [details]
dmesg 3.12-rc3 with nouveau.runpm=0

nouveau.runpm=0 gets me into Xorg.
I tried to use the nouveau card
#xrandr --setprovideroffloadsink nouveau Intel
#time DRI_PRIME=1 glxinfo
It shows its using nouveau but it takes a while to exit.
DRI_PRIME=1 glxinfo  14.29s user 122.64s system 100% cpu 2:16.82 total

Comment 18 Kevin N. 2013-10-03 18:12:19 UTC

(In reply to comment #17)
> Created attachment 87081 [details]
> dmesg 3.12-rc3 with nouveau.runpm=0
> 
> nouveau.runpm=0 gets me into Xorg.
> I tried to use the nouveau card
> #xrandr --setprovideroffloadsink nouveau Intel
> #time DRI_PRIME=1 glxinfo
> It shows its using nouveau but it takes a while to exit.
> DRI_PRIME=1 glxinfo  14.29s user 122.64s system 100% cpu 2:16.82 total

[  316.460763] nouveau E[   PFIFO][0000:01:00.0] playlist update failed
[  386.536115] nouveau E[glxinfo[4853]] failed to idle channel 0xcccc0000 [glxinfo[4853]]
[  401.553323] nouveau E[glxinfo[4853]] failed to idle channel 0xcccc0000 [glxinfo[4853]]
[  416.570533] nouveau E[glxinfo[4853]] failed to idle channel 0xcccc0000 [glxinfo[4853]]
[  418.572965] nouveau E[   PFIFO][0000:01:00.0] channel 3 [glxinfo[4853]] kick timeout
[  420.575362] nouveau E[   PFIFO][0000:01:00.0] playlist update failed
[  436.801104] nouveau E[   PFIFO][0000:01:00.0] playlist update failed
[  506.994155] nouveau E[glxinfo[5118]] failed to idle channel 0xcccc0000 [glxinfo[5118]]
[  522.011364] nouveau E[glxinfo[5118]] failed to idle channel 0xcccc0000 [glxinfo[5118]]
[  537.028572] nouveau E[glxinfo[5118]] failed to idle channel 0xcccc0000 [glxinfo[5118]]
[  539.030910] nouveau E[   PFIFO][0000:01:00.0] channel 3 [glxinfo[5118]] kick timeout
[  554.048076] nouveau E[glxinfo[5118]] failed to idle channel 0xcccc0000 [glxinfo[5118]]
[  569.065285] nouveau E[glxinfo[5118]] failed to idle channel 0xcccc0000 [glxinfo[5118]]
[  571.067792] nouveau E[   PFIFO][0000:01:00.0] playlist update failed
[  620.625781] nouveau E[   PFIFO][0000:01:00.0] playlist update failed

Comment 19 Kevin N. 2013-10-03 18:15:42 UTC

Created attachment 87082 [details]
dmesg 3.12-rc3 with nouveau.config=NvMSI=0

nouveau.config=NvMSI=0
Running startx causes it to stop responding, have to power off.

Comment 20 Kevin N. 2013-10-25 03:03:51 UTC

I'm not experiencing the hangs with 3.12-rc6 any longer.  Displayport is working with the Nvidia card as dedicated, but not with Optimus configuration DP-1-1 the TV shows no input.

Comment 21 Kevin N. 2013-10-26 13:38:25 UTC

(In reply to comment #20)
> I'm not experiencing the hangs with 3.12-rc6 any longer.  Displayport is
> working with the Nvidia card as dedicated, but not with Optimus
> configuration DP-1-1 the TV shows no input.

Scratch that, it had been fine for a few days.  It hung again just this morning when I ran xrandr, nothing relevant in dmesg, had to power off the machine.  After a few power cycles it seems to be working again.

Comment 22 Kevin N. 2013-11-04 16:50:35 UTC

Created attachment 88623 [details]
3.12 log with stalls

I ran xrandr in X with the 3.12 final and was able to get a better log before the system became unresponsive.

Nov 04 08:29:34 [logger] ACPI event unhandled: ibm/hotkey LEN0068:00 00000080 00006030
Nov 04 08:29:34 [kernel] thinkpad_acpi: EC reports that Thermal Table has changed
Nov 04 08:29:35 [logger] ACPI event unhandled: video/switchmode VMOD 00000080 00000000
Nov 04 08:29:49 [kernel] nouveau 0000:01:00.0: Refused to change power state, currently in D3
Nov 04 08:29:49 [logger] ACPI event unhandled: video/switchmode VMOD 00000080 00000000
Nov 04 08:29:49 [kernel] nouveau E[   PIBUS][0000:01:00.0] ROP0: 0x10fc7c 0x00030302 (0x38008208)
Nov 04 08:29:49 [kernel] nouveau E[   PFIFO][0000:01:00.0] write fault at 0x0000000000 [NO_CHANNEL] from BAR3/BAR_WRITE on channel 0x0000000000 [unknown]
Nov 04 08:29:49 [kernel] nouveau ![   PFIFO][0000:01:00.0] unhandled status 0x00800000
Nov 04 08:29:52 [kernel] nouveau E[   PFIFO][0000:01:00.0] playlist update failed
                - Last output repeated 2 times -
Nov 04 08:30:01 [cron] (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons)
Nov 04 08:30:01 [cron] (root) CMD (root^Itest -x /usr/sbin/run-crons && /usr/sbin/run-crons)
Nov 04 08:30:01 [cron] (wut) CMD (flexget --cron)
Nov 04 08:30:01 [kernel] nouveau E[      VM][0000:01:00.0] vm timeout 1: 0x00200000 1
Nov 04 08:30:04 [kernel] nouveau E[      VM][0000:01:00.0] vm timeout 1: 0x001f0000 1
Nov 04 08:30:07 [kernel] nouveau E[      VM][0000:01:00.0] vm timeout 1: 0x001e0000 1
Nov 04 08:30:10 [kernel] INFO: rcu_sched self-detected stall on CPU { 0}  (t=2101 jiffies g=1526 c=1525 q=1897)
Nov 04 08:30:10 [kernel] sending NMI to all CPUs:
Nov 04 08:30:10 [kernel] NMI backtrace for cpu 0
Nov 04 08:30:10 [kernel] CPU: 0 PID: 5929 Comm: X Not tainted 3.12.0-gentoo #2
Nov 04 08:30:10 [kernel] Hardware name: LENOVO 2359CTO/2359CTO, BIOS G4ET62WW (2.04 ) 09/13/2012
Nov 04 08:30:10 [kernel] task: ffff88043c2f6000 ti: ffff8804299c6000 task.ti: ffff8804299c6000
Nov 04 08:30:10 [kernel] RIP: 0010:[<ffffffff813a9d78>]  [<ffffffff813a9d78>] __const_udelay+0x12/0x26
Nov 04 08:30:10 [kernel] RSP: 0018:ffff88043e203e10  EFLAGS: 00000006
Nov 04 08:30:10 [kernel] RAX: 0000000001062560 RBX: 0000000000002710 RCX: 0000000000000007

Comment 23 Emil Velikov 2013-11-06 21:42:23 UTC

This issue suspitciously looks like bug 69203. Same chipset, same PBUS MMIO write error(s). 

Give this commit a try
http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=968a8d1b6c32c9f466f236032770b9165ece045a

Comment 24 Kevin N. 2013-11-07 07:39:25 UTC

I won't have the machine for at least a week or a bit more, I will test that commit as soon as I can.

Comment 25 Jean-Louis Dupond 2013-11-08 21:21:09 UTC

Got the exact same issue I guess:
[   71.384811] pci_pm_runtime_suspend(): nouveau_pmops_runtime_suspend+0x0/0xa0 [nouveau] returns -16

I tried the proposed patch:
http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=968a8d1b6c32c9f466f236032770b9165ece045a

And this seems to have fixed the isue for me. Xorg isn't crashing anymore when booting my system now.

Comment 26 Martin Peres 2013-11-09 02:49:08 UTC

Thank you for testing Jean-Louis. I mark this bug as fixed as the patch will land in Linux 3.13.

Comment 27 Jean-Louis Dupond 2013-11-09 23:30:34 UTC

Created attachment 88952 [details]
Dmesg 3.12 drm-next

I had no problems with booting anymore after using new kernel 3.12 with drm-next patches (until commit 91915260ea5ed9d9b19bfb75d53c989c8ada2ab0).

Now I still had some issues that my card didn't shutdown automaticly when lightdm/gnome-shell was running.

Anyway I decided to upgrade to Ubuntu 14.04 (Trusty).
It didn't work as expected, and the bug came back it seems :(

Attached is a dmesg taken with netconsole. As you can see it prints
[   49.498006] nouveau 0000:01:00.0: Refused to change power state, currently in D3
When lightdm is starting up.

Comment 28 Kevin N. 2013-11-12 18:44:47 UTC

http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=968a8d1b6c32c9f466f236032770b9165ece045a

With this patch the system no longer hangs.

However runtime power management seems broken as the card never goes to D3, so its effectively the same as using nouveau.runpm=0

Comment 29 Ned F. 2014-01-15 18:15:40 UTC

I understand that this bug may be fixed in 3.13.x, so I will wait for that to appear in FC19 or FC20.

For information:

I have this bug on a Lenovo T530, running Fedora 19 fully up-to-date as of today:
kernel:   kernel-3.12.7-200.fc19.x86_64
nouveau:xorg-x11-drv-nouveau-1.0.9-1.fc19.x86_64

Comment 30 Ned F. 2014-01-15 18:32:16 UTC

(In reply to comment #29)
> I have this bug on a Lenovo T530, running Fedora 19 fully up-to-date as of
> today:
> kernel:   kernel-3.12.7-200.fc19.x86_64
> nouveau:xorg-x11-drv-nouveau-1.0.9-1.fc19.x86_64

Everything works up through kernel 3.11.10.  When F19 updated to 3.12.6 then, when I log in, the window system starts, my desktop (Cinnamon or Gnome) starts, and 23sec after I started the session by entering my user password, the screen freezes and I cannot change another console; must turn off power to restart.

If during the live time, I open a terminal and 

tail -f /var/log/messages

the last message before the freeze is (likely unrelated and coincidental):

Jan 15 11:53:09 systemname systemd-logind[629]: Removed session c6.

and three seconds later, the screen freezes.

After reboot with 3.11.10, examining the /var/log/messages file shows that
the following messages surround the one above:

Jan 15 11:52:57 microwatt kernel: [  137.601242] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95)
Jan 15 11:52:57 microwatt kernel: [  137.602299] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95)
Jan 15 11:52:58 microwatt kernel: [  138.395359] thinkpad_acpi: EC reports that Thermal Table has changed
Jan 15 11:53:09 microwatt systemd-logind[629]: Removed session c6.
Jan 15 11:53:11 microwatt kernel: [  151.742703] thinkpad_acpi: EC reports that Thermal Table has changed
Jan 15 11:53:11 microwatt kernel: [  151.742708] nouveau 0000:01:00.0: Refused to change power state, currently in D3
Jan 15 11:53:11 microwatt kernel: [  151.813438] nouveau 0000:01:00.0: Refused to change power state, currently in D3
Jan 15 11:53:11 microwatt kernel: [  151.824445] nouveau 0000:01:00.0: Refused to change power state, currently in D3
Jan 15 11:53:21 microwatt systemd-logind[629]: Power key pressed.
Jan 15 11:53:21 microwatt systemd-logind[629]: Powering Off...
Jan 15 11:53:21 microwatt systemd-logind[629]: System is powering down.
Jan 15 11:53:21 microwatt systemd[1]: Starting Show Plymouth Power Off Screen...

at 11:53:21 I turned off power.

Comment 31 Ned F. 2014-01-15 20:43:13 UTC

I can also report that booting with kernel parameter:

nouveau.runpm=0

does allow my system to run with kernel 3.12.7

Comment 32 Ilia Mirkin 2014-01-15 20:49:25 UTC

The runtime pm thing is a separate issue. Talking about multiple unrelated issues in the same bug is very confusing. Integers are cheap -- just open a new bug if you have a new issue, no need to save up the bug id's.

Comment 33 Ned F. 2014-01-18 03:19:36 UTC

(In reply to comment #32)
> The runtime pm thing is a separate issue. Talking about multiple unrelated
> issues in the same bug is very confusing. Integers are cheap -- just open a
> new bug if you have a new issue, no need to save up the bug id's.

I don't mean to offend by breaking any rules, but Comment 15 of this bug report suggested trying nouveau.runpm=0 as a test.  I am merely reporting that I got the same result as others.  

Perhaps you are concerned that I am reporting on a bug already marked as fixed. It is clear that I am experiencing the same bug reported above, and I thought it may be helpful to others looking for a solution to see additional systems and software versions that experience the bug, and additional confirmation of one work around.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.