Bug 81136 - [NV92] Regression in Linux 3.15: GPU lockup after suspend
Summary: [NV92] Regression in Linux 3.15: GPU lockup after suspend
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: git
Hardware: All Linux (All)
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-07-10 00:35 UTC by Agustín Dall'Alba
Modified: 2015-04-30 04:37 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Logs for git kernel (73.23 KB, text/plain)
2014-07-10 00:35 UTC, Agustín Dall'Alba
no flags Details
Logs for git kernel with noaccel=1 nofbaccel=1 (67.15 KB, text/plain)
2014-07-10 00:35 UTC, Agustín Dall'Alba
no flags Details
Logs for Linux 3.15.4 with commit ecf24de reverted (56.85 KB, text/plain)
2014-07-10 00:36 UTC, Agustín Dall'Alba
no flags Details
kernel messages during wake-up from resume (62.22 KB, text/plain)
2014-08-10 07:27 UTC, Vasilis Lourdas
no flags Details

Description Agustín Dall'Alba 2014-07-10 00:35:07 UTC
Created attachment 102508 [details]
Logs for git kernel

After a suspend I get messages such as these: 

<3>[   39.550435] nouveau E[  PGRAPH][0000:01:00.0] PGRAPH TLB flush idle timeout fail
<3>[   39.550435] nouveau E[  PGRAPH][0000:01:00.0] PGRAPH_STATUS  : 0x01000001 BUSY ROP
<3>[   39.550435] nouveau E[  PGRAPH][0000:01:00.0] PGRAPH_VSTATUS0: 0x00000000
<3>[   39.550435] nouveau E[  PGRAPH][0000:01:00.0] PGRAPH_VSTATUS1: 0x00000000
<3>[   39.550435] nouveau E[  PGRAPH][0000:01:00.0] PGRAPH_VSTATUS2: 0x00200000 ROP
<3>[   41.685486] nouveau E[     DRM] GPU lockup - switching to software fbcon

And if I was running X it crashes and the screen ends up looking like this: http://imgur.com/a/D3VKw

This is always reproducible but only since Linux 3.15, so I ran a git bisect. The first bad commit is [ecf24de071f4f6cea79ecef5d990794df5875ee1] drm/nouveau: fix fbcon not being accelerated after suspend. After reverting the commmit the machine resumes properly.

The issue persists in drm-nouveau-next (last commit 0b4e8e7... from Jul 8), even if I boot with noaccel=1 nofbaccel=1.

Relevant IRC logs: http://people.freedesktop.org/~cbrill/dri-log/index.php?channel=nouveau&highlight_names=Nitsuga&date=2014-07-09
Comment 1 Agustín Dall'Alba 2014-07-10 00:35:42 UTC
Created attachment 102509 [details]
Logs for git kernel with noaccel=1 nofbaccel=1
Comment 2 Agustín Dall'Alba 2014-07-10 00:36:16 UTC
Created attachment 102510 [details]
Logs for Linux 3.15.4 with commit ecf24de reverted
Comment 3 Vasilis Lourdas 2014-08-10 07:24:58 UTC
I've experienced a similar issue when resuming from suspend-to-ram status. The screen was blank and in dmesg, I have several kernel messages from the nouveau module. I'm running Linux 3.16.0 (gentoo-sources package from Gentoo) with xorg-server 1.16.0, x11-drivers/xf86-video-nouveau-1.10.0-r1 and libdrm-2.4.54.

I will attach part of /var/log/messages with the nouveau errors.
Comment 4 Vasilis Lourdas 2014-08-10 07:27:58 UTC
Created attachment 104373 [details]
kernel messages during wake-up from resume
Comment 5 Vasilis Lourdas 2014-08-10 09:19:26 UTC
Forgot to mention my card model:

# lspci -v|fgrep -i vga
01:00.0 VGA compatible controller: NVIDIA Corporation G84 [GeForce 8600 GT] (rev a1) (prog-if 00 [VGA controller])
Comment 6 Sven Joachim 2014-08-11 04:15:14 UTC
Same problem here on NV86 [GeForce 8500 GT], reverting commit ecf24de071f4f6cea79ecef5d990794df5875ee1 in 3.16.0 helps.
Comment 7 Agustín Dall'Alba 2014-08-16 05:02:00 UTC
Update: I got tired of reverting the ecf24de commit on every linux update, so I tried booting with nouveau.nofbaccel=1 (instead of nofbaccel=1). It works fine. The system still does not resume properly without it on Linux v3.16.1, but that boot option is a better workaround than reverting.
Comment 8 Ed Santiago 2014-10-25 21:30:59 UTC
Same issue. Dell M4800 with QHD+ display -- NVIDIA Corporation GK106GLM [Quadro K2100M] (rev a1), 3.16.6-gentoo (I tried 3.17, that didn't even give me a usable display).

None of the workarounds were effective for me: nouveau.nofbaccel=1 causes suspend to fail, and so did reverting ecf24de071f4f6cea79ecef5d990794df5875ee1:

   A dependency job for suspend.target failed. See 'journalctl -xn' for details.
   ...
   Oct 25 15:21:16 hostname kernel: WARNING: CPU: 0 PID: 2852 at lib/iomap.c:43 bad_io_access+0x36/0x38()
   Oct 25 15:21:16 hostname kernel: Bad IO access at port 0x24 (outl(val,port))
Comment 9 Agustín Dall'Alba 2014-10-27 22:41:17 UTC
Another update:

I tried running with a secondary monitor. Unfortunately under that setup the nouveau.nofbaccel=1 workaround doesn't cut it anymore, and only one monitor works after resume. Trying to unplug and replug or use xrandr after this has happened doesn't make the other monitor work and once even left me with no screen. I found some new kernel messages, in particular:

<6>[    0.336621] nouveau  [     PFB][0000:01:00.0] RAM type: GDDR3
<6>[    0.336623] nouveau  [     PFB][0000:01:00.0] RAM size: 512 MiB
<3>[    0.336620] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000007 FAULT at 0x00e180
--snip--
<6>[    0.365519] nouveau  [     DRM] VRAM: 512 MiB
<6>[    0.365521] nouveau  [     DRM] GART: 1048576 MiB
--snip--
<3>[    0.366886] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000 FAULT at 0x00e070
<3>[    0.368257] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000 FAULT at 0x00e070

right after nouveau loads, another:

<6>[   75.935933] nouveau  [     DRM] suspending console...
<6>[   75.935944] nouveau  [     DRM] suspending display...
<6>[   75.936012] nouveau  [     DRM] evicting buffers...
<6>[   76.206568] nouveau  [     DRM] waiting for kernel channels to go idle...
<6>[   76.206573] nouveau  [     DRM] suspending client object trees...
<6>[   76.207261] nouveau  [     DRM] suspending kernel object tree...
<3>[   76.267516] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000 FAULT at 0x00e070

immediately before suspend and:

<6>[   78.110864] nouveau  [     DRM] re-enabling device...
<6>[   78.110870] nouveau  [     DRM] resuming kernel object tree...
<6>[   78.110882] nouveau  [   VBIOS][0000:01:00.0] running init tables
<3>[   78.200040] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000 FAULT at 0x00e074
<6>[   78.274292] nouveau  [    VOLT][0000:01:00.0] GPU voltage: 1000000uv
<6>[   78.274303] nouveau  [  PTHERM][0000:01:00.0] fan management: automatic
<6>[   78.274378] nouveau  [     CLK][0000:01:00.0] --: core 399 MHz shader 810 MHz memory 499 MHz
<3>[   78.275977] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000 FAULT at 0x00e070
<3>[   78.277301] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000 FAULT at 0x00e070
<6>[   78.277474] nouveau  [     DRM] resuming client object trees...
<6>[   78.277902] nouveau  [     DRM] resuming display...

on resume. Maybe this is another bug?

So now I'm using linux-lts 3.14.22. No problems there, suspend and multi monitor setups work great.
Comment 10 Emil Velikov 2014-10-28 01:59:56 UTC
Ed,
Considering that the workarounds mentioned do not work in your case and that you have a different card (reporter has nv92, while yours is gk106) we can safely conclude that you're having a different issue.
Please open another bug report and let us know if it is a regression, and if so which commit broke it.


Agustín,
These two should be non-fatal and the fix for them is in 3.18. Should end up in 3.16, 3.17 as well.
> FAULT at 0x00e070
> FAULT at 0x00e074

Now this one, I have no idea. Do you get this error with 3.14 and dual monitors ?
> FAULT at 0x00e180

Linux 3.17 includes quite a few fixes in the area of s/r, can you give it a try.
Comment 11 Agustín Dall'Alba 2014-10-28 02:41:06 UTC
On 3.14.22 I get no FAULTs and suspend works fine.

On linux 3.17.1 I get a FAULT at 0x00e070 and 0x00e074 on boot, suspend, resume, and when plugging the second monitor for the first time. But I can't reproduce a FAULT at 0x00e180 in any way. Checking the logs it looks like it's quite rare (it happens every twenty or so FAULTs) and unrelated to the second monitor.
If I use the nouveau.nofbaccel=1 (only) one of the monitors comes back after resume. If I don't I get the gabled display, 'GPU lockup' and PGRAPH errors as in the original post.

I'm downloading linux mainline now to test.
Comment 12 Emil Velikov 2014-10-28 03:07:07 UTC
The upstream commit addressing the e07{0,4} messages (ignore the typo in the commit message) is 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/gpu/drm/nouveau?id=b485a7005faba38286bc02ab1d80e2cbf61c1002

^^ is just in case 3.18 causes some other unwanted behaviour.
Comment 13 Agustín Dall'Alba 2014-10-28 03:22:10 UTC
Brilliant, Linux 3.18-rc2 resumes both monitors with nouveau.nofbaccel=1. :D

So it indeed was a different issue. The original GPU lockup bug is still there, though.
Comment 14 Agustín Dall'Alba 2015-04-30 04:37:07 UTC
Fixed in Linux 3.19 :)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.