Bug 67597 - nouveau E Xorg failed to idle channel on NVA0
Summary: nouveau E Xorg failed to idle channel on NVA0
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-08-01 00:01 UTC by Marc Meledandri
Modified: 2014-08-10 14:56 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg-nouveau-crasher-idle-fail (134.65 KB, text/plain)
2013-08-01 00:01 UTC, Marc Meledandri
no flags Details
git-bisect-log (3.08 KB, text/plain)
2013-08-11 20:30 UTC, Marc Meledandri
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Meledandri 2013-08-01 00:01:10 UTC
Created attachment 83404 [details]
dmesg-nouveau-crasher-idle-fail

Shortly after resume from suspend on kernel 3.10.4, nouveau crashes (hard lockup). This is the first nouveau crash I've seen since using xfwm4 and no compositing. I've been hanging out on the 3.4.x LTS series though, so it's been safe. 

dmesg attached.
Comment 1 Emil Velikov 2013-08-01 02:19:03 UTC
Hi Marc

Would you be able to bisect this issue ? Hopefully the fix will be more trivial than the i2c one you've reported :]

Cheers
Emil
Comment 2 Ilia Mirkin 2013-08-01 02:41:42 UTC
Actually all the fence stuff was redone between... 3.4 and 3.5 or 3.5 and 3.6 (sorry, I forgot), and that created regressions for at least one other user (on a nvc0 card though). The unfortunate thing is that the fence redo actually had some suspend bugs in it that were fixed over time (but I guess not all of them!) so this may end up being tricky to bisect, if you indeed zero in on those fence commits.
Comment 3 Marc Meledandri 2013-08-02 16:28:51 UTC
Hi, I may try to biscect this but won't be able to get it done soon. There are a couple of hurdles to testing...

1) The bug itself causes a hard lockup requiring poweroff. I use a Crucial M4 SSD which has a known issue with power loss condition where it becomes unusable. This is risky, so I'll need to image my install over to an external drive to test the biscec kernel builds.

2) There is no reliable steps to reproduce the issue. I was running 3.10.3 for several days with multiple suspend cycles prior to triggering this crasher. I'm not sure I can dedicate the time to this. As much as I'd like to help the nouveau driver have one less bug, the gpu in question is aged and I have to just let this one slide as it's pretty low priority for newer kernels I'd assume :]

Please leave this report open. I'll try to update it at some point with results of a bisect. Can I bisect the mainline tree, or must I use the nouveau git tree?
Comment 4 Ilia Mirkin 2013-08-02 16:37:55 UTC
Shouldn't matter -- the trees are identical for the purpose of bisection. I bet running glxgears while suspending will help trigger it more often. Also, it's interesting that it's a hard hang. I would have expected X to die and come back... can you not, e.g., ssh in when this happens?
Comment 5 Marc Meledandri 2013-08-02 16:47:26 UTC
Thanks for the info. Unfortunately, it's a CPU hard lockup - no ssh connectivity and cannot sysrq out safely. Additionally, I've got a raid that doesn't get synced on this crash, on top of the SSD issue, so this becomes quite troublesome.

I had reported a similar nouveau crasher via downstream Debian BTS last year.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=696554

That's why I switched over to XFCE and build my own vanilla 3.4.x kernels + i2c/pwm patch.
Comment 6 Marc Meledandri 2013-08-03 16:09:58 UTC
Thanks Ilia...glxgears while resuming from suspend triggered the crash on the first attempt. I'll get a test install up and running and start bisecting this soon.
Comment 7 Marc Meledandri 2013-08-11 20:29:26 UTC
Okay, I created a test environment and bisected this problem:

4f6029da58ba9204c98e33f4f3737fe085c87a6f is the first bad commit
commit 4f6029da58ba9204c98e33f4f3737fe085c87a6f
Author: Ben Skeggs <bskeggs@redhat.com>
Date:   Fri Nov 16 11:54:31 2012 +1000

    drm/nv50-nvc0: switch to common disp impl, removing previous version
    
    Signed-off-by: Ben Skeggs <bskeggs@redhat.com>

:040000 040000 9daeb0bd5ed3e9b22b53c21fab853bd2e392f6ed 4bdbb1d96e57d3f254affb8812788f04b7474bf7 M	drivers
Comment 8 Marc Meledandri 2013-08-11 20:30:17 UTC
Created attachment 83948 [details]
git-bisect-log
Comment 9 Ilia Mirkin 2013-08-11 23:16:00 UTC
Same bisect result as https://bugs.freedesktop.org/show_bug.cgi?id=67878 . NV98 vs NVA0 -- fairly similar cards, too.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.