Bug 90682 - [NVE6] failed to idle channel 0xcccc0001 + crash
Summary: [NVE6] failed to idle channel 0xcccc0001 + crash
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-05-27 14:30 UTC by Thomas Stewart
Modified: 2015-05-27 16:53 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dmesg captured with kdump (84.76 KB, text/plain)
2015-05-27 14:30 UTC, Thomas Stewart
no flags Details
Xorg.0.log with nouveau.noaccel=1 (34.39 KB, text/plain)
2015-05-27 15:48 UTC, Thomas Stewart
no flags Details
dmesg captured with kdump with video=eDP-1:d (84.99 KB, text/plain)
2015-05-27 16:50 UTC, Thomas Stewart
no flags Details
dmesg with nouveau.noaccel=1 video=eDP-1:d (82.10 KB, text/plain)
2015-05-27 16:52 UTC, Thomas Stewart
no flags Details
Xorg.0.log with nouveau.noaccel=1 video=eDP-1:d (32.01 KB, text/plain)
2015-05-27 16:53 UTC, Thomas Stewart
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Stewart 2015-05-27 14:30:03 UTC
Created attachment 116084 [details]
dmesg captured with kdump

Hi,

My Lenovo W540 started crashing when I upgraded from Linux 3.14 to 3.16 in Sept 2014. I continued to use 3.14 and I did nothing till last week when I tried Linux 4.0 (from sid) which kept crashing soon after logging in to GNOME 3. I EFI boot to grub and use GRUB_GFXPAYLOAD_LINUX=keep. Then use KMS and Plymouth till GDM, X.Org and GNOME3. When it crashed the mouse and keyboard no longer did anything (ie Ctrl-Alt-F1 did not work) and the laptop appeared to drop off the network. 

When I boot with Linux 4.0 and login to GNOME 3, it would crash within minutes, but sometimes an hour. I manually compiled 3.15, 3.16, 3.17, 3.18, 3.19 and 4.0 which all had various issues: the external monitor resolution was broken on 3.15, 3.16 seemed ok, but 3.17, 3.18, 3.19 and 4.0 all seemed to crash soon after boot.

I used kdump to capture a dump and dmesg and there was a message about nouveau:
[   76.792370] nouveau E[     DRM] failed to idle channel 0xcccc0001 [DRM]

Then 60 microseconds later a BUG:
[   76.792430] BUG: unable to handle kernel paging request at ffff8805660b7ffc
[   76.792455] IP: [<ffffffffa0406bf3>] evo_wait+0x53/0x120 [nouveau]

After a little googling I found out about the "nouveau.runpm=0" parameter. Once I added this parameter and rebooted my laptop has worked fine with Linux 4.0. However I have not tried that parameter in any previous kernels so am unsure which release this workaround started working.

I'm now happy that I have a working system with working lcd screen brightness controls and multi-stream transport monitors that work. However I don't want to kill the laptops battery by permanently disabling the power management. I could try bisecting, but with so many revisions I'm not sure what to mark good and bad or if I should use runpm at all.

Anyway here are some Debian package versions and info:
linux-image-4.0.0-1-amd64       4.0.2-1
libdrm-nouveau2                 2.4.60-3
libgl1-mesa-glx                 10.5.5-1
xserver-xorg-video-nouveau      1:1.0.11-1+b1

$ lspci | grep -i VGA
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06)
01:00.0 VGA compatible controller: NVIDIA Corporation GK106GLM [Quadro K2100M] (rev a1)
$

Kind Regards
--
Tom
Comment 1 Ilia Mirkin 2015-05-27 14:45:00 UTC
I think that some things are being confused here... you mention multi-stream, presumably in reference to DP-MST, but that doesn't work on nouveau, only intel and radeon. Are you actually using nouveau for anything? Can you attach an xorg log and xrandr output, that should make things a bit clearer.

runpm is a very crude form of power management -- it'll turn the GPU off when the driver decides it hasn't been used for ~5s or so (maybe 10? I forget). It can do this using ACPI calls. It will then wake the GPU back up when it wants to use something.

Now you have a Lenovo + GK106 combo, which is well known to have lots of trouble bringing up acceleration. In fact, your logs show

[    4.472623] nouveau E[     PGR][0000:01:00.0] grctx template channel unload timeout

which means that PGRAPH doesn't come up. You could save nouveau some trouble and just use nouveau.noaccel=1. You should still be able to use reverse-prime in this case.

Now, none of that has to do with you seeing the crash, just trying to help you achieve a working system. We shouldn't be crashing even when PGRAPH fails to come up.

[   52.772499] nouveau E[     DRM] failed to idle channel 0xcccc0000 [DRM]
[   52.788584] pci_pm_runtime_suspend(): nouveau_pmops_runtime_suspend+0x0/0xf0 [nouveau] returns -16

OK, so it tries to suspend but it can't shut down the main DRM channel. Because PGRAPH is dead to begin with. Although it begs the question of why it even tries to idle the channel, that means there's an active fence on it. I'm guessing that the remaining messages are due to improper cleanup of a failed runtime suspend.

If the above is in the vicinity of correct, nouveau.noaccel=1 should "fix" it, I think.
Comment 2 Thomas Stewart 2015-05-27 15:47:23 UTC
I do mention DP-MST! I have a ThinkPad Ultra Dock 40A2:
http://shop.lenovo.com/us/en/itemdetails/40A20090US/460/6D501EE899104FF9A362D739642CFC27
http://support.lenovo.com/us/en/documents/pd029622

It's got a variety of connectors: VGA, DVI, 2xDP, HDMI. It has a standard and advanced mode that is selectable from the BIOS. I have read that in standard mode only one port is usable but in advanced any ports are. I have mine set to standard.

I drive a pair of 2880x1620 displays with a DVI and a DP from the dock. I think the laptop and the dock use MST in order to use more than one display. Before using Linux 4.0 I had to drive one monitor with the dock and one with the VGA connector on the laptop itself.

I don't know if nouveau is used for anything, 2880x1620 is mentioned next to intel and nouveau lines in the Xorg.0.log. In the past I did blacklist nouveau.ko once to diagnose the crashing and found that the crashing stopped but I could only get outout on the laptop display. So it seems to be involved somewhere. How does one find out which screen is driven by which card, etc?

Thanks for the info regarding runpm, I have removed that option and am running with "nouveau.noaccel=1" now. It seems to be stable as it's up about 40 min so far. I have not used any reverse-prime stuff.

$ xrandr 
Screen 0: minimum 8 x 8, current 3840 x 1200, maximum 32767 x 32767
eDP2 connected (normal left inverted right x axis y axis)
   2880x1620     59.96 +  50.00  
   2048x1536     60.00  
   1920x1440     60.00  
   1856x1392     60.01  
   1792x1344     60.01  
   1920x1200     59.95  
   1920x1080     59.93  
   1600x1200     60.00  
   1680x1050     59.95    59.88  
   1600x1024     60.17  
   1400x1050     59.98  
   1280x1024     60.02  
   1440x900      59.89  
   1280x960      60.00  
   1360x768      59.80    59.96  
   1152x864      60.00  
   1024x768      60.00  
   800x600       60.32    56.25  
   640x480       59.94  
DP1 disconnected (normal left inverted right x axis y axis)
DP2 disconnected (normal left inverted right x axis y axis)
DP2-1 connected primary 1920x1200+0+0 (normal left inverted right x axis y axis) 518mm x 324mm
   1920x1200     59.95*+
   1920x1080     59.99    60.00    50.00    50.00    50.00    59.94  
   1600x1200     60.00  
   1680x1050     59.95  
   1280x1024     75.02    72.05    60.02  
   1440x900      74.98    59.89  
   1280x720      60.00    50.00    59.94  
   1024x768      75.08    70.07    60.00  
   800x600       72.19    75.00    60.32  
   720x576       50.00  
   720x480       60.00    59.94  
   640x480       75.00    72.81    66.03    60.00    59.94  
   720x400       70.08  
DP2-2 connected 1920x1200+1920+0 (normal left inverted right x axis y axis) 518mm x 324mm
   1920x1200     59.95*+
   1920x1080     59.99  
   1600x1200     60.00  
   1680x1050     59.88  
   1280x1024     75.02    72.05    60.02  
   1440x900      74.98    59.90  
   1024x768      75.08    70.07    60.00  
   800x600       72.19    75.00    60.32  
   640x480       75.00    72.81    66.03    60.00  
   720x400       70.08  
DP2-3 disconnected (normal left inverted right x axis y axis)
HDMI1 disconnected (normal left inverted right x axis y axis)
HDMI2 disconnected (normal left inverted right x axis y axis)
VGA1 disconnected (normal left inverted right x axis y axis)
VIRTUAL1 disconnected (normal left inverted right x axis y axis)
$

$ xrandr --listproviders
Providers: number : 2
Provider 0: id: 0x73 cap: 0xb, Source Output, Sink Output, Sink Offload crtcs: 4 outputs: 10 associated providers: 0 name:Intel
Provider 1: id: 0x45 cap: 0x7, Source Output, Sink Output, Source Offload crtcs: 4 outputs: 2 associated providers: 0 name:nouveau
$
Comment 3 Thomas Stewart 2015-05-27 15:48:27 UTC
Created attachment 116086 [details]
Xorg.0.log with nouveau.noaccel=1
Comment 4 Thomas Stewart 2015-05-27 15:51:35 UTC
Apologies, I copy and pasted the wrong resolutions. The external monitors are 1920x1200.
Comment 5 Ilia Mirkin 2015-05-27 16:06:10 UTC
All of those screens are attached to intel. I can't think of a reason why you would have lost your DP screens.

However it seems like your eDP panel is actively attached to *both* the intel *and* the nvidia adapters. Yeah, *that*'ll work great. Ugh. There also appears to be a DVI-D output, which may or may not be real.

I wonder if that's the cause of the trouble. It seems like intel is allocating its outputs second, so the first eDP screen goes to nouveau. But presumably nouveau is card1 while intel is card0.

Double-check with

grep . /sys/class/drm/card*-*/status

I bet that there's a eDP-1 and eDP-2, and that eDP-2 is on the card that all the other useful ones are on. In that case, try booting with

video=eDP-1:d
Comment 6 Thomas Stewart 2015-05-27 16:46:00 UTC
Here is the status:
$ grep . /sys/class/drm/card*-*/status
/sys/class/drm/card0-DVI-D-1/status:disconnected
/sys/class/drm/card0-eDP-1/status:disconnected
/sys/class/drm/card1-DP-1/status:disconnected
/sys/class/drm/card1-DP-2/status:disconnected
/sys/class/drm/card1-DP-3/status:connected
/sys/class/drm/card1-DP-4/status:connected
/sys/class/drm/card1-DP-5/status:disconnected
/sys/class/drm/card1-eDP-2/status:connected
/sys/class/drm/card1-HDMI-A-1/status:disconnected
/sys/class/drm/card1-HDMI-A-2/status:disconnected
/sys/class/drm/card1-VGA-1/status:disconnected
$

I'm not sure about the DVI-D, I do have a thunderbolt socket on the laptop that I have never used. I'm not sure if the card order has always been the same, but it looks like card0 is nouveau and card1 is intel:

$ ls -l /sys/class/drm/card?/device/driver
lrwxrwxrwx 1 root root 0 May 27 17:36 /sys/class/drm/card0/device/driver -> ../../../../bus/pci/drivers/nouveau
lrwxrwxrwx 1 root root 0 May 27 17:35 /sys/class/drm/card1/device/driver -> ../../../bus/pci/drivers/i915
$

I rebooted without "nouveau.noaccel=1" and with "video=eDP-1:d" but it crashed again in nv50_display_init(). I then rebooted with "nouveau.noaccel=1 video=eDP-1:d" and it seems stable but I'm not sure that has actually done anything as dmesg and Xorg.0.log seem very similar.
Comment 7 Thomas Stewart 2015-05-27 16:50:53 UTC
Created attachment 116087 [details]
dmesg captured with kdump with video=eDP-1:d
Comment 8 Thomas Stewart 2015-05-27 16:52:32 UTC
Created attachment 116088 [details]
dmesg with nouveau.noaccel=1 video=eDP-1:d
Comment 9 Thomas Stewart 2015-05-27 16:53:15 UTC
Created attachment 116089 [details]
Xorg.0.log with nouveau.noaccel=1 video=eDP-1:d


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.