|Summary:||[NVE6] failed to idle channel 0xcccc0001 + crash|
|Product:||xorg||Reporter:||Thomas Stewart <thomas>|
|Component:||Driver/nouveau||Assignee:||Nouveau Project <nouveau>|
|Status:||NEW ---||QA Contact:||Xorg Project Team <xorg-team>|
|i915 platform:||i915 features:|
Description Thomas Stewart 2015-05-27 14:30:03 UTC
Created attachment 116084 [details] dmesg captured with kdump Hi, My Lenovo W540 started crashing when I upgraded from Linux 3.14 to 3.16 in Sept 2014. I continued to use 3.14 and I did nothing till last week when I tried Linux 4.0 (from sid) which kept crashing soon after logging in to GNOME 3. I EFI boot to grub and use GRUB_GFXPAYLOAD_LINUX=keep. Then use KMS and Plymouth till GDM, X.Org and GNOME3. When it crashed the mouse and keyboard no longer did anything (ie Ctrl-Alt-F1 did not work) and the laptop appeared to drop off the network. When I boot with Linux 4.0 and login to GNOME 3, it would crash within minutes, but sometimes an hour. I manually compiled 3.15, 3.16, 3.17, 3.18, 3.19 and 4.0 which all had various issues: the external monitor resolution was broken on 3.15, 3.16 seemed ok, but 3.17, 3.18, 3.19 and 4.0 all seemed to crash soon after boot. I used kdump to capture a dump and dmesg and there was a message about nouveau: [ 76.792370] nouveau E[ DRM] failed to idle channel 0xcccc0001 [DRM] Then 60 microseconds later a BUG: [ 76.792430] BUG: unable to handle kernel paging request at ffff8805660b7ffc [ 76.792455] IP: [<ffffffffa0406bf3>] evo_wait+0x53/0x120 [nouveau] After a little googling I found out about the "nouveau.runpm=0" parameter. Once I added this parameter and rebooted my laptop has worked fine with Linux 4.0. However I have not tried that parameter in any previous kernels so am unsure which release this workaround started working. I'm now happy that I have a working system with working lcd screen brightness controls and multi-stream transport monitors that work. However I don't want to kill the laptops battery by permanently disabling the power management. I could try bisecting, but with so many revisions I'm not sure what to mark good and bad or if I should use runpm at all. Anyway here are some Debian package versions and info: linux-image-4.0.0-1-amd64 4.0.2-1 libdrm-nouveau2 2.4.60-3 libgl1-mesa-glx 10.5.5-1 xserver-xorg-video-nouveau 1:1.0.11-1+b1 $ lspci | grep -i VGA 00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06) 01:00.0 VGA compatible controller: NVIDIA Corporation GK106GLM [Quadro K2100M] (rev a1) $ Kind Regards -- Tom
Comment 1 Ilia Mirkin 2015-05-27 14:45:00 UTC
I think that some things are being confused here... you mention multi-stream, presumably in reference to DP-MST, but that doesn't work on nouveau, only intel and radeon. Are you actually using nouveau for anything? Can you attach an xorg log and xrandr output, that should make things a bit clearer. runpm is a very crude form of power management -- it'll turn the GPU off when the driver decides it hasn't been used for ~5s or so (maybe 10? I forget). It can do this using ACPI calls. It will then wake the GPU back up when it wants to use something. Now you have a Lenovo + GK106 combo, which is well known to have lots of trouble bringing up acceleration. In fact, your logs show [ 4.472623] nouveau E[ PGR][0000:01:00.0] grctx template channel unload timeout which means that PGRAPH doesn't come up. You could save nouveau some trouble and just use nouveau.noaccel=1. You should still be able to use reverse-prime in this case. Now, none of that has to do with you seeing the crash, just trying to help you achieve a working system. We shouldn't be crashing even when PGRAPH fails to come up. [ 52.772499] nouveau E[ DRM] failed to idle channel 0xcccc0000 [DRM] [ 52.788584] pci_pm_runtime_suspend(): nouveau_pmops_runtime_suspend+0x0/0xf0 [nouveau] returns -16 OK, so it tries to suspend but it can't shut down the main DRM channel. Because PGRAPH is dead to begin with. Although it begs the question of why it even tries to idle the channel, that means there's an active fence on it. I'm guessing that the remaining messages are due to improper cleanup of a failed runtime suspend. If the above is in the vicinity of correct, nouveau.noaccel=1 should "fix" it, I think.
Comment 2 Thomas Stewart 2015-05-27 15:47:23 UTC
I do mention DP-MST! I have a ThinkPad Ultra Dock 40A2: http://shop.lenovo.com/us/en/itemdetails/40A20090US/460/6D501EE899104FF9A362D739642CFC27 http://support.lenovo.com/us/en/documents/pd029622 It's got a variety of connectors: VGA, DVI, 2xDP, HDMI. It has a standard and advanced mode that is selectable from the BIOS. I have read that in standard mode only one port is usable but in advanced any ports are. I have mine set to standard. I drive a pair of 2880x1620 displays with a DVI and a DP from the dock. I think the laptop and the dock use MST in order to use more than one display. Before using Linux 4.0 I had to drive one monitor with the dock and one with the VGA connector on the laptop itself. I don't know if nouveau is used for anything, 2880x1620 is mentioned next to intel and nouveau lines in the Xorg.0.log. In the past I did blacklist nouveau.ko once to diagnose the crashing and found that the crashing stopped but I could only get outout on the laptop display. So it seems to be involved somewhere. How does one find out which screen is driven by which card, etc? Thanks for the info regarding runpm, I have removed that option and am running with "nouveau.noaccel=1" now. It seems to be stable as it's up about 40 min so far. I have not used any reverse-prime stuff. $ xrandr Screen 0: minimum 8 x 8, current 3840 x 1200, maximum 32767 x 32767 eDP2 connected (normal left inverted right x axis y axis) 2880x1620 59.96 + 50.00 2048x1536 60.00 1920x1440 60.00 1856x1392 60.01 1792x1344 60.01 1920x1200 59.95 1920x1080 59.93 1600x1200 60.00 1680x1050 59.95 59.88 1600x1024 60.17 1400x1050 59.98 1280x1024 60.02 1440x900 59.89 1280x960 60.00 1360x768 59.80 59.96 1152x864 60.00 1024x768 60.00 800x600 60.32 56.25 640x480 59.94 DP1 disconnected (normal left inverted right x axis y axis) DP2 disconnected (normal left inverted right x axis y axis) DP2-1 connected primary 1920x1200+0+0 (normal left inverted right x axis y axis) 518mm x 324mm 1920x1200 59.95*+ 1920x1080 59.99 60.00 50.00 50.00 50.00 59.94 1600x1200 60.00 1680x1050 59.95 1280x1024 75.02 72.05 60.02 1440x900 74.98 59.89 1280x720 60.00 50.00 59.94 1024x768 75.08 70.07 60.00 800x600 72.19 75.00 60.32 720x576 50.00 720x480 60.00 59.94 640x480 75.00 72.81 66.03 60.00 59.94 720x400 70.08 DP2-2 connected 1920x1200+1920+0 (normal left inverted right x axis y axis) 518mm x 324mm 1920x1200 59.95*+ 1920x1080 59.99 1600x1200 60.00 1680x1050 59.88 1280x1024 75.02 72.05 60.02 1440x900 74.98 59.90 1024x768 75.08 70.07 60.00 800x600 72.19 75.00 60.32 640x480 75.00 72.81 66.03 60.00 720x400 70.08 DP2-3 disconnected (normal left inverted right x axis y axis) HDMI1 disconnected (normal left inverted right x axis y axis) HDMI2 disconnected (normal left inverted right x axis y axis) VGA1 disconnected (normal left inverted right x axis y axis) VIRTUAL1 disconnected (normal left inverted right x axis y axis) $ $ xrandr --listproviders Providers: number : 2 Provider 0: id: 0x73 cap: 0xb, Source Output, Sink Output, Sink Offload crtcs: 4 outputs: 10 associated providers: 0 name:Intel Provider 1: id: 0x45 cap: 0x7, Source Output, Sink Output, Source Offload crtcs: 4 outputs: 2 associated providers: 0 name:nouveau $
Comment 3 Thomas Stewart 2015-05-27 15:48:27 UTC
Created attachment 116086 [details] Xorg.0.log with nouveau.noaccel=1
Comment 4 Thomas Stewart 2015-05-27 15:51:35 UTC
Apologies, I copy and pasted the wrong resolutions. The external monitors are 1920x1200.
Comment 5 Ilia Mirkin 2015-05-27 16:06:10 UTC
All of those screens are attached to intel. I can't think of a reason why you would have lost your DP screens. However it seems like your eDP panel is actively attached to *both* the intel *and* the nvidia adapters. Yeah, *that*'ll work great. Ugh. There also appears to be a DVI-D output, which may or may not be real. I wonder if that's the cause of the trouble. It seems like intel is allocating its outputs second, so the first eDP screen goes to nouveau. But presumably nouveau is card1 while intel is card0. Double-check with grep . /sys/class/drm/card*-*/status I bet that there's a eDP-1 and eDP-2, and that eDP-2 is on the card that all the other useful ones are on. In that case, try booting with video=eDP-1:d
Comment 6 Thomas Stewart 2015-05-27 16:46:00 UTC
Here is the status: $ grep . /sys/class/drm/card*-*/status /sys/class/drm/card0-DVI-D-1/status:disconnected /sys/class/drm/card0-eDP-1/status:disconnected /sys/class/drm/card1-DP-1/status:disconnected /sys/class/drm/card1-DP-2/status:disconnected /sys/class/drm/card1-DP-3/status:connected /sys/class/drm/card1-DP-4/status:connected /sys/class/drm/card1-DP-5/status:disconnected /sys/class/drm/card1-eDP-2/status:connected /sys/class/drm/card1-HDMI-A-1/status:disconnected /sys/class/drm/card1-HDMI-A-2/status:disconnected /sys/class/drm/card1-VGA-1/status:disconnected $ I'm not sure about the DVI-D, I do have a thunderbolt socket on the laptop that I have never used. I'm not sure if the card order has always been the same, but it looks like card0 is nouveau and card1 is intel: $ ls -l /sys/class/drm/card?/device/driver lrwxrwxrwx 1 root root 0 May 27 17:36 /sys/class/drm/card0/device/driver -> ../../../../bus/pci/drivers/nouveau lrwxrwxrwx 1 root root 0 May 27 17:35 /sys/class/drm/card1/device/driver -> ../../../bus/pci/drivers/i915 $ I rebooted without "nouveau.noaccel=1" and with "video=eDP-1:d" but it crashed again in nv50_display_init(). I then rebooted with "nouveau.noaccel=1 video=eDP-1:d" and it seems stable but I'm not sure that has actually done anything as dmesg and Xorg.0.log seem very similar.
Comment 7 Thomas Stewart 2015-05-27 16:50:53 UTC
Created attachment 116087 [details] dmesg captured with kdump with video=eDP-1:d
Comment 8 Thomas Stewart 2015-05-27 16:52:32 UTC
Created attachment 116088 [details] dmesg with nouveau.noaccel=1 video=eDP-1:d