Bug 107829

Summary: nouveau crash/freeze [MULTIPLE_WARP_ERRORS] warp 3f0009 [ILLEGAL_INSTR_ENCODING]
Product: xorg Reporter: sassmann
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED MOVED QA Contact: Xorg Project Team <xorg-team>
Severity: major    
Priority: medium CC: hgcoin, john, karolherbst
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=108080
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg.txt
none
Syslog with nouveau events leading to hard lock none

Description sassmann 2018-09-05 07:56:43 UTC
Created attachment 141458 [details]
dmesg.txt

After an arbitrary amount of time of working in the terminal the screen hard freezes.
External monitor is connected via DisplayPort on Lenovo Dock.

dmesg then shows:
[172362.507754] nouveau 0000:01:00.0: gr: TRAP ch 6 [00ff396000 Xorg[4075]]
[172362.507767] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3f0009 [ILLEGAL_INSTR_ENCODING]
[172362.507775] nouveau 0000:01:00.0: gr: GPC0/TPC2/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3f0009 [ILLEGAL_INSTR_ENCODING]
[172362.507782] nouveau 0000:01:00.0: gr: GPC0/TPC3/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3c0009 [ILLEGAL_INSTR_ENCODING]
[172362.507805] nouveau 0000:01:00.0: gr: TRAP ch 6 [00ff396000 Xorg[4075]]
[172362.507815] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global 00000000 [] warp 3e0009 [ILLEGAL_INSTR_ENCODING]
[172362.507823] nouveau 0000:01:00.0: gr: GPC0/TPC2/MP trap: global 00000000 [] warp 3e0009 [ILLEGAL_INSTR_ENCODING]
[172362.507830] nouveau 0000:01:00.0: gr: GPC0/TPC3/MP trap: global 00000000 [] warp 3d0009 [ILLEGAL_INSTR_ENCODING]
[172362.517638] nouveau 0000:01:00.0: gr: TRAP ch 6 [00ff396000 Xorg[4075]]
[172362.517651] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3d0009 [ILLEGAL_INSTR_ENCODING]
[172362.517658] nouveau 0000:01:00.0: gr: GPC0/TPC2/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3d0009 [ILLEGAL_INSTR_ENCODING]
[172362.517665] nouveau 0000:01:00.0: gr: GPC0/TPC3/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3f0009 [ILLEGAL_INSTR_ENCODING]
[172362.517685] nouveau 0000:01:00.0: gr: TRAP ch 6 [00ff396000 Xorg[4075]]
[172362.517695] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global 00000000 [] warp 3c0009 [ILLEGAL_INSTR_ENCODING]
[172362.517702] nouveau 0000:01:00.0: gr: GPC0/TPC2/MP trap: global 00000000 [] warp 3c0009 [ILLEGAL_INSTR_ENCODING]
[172362.517711] nouveau 0000:01:00.0: gr: GPC0/TPC3/MP trap: global 00000000 [] warp 3e0009 [ILLEGAL_INSTR_ENCODING]
[172362.534375] nouveau 0000:01:00.0: gr: TRAP ch 6 [00ff396000 Xorg[4075]]
[172362.534387] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3f0009 [ILLEGAL_INSTR_ENCODING]
[172362.534394] nouveau 0000:01:00.0: gr: GPC0/TPC2/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3f0009 [ILLEGAL_INSTR_ENCODING]
[172362.534399] nouveau 0000:01:00.0: gr: GPC0/TPC3/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3d0009 [ILLEGAL_INSTR_ENCODING]
Reboot is required at this point. Happens regularly, sometimes after few hours, sometimes after 1-2 days.

Hardware: Lenovo P50 running Fedora 28 4.17.19-200.fc28.x86_64
xorg-x11-drv-nouveau-1.0.15-4.fc28.x86_64
mesa-dri-drivers-18.0.5-3.fc28.x86_64

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GLM [Quadro M1000M] [10de:13b1] (rev a2) (prog-if 00 [VGA controller])
        Subsystem: Lenovo Device [17aa:2230]
        Flags: bus master, fast devsel, latency 0, IRQ 131
        Memory at b2000000 (32-bit, non-prefetchable) [size=16M]
        Memory at a0000000 (64-bit, prefetchable) [size=256M]
        Memory at b0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at 4000 [size=128]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [258] L1 PM Substates
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] #19
        Kernel driver in use: nouveau
        Kernel modules: nouveau
Comment 1 sassmann 2018-09-05 12:13:14 UTC
error on kernel 4.18.5 looked a little bit different.
[11735.012648] nouveau 0000:01:00.0: gr: TRAP ch 5 [00ff35f000 Xorg[2601]]
[11735.012660] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 000e [OOR_ADDR]
[11735.012666] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3000e [OOR_ADDR]
[11735.012672] nouveau 0000:01:00.0: gr: GPC0/TPC2/MP trap: global 00000000 [] warp 000e [OOR_ADDR]
[11735.012678] nouveau 0000:01:00.0: gr: GPC0/TPC3/MP trap: global 00000000 [] warp 000e [OOR_ADDR]
[11735.012701] nouveau 0000:01:00.0: gr: TRAP ch 5 [00ff35f000 Xorg[2601]]
[11735.012709] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 9000e [OOR_ADDR]
[11735.012715] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 2000e [OOR_ADDR]
[11735.012720] nouveau 0000:01:00.0: gr: GPC0/TPC2/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 6000e [OOR_ADDR]
[11735.012726] nouveau 0000:01:00.0: gr: GPC0/TPC3/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 4000e [OOR_ADDR]
[11735.013192] nouveau 0000:01:00.0: gr: TRAP ch 4 [00ff8a7000 systemd-logind[1203]]
[11735.013201] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap: global 00000000 [] warp 30009 [ILLEGAL_INSTR_ENCODING]
[11735.013208] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global 00000000 [] warp 10009 [ILLEGAL_INSTR_ENCODING]
[11735.013214] nouveau 0000:01:00.0: gr: GPC0/TPC2/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 0009 [ILLEGAL_INSTR_ENCODING]
[11735.013219] nouveau 0000:01:00.0: gr: GPC0/TPC3/MP trap: global 00000000 [] warp 10009 [ILLEGAL_INSTR_ENCODING]
Comment 2 sassmann 2018-12-22 11:35:33 UTC
Tested again with 4.20-rc7. System ran for 1-2 days and then froze again.
The error looked different though.
[162840.653595] nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
[162840.653610] nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
[162840.653621] nouveau 0000:01:00.0: fifo: channel 4: killed
[162840.653631] nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery
[162840.654013] nouveau 0000:01:00.0: systemd-logind[1383]: channel 4 killed!
Comment 3 kenorb 2019-01-06 00:08:40 UTC
Related post: https://askubuntu.com/q/1046945/78223
Comment 4 Karol Herbst 2019-01-09 00:47:13 UTC
Problem is, the log isn't able to tell us which application is causing that.

Do you think you could try to SSH into the machine while the screen hard freezes and check with top/htop if there is anything consuming lots of CPU? And see if killing that application unfreezes the screen?

If the only application consuming significantly more CPU is Xorg, maybe killing applications inside the Xorg session until it unfreezes could help us track down which application is causing that freeze.
Comment 5 sassmann 2019-01-09 12:52:49 UTC
Unfortunately there's no process hogging the CPU. Would it help to add some debug kernel cmd line options?
Comment 6 sassmann 2019-01-09 17:52:27 UTC
So I went ahead and tried 5.0-rc1, playing video with chromium to put some load on the gpu. After few hours it froze again.

[ 6603.232849] nouveau 0000:01:00.0: gr: TRAP ch 4 [00ff8a9000 systemd-logind[1385]]
[ 6603.232865] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap: global 00000000 [] warp 3e0001 [STACK_ERROR]
[ 6603.246125] nouveau 0000:01:00.0: gr: TRAP ch 4 [00ff8a9000 systemd-logind[1385]]
[ 6603.246137] nouveau 0000:01:00.0: gr: GPC0/TPC2/MP trap: global 00000000 [] warp 3d0001 [STACK_ERROR]

I then logged in via ssh and chromium was running with 100% cpu. After killing chromium the log went a bit further, but the screen did not recover and stayed frozen.

[ 6758.631306] nouveau 0000:01:00.0: Xorg[6494]: failed to idle channel 8 [Xorg[6494]]
[ 6773.631329] nouveau 0000:01:00.0: Xorg[6494]: failed to idle channel 8 [Xorg[6494]]
[ 6773.632334] nouveau 0000:01:00.0: fifo: fault 00 [READ] at 00000000000c2000 engine 07 [HOST0] client 06 [HUB/HOST] reason 42 [] on channel 8 [00fec69000 Xorg[6494]]
[ 6773.632347] nouveau 0000:01:00.0: fifo: channel 8: killed
[ 6773.632352] nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
[ 6773.632366] nouveau 0000:01:00.0: fifo: engine 5: scheduled for recovery
[ 6773.632378] nouveau 0000:01:00.0: Xorg[6494]: channel 8 killed!
Comment 7 sassmann 2019-01-10 16:45:35 UTC
Happened again, this time while starting firefox. No tabs opened yet.
nouveau 0000:01:00.0: disp: 0x000064a8[0]: INIT_GENERIC_CONDITON: unknown 0x07
nouveau 0000:01:00.0: DRM: GPU lockup - switching to software fbcon
nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
nouveau 0000:01:00.0: fifo: channel 4: killed
nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery
nouveau 0000:01:00.0: systemd-logind[1385]: channel 4 killed!
Comment 8 Karol Herbst 2019-01-29 17:09:28 UTC
are you still on Fedora 28? I could create a mesa package to try out for you for fedora 29 which should fix those issues, but they would need to be tested a bit longer to be sure.

I am sure what the issue is, just fixing it is quite challanging.
Comment 9 Karol Herbst 2019-01-29 17:26:55 UTC
anyway, here is the copr for fc29, just triggered a new build with an updated version: https://copr.fedorainfracloud.org/coprs/karolherbst/mesa/

version should be 18.2.8-9001.fc29
Comment 10 sassmann 2019-01-30 09:01:48 UTC
I'm on f29 by now.

Looking at
https://copr.fedorainfracloud.org/coprs/karolherbst/mesa/
the build failed. 

Let me know when a new build is available.
Thanks!
Comment 11 sassmann 2019-04-15 10:34:21 UTC
Any news on this? Still having this issue.
Comment 12 Harry Coin 2019-07-20 14:15:22 UTC
Created attachment 144827 [details]
Syslog with nouveau events leading to hard lock

Attached is a /var/log/syslog snip showing many events leading to the hard lock, followed by a trimmed reboot trace showing the nouveau configuration.
Kernel is generic Linux ceo1homenx 5.2.0-8-generic #9-Ubuntu SMP Mon Jul 8 13:07:27 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Context is:  a development system left for the night, crashed after perhaps 10 hours of idleness.  System is not a server for anything, not running any vms.  Running apps of note was an email client and web browser, otherwise just basic KDE.  Compositor was off.
Comment 13 Martin Peres 2019-12-04 09:44:46 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/454.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.