Bug 100567 - Nouveau system freeze fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
Summary: Nouveau system freeze fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) All
: high critical
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
: 96562 98138 (view as bug list)
Depends on: 103721
Blocks:
  Show dependency treegraph
 
Reported: 2017-04-04 20:09 UTC by Jeremy Booker
Modified: 2019-01-08 11:52 UTC (History)
11 users (show)

See Also:
i915 platform:
i915 features:


Attachments
journalctl output for last boot to crash (238.19 KB, text/x-log)
2017-04-04 20:09 UTC, Jeremy Booker
no flags Details
dmesg with SCHED_ERROR (starting at 73576 sec) (1.68 MB, text/plain)
2017-04-13 18:56 UTC, Kevin Liu
no flags Details
journalctl -kb of CTXSW_TIMEOUT on 4.10.10 (1.15 MB, text/plain)
2017-04-20 14:44 UTC, Kevin Liu
no flags Details
syslog (142.73 KB, text/plain)
2017-06-26 09:44 UTC, jadziadax30
no flags Details
dmesg (100.05 KB, text/plain)
2017-06-26 09:45 UTC, jadziadax30
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jeremy Booker 2017-04-04 20:09:12 UTC
Created attachment 130676 [details]
journalctl output for last boot to crash

I'm experiencing a random, but consistent hard-free (cannot switch virtual terminals) running an nVidia card with three monitors under Fedora 25. Recent kernel and driver updates have seemed to make this freeze much more frequent (5-6 times in two work days).

Journalctl -b -1 output is attached. Most relevant lines are:

kernel: nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
kernel: nouveau 0000:01:00.0: fifo: gr engine fault on channel 12, recovering...
kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:37:head-0] hw_done timed out
kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:37:head-0] hw_done timed out
kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:37:head-0] flip_done timed out

A possibly related bug is #96562, but it is old and hasn't received any attention, so I'm opening a new one.
Comment 1 Jeremy Booker 2017-04-05 16:18:18 UTC
I've tried downgrading to kernel 4.9.14, as the 4.10 series seems to have other changes relating to video drives and nouveau (which are also causing crashes).

I made it from ~8am to ~noon without a crash. Same basic log entries found in journalctl output:


Apr 05 12:10:09 localhost.localdomain kernel: nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
Apr 05 12:10:09 localhost.localdomain kernel: nouveau 0000:01:00.0: fifo: gr engine fault on channel 10, recovering...
Comment 2 Jeremy Booker 2017-04-06 15:18:12 UTC
https://bugs.freedesktop.org/show_bug.cgi?id=90453 Sounds related.
Comment 3 Kevin Liu 2017-04-10 23:53:06 UTC
I have the same issue, on Linux 4.11-rc6. journalctl -b:

Apr 10 18:22:55 jenny kernel: nouveau 0000:02:00.0: gr: TRAP ch 2 [007f901000 X[6443]]
Apr 10 18:22:55 jenny kernel: nouveau 0000:02:00.0: gr: GPC2/TPC1/MP trap: global 00000000 [] warp 3f0009 [ILLEGAL_INSTR_ENCODING]
...
Apr 10 18:22:59 jenny kernel: nouveau 0000:02:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
Apr 10 18:22:59 jenny kernel: nouveau 0000:02:00.0: fifo: runlist 0: scheduled for recovery
Apr 10 18:22:59 jenny kernel: nouveau 0000:02:00.0: fifo: channel 2: killed
Apr 10 18:22:59 jenny kernel: nouveau 0000:02:00.0: fifo: engine 0: scheduled for recovery
Apr 10 18:22:59 jenny kernel: nouveau 0000:02:00.0: X[6443]: channel 2 killed!

lspci -vv (GTX 770; NVE4):

02:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 770] (rev a1) (prog-if 00 [VGA controller])         
        Subsystem: Micro-Star International Co., Ltd. [MSI] GK104 [GeForce GTX 770]                                          
        Flags: bus master, fast devsel, latency 0, IRQ 33                                                                    
        Memory at fa000000 (32-bit, non-prefetchable) [size=16M]                                                             
        Memory at f0000000 (64-bit, prefetchable) [size=128M]                                                                
        Memory at f8000000 (64-bit, prefetchable) [size=32M]                                                                 
        I/O ports at e000 [size=128]                                                                                         
        Expansion ROM at fb000000 [disabled] [size=512K]                                                                     
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] #19
        Kernel driver in use: nouveau

02:00.1 Audio device: NVIDIA Corporation GK104 HDMI Audio Controller (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] GK104 HDMI Audio Controller
        Flags: bus master, fast devsel, latency 0, IRQ 17
        Memory at fb080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Kernel driver in use: snd_hda_intel

I'll post a proper dmesg the next time the bug appears.
Comment 4 Kevin Liu 2017-04-13 18:56:34 UTC
Created attachment 130834 [details]
dmesg with SCHED_ERROR (starting at 73576 sec)

It happened again! I believe the monitors were turned off when the problem occurred (I wasn't there at the time), but when I returned the monitors were on and frozen. I could move the mouse around, but nothing else responded.

The errors start at [73576].
Comment 5 Kevin Liu 2017-04-20 14:44:24 UTC
Created attachment 130945 [details]
journalctl -kb of CTXSW_TIMEOUT on 4.10.10

Can reproduce on Linux 4.10.10 as well. Along with the initial CTXSW_TIMEOUT, it seems to hang a few kernel tasks relating to atomic commits.

Relevant(?) snippet:

Apr 20 10:06:03 jenny kernel: nouveau 0000:02:00.0: gr: TRAP ch 3 [007f7c2000 X[3611]]
Apr 20 10:06:03 jenny kernel: nouveau 0000:02:00.0: gr: GPC1/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3e000d [OOR_REG]
Apr 20 10:06:03 jenny kernel: nouveau 0000:02:00.0: gr: GPC2/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3e000d [OOR_REG]
Apr 20 10:06:03 jenny kernel: nouveau 0000:02:00.0: gr: TRAP ch 3 [007f7c2000 X[3611]]
Apr 20 10:06:03 jenny kernel: nouveau 0000:02:00.0: gr: GPC1/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3f000d [OOR_REG]
Apr 20 10:06:03 jenny kernel: nouveau 0000:02:00.0: gr: GPC1/TPC1/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3f000d [OOR_REG]
Apr 20 10:06:03 jenny kernel: nouveau 0000:02:00.0: gr: GPC2/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3f000d [OOR_REG]
Apr 20 10:06:03 jenny kernel: nouveau 0000:02:00.0: gr: GPC3/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3f000d [OOR_REG]
Apr 20 10:06:03 jenny kernel: nouveau 0000:02:00.0: gr: GPC3/TPC1/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3d000d [OOR_REG]
Apr 20 10:06:07 jenny kernel: nouveau 0000:02:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
Apr 20 10:06:07 jenny kernel: nouveau 0000:02:00.0: fifo: gr engine fault on channel 6, recovering...
Apr 20 10:06:18 jenny kernel: nouveau 0000:02:00.0: X[3611]: nv50cal_space: -16
Apr 20 10:06:18 jenny kernel: nouveau 0000:02:00.0: X[3611]: nv50cal_space: -16
Apr 20 10:06:19 jenny kernel: nouveau 0000:02:00.0: X[3611]: nv50cal_space: -16
Apr 20 10:06:19 jenny kernel: nouveau 0000:02:00.0: X[3611]: nv50cal_space: -16
Apr 20 10:06:19 jenny kernel: nouveau 0000:02:00.0: X[3611]: nv50cal_space: -16
Apr 20 10:06:20 jenny kernel: nouveau 0000:02:00.0: X[3611]: nv50cal_space: -16
Apr 20 10:06:20 jenny kernel: nouveau 0000:02:00.0: X[3611]: nv50cal_space: -16
Apr 20 10:06:20 jenny kernel: nouveau 0000:02:00.0: X[3611]: nv50cal_space: -16
Apr 20 10:06:21 jenny kernel: nouveau 0000:02:00.0: X[3611]: nv50cal_space: -16
...
Comment 6 jadziadax30 2017-06-26 09:42:46 UTC
Same problem on 4.9.16-gentoo.

I've attached syslog and dmesg. 

In average the problem occurs every one or two days on my machine with one monitor, resolution 3440x1440. 


The x11-drivers/xf86-video-nouveau driver version is 1.0.15
Comment 7 jadziadax30 2017-06-26 09:44:38 UTC
Created attachment 132245 [details]
syslog
Comment 8 jadziadax30 2017-06-26 09:45:37 UTC
Created attachment 132246 [details]
dmesg
Comment 9 rahulmehra 2017-06-30 07:39:47 UTC
I think I am experiencing the same problem on 4.10.0-24-generic on Ubuntu 17.04 with GNOME 3.24. I have 1 monitor running an Nvidia card.

The desktop freezes between 20 mins to 5 hours of the system running from boot.

This is the system log when it crashes

Jun 30 16:49:47 r-ubuntu kernel: [ 2282.503403] nouveau 0000:01:00.0: gr: TRAP ch 2 [003fa29000 Xorg[1154]]
Jun 30 16:49:47 r-ubuntu kernel: [ 2282.503411] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 3c000d [OOR_REG]


Let me know if I need to provide more info
Comment 10 Géza Búza 2017-11-14 11:20:12 UTC
It must be a driver bug, as I see the same error on different kernel and different graphic hardware.


dmesg output:

nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
nouveau 0000:01:00.0: fifo: channel 12: killed
nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery


lspci -vv -s 01:00.0

VGA compatible controller: NVIDIA Corporation GK106GLM [Quadro K2100M] (rev a1) (prog-if 00 [VGA controller])
Subsystem: Dell GK106GLM [Quadro K2100M]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 25
Region 0: Memory at f5000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at e0000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at f0000000 (64-bit, prefetchable) [size=32M]
Region 5: I/O ports at e000 [size=128]
Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: nouveau
Kernel modules: nouveau


uname -a

Linux Geza-DellM4700 4.13.11-1-ARCH #1 SMP PREEMPT Thu Nov 2 10:25:56 CET 2017 x86_64 GNU/Linux


lsmod

Module                  Size  Used by
fuse                   94208  3
ccm                    20480  9
cmac                   16384  1
rfcomm                 69632  32
ipt_MASQUERADE         16384  1
nf_nat_masquerade_ipv4    16384  1 ipt_MASQUERADE
nf_conntrack_netlink    36864  0
nfnetlink              16384  2 nf_conntrack_netlink
xfrm_user              32768  1
xfrm_algo              16384  1 xfrm_user
iptable_nat            16384  1
nf_conntrack_ipv4      16384  3
nf_defrag_ipv4         16384  1 nf_conntrack_ipv4
nf_nat_ipv4            16384  1 iptable_nat
xt_addrtype            16384  2
iptable_filter         16384  1
xt_conntrack           16384  1
nf_nat                 24576  2 nf_nat_masquerade_ipv4,nf_nat_ipv4
nf_conntrack          110592  7 nf_conntrack_ipv4,ipt_MASQUERADE,nf_conntrack_netlink,nf_nat_masquerade_ipv4,xt_conntrack,nf_nat_ipv4,nf_nat
libcrc32c              16384  2 nf_conntrack,nf_nat
crc32c_generic         16384  0
br_netfilter           24576  0
bridge                139264  1 br_netfilter
stp                    16384  1 bridge
llc                    16384  2 bridge,stp
joydev                 20480  0
mousedev               20480  0
snd_hda_codec_hdmi     49152  1
snd_hda_codec_idt      49152  1
snd_hda_codec_generic    69632  1 snd_hda_codec_idt
bnep                   20480  2
arc4                   16384  2
hid_logitech_hidpp     32768  0
mei_wdt                16384  0
iTCO_wdt               16384  0
iTCO_vendor_support    16384  1 iTCO_wdt
ppdev                  20480  0
intel_rapl             20480  0
x86_pkg_temp_thermal    16384  0
intel_powerclamp       16384  0
dell_laptop            20480  0
coretemp               16384  0
dell_smm_hwmon         16384  0
kvm_intel             192512  0
iwlmvm                299008  0
kvm                   516096  1 kvm_intel
mac80211              688128  1 iwlmvm
irqbypass              16384  1 kvm
crct10dif_pclmul       16384  0
crc32_pclmul           16384  0
ghash_clmulni_intel    16384  0
pcbc                   16384  0
snd_hda_intel          36864  8
snd_hda_codec         106496  4 snd_hda_intel,snd_hda_codec_idt,snd_hda_codec_hdmi,snd_hda_codec_generic
aesni_intel           184320  8
uvcvideo               86016  0
snd_hda_core           65536  5 snd_hda_intel,snd_hda_codec,snd_hda_codec_idt,snd_hda_codec_hdmi,snd_hda_codec_generic
aes_x86_64             20480  1 aesni_intel
videobuf2_vmalloc      16384  1 uvcvideo
crypto_simd            16384  1 aesni_intel
snd_hwdep              20480  1 snd_hda_codec
btusb                  40960  0
glue_helper            16384  1 aesni_intel
videobuf2_memops       16384  1 videobuf2_vmalloc
e1000e                225280  0
nls_iso8859_1          16384  1
btrtl                  16384  1 btusb
cryptd                 20480  3 crypto_simd,ghash_clmulni_intel,aesni_intel
videobuf2_v4l2         20480  1 uvcvideo
iwlwifi               217088  1 iwlmvm
intel_cstate           16384  0
snd_pcm                86016  4 snd_hda_intel,snd_hda_codec,snd_hda_core,snd_hda_codec_hdmi
videobuf2_core         36864  2 uvcvideo,videobuf2_v4l2
btbcm                  16384  1 btusb
nls_cp437              20480  1
btintel                16384  1 btusb
dell_wmi               16384  0
dell_smbios            16384  2 dell_wmi,dell_laptop
bluetooth             479232  68 btrtl,btintel,bnep,btbcm,rfcomm,btusb
vfat                   20480  1
snd_timer              28672  1 snd_pcm
fat                    65536  1 vfat
videodev              155648  3 uvcvideo,videobuf2_core,videobuf2_v4l2
intel_rapl_perf        16384  0
snd                    73728  24 snd_hda_intel,snd_hwdep,snd_hda_codec,snd_hda_codec_idt,snd_timer,snd_hda_codec_hdmi,snd_hda_codec_generic,snd_pcm
dcdbas                 16384  1 dell_smbios
ecdh_generic           24576  1 bluetooth
mei_me                 36864  1
media                  32768  2 uvcvideo,videodev
psmouse               135168  0
input_leds             16384  0
sparse_keymap          16384  1 dell_wmi
wmi_bmof               16384  0
cfg80211              532480  3 iwlmvm,iwlwifi,mac80211
crc16                  16384  1 bluetooth
hid_logitech_dj        20480  0
ptp                    20480  1 e1000e
lpc_ich                24576  0
i2c_i801               24576  0
soundcore              16384  1 snd
mei                    81920  3 mei_me,mei_wdt
pps_core               20480  1 ptp
shpchp                 32768  0
parport_pc             28672  0
parport                40960  2 parport_pc,ppdev
dell_smo8800           16384  0
thermal                20480  0
dell_rbtn              16384  0
rfkill                 20480  10 bluetooth,dell_laptop,dell_rbtn,cfg80211
battery                20480  0
ac                     16384  0
evdev                  24576  39
mac_hid                16384  0
squashfs               49152  1
loop                   28672  2
sch_fq_codel           20480  6
vboxnetflt             28672  0
vboxnetadp             28672  0
pci_stub               16384  1
vboxpci                24576  0
vboxdrv               393216  3 vboxnetadp,vboxnetflt,vboxpci
nfsd                  315392  13
auth_rpcgss            57344  1 nfsd
overlay                65536  0
oid_registry           16384  1 auth_rpcgss
nfs_acl                16384  1 nfsd
lockd                  86016  1 nfsd
sg                     36864  0
grace                  16384  2 nfsd,lockd
crypto_user            16384  0
sunrpc                282624  19 auth_rpcgss,nfsd,nfs_acl,lockd
ip_tables              24576  2 iptable_filter,iptable_nat
x_tables               32768  5 ip_tables,iptable_filter,ipt_MASQUERADE,xt_addrtype,xt_conntrack
btrfs                1036288  1
xor                    24576  1 btrfs
raid6_pq              114688  1 btrfs
sr_mod                 24576  0
cdrom                  53248  1 sr_mod
sd_mod                 49152  3
usbhid                 45056  0
hid                   114688  4 usbhid,hid_logitech_dj,hid_logitech_hidpp
serio_raw              16384  0
atkbd                  24576  0
libps2                 16384  2 atkbd,psmouse
ahci                   36864  2
libahci                28672  1 ahci
firewire_ohci          40960  0
crc32c_intel           24576  2
libata                208896  2 ahci,libahci
xhci_pci               16384  0
sdhci_pci              28672  0
sdhci                  40960  1 sdhci_pci
ehci_pci               16384  0
xhci_hcd              188416  1 xhci_pci
scsi_mod              155648  4 sd_mod,libata,sr_mod,sg
ehci_hcd               73728  1 ehci_pci
firewire_core          57344  1 firewire_ohci
mmc_core              122880  2 sdhci,sdhci_pci
crc_itu_t              16384  1 firewire_core
usbcore               208896  7 uvcvideo,usbhid,ehci_hcd,xhci_pci,btusb,xhci_hcd,ehci_pci
usb_common             16384  1 usbcore
i8042                  24576  1 dell_laptop
serio                  20480  6 serio_raw,atkbd,psmouse,i8042
nouveau              1564672  44
button                 16384  1 nouveau
video                  36864  3 dell_wmi,dell_laptop,nouveau
led_class              16384  5 iwlmvm,sdhci,input_leds,dell_laptop,nouveau
mxm_wmi                16384  1 nouveau
wmi                    20480  4 dell_wmi,wmi_bmof,mxm_wmi,nouveau
i2c_algo_bit           16384  1 nouveau
drm_kms_helper        131072  1 nouveau
syscopyarea            16384  1 drm_kms_helper
sysfillrect            16384  1 drm_kms_helper
sysimgblt              16384  1 drm_kms_helper
fb_sys_fops            16384  1 drm_kms_helper
ttm                    81920  1 nouveau
drm                   303104  29 nouveau,ttm,drm_kms_helper
agpgart                36864  3 nouveau,ttm,drm
Comment 11 988alex 2017-11-30 20:08:07 UTC
Same problem on Linux Neon 4.10.0-40-generic
xserver-xorg-video-nouveau-hwe-16.04 - 1:1.0.14-0ubuntu1~16.04.1
Comment 12 Marc Burkhardt 2017-12-29 17:52:46 UTC
On my Lenovo P50 this problem is newly introduced since I use a 4.14 kernel. Kernels below that are stable/working in regard of that problem. This is what I can aquire from the logs:

Dec 28 17:23:13 marc kernel: [ 5657.352825] nouveau 0000:01:00.0: fifo: channel 2: killed
Dec 28 17:23:13 marc kernel: [ 5657.352827] nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
Dec 28 17:23:13 marc kernel: [ 5657.352830] nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery
Dec 28 17:23:13 marc kernel: [ 5657.352833] nouveau 0000:01:00.0: X[5708]: channel 2 killed!
Dec 28 17:23:13 local kernel: nouveau 0000:01:00.0: gr: TRAP ch 2 [00ffbd0000 X[5708]]
Dec 28 17:23:13 local kernel: nouveau 0000:01:00.0: gr: GPC0/TPC0/TEX: 80000000
Dec 28 17:23:13 local kernel: nouveau 0000:01:00.0: gr: GPC0/TPC1/TEX: 80000041
Dec 28 17:23:13 local kernel: nouveau 0000:01:00.0: fifo: read fault at 000ac40000 engine 00 [GR] client 07 [GPC0/T1_2] reason 02 [PTE] on channel 2 [00ffbd0000 X[5708]]
Dec 28 17:23:13 local kernel: nouveau 0000:01:00.0: fifo: channel 2: killed
Dec 28 17:23:13 local kernel: nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
Dec 28 17:23:13 local kernel: nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery
Dec 28 17:23:13 local kernel: nouveau 0000:01:00.0: X[5708]: channel 2 killed!
Dec 28 17:24:03 marc kernel: [ 5706.891489] sysrq: SysRq : Keyboard mode set to system default
Dec 28 17:24:03 local kernel: sysrq: SysRq : Keyboard mode set to system default
Dec 28 17:24:04 marc exiting on signal 15

I found this in my Xorg.0.log but I don't know if it's somehow related to the above:

[   155.535] (EE) libinput bug: timer event13 debounce short: offset negative (-2229)
Comment 13 Jim Scarborough 2018-01-30 13:01:57 UTC
This looks very similar to Bug 103721.
Comment 14 Marc Burkhardt 2018-03-01 17:26:28 UTC
Hello there?!?!

Just because of pure curiosity: is any developer interested in this or are the users left alone watching their machines freeze for almost 10 months now?

Is there anything I (or any other user) could do to provide more info? Maybe apply a patch, send debug log, try this or that?

It's really so sad to see absolutely no progress but more complaints for that long...
Comment 15 Zéfling 2018-07-30 21:20:19 UTC
I also have this problem with a 770 GTX since I switched to KDE 5 about 5 months ago. Freeze without reason learns exept the mouse cursor. 
Kubuntu 17.10, Kubuntu 18.4, Debian 9 KDE and KDE Neon, same problem. In addition, with the nVidia driver it's worse.

---

01:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 770] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Micro-Star International Co., Ltd. [MSI] GK104 [GeForce GTX 770]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 27
        Region 0: Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at e8000000 (64-bit, prefetchable) [size=128M]
        Region 3: Memory at f0000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at e000 [size=128]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: nouveau
        Kernel modules: nvidiafb, nouveau

01:00.1 Audio device: NVIDIA Corporation GK104 HDMI Audio Controller (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] GK104 HDMI Audio Controller
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 17
        Region 0: Memory at f7080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: <access denied>
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
Comment 16 Daniel Gustaw 2018-11-05 10:27:12 UTC
I had the same problem:

Nov  5 10:56:01 daniel-Inspiron-3543 kernel: [ 8431.056748] nouveau 0000:05:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
Nov  5 10:56:01 daniel-Inspiron-3543 kernel: [ 8431.056765] nouveau 0000:05:00.0: fifo: runlist 0: scheduled for recovery
Nov  5 10:56:01 daniel-Inspiron-3543 kernel: [ 8431.056781] nouveau 0000:05:00.0: fifo: channel 15: killed
Nov  5 10:56:01 daniel-Inspiron-3543 kernel: [ 8431.056791] nouveau 0000:05:00.0: fifo: engine 7: scheduled for recovery
Nov  5 10:56:01 daniel-Inspiron-3543 kernel: [ 8431.056798] nouveau 0000:05:00.0: fifo: engine 0: scheduled for recovery
Nov  5 10:56:01 daniel-Inspiron-3543 kernel: [ 8431.057508] nouveau 0000:05:00.0: gnome-shell[4200]: channel 15 killed!

lsb_release -d
Description:	Ubuntu 18.04.1 LTS
gnome-shell --version
GNOME Shell 3.28.3

sudo lshw -c video
  *-display                 
       description: VGA compatible controller
       product: GK107 [NVS 510]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:05:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nouveau latency=0
       resources: irq:47 memory:de000000-deffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:c000(size=128) memory:c0000-dffff

hwinfo --gfxcard
56: PCI 500.0: 0300 VGA compatible controller (VGA)             
  [Created at pci.378]
  Unique ID: Ddhb.WEoD030yuUF
  Parent ID: _Znp.jlvukYYcj4B
  SysFS ID: /devices/pci0000:00/0000:00:02.0/0000:05:00.0
  SysFS BusID: 0000:05:00.0
  Hardware Class: graphics card
  Model: "nVidia GK107 [NVS 510]"
  Vendor: pci 0x10de "nVidia Corporation"
  Device: pci 0x0ffd "GK107 [NVS 510]"
  SubVendor: pci 0x10de "nVidia Corporation"
  SubDevice: pci 0x0967 
  Revision: 0xa1
  Driver: "nouveau"
  Driver Modules: "nouveau"
  Memory Range: 0xde000000-0xdeffffff (rw,non-prefetchable)
  Memory Range: 0xc0000000-0xcfffffff (ro,non-prefetchable)
  Memory Range: 0xd0000000-0xd1ffffff (ro,non-prefetchable)
  I/O Ports: 0xc000-0xc07f (rw)
  Memory Range: 0x000c0000-0x000dffff (rw,non-prefetchable,disabled)
  IRQ: 47 (296925 events)
  I/O Port: 0x00 (rw)
  Module Alias: "pci:v000010DEd00000FFDsv000010DEsd00000967bc03sc00i00"
  Driver Info #0:
    Driver Status: nvidiafb is not active
    Driver Activation Cmd: "modprobe nvidiafb"
  Driver Info #1:
    Driver Status: nouveau is active
    Driver Activation Cmd: "modprobe nouveau"
  Config Status: cfg=new, avail=yes, need=no, active=unknown
  Attached to: #73 (PCI bridge)

Primary display adapter: #56
Comment 17 sassmann 2018-12-22 11:41:16 UTC
Same issue with Lenovo P50 on 4.20-rc7.

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GLM [Quadro M1000M] [10de:13b1] (rev a2) (prog-if 00 [VGA controller])

[162840.653595] nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
[162840.653610] nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
[162840.653621] nouveau 0000:01:00.0: fifo: channel 4: killed
[162840.653631] nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery
[162840.654013] nouveau 0000:01:00.0: systemd-logind[1383]: channel 4 killed!
Comment 18 kenorb 2019-01-05 21:29:37 UTC
The same problem on Ubuntu 18.10, kernel 4.18.0-13.

I've got 4x GPU: GTX 1080 Ti (3-Way SLI Connector), NVIDIA GeForce GTX 1080 Ti graphics card with 3584 cores.

$ uname -a
Linux Ubuntu-PC 4.18.0-13-generic #14-Ubuntu SMP Wed Dec 5 09:04:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Errors in kern.log file:

nouveau 0000:65:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
nouveau 0000:65:00.0: fifo: runlist 0: scheduled for recovery
nouveau 0000:65:00.0: fifo: channel 2: killed
nouveau 0000:65:00.0: fifo: engine 0: scheduled for recovery
nouveau 0000:65:00.0: Xorg[5447]: channel 2 killed!
nouveau 0000:65:00.0: systemd-logind[3394]: nv50cal_space: -16
nouveau 0000:65:00.0: systemd-logind[3394]: nv50cal_space: -16
(the same message repeated 800x over and over again)

The system got freeze (no mouse or keyboard reaction), however kernel reacted on few Magic SysRq keys, so here are some stack traces:

INFO: task kworker/u72:8:492 blocked for more than 120 seconds.
      Tainted: G           O      4.18.0-13-generic #14-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u72:8   D    0   492      2 0x80000000
Workqueue: events_unbound nv50_disp_atomic_commit_work [nouveau]

Call Trace at 20:25:50:
 __schedule+0x29e/0x840
 schedule+0x2c/0x80
 schedule_timeout+0x258/0x360
 ? nv50_wndw_atomic_destroy_state+0x1d/0x20 [nouveau]
 dma_fence_default_wait+0x1fc/0x260
 ? dma_fence_release+0xa0/0xa0
 dma_fence_wait_timeout+0x3e/0xf0
 drm_atomic_helper_wait_for_fences+0x3f/0xc0 [drm_kms_helper]
 nv50_disp_atomic_commit_tail+0x78/0x860 [nouveau]
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 nv50_disp_atomic_commit_work+0x12/0x20 [nouveau]
 process_one_work+0x20f/0x3c0
 worker_thread+0x34/0x400
 kthread+0x120/0x140
 ? pwq_unbound_release_workfn+0xd0/0xd0
 ? kthread_bind+0x40/0x40
 ret_from_fork+0x35/0x40

Same call trace at 20:29:51 (few minutes later while Xorg was frozen):
Workqueue: events_unbound nv50_disp_atomic_commit_work [nouveau]
Call Trace:
 __schedule+0x29e/0x840
 ? apic_timer_interrupt+0xa/0x20
 ? __drm_crtc_commit_free+0x12/0x20 [drm]
 schedule+0x2c/0x80
 schedule_timeout+0x258/0x360
 ? nv50_wndw_atomic_destroy_state+0x1d/0x20 [nouveau]
 dma_fence_default_wait+0x1fc/0x260
 ? dma_fence_release+0xa0/0xa0
 dma_fence_wait_timeout+0x3e/0xf0
 drm_atomic_helper_wait_for_fences+0x3f/0xc0 [drm_kms_helper]
 nv50_disp_atomic_commit_tail+0x78/0x860 [nouveau]
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 nv50_disp_atomic_commit_work+0x12/0x20 [nouveau]
 process_one_work+0x20f/0x3c0
 worker_thread+0x34/0x400
 kthread+0x120/0x140
 ? pwq_unbound_release_workfn+0xd0/0xd0
 ? kthread_bind+0x40/0x40
 ret_from_fork+0x35/0x40

Another one:
INFO: task Xorg:5447 blocked for more than 120 seconds.
      Tainted: G           O      4.18.0-13-generic #14-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Xorg            D    0  5447   5445 0x00000004
Call Trace:
 __schedule+0x29e/0x840
 schedule+0x2c/0x80
 schedule_preempt_disabled+0xe/0x10
 __ww_mutex_lock.isra.6+0x3c1/0x660
 __ww_mutex_lock_slowpath+0x16/0x20
 ww_mutex_lock+0x34/0x50
 drm_modeset_lock+0x6e/0xb0 [drm]
 drm_crtc_get_sequence_ioctl+0xbc/0x190 [drm]
 ? drm_wait_vblank_ioctl+0x610/0x610 [drm]
 drm_ioctl_kernel+0xa4/0xf0 [drm]
 drm_ioctl+0x227/0x400 [drm]
 ? drm_wait_vblank_ioctl+0x610/0x610 [drm]
 ? do_iter_write+0xe1/0x1a0
 ? do_iter_write+0xe1/0x1a0
 nouveau_drm_ioctl+0x73/0xc0 [nouveau]
 do_vfs_ioctl+0xa8/0x620
 ? __sys_recvmsg+0x88/0xa0
 ksys_ioctl+0x67/0x90
 __x64_sys_ioctl+0x1a/0x20
 do_syscall_64+0x5a/0x110
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3f654b93c7
Code: Bad RIP value.
RSP: 002b:00007ffd57bbf168 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffd57bbf200 RCX: 00007f3f654b93c7
RDX: 00007ffd57bbf1a0 RSI: 00000000c018643b RDI: 000000000000000e
RBP: 00007ffd57bbf1a0 R08: 0000000000000000 R09: 00005646eb8ff7c0
R10: 00005646eb54ad30 R11: 0000000000000246 R12: 00000000c018643b
R13: 000000000000000e R14: 00005646eb54b800 R15: 00005646eb466880

Full log: https://gist.github.com/kenorb/5b95caa1694dbf7f030ccc808a110856
Comment 19 kenorb 2019-01-05 21:44:24 UTC
*** Bug 98138 has been marked as a duplicate of this bug. ***
Comment 20 kenorb 2019-01-05 21:51:50 UTC
Related, possible dup: #99900
Comment 21 kenorb 2019-01-05 21:53:06 UTC
Based on the provided call stacks, related commit for nv50_wndw_atomic_destroy_state: https://lore.kernel.org/patchwork/patch/781346/
Comment 22 kenorb 2019-01-06 14:37:57 UTC
*** Bug 96562 has been marked as a duplicate of this bug. ***
Comment 23 A. Wilcox 2019-01-06 18:55:30 UTC
Maybe related, my Bug 104448.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.