Summary: | [GF110] KDE plasma locks randomly due to crash of nouveau driver | ||
---|---|---|---|
Product: | Mesa | Reporter: | Philippe Condé <conde.philippe> |
Component: | Drivers/DRI/nouveau | Assignee: | Nouveau Project <nouveau> |
Status: | RESOLVED FIXED | QA Contact: | Nouveau Project <nouveau> |
Severity: | normal | ||
Priority: | medium | CC: | lakostis |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
extract of journalctl with nouveau crash
output of dmesg |
Are you using mesa 11.0.3 or later? Also, what GPU is this? (a full dmesg log or lspci -nn -d 10de::300 should provide the info) Hello here the output: Mesa version # rpm -aq Mesa* Mesa-libGLESv2-2-11.0.4-125.1.x86_64 Mesa-libEGL1-11.0.4-125.1.x86_64 Mesa-libglapi0-32bit-11.0.4-125.1.x86_64 Mesa-32bit-11.0.4-125.1.x86_64 Mesa-libGL1-11.0.4-125.1.x86_64 Mesa-libGL-devel-11.0.4-125.1.x86_64 Mesa-libglapi0-11.0.4-125.1.x86_64 Mesa-libEGL-devel-11.0.4-125.1.x86_64 Mesa-libEGL1-32bit-11.0.4-125.1.x86_64 Mesa-11.0.4-125.1.x86_64 Mesa-libGL1-32bit-11.0.4-125.1.x86_64 VIDEO Card lspci -nn -d 10de::300 0a:00.0 VGA compatible controller [0300]: NVIDIA Corporation GF100GL [Quadro 4000] [10de:06dd] (rev a1) Regards Philippe Created attachment 119718 [details]
output of dmesg
Nov 14 19:06:12 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[4257]] subc 6 mthd 2878 data a0c02020 A few things wrong with this picture... first off, nouveau never uses subchannel 6. Method 2878 is: lookup -a c8 -d SUBCHAN -- -v obj-class GF100_3D 2878 TFB_VARYING_LOCS[0][0x1e] => 0 But those are all emitted in one fell swoop, starting from 2800. No way something could get into the middle of that and screw up the channel on just that one. (Since they don't have independent begin clauses). So this has to have come from elsewhere.... but where? Is this the first error? The first error is always the most important... the rest can often happen as a result of the earlier ones. Are you sure the kernel update is what brought this about and not a KDE update? Is this KDE4 or KDE5? hello, it is kde5 (KF5). With opensuse Tumbleweed we do a global update (snapshot). Difficult to see thereafter if this is due to KDE or the kernel. Three examples from journalctl: the very first error after the boot message: the lock occurred during reactivation of the session: Nov 14 18:02:58 hpprol2 kernel: nouveau 0000:0a:00.0: timeout at ../drivers/gpu/drm/nouveau/nvkm/engine/fifo/gpfifogf100.c:66/gf100_fifo_gpfifo_engine_fini()! Nov 14 18:02:58 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: channel 14 [xembedsniproxy[2209]] kick timeout Nov 14 18:02:58 hpprol2 kernel: nouveau: xembedsniproxy[2209]:cccc0000:0000906f: detach gr failed, -16 Nov 14 18:02:58 hpprol2 kernel: nouveau 0000:0a:00.0: timeout at ../drivers/gpu/drm/nouveau/nvkm/engine/fifo/gpfifogf100.c:66/gf100_fifo_gpfifo_engine_fini()! Nov 14 18:02:58 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: channel 20 [xembedsniproxy[2229]] kick timeout Nov 14 18:02:58 hpprol2 kernel: nouveau: xembedsniproxy[2229]:cccc0000:0000906f: detach gr failed, -16 Nov 14 18:02:59 hpprol2 audit[2124]: ANOM_ABEND auid=1000 uid=1000 gid=100 ses=1 pid=2124 comm="kactivitymanage" exe="/usr/bin/kactivitymanagerd" sig=11 2) the first errors with a corrupt screen followed by a lock when loading Yast Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 y2controlcenter[25892]] subc 0 mthd 0000 data 00000000 Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80040000 [] ch 30 [007e6ab000 y2controlcenter[25892]] subc 0 mthd 0000 data 00000000 Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 y2controlcenter[25892]] subc 0 mthd 0000 data 00000000 Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: read fault at 0000000000 engine 07 [PFIFO] client 06 [PFIFO] reason 00 [PT_NOT_PRESENT] on channel 30 [007e6ab000 y2controlcenter[25892]] Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: fifo engine fault on channel 30, recovering... Nov 15 10:36:37 hpprol2 kernel: nouveau 0000:0a:00.0: DRM: GPU lockup - switching to software fbcon 3) the system locked in the night 15-16 and the first error message are Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000 Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80040000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000 Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000 Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000 Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000 Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000 Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000 Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000 there are a lot of this messages and then errors about kscreenlocker ov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: read fault at 2000000000 engine 07 [PFIFO] client 06 [PFIFO] reason 00 [PT_NOT_PRESENT] on channel 30 [007e6ab000 kscreenlocker_g[14277]] Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: fifo engine fault on channel 30, recovering... the only common process in all case is nouveau Regards Philippe Hello, I tested with the non Xen kernel vmlinux-4.3.0-1-default. The crash exists also with this kernel here the journalctl output with a crash when loading a page in the browser: The screen was totally corrupted but I succeeded to go the the console CTRL Alt F1 and could load top ==> plasmashell (PID 2358) was running with 100%. I tried to shutdown but the system locked (I did not see any shutdown messages and did a hard reset) Nov 17 13:21:23 hpprol2 login[1943]: ROOT LOGIN ON tty1 Nov 17 13:22:02 hpprol2 kernel: nouveau 0000:0a:00.0: plasmashell[2358]: failed to idle channel 0xcccc0000 [plasmashell[2358]] Nov 17 13:22:15 hpprol2 systemd-logind[1141]: Power key pressed. Nov 17 13:22:17 hpprol2 kernel: nouveau 0000:0a:00.0: plasmashell[2358]: failed to idle channel 0xcccc0000 [plasmashell[2358]] Nov 17 13:22:23 hpprol2 kernel: nouveau 0000:0a:00.0: Xorg[2019]: nv50cal_space: -16 The last line was repeated many times thereafter are some lines related to the shutdown which end with Nov 17 13:22:56 hpprol2 display-manager[4671]: Shutting down service kdm..done Nov 17 13:23:06 hpprol2 kernel: nouveau 0000:0a:00.0: xembedsniproxy[2424]: failed to idle channel 0xcccc0000 [xembedsniproxy[2424]] Nov 17 13:23:21 hpprol2 kernel: nouveau 0000:0a:00.0: xembedsniproxy[2424]: failed to idle channel 0xcccc0000 [xembedsniproxy[2424]] Nov 17 13:23:30 hpprol2 systemd-logind[1141]: Power key pressed. can it be the plasmashell which start a crash in nouveau driver? Regards Philippe (In reply to Philippe Condé from comment #6) > Hello, > > I tested with the non Xen kernel vmlinux-4.3.0-1-default. > The crash exists also with this kernel > > here the journalctl output with a crash when loading a page in the browser: > The screen was totally corrupted but I succeeded to go the the console CTRL > Alt F1 and could load top ==> plasmashell (PID 2358) was running with 100%. Probably busy-waiting on a fence... > > I tried to shutdown but the system locked (I did not see any shutdown > messages and did a hard reset) > > Nov 17 13:21:23 hpprol2 login[1943]: ROOT LOGIN ON tty1 > Nov 17 13:22:02 hpprol2 kernel: nouveau 0000:0a:00.0: plasmashell[2358]: > failed to idle channel 0xcccc0000 [plasmashell[2358]] This generally indicates a GPU hang. This was the first complaint of any sort from nouveau in your logs? Nothing before that other than the init messages at boot time? > Nov 17 13:22:15 hpprol2 systemd-logind[1141]: Power key pressed. > Nov 17 13:22:17 hpprol2 kernel: nouveau 0000:0a:00.0: plasmashell[2358]: > failed to idle channel 0xcccc0000 [plasmashell[2358]] > Nov 17 13:22:23 hpprol2 kernel: nouveau 0000:0a:00.0: Xorg[2019]: > nv50cal_space: -16 This definitely indicates a GPU hang. > > The last line was repeated many times > thereafter are some lines related to the shutdown which end with > Nov 17 13:22:56 hpprol2 display-manager[4671]: Shutting down service > kdm..done > Nov 17 13:23:06 hpprol2 kernel: nouveau 0000:0a:00.0: xembedsniproxy[2424]: > failed to idle channel 0xcccc0000 [xembedsniproxy[2424]] > Nov 17 13:23:21 hpprol2 kernel: nouveau 0000:0a:00.0: xembedsniproxy[2424]: > failed to idle channel 0xcccc0000 [xembedsniproxy[2424]] > Nov 17 13:23:30 hpprol2 systemd-logind[1141]: Power key pressed. > > > can it be the plasmashell which start a crash in nouveau driver? Various people have poorly understood issues with KDE5 and nouveau. A class of those are fixed in kernel 4.3, but from the looks of it, hardly all. I don't think this is your issue, but there have been some locking fixes: http://cgit.freedesktop.org/~darktama/nouveau/commit/?id=2f3a56ad019e378a352e9cb7a559f478826f1a87 http://cgit.freedesktop.org/~darktama/nouveau/commit/?id=4179b15c6e9fcfb253e811e5477debe46c84c395 are the ones that'd apply for your GPU. Don't think they'll have an effect though, but it's something to try (you appear to be comfortable building kernels). It is also worth noting that at some point in the past, people with GF110's (NVC8) have had various issues with nouveau's context switching firmware. We believe we fixed those, but... who knows. You could try using blob pgraph firmware. See http://nouveau.freedesktop.org/wiki/VideoAcceleration/#firmware for how to extract it, and boot with nouveau.config=NvGrUseFW=1 . [Yeah, that page is about video decoding firmware, but that script also extracts the PGRAPH ctxsw fw.] Hello, I desactivated kscreenlocker which resolve the reactivate problem. But the problem of corrupt screen when starting different programs remains and is worst since the last zypper dup. I found this error still related to nouveau when the two screen got corrupted ov 20 22:37:07 hpprol2 kernel: audit: type=1105 audit(1448055427.213:154): pid=2624 uid=1000 auid=1000 ses=1 msg='op=PAM:session_open grantors=pam_limits,pam_unix,pam_umask,pam_systemd,pam_gnome_keyring,pam_ Nov 20 22:39:12 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kwin_x11[2097]] subc 0 mthd 0000 data 00000000 Nov 20 22:39:12 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80040000 [] ch 30 [007e6ab000 kwin_x11[2097]] subc 0 mthd 0000 data 00000000 Nov 20 22:39:12 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kwin_x11[2097]] subc 0 mthd 0000 data 00000000 Nov 20 22:39:12 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: read fault at 0000000000 engine 07 [PFIFO] client 06 [PFIFO] reason 00 [PT_NOT_PRESENT] on channel 30 [007e6ab000 kwin_x11[2097]] Nov 20 22:39:12 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: fifo engine fault on channel 30, recovering... during each boot i see two error messages related to edac (at the end) Nov 20 22:58:07 hpprol2 kernel: EDAC MC: Ver: 3.0.0 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3ca0 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3ca0 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3ca8 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3ca8 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3c71 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3c71 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3caa Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3caa Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cab Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cab Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cac Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cac Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cad Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cad Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cb8 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cb8 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cf4 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cf4 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cf6 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cf6 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cf5 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cf5 Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Some needed devices are missing Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Couldn't find mci handler when the screens are corrupted I can still go to a console (VIA CTRL-ALT-F1) and if I start top I see that the plasmeshell is running at 100% and also Xorg. trying to kill plasmashell d lock the graphic screens in the kdm.log I have a lot of these errors nouveau: kernel rejected pushbuf: Device or resource busy nouveau: ch0: krec 0 pushes 1 bufs 2 relocs 0 nouveau: ch0: buf 00000000 00000004 00000004 00000004 00000000 nouveau: ch0: buf 00000001 0000001c 00000002 00000000 00000002 nouveau: ch0: psh 00000000 0000006a04 0000006a84 nouveau: 0x20056080 nouveau: 0x000000cf nouveau: 0x00000000 nouveau: 0x00000040 nouveau: 0x00000001 nouveau: 0x00000000 nouveau: 0x20046086 nouveau: 0x00000774 nouveau: 0x0000040d nouveau: 0x00000000 nouveau: 0x06a20000 nouveau: 0x200460a0 nouveau: 0x00000000 nouveau: 0x00000000 nouveau: 0x00000774 nouveau: 0x0000040d nouveau: 0x200160ab nouveau: 0x00000003 nouveau: 0x20036160 nouveau: 0x00000004 nouveau: 0x000000cf nouveau: 0x00d6d2d0 nouveau: 0x20046180 nouveau: 0x00000657 nouveau: 0x00000026 nouveau: 0x00000658 nouveau: 0x0000003d nouveau: 0x20046180 nouveau: 0x0000013f nouveau: 0x0000003d nouveau: 0x00000658 nouveau: 0x0000003e nouveau: kernel rejected pushbuf: Device or resource busy nouveau: ch0: krec 0 pushes 1 bufs 2 relocs 0 nouveau: ch0: buf 00000000 00000004 00000004 00000004 00000000 nouveau: ch0: buf 00000001 0000001c 00000002 00000000 00000002 nouveau: ch0: psh 00000000 0000006a84 0000006af0 nouveau: 0x20056080 nouveau: 0x000000cf nouveau: 0x00000000 nouveau: 0x00000040 nouveau: 0x00000001 etc... Regards Philippe Hello, I have reinstalled the full system with the last snapchot of Opensuse tumbleweed: 20151118 "Linux hpprol2 4.3.0-2-xen #1 SMP Sat Nov 14 16:19:19 UTC 2015 (734b32c) x86_64 x86_64 x86_64 GNU/Linux" and the problem seems to have been solved. No more problem with corrupted screen or locking. So this bug can be closed Thanks for your help Philippe |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 119714 [details] extract of journalctl with nouveau crash Last opensuse tumbleweed version: I'm using the nouveau driver with the kernel 4.3.0-1-xen (geeko@buildhost) (gcc version 5.2.1 20151008 [gcc-5-branch revision 228597. My hardware is a HP proliant ML350p Gen8 with an Nvidia quadro 4000 video card with 2 screens (DVI + DP1) using a extended layout; I have some locks occurring randomly when - loading some programs (Yast, Xine; Shishen etc...) - or starting a video in the browser (seamonkey on www. youtube or news site (www.lesoir.be, www.lemonde.fr) - reactivating a session (lock occurs when moving mouse or pressing keyboard) the desktop freezes or is bad redraw and locks thereafter.Keyboard is inactive and I must do a hard reset. Some time the hardware indicate a non maskable interrupt This lock occurs randomly since kernel 4.2 In journalctl I see a crash of the nouveau driver. Here the output when locking by loading Yast: Nov 15 10:36:29 hpprol2 audit[25865]: USER_START pid=25865 uid=1000 auid=1000 ses=3 msg='op=PAM:session_open grantors=pam_limits,pam_unix,pam_umask,pam_systemd,pam_gnome_keyring,pam_env,pam_xauth acct="root" Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 y2controlcenter[25892]] subc 0 mthd 0000 data 00000000 Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80040000 [] ch 30 [007e6ab000 y2controlcenter[25892]] subc 0 mthd 0000 data 00000000 Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 y2controlcenter[25892]] subc 0 mthd 0000 data 00000000 Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: read fault at 0000000000 engine 07 [PFIFO] client 06 [PFIFO] reason 00 [PT_NOT_PRESENT] on channel 30 [007e6ab000 y2controlcenter[25892]] Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: fifo engine fault on channel 30, recovering... Nov 15 10:36:37 hpprol2 kernel: nouveau 0000:0a:00.0: DRM: GPU lockup - switching to software fbcon I'll attach a long extract of journalctl after a lock occuring when reactivating a session