Bug 92971

Summary: [GF110] KDE plasma locks randomly due to crash of nouveau driver
Product: Mesa Reporter: Philippe Condé <conde.philippe>
Component: Drivers/DRI/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED FIXED QA Contact: Nouveau Project <nouveau>
Severity: normal    
Priority: medium CC: lakostis
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: extract of journalctl with nouveau crash
output of dmesg

Description Philippe Condé 2015-11-16 20:20:13 UTC
Created attachment 119714 [details]
extract of journalctl with nouveau crash

Last opensuse tumbleweed version:
I'm using the nouveau driver with the kernel 4.3.0-1-xen (geeko@buildhost) (gcc version 5.2.1 20151008 [gcc-5-branch revision 228597.

My hardware is a HP proliant ML350p Gen8 with an Nvidia quadro 4000 video card with  2 screens (DVI + DP1) using a extended layout; 

I have some locks occurring randomly when 
- loading some programs (Yast, Xine; Shishen etc...)
- or starting a video in the browser (seamonkey  on www. youtube or news site (www.lesoir.be, www.lemonde.fr)
- reactivating a session (lock occurs when moving mouse or pressing keyboard)

the desktop freezes or is bad redraw and locks thereafter.Keyboard is inactive and I must do a hard reset.
Some time the hardware indicate a non maskable interrupt
This lock occurs randomly since kernel 4.2


In journalctl I see a crash of the nouveau driver.

Here the output when locking by loading Yast:
Nov 15 10:36:29 hpprol2 audit[25865]: USER_START pid=25865 uid=1000 auid=1000 ses=3 msg='op=PAM:session_open grantors=pam_limits,pam_unix,pam_umask,pam_systemd,pam_gnome_keyring,pam_env,pam_xauth acct="root" 
Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 y2controlcenter[25892]] subc 0 mthd 0000 data 00000000
Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80040000 [] ch 30 [007e6ab000 y2controlcenter[25892]] subc 0 mthd 0000 data 00000000
Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 y2controlcenter[25892]] subc 0 mthd 0000 data 00000000
Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: read fault at 0000000000 engine 07 [PFIFO] client 06 [PFIFO] reason 00 [PT_NOT_PRESENT] on channel 30 [007e6ab000 y2controlcenter[25892]]
Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: fifo engine fault on channel 30, recovering...
Nov 15 10:36:37 hpprol2 kernel: nouveau 0000:0a:00.0: DRM: GPU lockup - switching to software fbcon

I'll attach a long extract of journalctl after a lock occuring when reactivating a session
Comment 1 Ilia Mirkin 2015-11-16 20:24:30 UTC
Are you using mesa 11.0.3 or later?

Also, what GPU is this? (a full dmesg log or lspci -nn -d 10de::300 should provide the info)
Comment 2 Philippe Condé 2015-11-16 21:14:16 UTC
Hello here the output:

Mesa version
# rpm -aq Mesa*
Mesa-libGLESv2-2-11.0.4-125.1.x86_64
Mesa-libEGL1-11.0.4-125.1.x86_64
Mesa-libglapi0-32bit-11.0.4-125.1.x86_64
Mesa-32bit-11.0.4-125.1.x86_64
Mesa-libGL1-11.0.4-125.1.x86_64
Mesa-libGL-devel-11.0.4-125.1.x86_64
Mesa-libglapi0-11.0.4-125.1.x86_64
Mesa-libEGL-devel-11.0.4-125.1.x86_64
Mesa-libEGL1-32bit-11.0.4-125.1.x86_64
Mesa-11.0.4-125.1.x86_64
Mesa-libGL1-32bit-11.0.4-125.1.x86_64

VIDEO Card

lspci -nn -d 10de::300 
0a:00.0 VGA compatible controller [0300]: NVIDIA Corporation GF100GL [Quadro 4000] [10de:06dd] (rev a1)

Regards
Philippe
Comment 3 Philippe Condé 2015-11-16 21:15:14 UTC
Created attachment 119718 [details]
output of dmesg
Comment 4 Ilia Mirkin 2015-11-16 22:18:16 UTC
Nov 14 19:06:12 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[4257]] subc 6 mthd 2878 data a0c02020

A few things wrong with this picture... first off, nouveau never uses subchannel 6.

Method 2878 is: lookup -a c8 -d SUBCHAN -- -v obj-class GF100_3D 2878
TFB_VARYING_LOCS[0][0x1e] => 0

But those are all emitted in one fell swoop, starting from 2800. No way something could get into the middle of that and screw up the channel on just that one. (Since they don't have independent begin clauses). So this has to have come from elsewhere.... but where?

Is this the first error? The first error is always the most important... the rest can often happen as a result of the earlier ones.

Are you sure the kernel update is what brought this about and not a KDE update? Is this KDE4 or KDE5?
Comment 5 Philippe Condé 2015-11-16 23:37:25 UTC
hello,

it is kde5  (KF5). 
With opensuse Tumbleweed we do a global update (snapshot). Difficult to see thereafter if this is due to KDE or the kernel. 

Three examples from journalctl:

the very first error after the boot message: the lock occurred during reactivation of the session:

Nov 14 18:02:58 hpprol2 kernel: nouveau 0000:0a:00.0: timeout at ../drivers/gpu/drm/nouveau/nvkm/engine/fifo/gpfifogf100.c:66/gf100_fifo_gpfifo_engine_fini()!
Nov 14 18:02:58 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: channel 14 [xembedsniproxy[2209]] kick timeout
Nov 14 18:02:58 hpprol2 kernel: nouveau: xembedsniproxy[2209]:cccc0000:0000906f: detach gr failed, -16
Nov 14 18:02:58 hpprol2 kernel: nouveau 0000:0a:00.0: timeout at ../drivers/gpu/drm/nouveau/nvkm/engine/fifo/gpfifogf100.c:66/gf100_fifo_gpfifo_engine_fini()!
Nov 14 18:02:58 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: channel 20 [xembedsniproxy[2229]] kick timeout
Nov 14 18:02:58 hpprol2 kernel: nouveau: xembedsniproxy[2229]:cccc0000:0000906f: detach gr failed, -16
Nov 14 18:02:59 hpprol2 audit[2124]: ANOM_ABEND auid=1000 uid=1000 gid=100 ses=1 pid=2124 comm="kactivitymanage" exe="/usr/bin/kactivitymanagerd" sig=11


2) the first errors with a corrupt screen followed by a lock when loading Yast

Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 y2controlcenter[25892]] subc 0 mthd 0000 data 00000000
Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80040000 [] ch 30 [007e6ab000 y2controlcenter[25892]] subc 0 mthd 0000 data 00000000
Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 y2controlcenter[25892]] subc 0 mthd 0000 data 00000000
Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: read fault at 0000000000 engine 07 [PFIFO] client 06 [PFIFO] reason 00 [PT_NOT_PRESENT] on channel 30 [007e6ab000 y2controlcenter[25892]]
Nov 15 10:36:29 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: fifo engine fault on channel 30, recovering...
Nov 15 10:36:37 hpprol2 kernel: nouveau 0000:0a:00.0: DRM: GPU lockup - switching to software fbcon

3) the system locked in the night 15-16 and the first error message are
Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000
Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80040000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000
Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000
Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000
Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000
Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000
Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000
Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kscreenlocker_g[14277]] subc 0 mthd 0000 data 00000000


there are a lot of this messages  and then errors about kscreenlocker 
ov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: read fault at 2000000000 engine 07 [PFIFO] client 06 [PFIFO] reason 00 [PT_NOT_PRESENT] on channel 30 [007e6ab000 kscreenlocker_g[14277]]
Nov 16 00:10:32 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: fifo engine fault on channel 30, recovering...

the only common process in all case is nouveau 
Regards
Philippe
Comment 6 Philippe Condé 2015-11-17 12:50:01 UTC
Hello,

I tested with the non Xen kernel vmlinux-4.3.0-1-default.
The crash exists also with this kernel

here the journalctl output with a crash when loading a page in the browser: The screen was totally corrupted but I succeeded to go the the console CTRL Alt F1 and could load top ==> plasmashell (PID 2358) was running with 100%. 

I tried to shutdown but the system locked (I did not see any shutdown messages and did a hard reset)

Nov 17 13:21:23 hpprol2 login[1943]: ROOT LOGIN ON tty1
Nov 17 13:22:02 hpprol2 kernel: nouveau 0000:0a:00.0: plasmashell[2358]: failed to idle channel 0xcccc0000 [plasmashell[2358]]
Nov 17 13:22:15 hpprol2 systemd-logind[1141]: Power key pressed.
Nov 17 13:22:17 hpprol2 kernel: nouveau 0000:0a:00.0: plasmashell[2358]: failed to idle channel 0xcccc0000 [plasmashell[2358]]
Nov 17 13:22:23 hpprol2 kernel: nouveau 0000:0a:00.0: Xorg[2019]: nv50cal_space: -16

The last line was repeated many times 
thereafter are some lines related to the shutdown which end with
Nov 17 13:22:56 hpprol2 display-manager[4671]: Shutting down service kdm..done
Nov 17 13:23:06 hpprol2 kernel: nouveau 0000:0a:00.0: xembedsniproxy[2424]: failed to idle channel 0xcccc0000 [xembedsniproxy[2424]]
Nov 17 13:23:21 hpprol2 kernel: nouveau 0000:0a:00.0: xembedsniproxy[2424]: failed to idle channel 0xcccc0000 [xembedsniproxy[2424]]
Nov 17 13:23:30 hpprol2 systemd-logind[1141]: Power key pressed.


can it be the plasmashell which start a crash in nouveau driver?

Regards
Philippe
Comment 7 Ilia Mirkin 2015-11-17 13:05:52 UTC
(In reply to Philippe Condé from comment #6)
> Hello,
> 
> I tested with the non Xen kernel vmlinux-4.3.0-1-default.
> The crash exists also with this kernel
> 
> here the journalctl output with a crash when loading a page in the browser:
> The screen was totally corrupted but I succeeded to go the the console CTRL
> Alt F1 and could load top ==> plasmashell (PID 2358) was running with 100%.

Probably busy-waiting on a fence...

> 
> I tried to shutdown but the system locked (I did not see any shutdown
> messages and did a hard reset)
> 
> Nov 17 13:21:23 hpprol2 login[1943]: ROOT LOGIN ON tty1
> Nov 17 13:22:02 hpprol2 kernel: nouveau 0000:0a:00.0: plasmashell[2358]:
> failed to idle channel 0xcccc0000 [plasmashell[2358]]

This generally indicates a GPU hang. This was the first complaint of any sort from nouveau in your logs? Nothing before that other than the init messages at boot time?

> Nov 17 13:22:15 hpprol2 systemd-logind[1141]: Power key pressed.
> Nov 17 13:22:17 hpprol2 kernel: nouveau 0000:0a:00.0: plasmashell[2358]:
> failed to idle channel 0xcccc0000 [plasmashell[2358]]
> Nov 17 13:22:23 hpprol2 kernel: nouveau 0000:0a:00.0: Xorg[2019]:
> nv50cal_space: -16

This definitely indicates a GPU hang.

> 
> The last line was repeated many times 
> thereafter are some lines related to the shutdown which end with
> Nov 17 13:22:56 hpprol2 display-manager[4671]: Shutting down service
> kdm..done
> Nov 17 13:23:06 hpprol2 kernel: nouveau 0000:0a:00.0: xembedsniproxy[2424]:
> failed to idle channel 0xcccc0000 [xembedsniproxy[2424]]
> Nov 17 13:23:21 hpprol2 kernel: nouveau 0000:0a:00.0: xembedsniproxy[2424]:
> failed to idle channel 0xcccc0000 [xembedsniproxy[2424]]
> Nov 17 13:23:30 hpprol2 systemd-logind[1141]: Power key pressed.
> 
> 
> can it be the plasmashell which start a crash in nouveau driver?

Various people have poorly understood issues with KDE5 and nouveau. A class of those are fixed in kernel 4.3, but from the looks of it, hardly all.

I don't think this is your issue, but there have been some locking fixes:

http://cgit.freedesktop.org/~darktama/nouveau/commit/?id=2f3a56ad019e378a352e9cb7a559f478826f1a87
http://cgit.freedesktop.org/~darktama/nouveau/commit/?id=4179b15c6e9fcfb253e811e5477debe46c84c395

are the ones that'd apply for your GPU. Don't think they'll have an effect though, but it's something to try (you appear to be comfortable building kernels).

It is also worth noting that at some point in the past, people with GF110's (NVC8) have had various issues with nouveau's context switching firmware. We believe we fixed those, but... who knows. You could try using blob pgraph firmware. See http://nouveau.freedesktop.org/wiki/VideoAcceleration/#firmware for how to extract it, and boot with nouveau.config=NvGrUseFW=1 . [Yeah, that page is about video decoding firmware, but that script also extracts the PGRAPH ctxsw fw.]
Comment 8 Philippe Condé 2015-11-21 07:50:35 UTC
Hello,
I desactivated kscreenlocker which resolve the reactivate problem.

But the problem of corrupt screen when starting different programs remains and is worst since the last zypper dup.

I found this error still related to nouveau when the two screen got corrupted

ov 20 22:37:07 hpprol2 kernel: audit: type=1105 audit(1448055427.213:154): pid=2624 uid=1000 auid=1000 ses=1 msg='op=PAM:session_open grantors=pam_limits,pam_unix,pam_umask,pam_systemd,pam_gnome_keyring,pam_
Nov 20 22:39:12 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kwin_x11[2097]] subc 0 mthd 0000 data 00000000
Nov 20 22:39:12 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80040000 [] ch 30 [007e6ab000 kwin_x11[2097]] subc 0 mthd 0000 data 00000000
Nov 20 22:39:12 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: PBDMA0: 80000000 [] ch 30 [007e6ab000 kwin_x11[2097]] subc 0 mthd 0000 data 00000000
Nov 20 22:39:12 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: read fault at 0000000000 engine 07 [PFIFO] client 06 [PFIFO] reason 00 [PT_NOT_PRESENT] on channel 30 [007e6ab000 kwin_x11[2097]]
Nov 20 22:39:12 hpprol2 kernel: nouveau 0000:0a:00.0: fifo: fifo engine fault on channel 30, recovering...


during each boot i see two error messages related to edac (at the end)
Nov 20 22:58:07 hpprol2 kernel: EDAC MC: Ver: 3.0.0
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3ca0
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3ca0
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3ca8
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3ca8
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3c71
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3c71
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3caa
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3caa
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cab
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cab
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cac
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cac
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cad
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cad
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cb8
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cb8
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cf4
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cf4
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cf6
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cf6
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cf5
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Seeking for: PCI ID 8086:3cf5
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Some needed devices are missing
Nov 20 22:58:07 hpprol2 kernel: EDAC sbridge: Couldn't find mci handler


when the screens are corrupted I can still  go to a console (VIA CTRL-ALT-F1) and if I start top I see that the plasmeshell is running at 100%  and also Xorg.

trying to kill plasmashell d lock the graphic screens

in the kdm.log I have a lot of these errors

nouveau: kernel rejected pushbuf: Device or resource busy
nouveau: ch0: krec 0 pushes 1 bufs 2 relocs 0
nouveau: ch0: buf 00000000 00000004 00000004 00000004 00000000
nouveau: ch0: buf 00000001 0000001c 00000002 00000000 00000002
nouveau: ch0: psh 00000000 0000006a04 0000006a84
nouveau:        0x20056080
nouveau:        0x000000cf
nouveau:        0x00000000
nouveau:        0x00000040
nouveau:        0x00000001
nouveau:        0x00000000
nouveau:        0x20046086
nouveau:        0x00000774
nouveau:        0x0000040d
nouveau:        0x00000000
nouveau:        0x06a20000
nouveau:        0x200460a0
nouveau:        0x00000000
nouveau:        0x00000000
nouveau:        0x00000774
nouveau:        0x0000040d
nouveau:        0x200160ab
nouveau:        0x00000003
nouveau:        0x20036160
nouveau:        0x00000004
nouveau:        0x000000cf
nouveau:        0x00d6d2d0
nouveau:        0x20046180
nouveau:        0x00000657
nouveau:        0x00000026
nouveau:        0x00000658
nouveau:        0x0000003d
nouveau:        0x20046180
nouveau:        0x0000013f
nouveau:        0x0000003d
nouveau:        0x00000658
nouveau:        0x0000003e
nouveau: kernel rejected pushbuf: Device or resource busy
nouveau: ch0: krec 0 pushes 1 bufs 2 relocs 0
nouveau: ch0: buf 00000000 00000004 00000004 00000004 00000000
nouveau: ch0: buf 00000001 0000001c 00000002 00000000 00000002
nouveau: ch0: psh 00000000 0000006a84 0000006af0
nouveau:        0x20056080
nouveau:        0x000000cf
nouveau:        0x00000000
nouveau:        0x00000040
nouveau:        0x00000001
etc...

Regards
Philippe
Comment 9 Philippe Condé 2015-11-22 07:56:46 UTC
Hello,

I have reinstalled the full system with the last snapchot of Opensuse tumbleweed:
20151118 "Linux hpprol2 4.3.0-2-xen #1 SMP Sat Nov 14 16:19:19 UTC 2015 (734b32c) x86_64 x86_64 x86_64 GNU/Linux" 
and the problem seems to have been solved. No more problem with corrupted screen or locking.

So this  bug can be closed
Thanks for your help
Philippe

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.