Bug 92077 - nouveau graphics freeze when using KDE Plasma 5; PGR engine fault
Summary: nouveau graphics freeze when using KDE Plasma 5; PGR engine fault
Status: NEW
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/nouveau (show other bugs)
Version: 13.0
Hardware: Other All
: medium normal
Assignee: Nouveau Project
QA Contact: Nouveau Project
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 73373 92438 92515 97537
  Show dependency treegraph
 
Reported: 2015-09-22 17:54 UTC by zoominee
Modified: 2017-12-31 12:48 UTC (History)
9 users (show)

See Also:
i915 platform:
i915 features:


Attachments
VBIOS dump (57.50 KB, application/octet-stream)
2015-09-22 18:01 UTC, zoominee
Details

Note You need to log in before you can comment on or make changes to this bug.
Description zoominee 2015-09-22 17:54:10 UTC
I upgraded KDE to Plasma-workspaces version 5. (Gentoo system)
Now, sometimes when my system has been idle for a while, it appears "unresponsive" when I come back (screen doesn't wake up on mouse, mouse pointer doesn't work if I switch off screen sleep, no reaction to any keys, etc.). The system doesn't crash, but the nouveau graphics output does, and I cannot reach a console using the keyboard, either.

I found the below output in /var/log/messages (a number of instances of observing the bug; ... indicates that I had to restart the system).

Versions used:
x11-drivers/xf86-video-nouveau-1.0.11
x11-base/xorg-server-1.16.4
media-libs/mesa-10.3.7-r1
x11-libs/libdrm-2.4.59
x11-base/xorg-drivers-1.16
Kernel: 4.0.5 (self-compiled)

I filed a bug against Plasma 5 but I'm not sure if it's their bug. See https://bugs.kde.org/show_bug.cgi?id=352605

I also have an apitrace record of plasmashell, replaying which (like a mini film) also freezes the nouveau. But it's a rather large file (1.4 GB) and contains private photos.

From /var/log/messages:

Sep  6 16:31:09 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] read fault at 0x0004042000 [PTE] from GR/GPC0/T1_1 on channel 0x0
03f7af000 [plasmashell[6790]]
Sep  6 16:31:09 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] PGR engine fault on channel 10, recovering...
Sep  6 16:33:20 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] read fault at 0x0000023000 [PTE] from PBDMA0/HOST_CPU on channel 
0x003fbe0000 [unknown]
[...]
Sep  7 23:30:25 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] write fault at 0x0003d49000 [PTE] from GR/GPC0/PROP_0 on channel 
0x003f8ef000 [plasmashell[19341]]
Sep  7 23:30:25 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] PGR engine fault on channel 8, recovering...
[...]
Sep  8 23:21:29 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] write fault at 0x0003e02000 [PTE] from GR/GPC0/PROP_0 on channel 
0x003f8ef000 [plasmashell[4952]]
Sep  8 23:21:29 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] PGR engine fault on channel 8, recovering...
[...]
Sep 10 02:00:03 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] read fault at 0x0001740000 [PTE] from GR/GPC0/T1_0 on channel 0x0
03f8ef000 [plasmashell[4709]]
Sep 10 02:00:03 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] PGR engine fault on channel 8, recovering...
[...]
Sep 11 03:07:57 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] read fault at 0x0003508000 [PTE] from GR/GPC0/T1_0 on channel 0x0
03f84f000 [plasmashell[6720]]
Sep 11 03:07:57 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] PGR engine fault on channel 9, recovering...
[...]
Sep 12 00:19:50 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] write fault at 0x0003ef5000 [PTE] from GR/GPC0/PROP_0 on channel 
0x003facd000 [plasmashell[20804]]
Sep 12 00:19:50 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] PGR engine fault on channel 5, recovering...
Comment 1 Ilia Mirkin 2015-09-22 17:58:31 UTC
An apitrace is really the only way this sort of issue will be debugged. If you can't share the trace due to privacy, please try to recreate the issue without private content. You can upload a large file to e.g. google drive.

A number of people are having issues with plasmashell and nouveau.
Comment 2 zoominee 2015-09-22 18:01:42 UTC
Created attachment 118402 [details]
VBIOS dump

VBIOS dump acquired using the /sys method.
Comment 3 zoominee 2015-09-22 18:02:52 UTC
(In reply to Ilia Mirkin from comment #1)
> An apitrace is really the only way this sort of issue will be debugged. If
> you can't share the trace due to privacy, please try to recreate the issue
> without private content. You can upload a large file to e.g. google drive.
> 
> A number of people are having issues with plasmashell and nouveau.

OK, I'll try to recreate a trace with the background graphics off and doing nothing on the screen. I think it's the applet that shows current system activity which causes this behaviour (or maybe it's the clock display).
Comment 4 zoominee 2015-09-23 04:37:55 UTC
I generated a new apitrace without too much private content. It was about 155 MB, so I zipped it down to 77 MB. Because I couldn't attach that to this report, I uploaded it to a free service, here's the link. Maybe it works.
http://www.uploadmb.com/dw.php?id=1442982871

The apitrace was generated with the versions mentioned in the bug's Description.

I noticed that there is a newer mesa, so I will test (later) whether the apitrace still results in the graphics freeze.

The dmesg output related to this graphics freeze captured in the new trace is the following:

Sep 22 23:49:39 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] read fault at 0x000310e000 [PTE] from GR/GPC0/T1_1 on channel 0x003facd000 [plasmashell[2933]]                                                                                                        
Sep 22 23:49:39 localhost kernel: nouveau E[   PFIFO][0000:02:00.0] PGR engine fault on channel 5, recovering...
Sep 22 23:49:39 localhost kernel: nouveau E[     PGR][0000:02:00.0] TRAP ch 5 [0x003facd000 plasmashell[2933]]
Sep 22 23:49:39 localhost kernel: nouveau E[     PGR][0000:02:00.0] GPC0/TPC0/TEX: 0x80000049
Sep 22 23:49:39 localhost kernel: nouveau E[     PGR][0000:02:00.0] GPC0/TPC1/TEX: 0x80000049
Comment 5 zoominee 2015-09-23 15:59:36 UTC
I upgraded the system to the following versions, and the bug is still present.

x11-drivers/xf86-video-nouveau-1.0.11
x11-base/xorg-server-1.17.2-r1
media-libs/mesa-11.0.0
x11-libs/libdrm-2.4.65
x11-base/xorg-drivers-1.17
Kernel: 4.0.5 (self-compiled)
Comment 6 zoominee 2015-09-23 16:07:10 UTC
Information on graphics card from lspci -v:

02:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 640 Rev. 2] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. GK208 [GeForce GT 640 Rev. 2]
        Flags: bus master, fast devsel, latency 0, IRQ 27
        Memory at df000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=128M]
        Memory at dc000000 (64-bit, prefetchable) [size=32M]
        I/O ports at ec00 [size=128]
        Expansion ROM at def80000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] #19
        Kernel driver in use: nouveau
Comment 7 dmidge 2015-10-04 21:09:10 UTC
Hi everyone,

I can identify myself in this bug. As zoominee did, I reported a similar bug on KDE forum: https://bugs.kde.org/show_bug.cgi?id=353292. There, you can access the stacktrace (on an Archlinux system). You also have the dmesg messages. There is a sample of what is displayed:
[251170.682470] nouveau E[  PGRAPH][0000:01:00.0] magic set 1:
[251170.682485] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e04: 0x2008bf05
[251170.682494] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e08: 0x00205640
[251170.682501] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e0c: 0x40000432
[251170.682509] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e10: 0x56400000
[251170.682515] nouveau E[  PGRAPH][0000:01:00.0] TRAP_TEXTURE - TP1:  FAULT
[251170.682531] nouveau E[  PGRAPH][0000:01:00.0] ch 9 [0x003f518000 plasmashell[503]] subc 3 class 0x8597 mthd 0x1b0c data 0x1000f010
[251170.682555] nouveau E[     PFB][0000:01:00.0] trapped read at 0x0020564000 on channel 0x0003f518 [plasmashell[503]] PGRAPH/TEXTURE/00 reason: PAGE_NOT_PRESENT
[251170.683469] nouveau E[  PGRAPH][0000:01:00.0] magic set 1:
[251170.683476] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e04: 0x2009bc05
[251170.683481] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e08: 0x00205641
[251170.683485] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e0c: 0x40000432
[251170.683490] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e10: 0x56400000
[251170.683493] nouveau E[  PGRAPH][0000:01:00.0] TRAP_TEXTURE - TP1:  FAULT
[251170.683502] nouveau E[  PGRAPH][0000:01:00.0] ch 9 [0x003f518000 plasmashell[503]] subc 3 class 0x8597 mthd 0x1b0c data 0x1000f010
[251170.683513] nouveau E[     PFB][0000:01:00.0] trapped read at 0x0020564100 on channel 0x0003f518 [plasmashell[503]] PGRAPH/TEXTURE/00 reason: PAGE_NOT_PRESENT
[251170.683543] nouveau E[  PGRAPH][0000:01:00.0] magic set 1:
[251170.683555] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e04: 0x2009610f
[251170.683560] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e08: 0x00205734
[251170.683565] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e0c: 0x40000432
[251170.683570] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e10: 0x57200000
[251170.683574] nouveau E[  PGRAPH][0000:01:00.0] TRAP_TEXTURE - TP1:  FAULT
[251170.683586] nouveau E[  PGRAPH][0000:01:00.0] ch 9 [0x003f518000 plasmashell[503]] subc 3 class 0x8597 mthd 0x15f0 data 0x02000201
[251170.683599] nouveau E[     PFB][0000:01:00.0] trapped read at 0x0020573400 on channel 0x0003f518 [plasmashell[503]] PGRAPH/TEXTURE/00 reason: PAGE_NOT_PRESENT
[251170.683632] nouveau E[  PGRAPH][0000:01:00.0] magic set 1:
[251170.683644] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e04: 0x20086805
[251170.683650] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e08: 0x00205890
[251170.683656] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e0c: 0x40000432
[251170.683662] nouveau E[  PGRAPH][0000:01:00.0] 	0x00408e10: 0x58900000
[251170.683667] nouveau E[  PGRAPH][0000:01:00.0] TRAP_TEXTURE - TP1:  FAULT
[251170.683680] nouveau E[  PGRAPH][0000:01:00.0] ch 9 [0x003f518000 plasmashell[503]] subc 3 class 0x8597 mthd 0x0900 data 0x20000010
[251170.683698] nouveau E[     PFB][0000:01:00.0] trapped read at 0x0020589000 on channel 0x0003f518 [plasmashell[503]] PGRAPH/TEXTURE/00 reason: PAGE_NOT_PRESENT

... and there is more there. Please see the attachment.

I can also add that it makes the plasmashell application hangs for a while, with 100%CPU usage. After some time, the system is responsive again, with some glitches in the interface (meaning, some icons disappearing or wrongly displayed for instance. Once, it made the file explorer dolphin crash).

It has been identified by a problem with nouveau (I refer you to the comment 4 of the ticket, that gives better understanding of the problem):
https://bugs.kde.org/show_bug.cgi?id=353292#c4

Thank you for your time! And keep the good work going!
Cheers.
Comment 8 zoominee 2015-10-05 15:24:07 UTC
Dear dmidge!  I think the error messages that you get (something about PGRAPH and textures) are quite different from mine (something about PFIFO and the PGR engine, whatever that may be...) - it probably makes more sense to file it as a different bug instead of putting it into the same thread. Thanks!
Comment 9 dmidge 2015-10-05 20:09:31 UTC
Hi zoominee,
Ooops, sorry. I thought it was the same one, because we have the same application that crashes. Alright, I will file a new bug! Thanks! And sorry for the inconvenience. Is there a way I can delete my previous post, to avoid people to read it anymore, since it seems to be unrelated?
Comment 10 Ilia Mirkin 2015-10-15 17:37:56 UTC
I pushed out some fixes in Mesa 11.0.3 which affect resource lifetime tracking. Can you try it to see if things got better?
Comment 11 Ilia Mirkin 2015-10-22 00:46:41 UTC
Also you can try the patch at http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=2a6c521bb41ce862e43db46f52e7681d33e8d771 which helped a bunch of people running plasmashell.
Comment 12 zoominee 2015-11-29 05:45:55 UTC
Sorry for the delay, I was offline.

I've updated mesa to version 11.0.6 and xorg-server to 1.17.4. I've also updated the linux kernel to 4.1.12.

When I replay the previous apitrace, it still freezes the graphics.  The mouse pointer continues to work now (i.e., something has been changed), but nothing else is updated, and I cannot switch from X to console either.
Comment 13 Ilia Mirkin 2015-11-29 05:53:19 UTC
(In reply to zoominee from comment #12)
> Sorry for the delay, I was offline.
> 
> I've updated mesa to version 11.0.6 and xorg-server to 1.17.4. I've also
> updated the linux kernel to 4.1.12.
> 
> When I replay the previous apitrace, it still freezes the graphics.  The
> mouse pointer continues to work now (i.e., something has been changed), but
> nothing else is updated, and I cannot switch from X to console either.

The fix in question should be in 4.1.13 (as well as a bunch of other stable trees)... or did you apply the patch on top of 4.1.12 yourself?
Comment 14 zoominee 2015-11-29 07:00:48 UTC
Regarding the kernel patch - I first wanted to see if the mesa update from Comment 10 fixed this issue. (I suppose it did not.)

I then applied the patch from the Comment 11 (manually to 4.1.12). The problem still persists.  The apitrace replays, then freezes somewhere (not at the same place as before).  The mouse pointer still works, but I can't do anything and nothing else on the screen is updated, I can't even change to console.

Do I maybe have to generate a new apitrace (i.e., the problem is stored in the apitrace file created using the old versions)?  Or maybe this is a different problem from the one addressed in the patch?
Comment 15 zoominee 2015-11-29 19:16:12 UTC
For information, my package versions are now:
mesa 11.0.6
xorg-server 1.17.4
xf86-video-nouveau 1.0.11
linux kernel 4.1.12 self-compiled with the patch from Comment 11

The bug persists, it still occurs even when I do not replay the old apitrace.

Now it manifests as the whole screen freezing (display is static), except for the mouse cursor, which I can still move around with the mouse. I cannot perform any actions with the keyboard, and clicking the mouse doesn't do anything either.

The error messages are the same as in Description, except for the times and some of the memory addresses. The error messages now attribute the error to the X process, not the plasmashell process as before.
Comment 16 zoominee 2016-04-13 15:17:23 UTC
For information, the bug is still present. My package versions are now:
mesa 11.1.2
xorg-server 1.18.1
xf86-video-nouveau 1.0.12
linux kernel 4.4.2

I've noticed that, with the chrome-based browsers (Opera etc.) scrolling up and down the page sometimes results in tear: the screen contains wide "columns" of 100 or so pixels that do not scroll and other columns that scroll up and down normally.  Other windows and the desktop itself are unaffected.  When this tearing happens, if I don't close the windows, the system freezes soon.  I suspect that this is related to the bug I reported here.
Comment 17 Andrey Mazo 2016-04-14 23:09:48 UTC
I think, I'm also experiencing the same bug as described in comment 0.
(screen doesn't wake from a sleep after 2-3 days of uptime)

I'm also running Gentoo and have the following packages installed.
Linux version 4.4.6-gentoo-20160401 (root@localhost) (gcc version 4.8.5 (Gentoo 4.8.5 p1.3, pie-0.6.2) ) #1 SMP Fri Apr 1 15:03:40 EDT 2016
x11-base/xorg-server-1.17.4
x11-drivers/xf86-video-nouveau-1.0.12
media-libs/mesa-11.0.6
x11-libs/libdrm-2.4.65
kde-plasma/plasma-workspace-5.6.1
www-client/vivaldi-1.0.344.37_p1 (a chromium-based browser)

eselect qtgraphicssystem is set to native

(as far as I remember, kernel 4.5.0 wasn't able to show me anything on the screen besides total garbage)

I'm ready to recompile the kernel with whatever debugging options you might need to investigate the problem (increased CONFIG_NOUVEAU_DEBUG or any generic kernel hacking options).

I'm also glad to capture an apitrace, but not sure of what.
plasmashell, X, a browser?


zoominee, have you tried disabling the system activity applet that you suspected could trigger the issue?


Here is a part of my dmesg:
[190683.157049] nouveau 0000:03:00.0: gr: TRAP ch 8 [001faea000 plasmashell[26949]]
[190683.157063] nouveau 0000:03:00.0: gr: GPC0/TPC0/TEX: 80000049
[190683.157082] nouveau 0000:03:00.0: fifo: read fault at 0001115000 engine 00 [PGRAPH] client 01 [GPC0/TEX] reason 02 [PAGE_NOT_PRESENT] on channel 8 [001faea000 plasmashell[26949]]
[190683.157084] nouveau 0000:03:00.0: fifo: gr engine fault on channel 8, recovering...
[192882.157535] nouveau 0000:03:00.0: fifo: PBDMA0: 04000000 [] ch 2 [001fe71000 X[24266]] subc 0 mthd 001c data 00001004
[192882.157807] nouveau 0000:03:00.0: fifo: PBDMA0: 04000000 [] ch 2 [001fe71000 X[24266]] subc 0 mthd 001c data 00001004
[195081.297439] nouveau 0000:03:00.0: fifo: PBDMA0: 04000000 [] ch 2 [001fe71000 X[24266]] subc 0 mthd 001c data 00001004
[195214.396093] [TTM] Failed to expire sync object before buffer eviction
[195229.455046] [TTM] Failed to expire sync object before buffer eviction


And after initiating reboot:
[197665.853048] nouveau 0000:03:00.0: X[24266]: failed to idle channel 2 [X[24266]]
[197680.853033] nouveau 0000:03:00.0: X[24266]: failed to idle channel 2 [X[24266]]
[197680.853119] nouveau 0000:03:00.0: fifo: read fault at 000001b000 engine 07 [PFIFO] client 07 [BAR_READ] reason 02 [PAGE_NOT_PRESENT] on channel 2 [001fe71000 X[24266]]
...
[197680.853150] =============================================
[197680.853151] [ INFO: possible recursive locking detected ]
[197680.853152] 4.4.6-gentoo-20160401 #1 Not tainted
[197680.853153] ---------------------------------------------
[197680.853154] kworker/0:1/27582 is trying to acquire lock:
[197680.853155]  ((&fifo->fault)){+.+...}, at: [<ffffffffa42cebd0>] flush_work+0x0/0x280
[197680.853162]
                but task is already holding lock:
[197680.853163]  ((&fifo->fault)){+.+...}, at: [<ffffffffa42cf644>] process_one_work+0x144/0x430
[197680.853167]
                other info that might help us debug this:
[197680.853168]  Possible unsafe locking scenario:

[197680.853169]        CPU0
[197680.853169]        ----
[197680.853170]   lock((&fifo->fault));
[197680.853171]   lock((&fifo->fault));
[197680.853172]
                 *** DEADLOCK ***

[197680.853173]  May be due to missing lock nesting notation

[197680.853174] 2 locks held by kworker/0:1/27582:
[197680.853175]  #0:  ("events"){.+.+.+}, at: [<ffffffffa42cf644>] process_one_work+0x144/0x430
[197680.853178]  #1:  ((&fifo->fault)){+.+...}, at: [<ffffffffa42cf644>] process_one_work+0x144/0x430
[197680.853181]
                stack backtrace:
[197680.853183] CPU: 0 PID: 27582 Comm: kworker/0:1 Not tainted 4.4.6-gentoo-hippo-20160401 #1
[197680.853184] Hardware name: Dell Inc. Precision Tower 7910/0215PR, BIOS A06 01/19/2015
[197680.853187] Workqueue: events gf100_fifo_recover_work
[197680.853189]  0000000000000000 ffff8803875bfaf0 ffffffffa453193e ffffffffa572ac50
[197680.853191]  ffffffffa572ac50 ffff8803875bfbb0 ffffffffa42fcbda 000000010bc3ca5c
[197680.853193]  ffff88083bcf6140 0000000000000000 0000000000000000 00000002442f617b
[197680.853195] Call Trace:
[197680.853198]  [<ffffffffa453193e>] dump_stack+0x67/0x99
[197680.853201]  [<ffffffffa42fcbda>] __lock_acquire+0x16fa/0x1b90
[197680.853202]  [<ffffffffa42fd7e0>] lock_acquire+0x60/0x80
[197680.853204]  [<ffffffffa42cebd0>] ? mod_delayed_work_on+0x80/0x80
[197680.853206]  [<ffffffffa42cec17>] flush_work+0x47/0x280
[197680.853207]  [<ffffffffa42cebd0>] ? mod_delayed_work_on+0x80/0x80
[197680.853210]  [<ffffffffa463cc16>] ? nvkm_subdev_fini+0x46/0x1f0
[197680.853212]  [<ffffffffa42fb026>] ? mark_held_locks+0x66/0x90
[197680.853214]  [<ffffffffa4319fba>] ? ktime_get+0x6a/0x110
[197680.853216]  [<ffffffffa469b390>] gf100_fifo_fini+0x10/0x20
[197680.853217]  [<ffffffffa469975a>] nvkm_fifo_fini+0x1a/0x30
[197680.853219]  [<ffffffffa4639040>] nvkm_engine_fini+0x20/0x30
[197680.853220]  [<ffffffffa463cc2f>] nvkm_subdev_fini+0x5f/0x1f0
[197680.853222]  [<ffffffffa469bbae>] gf100_fifo_recover_work+0xee/0x200
[197680.853224]  [<ffffffffa42cf6a0>] process_one_work+0x1a0/0x430
[197680.853225]  [<ffffffffa42cf644>] ? process_one_work+0x144/0x430
[197680.853227]  [<ffffffffa42cfa45>] worker_thread+0x115/0x460
[197680.853230]  [<ffffffffa4916a66>] ? __schedule+0x2f6/0x920
[197680.853232]  [<ffffffffa42cf930>] ? process_one_work+0x430/0x430
[197680.853234]  [<ffffffffa42d5959>] kthread+0xf9/0x110
[197680.853236]  [<ffffffffa42d5860>] ? kthread_create_on_node+0x230/0x230
[197680.853238]  [<ffffffffa491cabf>] ret_from_fork+0x3f/0x70
[197680.853239]  [<ffffffffa42d5860>] ? kthread_create_on_node+0x230/0x230
Comment 18 zoominee 2016-04-15 05:13:46 UTC
(In reply to Andrey Mazo from comment #17)
> zoominee, have you tried disabling the system activity applet that you
> suspected could trigger the issue?

If it's caused by an applet, it would be the clock in the top right corner, that's the only thing that updates the screen regularly...  But it seems that other larger-scale updates to the screen (e.g., processing pictures in the image processing program, or the weird scrolling thing I mentioned with the Opera and Vivaldi browsers) lead to this bug/freeze more quickly.
Comment 19 Andrey Mazo 2016-04-15 16:03:55 UTC
(In reply to zoominee from comment #18)
> (In reply to Andrey Mazo from comment #17)
> > zoominee, have you tried disabling the system activity applet that you
> > suspected could trigger the issue?
> 
> If it's caused by an applet, it would be the clock in the top right corner,
> that's the only thing that updates the screen regularly...  But it seems
> that other larger-scale updates to the screen (e.g., processing pictures in
> the image processing program, or the weird scrolling thing I mentioned with
> the Opera and Vivaldi browsers) lead to this bug/freeze more quickly.

Looks like, I don't have the clock applet as you do, so for me, Network Monitor and System Load Viewer applets are the only ones to update the screen every second.

Will try to scroll web-pages in Opera like crazy while capturing an apitrace in order to reproduce the problem.
Comment 20 Andrey Mazo 2016-04-25 16:14:07 UTC
(In reply to Andrey Mazo from comment #19)
> Looks like, I don't have the clock applet as you do, so for me, Network
> Monitor and System Load Viewer applets are the only ones to update the
> screen every second.

I tried to disable Network Monitor applet and to close the browser when unused, and the system seems to survive longer (it's been almost 6 days of uptime by now).
Not a solution, of course, but, at least, makes the system usable.
Comment 21 Karol Herbst 2016-04-25 19:53:48 UTC
(In reply to Andrey Mazo from comment #20)
> (In reply to Andrey Mazo from comment #19)
> > Looks like, I don't have the clock applet as you do, so for me, Network
> > Monitor and System Load Viewer applets are the only ones to update the
> > screen every second.
> 
> I tried to disable Network Monitor applet and to close the browser when
> unused, and the system seems to survive longer (it's been almost 6 days of
> uptime by now).
> Not a solution, of course, but, at least, makes the system usable.

and maybe also tells us a fast way to reproduce.
Comment 22 Andrey Mazo 2016-05-03 17:01:44 UTC
(In reply to Karol Herbst from comment #21)
> > I tried to disable Network Monitor applet and to close the browser when
> > unused, and the system seems to survive longer (it's been almost 6 days of
> > uptime by now).
> > Not a solution, of course, but, at least, makes the system usable.
> 
> and maybe also tells us a fast way to reproduce.

Yeah, good point.
I'll try to run the applet under apitrace and reproduce the problem.

The system feels much better with the Network Monitor applet disabled -- it's been 14 days of uptime now.
Comment 23 Andrey Mazo 2016-05-06 22:34:41 UTC
(In reply to Andrey Mazo from comment #22)
> I'll try to run the applet under apitrace and reproduce the problem.

I finally got an apitrace of Network Monitor applet causing the crash.
I tried to replay it and it's causing the crash within 30 minutes.
I also tried to run 4 replays simultaneously, but it doesn't speed things up -- still about 30 minutes to get to the crash.

The command used to generated the trace:
apitrace trace --api gl plasmawindowed org.kde.plasma.systemmonitor.net

The tarball with the trace, part of dmesg, and other maybe useful info.
https://drive.google.com/open?id=0B19GGUrpQNv3TTNOLXpNUWdnYWM

Please, let me know, if you need any additional information.
Comment 24 Andrey Mazo 2016-09-16 17:51:10 UTC
A similar fault and a deadlock that happened on newer kernel (4.7.2).
Looks like to be triggered by qupzilla (www-client/qupzilla-2.0.1, compiled with KDE/Plasma 5 support) launch this time.
Could be a QtWebEngine problem though.

[838735.611662] nouveau 0000:03:00.0: fifo: PBDMA0: 00200000 [ILLEGAL_MTHD] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 000c data 0000003c
[838735.611680] nouveau 0000:03:00.0: fifo: PBDMA0: 02000000 [] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 0020 data 00000000
[838735.611694] nouveau 0000:03:00.0: fifo: PBDMA0: 00200000 [ILLEGAL_MTHD] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 0030 data 0000003c
[838735.611707] nouveau 0000:03:00.0: fifo: PBDMA0: 00200000 [ILLEGAL_MTHD] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 0034 data 20040360
[838735.611720] nouveau 0000:03:00.0: fifo: PBDMA0: 00200000 [ILLEGAL_MTHD] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 0038 data 00000000
[838735.611733] nouveau 0000:03:00.0: fifo: PBDMA0: 00200000 [ILLEGAL_MTHD] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 003c data 00000000
[838735.611746] nouveau 0000:03:00.0: fifo: PBDMA0: 00200000 [ILLEGAL_MTHD] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 0040 data 00000000
[838735.611758] nouveau 0000:03:00.0: fifo: PBDMA0: 00200000 [ILLEGAL_MTHD] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 0044 data 00000000
[838735.611771] nouveau 0000:03:00.0: fifo: PBDMA0: 00200000 [ILLEGAL_MTHD] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 0048 data 20010674
[838735.611784] nouveau 0000:03:00.0: fifo: PBDMA0: 00200000 [ILLEGAL_MTHD] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 004c data 0000003c
[838735.611798] nouveau 0000:03:00.0: fifo: PBDMA0: 00200000 [ILLEGAL_MTHD] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 0054 data 00000000
[838735.611811] nouveau 0000:03:00.0: fifo: PBDMA0: 00200000 [ILLEGAL_MTHD] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 0058 data 00252000
[838735.611824] nouveau 0000:03:00.0: fifo: PBDMA0: 00200000 [ILLEGAL_MTHD] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 005c data 200240c7
[838735.611836] nouveau 0000:03:00.0: fifo: PBDMA0: 00200000 [ILLEGAL_MTHD] ch 15 [001f68d000 qupzilla[12159]] subc 0 mthd 0060 data 00000048
[838735.611855] nouveau 0000:03:00.0: fifo: read fault at 0000000000 engine 07 [PFIFO] client 07 [BAR_READ] reason 02 [PAGE_NOT_PRESENT] on channel 15 [001f68d000 qupzilla[12159]]
[838735.611858] nouveau 0000:03:00.0: fifo: fifo engine fault on channel 15, recovering...

[838735.612309] =============================================
[838735.612311] [ INFO: possible recursive locking detected ]
[838735.612314] 4.7.2-gentoo-20160906 #2 Not tainted
[838735.612315] ---------------------------------------------
[838735.612317] kworker/0:1/12138 is trying to acquire lock:
[838735.612319]  ((&fifo->recover.work)){+.+...}, at: [<ffffffffb72d5340>] flush_work+0x0/0x290
[838735.612329]
                but task is already holding lock:
[838735.612331]  ((&fifo->recover.work)){+.+...}, at: [<ffffffffb72d5e3b>] process_one_work+0x14b/0x450
[838735.612336]
                other info that might help us debug this:
[838735.612337]  Possible unsafe locking scenario:

[838735.612339]        CPU0
[838735.612340]        ----
[838735.612341]   lock((&fifo->recover.work));
[838735.612343]   lock((&fifo->recover.work));
[838735.612345]
                 *** DEADLOCK ***

[838735.612347]  May be due to missing lock nesting notation

[838735.612349] 2 locks held by kworker/0:1/12138:
[838735.612351]  #0:  ("events"){.+.+.+}, at: [<ffffffffb72d5e3b>] process_one_work+0x14b/0x450
[838735.612356]  #1:  ((&fifo->recover.work)){+.+...}, at: [<ffffffffb72d5e3b>] process_one_work+0x14b/0x450
[838735.612360]
                stack backtrace:
[838735.612363] CPU: 0 PID: 12138 Comm: kworker/0:1 Not tainted 4.7.2-gentoo-20160906 #2
[838735.612365] Hardware name: Dell Inc. Precision Tower 7910/0215PR, BIOS A06 01/19/2015
[838735.612371] Workqueue: events gf100_fifo_recover_work
[838735.612373]  0000000000000000 ffff8800750a7af0 ffffffffb7567885 ffff88082f220000
[838735.612378]  ffffffffb86fd780 ffff8800750a7bb8 ffffffffb73064c8 00000000b86fd780
[838735.612381]  ffff88082f220050 ffff8800750a7b00 0000000268312189 0000000000000002
[838735.612385] Call Trace:
[838735.612391]  [<ffffffffb7567885>] dump_stack+0x67/0x92
[838735.612396]  [<ffffffffb73064c8>] __lock_acquire+0x1508/0x1620
[838735.612399]  [<ffffffffb73069d0>] lock_acquire+0x60/0x80
[838735.612401]  [<ffffffffb72d5340>] ? mod_delayed_work_on+0x80/0x80
[838735.612403]  [<ffffffffb72d5387>] flush_work+0x47/0x290
[838735.612405]  [<ffffffffb72d5340>] ? mod_delayed_work_on+0x80/0x80
[838735.612408]  [<ffffffffb7304b21>] ? mark_held_locks+0x71/0x90
[838735.612414]  [<ffffffffb732412a>] ? ktime_get+0x6a/0x130
[838735.612417]  [<ffffffffb76dbbc0>] gf100_fifo_fini+0x10/0x20
[838735.612420]  [<ffffffffb76d9d8a>] nvkm_fifo_fini+0x1a/0x30
[838735.612424]  [<ffffffffb76745e0>] nvkm_engine_fini+0x20/0x30
[838735.612429]  [<ffffffffb767839a>] nvkm_subdev_fini+0x5a/0x160
[838735.612432]  [<ffffffffb76dc20a>] gf100_fifo_recover_work+0xea/0x1f0
[838735.612434]  [<ffffffffb72d5e9a>] process_one_work+0x1aa/0x450
[838735.612436]  [<ffffffffb72d5e3b>] ? process_one_work+0x14b/0x450
[838735.612438]  [<ffffffffb72d6639>] worker_thread+0x49/0x490
[838735.612441]  [<ffffffffb72d65f0>] ? workqueue_congested+0x160/0x160
[838735.612443]  [<ffffffffb72d65f0>] ? workqueue_congested+0x160/0x160
[838735.612447]  [<ffffffffb72dc449>] kthread+0xf9/0x110
[838735.612452]  [<ffffffffb7978c3f>] ret_from_fork+0x1f/0x40
[838735.612455]  [<ffffffffb72dc350>] ? kthread_create_on_node+0x230/0x230
Comment 25 Tomasz Paweł Gajc 2016-12-10 12:33:49 UTC
This issue still exists with latest x11-server 1.19.0 and Mesa 13.0.2 and Plasma 5.8.4

I got this in logs:

gru 10 13:31:37 lazur kernel: nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 17 [plasmashell[8353]] get 0020117b08 put 0020129744 ib_get 00000059 ib_put 0000005a state 80007088 (err: INVALID_CMD) push 00406040
gru 10 13:31:40 lazur PackageKit[4860]: get-updates transaction /1645_dbadbbda from uid 1001 finished with success after 3316ms
gru 10 13:31:41 lazur systemd-coredump[8355]: Failed to generate stack trace: (null)
gru 10 13:31:41 lazur systemd-coredump[8355]: Process 8306 (plasmashell) of user 1001 dumped core.
Comment 26 Tomasz Paweł Gajc 2016-12-10 12:40:54 UTC
More crashes
gru 10 13:39:25 lazur kernel: nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 17 [plasmashell[8353]] get 0020208540 put 002020fc7c ib_get 0000006f ib_put 00000070 state 80007698 (err: INVALID_CMD) push 00406040
gru 10 13:39:25 lazur kernel: nouveau 0000:01:00.0: gr: DATA_ERROR 00000005 [INVALID_ENUM]
gru 10 13:39:25 lazur kernel: nouveau 0000:01:00.0: gr: 00100000 [] ch 17 [001e50c000 plasmashell[8353]] subc 3 class 8297 mthd 15dc data 00000100
Comment 27 Tomasz Paweł Gajc 2016-12-10 12:46:02 UTC
Seems like this is very related to #98039
Comment 28 sacarde 2017-01-06 10:49:32 UTC
hi,
   this is my experience in archlinux64 + kde + nouveau(1.0.13)

with:
 libdrm 2.4.73
 mesa 13.0.1
 qt5.* 5.7.0
 livxpm 3.5.11
 kdelibs 4.14.26
driver nouveau works OK

if I upgrade to:
 libdrm 2.4.74
 mesa 13.0.2
 qt5.* 5.7.1
 livxpm 3.5.12
 kdelibs 4.14.27
driver nouveau freeze


thank you
Comment 29 Ilia Mirkin 2017-01-11 04:22:23 UTC
Can someone see whether sticking

QSG_RENDER_LOOP=basic

into /etc/environment helps anything?
Comment 30 Andrey Mazo 2017-01-13 22:17:55 UTC
(In reply to Ilia Mirkin from comment #29)
> Can someone see whether sticking
> 
> QSG_RENDER_LOOP=basic
> 
> into /etc/environment helps anything?

I could try, but looks like it's used by default already:

$ QT_LOGGING_RULES="qt.scenegraph.general=true" plasmawindowed org.kde.plasma.systemmonitor.net
qt.scenegraph.general: QSG: basic render loop
qt.scenegraph.general: texture atlas dimensions: 1024x512
qt.scenegraph.general: R/G/B/A Buffers:    8 8 8 8
qt.scenegraph.general: Depth Buffer:       24
qt.scenegraph.general: Stencil Buffer:     8
qt.scenegraph.general: Samples:            -1
qt.scenegraph.general: GL_VENDOR:          nouveau
qt.scenegraph.general: GL_RENDERER:        Gallium 0.4 on NVD9
qt.scenegraph.general: GL_VERSION:         3.0 Mesa 12.0.1
...


I don't know for sure, which one was used when I reported the problem originally (comment 17) or when captured an apitrace (comment 23).
But I would guess, that it was the same based on the following [1]:
"""
The non-threaded render loop is currently used by default ... Linux with Mesa drivers.
"""

[1] http://doc.qt.io/qt-5/qtquick-visualcanvas-scenegraph.html#non-threaded-render-loops-basic-and-windows
Comment 31 Andrey Mazo 2017-01-26 00:26:52 UTC
I've started to get a slightly different crash with update to kernel 4.9.5 and/or newer Plasma.

It usually happens while unlocking the screen (enter password, hit enter, and boom -- left monitor is black, right monitor is frozen with a mouse cursor (well, a glitch instead of a normal cursor) moving).


x11-base/xorg-server-1.18.4
x11-drivers/xf86-video-nouveau-1.0.12
media-libs/mesa-12.0.1
x11-libs/libdrm-2.4.70
kde-plasma/plasma-workspace-5.8.3-r4
kde-frameworks/plasma-5.29.0


[94521.089913] DMAR: DRHD: handling fault status reg 2
[94521.089940] DMAR: [DMA Read] Request device [03:00.0] fault addr edb60000 [fault reason 06] PTE Read access is not set
[94521.089947] DMAR: [DMA Read] Request device [03:00.0] fault addr edb60000 [fault reason 06] PTE Read access is not set
<snip>
[94521.090876] DMAR: [DMA Read] Request device [03:00.0] fault addr edba3000 [fault reason 06] PTE Read access is not set
[94521.090876] DMAR: [DMA Read] Request device [03:00.0] fault addr edba3000 [fault reason 06] PTE Read access is not set
[94521.096700] DMAR: DRHD: handling fault status reg 500
[94521.096729] nouveau 0000:03:00.0: fifo: write fault at 00029a7000 engine 15 [PCE0] client 01 [PCOPY0] reason 02 [PAGE_NOT_PRESENT] on channel 0 [001fe72000 DRM]
[94521.096732] nouveau 0000:03:00.0: fifo: ce0 engine fault on channel 0, recovering...


Xorg.log has usual (in such cases) EQ overflow errors:
(EE) [mi] EQ overflowing.  Additional events will be discarded until existing events are processed.
(EE)
(EE) Backtrace:
(EE) 0: /usr/bin/X (xorg_backtrace+0x56) [0x589af6]
(EE) 1: /usr/bin/X (mieqEnqueue+0x24b) [0x56bb6b]
(EE) 2: /usr/bin/X (QueuePointerEvents+0x52) [0x44dd22]
(EE) 3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7fb450ad0000+0x623f) [0x7fb450ad623f]
(EE) 4: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7fb450ad0000+0x6acd) [0x7fb450ad6acd]
(EE) 5: /usr/bin/X (0x400000+0x750c8) [0x4750c8]
(EE) 6: /usr/bin/X (0x400000+0x9a056) [0x49a056]
(EE) 7: /lib64/libc.so.6 (0x7fb45b990000+0x331f0) [0x7fb45b9c31f0]
(EE) 8: /lib64/libc.so.6 (ioctl+0x5) [0x7fb45ba6fc75]
(EE) 9: /usr/lib64/libdrm.so.2 (drmIoctl+0x28) [0x7fb45cb1dc68]
(EE) 10: /usr/lib64/libdrm.so.2 (drmCommandWrite+0x1b) [0x7fb45cb2098b]
(EE) 11: /usr/lib64/libdrm_nouveau.so.2 (nouveau_bo_wait+0xbc) [0x7fb4578fd44c]
(EE) 12: /usr/lib64/xorg/modules/drivers/nouveau_drv.so (0x7fb457b02000+0xc5a7) [0x7fb457b0e5a7]
(EE) 13: /usr/lib64/xorg/modules/drivers/nouveau_drv.so (0x7fb457b02000+0xcfed) [0x7fb457b0efed]
(EE) 14: /usr/bin/X (DRI2SwapBuffers+0x1c8) [0x55cd68]
(EE) 15: /usr/bin/X (0x400000+0x15e5ec) [0x55e5ec]
(EE) 16: /usr/bin/X (0x400000+0x355bf) [0x4355bf]
(EE) 17: /usr/bin/X (0x400000+0x39643) [0x439643]
(EE) 18: /lib64/libc.so.6 (__libc_start_main+0xf0) [0x7fb45b9b0790]
(EE) 19: /usr/bin/X (_start+0x29) [0x423939]
(EE)
(EE) [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.
(EE) [mi] mieq is *NOT* the cause.  It is a victim.
(EE) [mi] EQ overflow continuing.  100 events have been dropped.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.