Bug 103721 - [GM107] Frequent freezes with nouveau on Thinkpad P50
Summary: [GM107] Frequent freezes with nouveau on Thinkpad P50
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) All
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 100567
  Show dependency treegraph
 
Reported: 2017-11-13 15:42 UTC by Will Newton
Modified: 2019-01-05 23:45 UTC (History)
8 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Output of journalctl -k -b-1 (104.62 KB, text/x-log)
2017-11-13 15:42 UTC, Will Newton
no flags Details
journalctl -k -b -1 from my last freeze (244.59 KB, text/x-log)
2017-12-04 10:57 UTC, James Hewitt
no flags Details
journalctl -k from a recent freeze, newer kernel, newer bios, still occuring (90.00 KB, text/x-log)
2018-01-16 19:56 UTC, James Hewitt
no flags Details
Freeze log with drm.debug=14 (2.63 MB, application/gzip)
2018-01-16 22:52 UTC, James Hewitt
no flags Details
dmesg output from after plugging in DVI monitor first, then display port monitor (82.55 KB, text/plain)
2018-01-21 18:47 UTC, Jeff Peeler
no flags Details
dmesg output after display becomes completely unresponsive (195.52 KB, text/plain)
2018-01-21 18:52 UTC, Jeff Peeler
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Will Newton 2017-11-13 15:42:22 UTC
Created attachment 135437 [details]
Output of journalctl -k -b-1

I have been experiencing frequent freezes of xorg/wayland with the nouveau driver on a Thinkpad P50 20EN.

The freezes seem to be related to system load and occur several times per-day.

I'm running Fedora 26 with kernel 4.13.11-200.fc26.x86_64 and the crashes seem to happen with Xorg and Wayland. I'm running with discrete graphics enabled in the BIOS as enabling Hybrid prevents the system from booting (there is no option to run purely integrated graphics).

I have attached the kernel logs to the ticket.

I've also reported the issue in the Fedora bug tracker where there are some more logs but had no response there as yet: https://bugzilla.redhat.com/show_bug.cgi?id=1509294

lspci:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 07)
00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
00:16.3 Serial controller: Intel Corporation Sunrise Point-H KT Redirection (rev 31)
00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #1 (rev f1)
00:1c.2 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #3 (rev f1)
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #5 (rev f1)
00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #13 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
00:1f.3 Audio device: Intel Corporation Sunrise Point-H HD Audio (rev 31)
00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M2000M] (rev a2)
01:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1)
04:00.0 Network controller: Intel Corporation Wireless 8260 (rev 3a)
3e:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
Comment 1 James Hewitt 2017-12-04 10:56:37 UTC
I regularly get the same freeze. Lenovo P50 20EQ

Fedora 27, Linux ibm007470 4.13.15-300.fc27.x86_64 #1 SMP Tue Nov 21 21:10:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

We both have NVIDIA GM107 and 
nouveau 0000:01:00.0: bios: version 82.07.9d.00.14
Comment 2 James Hewitt 2017-12-04 10:57:28 UTC
Created attachment 135918 [details]
journalctl -k -b -1 from my last freeze
Comment 3 B Kaye 2017-12-04 21:33:24 UTC
I am having the same issue on a regular basis.

Kernel: 4.13.16-200.fc26.x86_64
Nouveau xorg-x11-drv-nouveau-1.0.15-1.fc26.x86_64
Comment 4 Will Newton 2018-01-10 17:17:11 UTC
I haven't been seeing crashes as frequently any more - I switched back to wayland, not sure if there is anything else that changed besides installing Fedora updates - but I just had another crash:

Jan 10 16:34:16 localhost.localdomain kernel: nouveau 0000:01:00.0: disp: 0x00006671[0]: INIT_GENERIC_CONDITON: unknown 0x07
Jan 10 17:09:15 localhost.localdomain kernel: nouveau 0000:01:00.0: fifo: write fault at 001bc5b000 engine 00 [GR] client 0f [GPC0/PROP_0] reason 00 [PDE] on channel 14 [00fdf1d000 Xwayland[1935]]
Jan 10 17:09:15 localhost.localdomain kernel: nouveau 0000:01:00.0: fifo: channel 14: killed
Jan 10 17:09:15 localhost.localdomain kernel: nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
Jan 10 17:09:15 localhost.localdomain kernel: nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery
Jan 10 17:09:15 localhost.localdomain kernel: nouveau 0000:01:00.0: Xwayland[1935]: channel 14 killed!
Comment 5 James Hewitt 2018-01-16 19:56:00 UTC
Created attachment 136787 [details]
journalctl -k from a recent freeze, newer kernel, newer bios, still occuring

Have had three recent freezes with a newer bios and newer kernel. Full log of last freeze attached. Was editing a video in kdenlive, and could still hear the audio of the video and notifications coming in, so its not a 100% freeze. No keyboard or mouse input was responsive, video wasn't updating on screen.

End of logs for last three freezes:

Jan 16 13:08:24 ibm007470 kernel: nouveau 0000:01:00.0: disp: 0x00006671[0]: INIT_GENERIC_CONDITON: unknown 0x07
Jan 16 13:21:57 ibm007470 kernel: nouveau 0000:01:00.0: fifo: write fault at 000cc40000 engine 00 [GR] client 0f [GPC0/PROP_0] reason 00 [PDE] on channel 7 [007f087000 Xorg[2752]]
Jan 16 13:21:57 ibm007470 kernel: nouveau 0000:01:00.0: fifo: channel 7: killed
Jan 16 13:21:57 ibm007470 kernel: nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
Jan 16 13:21:57 ibm007470 kernel: nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery
Jan 16 13:21:57 ibm007470 kernel: nouveau 0000:01:00.0: Xorg[2752]: channel 7 killed!


Jan 16 12:12:02 ibm007470 kernel: nouveau 0000:01:00.0: fifo: read fault at 8e8e32b000 engine 00 [GR] client 0d [GPC0/GCC] reason 00 [PDE] on channel 7 [007f10f000 Xorg[3050]]
Jan 16 12:12:02 ibm007470 kernel: nouveau 0000:01:00.0: fifo: channel 7: killed
Jan 16 12:12:02 ibm007470 kernel: nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
Jan 16 12:12:02 ibm007470 kernel: nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery


Jan 14 18:30:20 ibm007470 kernel: nouveau 0000:01:00.0: fifo: read fault at 002fda0000 engine 00 [GR] client 0f [GPC0/PROP_0] reason 00 [PDE] on channel 4 [007f9e9000 systemd-logind[1052]]
Jan 14 18:30:20 ibm007470 kernel: nouveau 0000:01:00.0: fifo: channel 4: killed
Jan 14 18:30:20 ibm007470 kernel: nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
Jan 14 18:30:20 ibm007470 kernel: nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery
Comment 6 James Hewitt 2018-01-16 22:52:34 UTC
Created attachment 136797 [details]
Freeze log with drm.debug=14

Turned on drm.debug=14 to capture more driver output, had another freeze after running for a couple of hours, attached new log.
Comment 7 Jeff Peeler 2018-01-21 18:47:08 UTC
Created attachment 136881 [details]
dmesg output from after plugging in DVI monitor first, then display port monitor

I'm unsure how helpful this info is, but I can't even boot using a 4.13 kernel with any monitors plugged in. So after starting up, I plug in my monitors (and often xorg crashes), but eventually I'm able to use both monitors... until some random time in the future the display becomes completely unresponsive and I'm forced to reboot.
Comment 8 Jeff Peeler 2018-01-21 18:52:39 UTC
Created attachment 136882 [details]
dmesg output after display becomes completely unresponsive

The "write fault" seems to be when the display becomes unusable. If it's helpful, I can boot with: "nouveau.debug=debug drm.debug=0x1e" and attach that as well.
Comment 9 Jim Scarborough 2018-01-29 21:04:41 UTC
This looks very similar to bug 100567.  I don't know enough to flag a duplicate, but there appears to be more log data here.
Comment 10 James Hewitt 2018-03-12 13:45:58 UTC
I haven't seen a freeze for a while, I'm guessing that a recent kernel update has resolved the freeze.

Instead, I occasionally get a chrome tab going screwy with a segfault in nouveau:
Mar 12 13:40:23 ibm007470 kernel: nouveau 0000:01:00.0: Xorg[2838]: Unknown handle 0x01c69fda
Mar 12 13:40:23 ibm007470 kernel: nouveau 0000:01:00.0: Xorg[2838]: validate_init
Mar 12 13:40:23 ibm007470 kernel: nouveau 0000:01:00.0: Xorg[2838]: validate: -2
Mar 12 13:40:23 ibm007470 kernel: show_signal_msg: 6 callbacks suppressed
Mar 12 13:40:23 ibm007470 kernel: chrome[6053]: segfault at fffff5c81b30075a ip 00007f10c41be05c sp 00007ffd499625b0 error 5 in libdrm_nouveau.so.2.0.0[7f10c41bb000+7000]

Going to raise that with the chromium team to see if they can get more debug info that could address a potential underlying cause.

Current kernel: Linux ibm007470 4.15.6-300.fc27.x86_64 #1 SMP Mon Feb 26 18:43:03 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Comment 11 Will Newton 2018-03-12 15:37:07 UTC
I saw a couple of freezes today with kernel 4.15.6-300.fc27.x86_64:

Mar 12 11:59:00 localhost.localdomain kernel: nouveau 0000:01:00.0: gr: TRAP ch 14 [00fdf0d000 Xorg[1841]]
Mar 12 11:59:00 localhost.localdomain kernel: nouveau 0000:01:00.0: gr: GPC0/TPC2/TEX: 80000009
Mar 12 11:59:00 localhost.localdomain kernel: nouveau 0000:01:00.0: gr: TRAP ch 14 [00fdf0d000 Xorg[1841]]
Mar 12 11:59:00 localhost.localdomain kernel: nouveau 0000:01:00.0: gr: GPC0/TPC1/TEX: 80000000
Mar 12 11:59:00 localhost.localdomain kernel: nouveau 0000:01:00.0: gr: TRAP ch 14 [00fdf0d000 Xorg[1841]]
Mar 12 11:59:00 localhost.localdomain kernel: nouveau 0000:01:00.0: gr: GPC0/TPC2/TEX: 80000000
Comment 12 Goncalo Gomes 2018-03-31 00:44:18 UTC
I've also been experiencing this for at least 2 to 3 years and would love to help getting to the bottom of it. For a good while, this would only reproduce every 15 - 40 days. Often during playing of videos, but recently it became 3 - 10 times per day. It's nigh on impossible to do anything productive anymore. 

I run Ubuntu 16.04.2 LTS, with the gnome-panel emulation (+ compiz); I notice these hangs tend to reproduce far too easily when using anything that is hard on graphics or makes good use of fullscreen. For example

* google-chrome
* atom
* mplayer / vlc


I am more than happy to help with identify the root cause of the issue as I seem to be able to replicate it on demand. I'm happy to run developer debug builds and provide as much input as possible.

From today:

[Sat Mar 31 01:22:32 2018] nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
[Sat Mar 31 01:22:32 2018] nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
[Sat Mar 31 01:22:32 2018] nouveau 0000:01:00.0: fifo: channel 5: killed
[Sat Mar 31 01:22:32 2018] nouveau 0000:01:00.0: fifo: engine 7: scheduled for recovery
[Sat Mar 31 01:22:32 2018] nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery
[Sat Mar 31 01:22:32 2018] nouveau 0000:01:00.0: compiz[3120]: channel 5 killed!
[Sat Mar 31 01:22:39 2018] [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:38:head-0] hw_done timed out
[Sat Mar 31 01:22:50 2018] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:38:head-0] hw_done timed out
[Sat Mar 31 01:23:00 2018] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:38:head-0] flip_done timed out
[Sat Mar 31 01:25:16 2018] INFO: task kworker/u16:0:21874 blocked for more than 120 seconds.
[Sat Mar 31 01:25:16 2018]       Not tainted 4.13.0-37-generic #42~16.04.1-Ubuntu
[Sat Mar 31 01:25:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sat Mar 31 01:25:16 2018] kworker/u16:0   D    0 21874      2 0x80000000
[Sat Mar 31 01:25:16 2018] Workqueue: events_unbound nv50_disp_atomic_commit_work [nouveau]
[Sat Mar 31 01:25:16 2018] Call Trace:
[Sat Mar 31 01:25:16 2018]  __schedule+0x3d6/0x8b0
[Sat Mar 31 01:25:16 2018]  ? nvkm_ioctl_ntfy_get+0x69/0xb0 [nouveau]
[Sat Mar 31 01:25:16 2018]  schedule+0x36/0x80
[Sat Mar 31 01:25:16 2018]  schedule_timeout+0x1f3/0x360
[Sat Mar 31 01:25:16 2018]  ? nvkm_client_ioctl+0x12/0x20 [nouveau]
[Sat Mar 31 01:25:16 2018]  ? nvif_object_ioctl+0x47/0x50 [nouveau]
[Sat Mar 31 01:25:16 2018]  ? nouveau_bo_rd32+0x2a/0x30 [nouveau]
[Sat Mar 31 01:25:16 2018]  ? nv84_fence_read+0x2e/0x30 [nouveau]
[Sat Mar 31 01:25:16 2018]  dma_fence_default_wait+0x1c5/0x260
[Sat Mar 31 01:25:16 2018]  ? dma_fence_default_wait+0x1c5/0x260
[Sat Mar 31 01:25:16 2018]  ? dma_fence_free+0x20/0x20
[Sat Mar 31 01:25:16 2018]  dma_fence_wait_timeout+0x3f/0x100
[Sat Mar 31 01:25:16 2018]  drm_atomic_helper_wait_for_fences+0x40/0xc0 [drm_kms_helper]
[Sat Mar 31 01:25:16 2018]  nv50_disp_atomic_commit_tail+0x55/0x3b70 [nouveau]
[Sat Mar 31 01:25:16 2018]  nv50_disp_atomic_commit_work+0x12/0x20 [nouveau]
[Sat Mar 31 01:25:16 2018]  process_one_work+0x15b/0x410
[Sat Mar 31 01:25:16 2018]  worker_thread+0x4b/0x460
[Sat Mar 31 01:25:16 2018]  kthread+0x10c/0x140
[Sat Mar 31 01:25:16 2018]  ? process_one_work+0x410/0x410
[Sat Mar 31 01:25:16 2018]  ? kthread_create_on_node+0x70/0x70
[Sat Mar 31 01:25:16 2018]  ret_from_fork+0x35/0x40


root@darkside:/proc/21874# cat stack
[<ffffffff9104d575>] dma_fence_default_wait+0x1c5/0x260
[<ffffffff9104d06f>] dma_fence_wait_timeout+0x3f/0x100
[<ffffffffc0506ba0>] drm_atomic_helper_wait_for_fences+0x40/0xc0 [drm_kms_helper]
[<ffffffffc081f7a5>] nv50_disp_atomic_commit_tail+0x55/0x3b70 [nouveau]
[<ffffffffc08232d2>] nv50_disp_atomic_commit_work+0x12/0x20 [nouveau]
[<ffffffff90aa23ab>] process_one_work+0x15b/0x410
[<ffffffff90aa26ab>] worker_thread+0x4b/0x460
[<ffffffff90aa8b5c>] kthread+0x10c/0x140
[<ffffffff91400485>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

More than happy to help fixing this issue.
Comment 13 Goncalo Gomes 2018-03-31 23:56:55 UTC
Apologies, this was posted incorrectly against this hardware. Please disregard the above.
Comment 14 kenorb 2019-01-05 23:41:00 UTC
Pasting relevant info from freeze-20180116-drm.debug.log, so it's easier to find.

[drm:drm_mode_addfb2 [drm]] [FB:90]
00a0 2 nv50_base_ntfy_set
        00000120
        f0000000
0084 1 nv50_base_image_set
        00000010
00c0 1 nv50_base_image_set
        fb0000fe
0400 5 nv50_base_image_set
        000f6c00
        00000000
        04380780
        00007804
        0000cf00
0080 1 nv50_base_update
        00000000
nouveau 0000:01:00.0: fifo: read fault at aa8e32b000 engine 00 [GR] client 0d [GPC0/GCC] reason 00 [PDE] on channel 6 [007f287000 Xorg[2795]]
nouveau 0000:01:00.0: fifo: channel 6: killed
nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery
nouveau 0000:01:00.0: Xorg[2795]: channel 6 killed!
[drm:drm_mode_addfb2 [drm]] [FB:93]
Comment 15 kenorb 2019-01-05 23:45:07 UTC
[drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 0


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.