Bug 104345

Summary:	X-Server hangs with showing scrambled picture, sound still playing.
Product:	Mesa	Reporter:	bernhardu <bernhardu>
Component:	Drivers/Gallium/radeonsi	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED MOVED	QA Contact:	Default DRI bug account <dri-devel>
Severity:	normal
Priority:	medium
Version:	unspecified
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	kernel output captured from the serial line. glxinfo vdpauinfo kernel output captured from the serial line. (4.14.10+lockdep) kernel output captured from the serial line. (4.14.14 vanilla with debian config) video output with the chess board like pattern 2018-12-20-crash.txt

Description bernhardu 2017-12-20 10:05:22 UTC

Created attachment 136312 [details]
kernel output captured from the serial line.

Playing video hangs X-Server with showing scrambled picture, sound still playing.

System hangs after some time playing mostly when playing a recorded DVB-S-SD
stream via VLC. Happened also with other videos and players.

This left the system showing a distorted version of the last shown frame from
the video. Most of the time it is a chess board like pattern.
Sound is still playing regularly.
A ssh connection is still possible.

Playing the exact same video again later works without problem.


Kernel output shows at this time following:

[ 4353.316286] amdgpu 0000:08:00.0: GPU fault detected: 146 0x0570480c
[ 4353.328836] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001004AE
[ 4353.343791] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A04800C
[ 4353.358755] amdgpu 0000:08:00.0: VM fault (0x0c, vmid 5) at page 1049774, read from 'TC0' (0x54433000) (72)
[ 4353.378234] amdgpu 0000:08:00.0: GPU fault detected: 146 0x0570440c
[ 4353.390777] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010049C
[ 4353.405735] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A04400C
[ 4353.420692] amdgpu 0000:08:00.0: VM fault (0x0c, vmid 5) at page 1049756, read from 'TC1' (0x54433100) (68)

Followed by this:

[ 4592.681352] INFO: task amdgpu_cs:0:1007 blocked for more than 120 seconds.
[ 4592.695116]       Tainted: G           O    4.13.0-1-amd64 #1 Debian 4.13.13-1
[ 4592.709587] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4592.725255] amdgpu_cs:0     D    0  1007   1000 0x00400000
[ 4592.736245] Call Trace:
[ 4592.741145]  ? __schedule+0x3c8/0x860
[ 4592.748486]  ? schedule+0x32/0x80
[ 4592.755120]  ? schedule_timeout+0x1da/0x350
[ 4592.763537]  ? amdgpu_vm_validate_level.isra.9+0x80/0x80 [amdgpu]
[ 4592.775725]  ? dma_fence_default_wait+0x239/0x260
[ 4592.785138]  ? dma_fence_default_wait+0x239/0x260
[ 4592.794548]  ? dma_fence_free+0x20/0x20
[ 4592.802220]  ? dma_fence_wait_timeout+0x33/0xe0
[ 4592.811317]  ? amdgpu_ctx_add_fence+0x61/0xf0 [amdgpu]
[ 4592.821632]  ? amdgpu_cs_ioctl+0x156b/0x1920 [amdgpu]
[ 4592.831784]  ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu]
[ 4592.842432]  ? drm_ioctl_kernel+0x65/0xb0 [drm]
[ 4592.851510]  ? drm_ioctl+0x2e3/0x3a0 [drm]
[ 4592.859737]  ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu]
[ 4592.870373]  ? do_futex+0x2df/0xa90
[ 4592.877376]  ? amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[ 4592.886971]  ? do_vfs_ioctl+0x9f/0x600
[ 4592.894471]  ? SyS_futex+0x7a/0x170
[ 4592.901453]  ? SyS_ioctl+0x74/0x80
[ 4592.908264]  ? system_call_fast_compare_end+0xc/0x97

Attached is a log captured on the serial port from boot to the
crash follwed by a Magic-Sys-Req-T.

This are the system specs:

Graphics:  Card: Advanced Micro Devices [AMD/ATI] Baffin [Radeon RX 460] (XFX)
           Display Server: x11 (X.Org 1.19.5 ) drivers: amdgpu (unloaded: modesetting,fbdev,vesa)
           Resolution: 1920x1080@60.00hz
           OpenGL: renderer: AMD Radeon RX 460 Graphics (AMD POLARIS11 / DRM 3.18.0 / 4.13.0-1-amd64, LLVM 5.0.0)
           version: 4.5 Mesa 17.2.5
System:    Kernel: 4.13.0-1-amd64 x86_64 bits: 64
Desktop:   KDE Plasma 5.10.5
Machine:   Device: desktop Mobo: ASUSTeK model: PRIME B350M-A v: Rev X.0x serial: N/A
           UEFI: American Megatrends v: 3203 date: 11/09/2017
CPU:       Octa core AMD Ryzen 7 1700 Eight-Core (-HT-MCP-) speed/max: 1550/3000 MHz

Error happened with Debian Stretch with linux 4.9 and later.
Currently it is running a Debian testing.

Is this the right product/component for this report?

Comment 1 Michel Dänzer 2017-12-20 10:10:10 UTC

It's more likely a Mesa or maybe LLVM issue. Did this start happening after you upgraded Mesa and/or LLVM, or maybe another component such as the kernel?

Comment 2 bernhardu 2017-12-20 10:35:53 UTC

(In reply to Michel Dänzer from comment #1)
> It's more likely a Mesa or maybe LLVM issue. Did this start happening after
> you upgraded Mesa and/or LLVM, or maybe another component such as the kernel?

I get such hangs since I am using this system since april. I waited to report
it because at that time I wondered if this might be a firmware issue, as the
hardware was very new.
But after some bios updates these hangs did not resolve.
(Just mainboard firmware update, found not yet an update for the graphics card.)
Also thought that maybe the versions in Debian Stretch would be too old.

The earliest hanging logs I have are from mid June with kernel 4.9.0-3-amd64.
There the output was a little different as there was Xorg shown as blocking:
    INFO: task Xorg:1024 blocked for more than 120 seconds.
I tried every following kernel up to 4.12 via backports and upgraded at begin
of december to testing with 4.13.

Comment 3 Michel Dänzer 2017-12-20 10:38:46 UTC

Please attach the output of glxinfo and vdpauinfo.

Comment 4 bernhardu 2017-12-20 21:25:20 UTC

Created attachment 136335 [details]
glxinfo

Sorry for the delay.

Comment 5 bernhardu 2017-12-20 21:25:45 UTC

Created attachment 136336 [details]
vdpauinfo

Comment 6 bernhardu 2018-01-13 14:33:49 UTC

Created attachment 136709 [details]
kernel output captured from the serial line. (4.14.10+lockdep)

Attaching another example with a vanilla kernel 4.14.10 with lockdep enabled.

Shows again such "GPU fault detected:"
followed by a "task amdgpu_cs:0:1014 blocked for more than 120 seconds.

I might add that it may be related to one or more suspend and resume cycles.

Is there anything I can attach or test to make
identification of the issue easier?

Comment 7 bernhardu 2018-02-04 10:07:17 UTC

Created attachment 137161 [details]
kernel output captured from the serial line. (4.14.14 vanilla with debian config)

Comment 8 bernhardu 2018-02-04 10:09:57 UTC

Created attachment 137162 [details]
video output with the chess board like pattern

Comment 9 bernhardu 2018-02-04 10:26:29 UTC

This time with latest firmware, with "rcu_nocbs=0-16" for #196683
Happened without any hibernation cycle.
Just the "GPU fault detected: 146" appeared, no "blocked for more than" message.

Somehow it looks like when this "GPU fault" happens some locks are
never unlocked and therefore blocking the X server.
At least the possibility for a clean shutdown would be good.

Is there any other information I should collect next time, any other configuration or version I can test?

Comment 10 bernhardu 2018-02-19 12:47:37 UTC

Same happens with vanilla kernel 4.15.2.

Comment 11 bernhardu 2018-05-22 20:11:44 UTC

Happened again today. Got no such hang since 04.04.2018
From Debian testing:
Linux kernel: 4.16.5-1
Mesa: 18.0.3-1
xserver-xorg-video-amdgpu: 18.0.1-1
libdrm2: 2.4.91-2
(Nearly) current firmware: PRIME-B350M-A-ASUS-4008.CAP


[ 1085.521781] amdgpu 0000:08:00.0: GPU fault detected: 146 0x0bf8440c
[ 1085.521789] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010037F
[ 1085.521792] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0204400C
[ 1085.521798] amdgpu 0000:08:00.0: VM fault (0x0c, vmid 1) at page 1049471, read from 'TC1' (0x54433100) (68)
[ 1095.659133] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=39661, last emitted seq=39663
[ 1095.659146] [drm] IP block:gfx_v8_0 is hung!
[ 1095.659194] [drm] GPU recovery disabled.


Is there anything else I could collect next time or try?

Comment 12 bernhardu 2018-12-20 14:09:23 UTC

Created attachment 142867 [details]
2018-12-20-crash.txt

The last months I observed this less.
But in December it started to get more visible again.
(I am following debian testing closely.)

It shows now nearly always when e.g. browsing with firefox -
even in simple sites e.g. bugtrackers.

The chess board like pattern is still the same.

The blocked kernel threads stack changed with some kernel upgrade to:

Call Trace:
 ? __schedule+0x2b7/0x880
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 schedule+0x28/0x80
 schedule_timeout+0x1ee/0x380
 ? dce110_timing_generator_get_position+0x5b/0x70 [amdgpu]
 ? dce110_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
 dma_fence_default_wait+0x1fd/0x280
 ? dma_fence_release+0x90/0x90
 dma_fence_wait_timeout+0x39/0xf0
 reservation_object_wait_timeout_rcu+0x10a/0x290
 amdgpu_dm_do_flip+0x112/0x340 [amdgpu]
 amdgpu_dm_atomic_commit_tail+0xbac/0xdb0 [amdgpu]
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 ? __switch_to+0x16f/0x450
 commit_tail+0x3d/0x70 [drm_kms_helper]
 process_one_work+0x195/0x380
 worker_thread+0x30/0x390
 ? process_one_work+0x380/0x380
 kthread+0x113/0x130
 ? kthread_create_worker_on_cpu+0x70/0x70
 ret_from_fork+0x22/0x40

What can I do to get this bug forward?

Comment 13 bernhardu 2019-01-15 13:13:52 UTC

Happened again.
A way to avoid that crash may be to not stay int first cold boot.
I am not sure but I think I never saw this when system was
running from a rebooted state (warm boot).


[    0.000000] Linux version 4.19.0-1-amd64 (debian-kernel@lists.debian.org) (gcc version 8.2.0 (Debian 8.2.0-13)) #1 SMP Debian 4.19.12-1 (2018-12-22)
...
[32345.334084] amdgpu 0000:08:00.0: GPU fault detected: 146 0x06407704 for process plasmashell pid 1422 thread plasmashel:cs0 pid 1720
[32345.334092] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001002C8
[32345.334096] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02077004
[32345.334101] amdgpu 0000:08:00.0: VM fault (0x04, vmid 1, pasid 32775) at page 1049288, read from 'SDM0' (0x53444d30) (119)
[32345.334114] amdgpu 0000:08:00.0: GPU fault detected: 146 0x06407704 for process plasmashell pid 1422 thread plasmashel:cs0 pid 1720
[32345.334118] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001002C8
[32345.334121] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02077004
[32345.334125] amdgpu 0000:08:00.0: VM fault (0x04, vmid 1, pasid 32775) at page 1049288, read from 'SDM0' (0x53444d30) (119)
[32345.334523] amdgpu 0000:08:00.0: GPU fault detected: 146 0x06d8770c for process plasmashell pid 1422 thread plasmashel:cs0 pid 1720
[32345.334528] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001002DB
[32345.334532] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0207700C
[32345.334538] amdgpu 0000:08:00.0: VM fault (0x0c, vmid 1, pasid 32775) at page 1049307, read from 'SDM0' (0x53444d30) (119)
[32355.357958] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=947938, emitted seq=947940
[32355.357965] [drm] GPU recovery disabled.
[32505.119425] INFO: task kworker/u32:2:12454 blocked for more than 120 seconds.
[32505.119433]       Tainted: G           OE     4.19.0-1-amd64 #1 Debian 4.19.12-1
[32505.119435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[32505.119438] kworker/u32:2   D    0 12454      2 0x80000000
[32505.119460] Workqueue: events_unbound commit_work [drm_kms_helper]
[32505.119464] Call Trace:
[32505.119474]  ? __schedule+0x2a2/0x870
[32505.119480]  ? __switch_to_asm+0x40/0x70
[32505.119484]  schedule+0x28/0x80
[32505.119488]  schedule_timeout+0x26d/0x390
[32505.119586]  ? dce110_timing_generator_get_position+0x5b/0x70 [amdgpu]
[32505.119682]  ? dce110_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
[32505.119687]  dma_fence_default_wait+0x238/0x2a0
[32505.119691]  ? dma_fence_release+0x90/0x90
[32505.119695]  dma_fence_wait_timeout+0xdd/0x100
[32505.119699]  reservation_object_wait_timeout_rcu+0x173/0x280
[32505.119800]  amdgpu_dm_do_flip+0x112/0x340 [amdgpu]
[32505.119903]  amdgpu_dm_atomic_commit_tail+0x750/0xdb0 [amdgpu]
[32505.119909]  ? wait_for_completion_timeout+0x3b/0x1a0
[32505.119913]  ? __switch_to_asm+0x34/0x70
[32505.119917]  ? __switch_to_asm+0x40/0x70
[32505.119921]  ? __switch_to_asm+0x34/0x70
[32505.119924]  ? __switch_to_asm+0x40/0x70
[32505.119938]  commit_tail+0x3d/0x70 [drm_kms_helper]
[32505.119946]  process_one_work+0x1a7/0x3a0
[32505.119951]  worker_thread+0x30/0x390
[32505.119955]  ? pwq_unbound_release_workfn+0xd0/0xd0
[32505.119959]  kthread+0x112/0x130
[32505.119963]  ? kthread_bind+0x30/0x30
[32505.119967]  ret_from_fork+0x22/0x40
[32528.537326] sysrq: SysRq : Keyboard mode set to system default

Comment 14 bernhardu 2019-04-29 14:54:40 UTC

This sample happend while just browsing in firefox inside a plasma desktop,
running in amd64 debian buster.


[138335.303379] amdgpu 0000:08:00.0: GPU fault detected: 147 0x0a704402 for process Xorg pid 1129 thread Xorg:cs0 pid 1161
[138335.303387] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0012014E
[138335.303391] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E044002
[138335.303396] amdgpu 0000:08:00.0: VM fault (0x02, vmid 7, pasid 32768) at page 1179982, read from 'TC1' (0x54433100) (68)
[138335.313378] amdgpu 0000:08:00.0: GPU fault detected: 147 0x08904802 for process kwin_x11 pid 1350 thread kwin_x11:cs0 pid 1568
[138335.313384] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00120712
[138335.313388] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048002
[138335.313393] amdgpu 0000:08:00.0: VM fault (0x02, vmid 1, pasid 32773) at page 1181458, read from 'TC0' (0x54433000) (72)
[139983.820484] amdgpu 0000:08:00.0: GPU fault detected: 147 0x00004802 for process Xorg pid 1129 thread Xorg:cs0 pid 1161
[139983.820491] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000800
[139983.820495] amdgpu 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C048002
[139983.820500] amdgpu 0000:08:00.0: VM fault (0x02, vmid 6, pasid 32768) at page 2048, read from 'TC0' (0x54433000) (72)
[139994.156833] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=14610386, emitted seq=14610389
[139994.156840] [drm] GPU recovery disabled.
[140167.215046] INFO: task kworker/u32:3:15564 blocked for more than 120 seconds.
[140167.215053]       Tainted: G           OE     4.19.0-4-amd64 #1 Debian 4.19.28-2
[140167.215055] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[140167.215058] kworker/u32:3   D    0 15564      2 0x80000000
[140167.215080] Workqueue: events_unbound commit_work [drm_kms_helper]
[140167.215083] Call Trace:
[140167.215093]  ? __schedule+0x2a2/0x870
[140167.215098]  ? __switch_to_asm+0x40/0x70
[140167.215102]  schedule+0x28/0x80
[140167.215106]  schedule_timeout+0x26d/0x390
[140167.215204]  ? dce110_timing_generator_get_position+0x5b/0x70 [amdgpu]
[140167.215300]  ? dce110_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
[140167.215305]  dma_fence_default_wait+0x238/0x2a0
[140167.215310]  ? dma_fence_release+0x90/0x90
[140167.215314]  dma_fence_wait_timeout+0xdd/0x100
[140167.215319]  reservation_object_wait_timeout_rcu+0x173/0x280
[140167.215420]  amdgpu_dm_do_flip+0x112/0x340 [amdgpu]
[140167.215523]  amdgpu_dm_atomic_commit_tail+0x750/0xdb0 [amdgpu]
[140167.215528]  ? wait_for_completion_timeout+0x3b/0x1a0
[140167.215531]  ? __switch_to_asm+0x34/0x70
[140167.215535]  ? __switch_to_asm+0x40/0x70
[140167.215538]  ? __switch_to_asm+0x34/0x70
[140167.215541]  ? __switch_to_asm+0x40/0x70
[140167.215555]  commit_tail+0x3d/0x70 [drm_kms_helper]
[140167.215562]  process_one_work+0x1a7/0x3a0
[140167.215566]  worker_thread+0x30/0x390
[140167.215571]  ? create_worker+0x1a0/0x1a0
[140167.215573]  kthread+0x112/0x130
[140167.215577]  ? kthread_bind+0x30/0x30
[140167.215581]  ret_from_fork+0x22/0x40
[140241.805748] sysrq: SysRq : Keyboard mode set to system default


What can I do to get this bug forward?

Comment 15 GitLab Migration User 2019-09-25 18:01:29 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1293.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.