Bug 105317

Summary:	The GPU Vega 56 was hang while try to pass #GraphicsFuzz shader15 test
Product:	Mesa	Reporter:	mikhail.v.gavrilov
Component:	Drivers/Gallium/radeonsi	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED MOVED	QA Contact:	Default DRI bug account <dri-devel>
Severity:	normal
Priority:	medium	CC:	david.cap, devurandom, freedesktop, ilvipero
Version:	git
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	Shader runner link test

Description mikhail.v.gavrilov 2018-03-01 18:41:26 UTC

Preparing:

1. install Fedora 27.
https://download.fedoraproject.org/pub/fedora/linux/releases/27/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-27-1.6.iso
2. install latest MESA and LLVM
https://copr.fedorainfracloud.org/coprs/che/mesa/
3. build and install staging kernel with latest amdgpu driver
$ git clone git://people.freedesktop.org/~agd5f/linux --branch amd-staging-drm-next
$ cd linux
$ make clean && make bzImage && make module
# make modules_install && make install

Reproducing issue:
1. Launch any browser (I checked on Firefox and Opera)
2. Open http://www.graphicsfuzz.com/benchmark/android-v1.html
3. Press Go
4. Wait when reached shader15 test


Symptoms:
1. The system stop to respod.
2. All the LEDs on the video card showing the load start to glow.
3. The turbine on the video card starts to make a lot of noise.

In dmesg appears follow lines:
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=71473, last emitted seq=71475
[drm] No hardware hang detected. Did some blocks stall?


If you are used Opera browser before would be added follow lines:
[drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
[drm] No hardware hang detected. Did some blocks stall?

Comment 1 Emil Velikov 2018-03-01 19:11:28 UTC

Mikhail one suggestion to consider for the future:

Do mention version numbers (or sha if using a git checkout), for the different components mesa, llvm, kernel.

Comment 2 mikhail.v.gavrilov 2018-03-03 08:52:02 UTC

(In reply to Emil Velikov from comment #1)
> Mikhail one suggestion to consider for the future:
> 
> Do mention version numbers (or sha if using a git checkout), for the
> different components mesa, llvm, kernel.

kernel: 4.16.0-rc1-git63e5921e856b
mesa: 18.1.0-0.4.git56dc9f9
llvm: 7.0.0-0.1.r326462

Comment 3 mikhail.v.gavrilov 2018-03-24 14:26:37 UTC

[  463.172901] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=26958, last emitted seq=26960
[  463.172985] [drm] No hardware hang detected. Did some blocks stall?
[  473.357738] sysrq: SysRq : Show Blocked State
[  473.357758]   task                        PC stack   pid father
[  473.357955] amdgpu_cs:0     D13176  2340   2283 0x00000000
[  473.357969] Call Trace:
[  473.357988]  ? __schedule+0x2ed/0xba0
[  473.358005]  ? dma_fence_default_wait+0x14f/0x370
[  473.358013]  schedule+0x2f/0x90
[  473.358021]  schedule_timeout+0x23d/0x540
[  473.358030]  ? find_held_lock+0x34/0xa0
[  473.358044]  ? mark_held_locks+0x56/0x80
[  473.358053]  ? _raw_spin_unlock_irqrestore+0x32/0x60
[  473.358065]  ? dma_fence_default_wait+0x14f/0x370
[  473.358072]  dma_fence_default_wait+0x23b/0x370
[  473.358081]  ? dma_fence_release+0x170/0x170
[  473.358094]  dma_fence_wait_timeout+0x4f/0x270
[  473.358176]  amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu]
[  473.358237]  amdgpu_cs_ioctl+0x99/0x1d60 [amdgpu]
[  473.358357]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
[  473.358383]  drm_ioctl_kernel+0x5b/0xb0 [drm]
[  473.358409]  drm_ioctl+0x2d5/0x370 [drm]
[  473.358466]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
[  473.358479]  ? __pm_runtime_resume+0x54/0x90
[  473.358493]  ? trace_hardirqs_on_caller+0xed/0x180
[  473.358551]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[  473.358566]  do_vfs_ioctl+0xa5/0x6e0
[  473.358589]  SyS_ioctl+0x74/0x80
[  473.358603]  do_syscall_64+0x79/0x220
[  473.358612]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  473.358678] RIP: 0033:0x7fa95fa5c0f7
[  473.358683] RSP: 002b:00007fa957459998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  473.358692] RAX: ffffffffffffffda RBX: 00007fa957459a80 RCX: 00007fa95fa5c0f7
[  473.358697] RDX: 00007fa957459a00 RSI: 00000000c0186444 RDI: 000000000000000b
[  473.358701] RBP: 00007fa957459a00 R08: 00007fa957459ab0 R09: 00007fa9574599e0
[  473.358706] R10: 00007fa957459ab0 R11: 0000000000000246 R12: 00000000c0186444
[  473.358710] R13: 000000000000000b R14: 0000000002876fe8 R15: 0000000000000002
[  473.358836] tracker-store   D12456  2792   2166 0x00000000
[  473.358848] Call Trace:
[  473.358862]  ? __schedule+0x2ed/0xba0
[  473.358882]  schedule+0x2f/0x90
[  473.358889]  io_schedule+0x12/0x40
[  473.358898]  generic_file_read_iter+0x39e/0xdb0
[  473.358922]  ? page_cache_tree_insert+0x130/0x130
[  473.359001]  xfs_file_buffered_aio_read+0x65/0x1a0 [xfs]
[  473.359066]  xfs_file_read_iter+0x64/0xc0 [xfs]
[  473.359077]  __vfs_read+0x102/0x170
[  473.359100]  vfs_read+0x9e/0x150
[  473.359111]  SyS_pread64+0x93/0xb0
[  473.359119]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  473.359132]  do_syscall_64+0x79/0x220
[  473.359142]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  473.359148] RIP: 0033:0x7f7bb7448873
[  473.359152] RSP: 002b:00007ffc37fd1220 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
[  473.359161] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f7bb7448873
[  473.359166] RDX: 0000000000001000 RSI: 0000556e21670258 RDI: 0000000000000008
[  473.359170] RBP: 0000000000001000 R08: 0000556e21670258 R09: 000000000fef0fff
[  473.359175] R10: 0000000002761000 R11: 0000000000000293 R12: 0000000000000000
[  473.359179] R13: 0000556e21670258 R14: 0000000002761000 R15: 0000556e214e9d80
[  473.359343] kworker/u16:0   D12152  4711      2 0x80000000
[  473.359370] Workqueue: events_unbound commit_work [drm_kms_helper]
[  473.359379] Call Trace:
[  473.359394]  ? __schedule+0x2ed/0xba0
[  473.359410]  ? dma_fence_default_wait+0x14f/0x370
[  473.359418]  schedule+0x2f/0x90
[  473.359425]  schedule_timeout+0x23d/0x540
[  473.359433]  ? find_held_lock+0x34/0xa0
[  473.359448]  ? mark_held_locks+0x56/0x80
[  473.359456]  ? _raw_spin_unlock_irqrestore+0x32/0x60
[  473.359469]  ? dma_fence_default_wait+0x14f/0x370
[  473.359476]  dma_fence_default_wait+0x23b/0x370
[  473.359484]  ? dma_fence_release+0x170/0x170
[  473.359498]  dma_fence_wait_timeout+0x4f/0x270
[  473.359509]  reservation_object_wait_timeout_rcu+0x193/0x4d0
[  473.359607]  amdgpu_dm_do_flip+0x112/0x350 [amdgpu]
[  473.359761]  amdgpu_dm_atomic_commit_tail+0xb66/0xdc0 [amdgpu]
[  473.359777]  ? wait_for_completion_timeout+0x76/0x1b0
[  473.359826]  commit_tail+0x3d/0x70 [drm_kms_helper]
[  473.359841]  process_one_work+0x266/0x6b0
[  473.359876]  worker_thread+0x3a/0x390
[  473.359883]  ? process_one_work+0x6b0/0x6b0
[  473.359886]  kthread+0x121/0x140
[  473.359890]  ? kthread_create_worker_on_cpu+0x70/0x70
[  473.359896]  ret_from_fork+0x3a/0x50

Comment 4 Timothy Arceri 2018-04-01 03:47:00 UTC

Created attachment 138471 [details]
Shader runner link test

I've distilled one problem in the attached shader runner test. Seems we have another unrolling bug somewhere in the GLSL IR unrolling pass.

We end up with the following:

FRAG
DCL OUT[0], COLOR
DCL TEMP[0..3], LOCAL
IMM[0] UINT32 {0, 4294967295, 0, 0}
IMM[1] INT32 {0, 1, 0, 0}
IMM[2] FLT32 {    1.0000,     0.0000,     0.0000,     0.0000}
  0: MOV TEMP[0].x, IMM[0].xxxx
  1: MOV TEMP[1].x, IMM[1].xxxx
  2: BGNLOOP
  3:   USEQ TEMP[2].x, TEMP[1].xxxx, IMM[1].yyyy
  4:   UIF TEMP[2].xxxx
  5:     BRK
  6:   ENDIF
  7:   MOV TEMP[3], IMM[2].xxxx
  8:   MOV TEMP[0].x, IMM[0].yyyy
  9:   BRK
 10:   UADD TEMP[1].x, TEMP[1].xxxx, IMM[1].yyyy
 11: ENDLOOP
 12: MOV OUT[0], IMM[2].xxxx
 13: END

Terminator found in the middle of a basic block!
label %endif6
LLVM ERROR: Broken function found, compilation aborted!

Comment 5 Timothy Arceri 2018-04-02 10:20:57 UTC

*** Bug 104683 has been marked as a duplicate of this bug. ***

Comment 6 Timothy Arceri 2018-04-03 01:44:49 UTC

Piglit test:

https://patchwork.freedesktop.org/patch/214341/

Mesa fix:

https://patchwork.freedesktop.org/patch/214346/

Note the WebGL test still froze in my testing but I think Firefox was continuing to use my system mesa libs for some reason. The mesa patch fixes the hang in the piglit test.

Comment 7 Bráulio Barros de Oliveira 2018-04-13 22:29:03 UTC

Likely duplicate of this https://bugs.freedesktop.org/show_bug.cgi?id=104817

Comment 8 Juan A. Suarez 2018-04-18 08:42:31 UTC

This already landed in Mesa. Can we close this as fixed?

Comment 9 mikhail.v.gavrilov 2018-04-18 08:48:53 UTC

I don't thinks so because if it happens again by another reason GPU again will hang.
I will be happy if it this case GPU reset code will present in driver.

Comment 10 Mauro Gaspari 2018-08-28 09:41:45 UTC

I am also affected by this bug. I filed a bug with openSUSE tumbleweed and bug was closed earlier this year.
However, with latest mesa updates, the issue resurfaced, therefore I reopened the bug. This is the link https://bugzilla.opensuse.org/show_bug.cgi?id=1090456

System Info:
OS: OpenSUSE tumbleweed x86_64 updated (2018 08 27)
Kernel: 4.18.0-1-default
Desktop Environment: KDE Plasma (x11)
OpenGL version string: 3.1 Mesa 18.1.6
GPU: AMD Radeon RX Vega 64 8GB 

Relevant log lines I found during freeze:

2018-08-09T23:16:53.103775+08:00 MGDT-Tumbleweed kernel: [ 6305.852703] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1745163, last emitted seq=
1745165
2018-08-09T23:16:53.103795+08:00 MGDT-Tumbleweed kernel: [ 6305.852704] [drm] No hardware hang detected. Did some blocks stall?


Dmesg lines relative to amdgpu:

[    3.130759] [drm] amdgpu kernel modesetting enabled.
[    3.135770] fb: switching to amdgpudrmfb from EFI VGA
[    3.136106] amdgpu 0000:03:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[    3.136171] amdgpu 0000:03:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[    3.136173] amdgpu 0000:03:00.0: GTT: 512M 0x000000F600000000 - 0x000000F61FFFFFFF
[    3.136494] [drm] amdgpu: 8176M of VRAM memory ready
[    3.136495] [drm] amdgpu: 8176M of GTT memory ready.
[    4.114469] fbcon: amdgpudrmfb (fb0) is primary device
[    4.141179] amdgpu 0000:03:00.0: fb0: amdgpudrmfb frame buffer device
[    4.164072] amdgpu 0000:03:00.0: ring 0(gfx) uses VM inv eng 4 on hub 0
[    4.164074] amdgpu 0000:03:00.0: ring 1(comp_1.0.0) uses VM inv eng 5 on hub 0
[    4.164075] amdgpu 0000:03:00.0: ring 2(comp_1.1.0) uses VM inv eng 6 on hub 0
[    4.164075] amdgpu 0000:03:00.0: ring 3(comp_1.2.0) uses VM inv eng 7 on hub 0
[    4.164076] amdgpu 0000:03:00.0: ring 4(comp_1.3.0) uses VM inv eng 8 on hub 0
[    4.164077] amdgpu 0000:03:00.0: ring 5(comp_1.0.1) uses VM inv eng 9 on hub 0
[    4.164078] amdgpu 0000:03:00.0: ring 6(comp_1.1.1) uses VM inv eng 10 on hub 0
[    4.164079] amdgpu 0000:03:00.0: ring 7(comp_1.2.1) uses VM inv eng 11 on hub 0
[    4.164079] amdgpu 0000:03:00.0: ring 8(comp_1.3.1) uses VM inv eng 12 on hub 0
[    4.164080] amdgpu 0000:03:00.0: ring 9(kiq_2.1.0) uses VM inv eng 13 on hub 0
[    4.164081] amdgpu 0000:03:00.0: ring 10(sdma0) uses VM inv eng 4 on hub 1
[    4.164082] amdgpu 0000:03:00.0: ring 11(sdma1) uses VM inv eng 5 on hub 1
[    4.164083] amdgpu 0000:03:00.0: ring 12(uvd) uses VM inv eng 6 on hub 1
[    4.164084] amdgpu 0000:03:00.0: ring 13(uvd_enc0) uses VM inv eng 7 on hub 1
[    4.164085] amdgpu 0000:03:00.0: ring 14(uvd_enc1) uses VM inv eng 8 on hub 1
[    4.164085] amdgpu 0000:03:00.0: ring 15(vce0) uses VM inv eng 9 on hub 1
[    4.164086] amdgpu 0000:03:00.0: ring 16(vce1) uses VM inv eng 10 on hub 1
[    4.164087] amdgpu 0000:03:00.0: ring 17(vce2) uses VM inv eng 11 on hub 1
[    4.164553] [drm] Initialized amdgpu 3.25.0 20150101 for 0000:03:00.0 on minor 0


as a side note, the freeze does not happen on my Kubuntu system. Same hardware, same games.

OS: Kubuntu 18.04 x86_64 updated (2018 08 27)
Kernel: 4.15.0-33-generic
Desktop Environment: KDE Plasma (x11)
OpenGL version string: 3.0 Mesa 18.0.5
GPU: AMD Radeon RX Vega 64 8GB

Comment 11 GitLab Migration User 2019-09-25 18:03:26 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1307.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.