Created attachment 136347 [details]
When i start game Path of Exile on wine-nine computer freez.
I can`t switch in kernel console.
When i connect on ssh i can get dmesg output.
Radeon HD 7950 (TAHITI)
kernel 4.13.4 - 4.14.6
kernel module AMDGPU
Mesa 17.2.1 - 17.3.1
wine-nine 2.20 - 2.21
What I mean by freezing.
The computer does not respond to the keyboard and mouse.
When I press 'Num Lock' or 'Caps Lock' the LED does not light up.
Clicking on Ctrl+Alt+F# does not switch to the TTY#.
Today, when I went to a hung computer, I saw a new error in dmesg.
[26173.119284] INFO: task amdgpu_cs:0:660 blocked for more than 120 seconds.
[26173.119292] Tainted: G O 4.14.9-1-ARCH #1
[26173.119295] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26173.119299] amdgpu_cs:0 D 0 660 629 0x00000000
[26173.119303] Call Trace:
[26173.119316] ? __schedule+0x290/0x890
[26173.119407] amd_sched_entity_push_job+0xd2/0x110 [amdgpu]
[26173.119415] ? wait_woken+0x80/0x80
[26173.119488] amdgpu_job_submit+0x76/0x90 [amdgpu]
[26173.119550] amdgpu_vm_bo_update_mapping.constprop.25+0x35a/0x3c0 [amdgpu]
[26173.119612] ? amdgpu_vm_prt_cb+0x20/0x20 [amdgpu]
[26173.119673] amdgpu_vm_bo_update+0x272/0x550 [amdgpu]
[26173.119734] amdgpu_cs_ioctl+0x12a9/0x1a50 [amdgpu]
[26173.119797] ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu]
[26173.119826] drm_ioctl_kernel+0x59/0xb0 [drm]
[26173.119851] drm_ioctl+0x2d5/0x370 [drm]
[26173.119910] ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu]
[26173.119964] amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[26173.119976] ? SyS_futex+0x12d/0x180
[26173.119988] RIP: 0033:0x7effda21d337
[26173.119990] RSP: 002b:00007effd028eb08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[26173.119993] RAX: ffffffffffffffda RBX: 0000555f9a2f4870 RCX: 00007effda21d337
[26173.119994] RDX: 00007effd028eb70 RSI: 00000000c0186444 RDI: 0000000000000018
[26173.119996] RBP: 00007effd028eae0 R08: 00007effd028ec10 R09: 00007effd028eb50
[26173.119998] R10: 00007effd028ec10 R11: 0000000000000246 R12: 0000000040086409
[26173.119999] R13: 0000000000000018 R14: 0000555f99557420 R15: 0000555f9a1ecf60
Created attachment 141280 [details]
Freeze still going on Linux 4.18.4 and mesa 18.1.7. Dmesg different.
After freeze i try cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
(In reply to Vladimir Usikov from comment #2)
> Created attachment 141280 [details]
> Freeze still going on Linux 4.18.4 and mesa 18.1.7. Dmesg different.
> After freeze i try cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
Please provide clean new dmesg loga also glxinfo.
Created attachment 141329 [details]
Created attachment 141330 [details]
Did you do pm-hibernate before the hang happened ? I see a hibernate print just before the hang in the previous log ?
In latest log i see no prints of hang.
Created attachment 141405 [details]
>Did you do pm-hibernate before the hang happened ? I see a hibernate print just before the hang in the previous log ?
Yes, several times.
>In latest log i see no prints of hang.
Yes, you request clean dmesg.
Now i attach dmesg output without pm-hibernate.
We can try and check the gfx command buffer for latest commands and CUs status -
Clone and build our open source register analyzer from here - https://cgit.freedesktop.org/amd/umr/
After hang happens please get following outputs -
sudo umr -lb > umr_dump
sudo umr -O verbose,use_colour -R gfx[.] >> umr_dump
sudo umr -O halt_waves,use_colour -wa >> umr_dump
dmesg > dmesg_dump
My Radeon 7950 dead, can`t test any more.
Created attachment 143036 [details]
UMR dump for PoE/gallium-nine induced AMDGPU hang
I am experiencing the same bug, here is the UMR dump.
Created attachment 143037 [details]
dmesg dump for hang
Andrey is there anything I can do to help resolve this? Be happy to help. Haven't looked at the ring buffers, is there some kind of deadlock in there?
(In reply to nmr from comment #10)
> Created attachment 143036 [details]
> UMR dump for PoE/gallium-nine induced AMDGPU hang
> I am experiencing the same bug, here is the UMR dump.
Marek, I am seeing waves dumps in here during the hang, could you please take a look and advise ?
There is some branching and SGPR spilling, so I guess that's the problem.
Marek forgive my ignorance but why would SGPR spilling or branching cause the hang? Is the shader just timing out somehow and the timeout resulting in a kernel module abort?
The wave dump suggests that image_sample_lz might be responsible for the hang, but its SGPR inputs seem to contain valid descriptors.
(In reply to nmr from comment #15)
> Marek forgive my ignorance but why would SGPR spilling or branching cause
> the hang? Is the shader just timing out somehow and the timeout resulting in
> a kernel module abort?
Pretty much. The shader is stuck and doesn't continue. Also the shader is insanely huge with lots of SGPR spilling and branching.
Created attachment 143124 [details]
dmesg during reboot/recovery
On the basis that it may be shader induced I repro'd a similar hang with dxvk (v0.95-5-gcc38412).
FWIW here's dmesg during subsequent GPU recovery (which fails :( ) and reboot (which hangs.) It appears hung on a DMA, and/or hung doing a modeset, acquiring the modeset lock.
Created attachment 143125 [details]
UMR dump for similar dxvk hang
I see that there is only one wave noted in the dump, and the shader appears to be of reasonable length.
> pgm[7@0x800100025000 + 0x94 ] = 0x3727c5ac ;;
Are these timed NOPs or something to achieve the correct cycle delay to avoid load hazards or something?