Created attachment 136347 [details]
When i start game Path of Exile on wine-nine computer freez.
I can`t switch in kernel console.
When i connect on ssh i can get dmesg output.
Radeon HD 7950 (TAHITI)
kernel 4.13.4 - 4.14.6
kernel module AMDGPU
Mesa 17.2.1 - 17.3.1
wine-nine 2.20 - 2.21
What I mean by freezing.
The computer does not respond to the keyboard and mouse.
When I press 'Num Lock' or 'Caps Lock' the LED does not light up.
Clicking on Ctrl+Alt+F# does not switch to the TTY#.
Today, when I went to a hung computer, I saw a new error in dmesg.
[26173.119284] INFO: task amdgpu_cs:0:660 blocked for more than 120 seconds.
[26173.119292] Tainted: G O 4.14.9-1-ARCH #1
[26173.119295] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26173.119299] amdgpu_cs:0 D 0 660 629 0x00000000
[26173.119303] Call Trace:
[26173.119316] ? __schedule+0x290/0x890
[26173.119407] amd_sched_entity_push_job+0xd2/0x110 [amdgpu]
[26173.119415] ? wait_woken+0x80/0x80
[26173.119488] amdgpu_job_submit+0x76/0x90 [amdgpu]
[26173.119550] amdgpu_vm_bo_update_mapping.constprop.25+0x35a/0x3c0 [amdgpu]
[26173.119612] ? amdgpu_vm_prt_cb+0x20/0x20 [amdgpu]
[26173.119673] amdgpu_vm_bo_update+0x272/0x550 [amdgpu]
[26173.119734] amdgpu_cs_ioctl+0x12a9/0x1a50 [amdgpu]
[26173.119797] ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu]
[26173.119826] drm_ioctl_kernel+0x59/0xb0 [drm]
[26173.119851] drm_ioctl+0x2d5/0x370 [drm]
[26173.119910] ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu]
[26173.119964] amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[26173.119976] ? SyS_futex+0x12d/0x180
[26173.119988] RIP: 0033:0x7effda21d337
[26173.119990] RSP: 002b:00007effd028eb08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[26173.119993] RAX: ffffffffffffffda RBX: 0000555f9a2f4870 RCX: 00007effda21d337
[26173.119994] RDX: 00007effd028eb70 RSI: 00000000c0186444 RDI: 0000000000000018
[26173.119996] RBP: 00007effd028eae0 R08: 00007effd028ec10 R09: 00007effd028eb50
[26173.119998] R10: 00007effd028ec10 R11: 0000000000000246 R12: 0000000040086409
[26173.119999] R13: 0000000000000018 R14: 0000555f99557420 R15: 0000555f9a1ecf60
Created attachment 141280 [details]
Freeze still going on Linux 4.18.4 and mesa 18.1.7. Dmesg different.
After freeze i try cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
(In reply to Vladimir Usikov from comment #2)
> Created attachment 141280 [details]
> Freeze still going on Linux 4.18.4 and mesa 18.1.7. Dmesg different.
> After freeze i try cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
Please provide clean new dmesg loga also glxinfo.
Created attachment 141329 [details]
Created attachment 141330 [details]
Did you do pm-hibernate before the hang happened ? I see a hibernate print just before the hang in the previous log ?
In latest log i see no prints of hang.
Created attachment 141405 [details]
>Did you do pm-hibernate before the hang happened ? I see a hibernate print just before the hang in the previous log ?
Yes, several times.
>In latest log i see no prints of hang.
Yes, you request clean dmesg.
Now i attach dmesg output without pm-hibernate.
We can try and check the gfx command buffer for latest commands and CUs status -
Clone and build our open source register analyzer from here - https://cgit.freedesktop.org/amd/umr/
After hang happens please get following outputs -
sudo umr -lb > umr_dump
sudo umr -O verbose,use_colour -R gfx[.] >> umr_dump
sudo umr -O halt_waves,use_colour -wa >> umr_dump
dmesg > dmesg_dump
My Radeon 7950 dead, can`t test any more.
Created attachment 143036 [details]
UMR dump for PoE/gallium-nine induced AMDGPU hang
I am experiencing the same bug, here is the UMR dump.
Created attachment 143037 [details]
dmesg dump for hang
Andrey is there anything I can do to help resolve this? Be happy to help. Haven't looked at the ring buffers, is there some kind of deadlock in there?
(In reply to nmr from comment #10)
> Created attachment 143036 [details]
> UMR dump for PoE/gallium-nine induced AMDGPU hang
> I am experiencing the same bug, here is the UMR dump.
Marek, I am seeing waves dumps in here during the hang, could you please take a look and advise ?
There is some branching and SGPR spilling, so I guess that's the problem.
Marek forgive my ignorance but why would SGPR spilling or branching cause the hang? Is the shader just timing out somehow and the timeout resulting in a kernel module abort?
The wave dump suggests that image_sample_lz might be responsible for the hang, but its SGPR inputs seem to contain valid descriptors.
(In reply to nmr from comment #15)
> Marek forgive my ignorance but why would SGPR spilling or branching cause
> the hang? Is the shader just timing out somehow and the timeout resulting in
> a kernel module abort?
Pretty much. The shader is stuck and doesn't continue. Also the shader is insanely huge with lots of SGPR spilling and branching.
Created attachment 143124 [details]
dmesg during reboot/recovery
On the basis that it may be shader induced I repro'd a similar hang with dxvk (v0.95-5-gcc38412).
FWIW here's dmesg during subsequent GPU recovery (which fails :( ) and reboot (which hangs.) It appears hung on a DMA, and/or hung doing a modeset, acquiring the modeset lock.
Created attachment 143125 [details]
UMR dump for similar dxvk hang
I see that there is only one wave noted in the dump, and the shader appears to be of reasonable length.
> pgm[7@0x800100025000 + 0x94 ] = 0x3727c5ac ;;
Are these timed NOPs or something to achieve the correct cycle delay to avoid load hazards or something?
It looks like from the previous comments the problem is in radeonsi.
As a 'temporary fix', you could try this patch:
This patch recompiles the shaders with the boolean and integer constant values given by the app, thus the branches controlled by them are simplified.
There may also be a bug in radeonsi, and thanks for the heads up, but every circumstance where user code causes a kernel hang is a bug.
amdgpu still hangs kernel in Linux waldorf 4.19.0-2-amd64 #1 SMP Debian 4.19.16-1 (2019-01-17) x86_64 GNU/Linux
Marek, is this even the right bug tracker for the kernel module or is this just for user space?
(In reply to nmr from comment #23)
> Marek, is this even the right bug tracker for the kernel module or is this
> just for user space?
Same bug tracker for all components.
Is it likely that this hang will get any traction with the AMDGPU team? Or should I just close it and reset my expectations?
I'm getting the impression that AMD does not regard the kernel hang as the underlying issue. Is that correct?
(In reply to nmr from comment #26)
> I'm getting the impression that AMD does not regard the kernel hang as the
> underlying issue. Is that correct?
The GPU hang is most likely caused by a bug in mesa. What kernel are you using? GPU reset was only recently enabled by default on certain asics. Even if a GPU reset is successful, user mode programs (like X or the wayland desktop compositor) need to properly catch and handle GPU resets which they don't currently today. Can you try 4.20 or newer?
I get that it's triggered by Mesa, but don't you think it's a bug itself that user-space can hang the kernel? I can't even switch virtual consoles when it hangs.
I'm currently running Linux waldorf 4.19.0-2-amd64 #1 SMP Debian 4.19.16-1 (2019-01-17) x86_64 GNU/Linux
I'll report back when I upgrade to 4.20