Bug 104362 - GPU fault detected on wine-nine Path of Exile
Summary: GPU fault detected on wine-nine Path of Exile
Status: NEW
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-21 18:56 UTC by Vladimir Usikov
Modified: 2019-02-18 22:08 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg log (194.02 KB, text/plain)
2017-12-21 18:56 UTC, Vladimir Usikov
Details
dmesg (105.73 KB, text/plain)
2018-08-26 05:42 UTC, Vladimir Usikov
Details
clean dmesg (71.20 KB, text/plain)
2018-08-29 03:08 UTC, Vladimir Usikov
Details
glxinfo (143.02 KB, text/plain)
2018-08-29 03:08 UTC, Vladimir Usikov
Details
dmesg_2 (71.16 KB, text/plain)
2018-09-01 15:01 UTC, Vladimir Usikov
Details
UMR dump for PoE/gallium-nine induced AMDGPU hang (7.64 MB, text/plain)
2019-01-09 14:54 UTC, nmr
Details
dmesg dump for hang (93.29 KB, text/plain)
2019-01-09 14:55 UTC, nmr
Details
dmesg during reboot/recovery (108.03 KB, text/plain)
2019-01-15 12:19 UTC, nmr
Details
UMR dump for similar dxvk hang (43.30 MB, text/plain)
2019-01-15 12:22 UTC, nmr
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vladimir Usikov 2017-12-21 18:56:21 UTC
Created attachment 136347 [details]
dmesg log

When i start game Path of Exile on wine-nine computer freez.
I can`t switch in kernel console.
When i connect on ssh i can get dmesg output.

Radeon HD 7950 (TAHITI)

kernel 4.13.4 - 4.14.6
kernel module AMDGPU

Mesa 17.2.1 - 17.3.1

wine-nine 2.20 - 2.21
Comment 1 Vladimir Usikov 2017-12-31 16:29:00 UTC
What I mean by freezing.
The computer does not respond to the keyboard and mouse.
When I press 'Num Lock' or 'Caps Lock' the LED does not light up.
Clicking on Ctrl+Alt+F# does not switch to the TTY#.
Today, when I went to a hung computer, I saw a new error in dmesg.

[26173.119284] INFO: task amdgpu_cs:0:660 blocked for more than 120 seconds.
[26173.119292]       Tainted: G           O    4.14.9-1-ARCH #1
[26173.119295] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26173.119299] amdgpu_cs:0     D    0   660    629 0x00000000
[26173.119303] Call Trace:
[26173.119316]  ? __schedule+0x290/0x890
[26173.119320]  schedule+0x2f/0x90
[26173.119407]  amd_sched_entity_push_job+0xd2/0x110 [amdgpu]
[26173.119415]  ? wait_woken+0x80/0x80
[26173.119488]  amdgpu_job_submit+0x76/0x90 [amdgpu]
[26173.119550]  amdgpu_vm_bo_update_mapping.constprop.25+0x35a/0x3c0 [amdgpu]
[26173.119612]  ? amdgpu_vm_prt_cb+0x20/0x20 [amdgpu]
[26173.119673]  amdgpu_vm_bo_update+0x272/0x550 [amdgpu]
[26173.119734]  amdgpu_cs_ioctl+0x12a9/0x1a50 [amdgpu]
[26173.119797]  ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu]
[26173.119826]  drm_ioctl_kernel+0x59/0xb0 [drm]
[26173.119851]  drm_ioctl+0x2d5/0x370 [drm]
[26173.119910]  ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu]
[26173.119964]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[26173.119971]  do_vfs_ioctl+0xa1/0x610
[26173.119976]  ? SyS_futex+0x12d/0x180
[26173.119980]  SyS_ioctl+0x74/0x80
[26173.119984]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[26173.119988] RIP: 0033:0x7effda21d337
[26173.119990] RSP: 002b:00007effd028eb08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[26173.119993] RAX: ffffffffffffffda RBX: 0000555f9a2f4870 RCX: 00007effda21d337
[26173.119994] RDX: 00007effd028eb70 RSI: 00000000c0186444 RDI: 0000000000000018
[26173.119996] RBP: 00007effd028eae0 R08: 00007effd028ec10 R09: 00007effd028eb50
[26173.119998] R10: 00007effd028ec10 R11: 0000000000000246 R12: 0000000040086409
[26173.119999] R13: 0000000000000018 R14: 0000555f99557420 R15: 0000555f9a1ecf60
Comment 2 Vladimir Usikov 2018-08-26 05:42:43 UTC
Created attachment 141280 [details]
dmesg

Freeze still going on Linux 4.18.4 and mesa 18.1.7. Dmesg different.

After freeze i try cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
Comment 3 Andrey Grodzovsky 2018-08-27 21:15:21 UTC
(In reply to Vladimir Usikov from comment #2)
> Created attachment 141280 [details]
> dmesg
> 
> Freeze still going on Linux 4.18.4 and mesa 18.1.7. Dmesg different.
> 
> After freeze i try cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover

Please provide clean new dmesg loga also glxinfo.
Comment 4 Vladimir Usikov 2018-08-29 03:08:07 UTC
Created attachment 141329 [details]
clean dmesg
Comment 5 Vladimir Usikov 2018-08-29 03:08:41 UTC
Created attachment 141330 [details]
glxinfo
Comment 6 Andrey Grodzovsky 2018-08-29 13:45:17 UTC
Did you do pm-hibernate before the hang happened ? I see a hibernate print just before the hang in the previous log ?
In latest log i see no prints of hang.
Comment 7 Vladimir Usikov 2018-09-01 15:01:36 UTC
Created attachment 141405 [details]
dmesg_2

>Did you do pm-hibernate before the hang happened ? I see a hibernate print just before the hang in the previous log ?

Yes, several times.

>In latest log i see no prints of hang.

Yes, you request clean dmesg.

Now i attach dmesg output without pm-hibernate.
Comment 8 Andrey Grodzovsky 2018-09-04 19:00:03 UTC
We can try and check the gfx command buffer for latest commands and CUs status - 

Clone and build our open source register analyzer from here - https://cgit.freedesktop.org/amd/umr/

After hang happens please get following outputs -

sudo umr -lb > umr_dump
sudo umr -O verbose,use_colour -R gfx[.] >> umr_dump
sudo umr -O halt_waves,use_colour -wa >> umr_dump
dmesg > dmesg_dump
Comment 9 Vladimir Usikov 2018-10-16 18:22:54 UTC
My Radeon 7950 dead, can`t test any more.
Comment 10 nmr 2019-01-09 14:54:51 UTC
Created attachment 143036 [details]
UMR dump for PoE/gallium-nine induced AMDGPU hang

I am experiencing the same bug, here is the UMR dump.
Comment 11 nmr 2019-01-09 14:55:19 UTC
Created attachment 143037 [details]
dmesg dump for hang
Comment 12 nmr 2019-01-11 13:09:06 UTC
Andrey is there anything I can do to help resolve this? Be happy to help. Haven't looked at the ring buffers, is there some kind of deadlock in there?
Comment 13 Andrey Grodzovsky 2019-01-11 17:38:07 UTC
(In reply to nmr from comment #10)
> Created attachment 143036 [details]
> UMR dump for PoE/gallium-nine induced AMDGPU hang
> 
> I am experiencing the same bug, here is the UMR dump.

Marek, I am seeing waves dumps in here during the hang, could you please take a look and advise ?
Comment 14 Marek Olšák 2019-01-15 00:51:11 UTC
There is some branching and SGPR spilling, so I guess that's the problem.
Comment 15 nmr 2019-01-15 00:58:52 UTC
Marek forgive my ignorance but why would SGPR spilling or branching cause the hang? Is the shader just timing out somehow and the timeout resulting in a kernel module abort?
Comment 16 Marek Olšák 2019-01-15 01:34:01 UTC
The wave dump suggests that image_sample_lz might be responsible for the hang, but its SGPR inputs seem to contain valid descriptors.
Comment 17 Marek Olšák 2019-01-15 01:39:15 UTC
(In reply to nmr from comment #15)
> Marek forgive my ignorance but why would SGPR spilling or branching cause
> the hang? Is the shader just timing out somehow and the timeout resulting in
> a kernel module abort?

Pretty much. The shader is stuck and doesn't continue. Also the shader is insanely huge with lots of SGPR spilling and branching.
Comment 18 nmr 2019-01-15 12:19:04 UTC
Created attachment 143124 [details]
dmesg during reboot/recovery

On the basis that it may be shader induced I repro'd a similar hang with dxvk (v0.95-5-gcc38412). 

FWIW here's dmesg during subsequent GPU recovery (which fails :( ) and reboot (which hangs.) It appears hung on a DMA, and/or hung doing a modeset, acquiring the modeset lock.
Comment 19 nmr 2019-01-15 12:22:12 UTC
Created attachment 143125 [details]
UMR dump for similar dxvk hang

I see that there is only one wave noted in the dump, and the shader appears to be of reasonable length.

>   pgm[7@0x800100025000 + 0x94  ] = 0x3727c5ac      ;;                                                          

Are these timed NOPs or something to achieve the correct cycle delay to avoid load hazards or something?
Comment 20 Axel Davy 2019-01-22 19:43:05 UTC
It looks like from the previous comments the problem is in radeonsi.

As a 'temporary fix', you could try this patch:
https://github.com/iXit/Mesa-3D/commit/976f3fe791b0aa34cc04eaac53147eb60089e0f7

This patch recompiles the shaders with the boolean and integer constant values given by the app, thus the branches controlled by them are simplified.
Comment 21 nmr 2019-01-22 21:54:43 UTC
There may also be a bug in radeonsi, and thanks for the heads up, but every circumstance where user code causes a kernel hang is a bug.
Comment 22 nmr 2019-02-08 14:31:08 UTC
amdgpu still hangs kernel in Linux waldorf 4.19.0-2-amd64 #1 SMP Debian 4.19.16-1 (2019-01-17) x86_64 GNU/Linux
Comment 23 nmr 2019-02-18 14:40:38 UTC
Marek, is this even the right bug tracker for the kernel module or is this just for user space?
Comment 24 Alex Deucher 2019-02-18 16:02:10 UTC
(In reply to nmr from comment #23)
> Marek, is this even the right bug tracker for the kernel module or is this
> just for user space?

Same bug tracker for all components.
Comment 25 nmr 2019-02-18 16:38:09 UTC
Is it likely that this hang will get any traction with the AMDGPU team? Or should I just close it and reset my expectations?
Comment 26 nmr 2019-02-18 21:04:52 UTC
I'm getting the impression that AMD does not regard the kernel hang as the underlying issue. Is that correct?
Comment 27 Alex Deucher 2019-02-18 21:13:57 UTC
(In reply to nmr from comment #26)
> I'm getting the impression that AMD does not regard the kernel hang as the
> underlying issue. Is that correct?

The GPU hang is most likely caused by a bug in mesa.  What kernel are you using?  GPU reset was only recently enabled by default on certain asics.  Even if a GPU reset is successful, user mode programs (like X or the wayland desktop compositor) need to properly catch and handle GPU resets which they don't currently today.  Can you try 4.20 or newer?
Comment 28 nmr 2019-02-18 22:08:42 UTC
I get that it's triggered by Mesa, but don't you think it's a bug itself that user-space can hang the kernel? I can't even switch virtual consoles when it hangs. 

I'm currently running Linux waldorf 4.19.0-2-amd64 #1 SMP Debian 4.19.16-1 (2019-01-17) x86_64 GNU/Linux

I'll report back when I upgrade to 4.20


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.