104362 – GPU fault detected on wine-nine Path of Exile

Bug 104362 - GPU fault detected on wine-nine Path of Exile

Summary: GPU fault detected on wine-nine Path of Exile

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-12-21 18:56 UTC by Vladimir Usikov
Modified:	2019-09-25 18:02 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg log (194.02 KB, text/plain) 2017-12-21 18:56 UTC, Vladimir Usikov	Details
dmesg (105.73 KB, text/plain) 2018-08-26 05:42 UTC, Vladimir Usikov	Details
clean dmesg (71.20 KB, text/plain) 2018-08-29 03:08 UTC, Vladimir Usikov	Details
glxinfo (143.02 KB, text/plain) 2018-08-29 03:08 UTC, Vladimir Usikov	Details
dmesg_2 (71.16 KB, text/plain) 2018-09-01 15:01 UTC, Vladimir Usikov	Details
UMR dump for PoE/gallium-nine induced AMDGPU hang (7.64 MB, text/plain) 2019-01-09 14:54 UTC, nmr	Details
dmesg dump for hang (93.29 KB, text/plain) 2019-01-09 14:55 UTC, nmr	Details
dmesg during reboot/recovery (108.03 KB, text/plain) 2019-01-15 12:19 UTC, nmr	Details
UMR dump for similar dxvk hang (43.30 MB, text/plain) 2019-01-15 12:22 UTC, nmr	Details
View All

Description Vladimir Usikov 2017-12-21 18:56:21 UTC

Created attachment 136347 [details]
dmesg log

When i start game Path of Exile on wine-nine computer freez.
I can`t switch in kernel console.
When i connect on ssh i can get dmesg output.

Radeon HD 7950 (TAHITI)

kernel 4.13.4 - 4.14.6
kernel module AMDGPU

Mesa 17.2.1 - 17.3.1

wine-nine 2.20 - 2.21

Comment 1 Vladimir Usikov 2017-12-31 16:29:00 UTC

What I mean by freezing.
The computer does not respond to the keyboard and mouse.
When I press 'Num Lock' or 'Caps Lock' the LED does not light up.
Clicking on Ctrl+Alt+F# does not switch to the TTY#.
Today, when I went to a hung computer, I saw a new error in dmesg.

[26173.119284] INFO: task amdgpu_cs:0:660 blocked for more than 120 seconds.
[26173.119292]       Tainted: G           O    4.14.9-1-ARCH #1
[26173.119295] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26173.119299] amdgpu_cs:0     D    0   660    629 0x00000000
[26173.119303] Call Trace:
[26173.119316]  ? __schedule+0x290/0x890
[26173.119320]  schedule+0x2f/0x90
[26173.119407]  amd_sched_entity_push_job+0xd2/0x110 [amdgpu]
[26173.119415]  ? wait_woken+0x80/0x80
[26173.119488]  amdgpu_job_submit+0x76/0x90 [amdgpu]
[26173.119550]  amdgpu_vm_bo_update_mapping.constprop.25+0x35a/0x3c0 [amdgpu]
[26173.119612]  ? amdgpu_vm_prt_cb+0x20/0x20 [amdgpu]
[26173.119673]  amdgpu_vm_bo_update+0x272/0x550 [amdgpu]
[26173.119734]  amdgpu_cs_ioctl+0x12a9/0x1a50 [amdgpu]
[26173.119797]  ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu]
[26173.119826]  drm_ioctl_kernel+0x59/0xb0 [drm]
[26173.119851]  drm_ioctl+0x2d5/0x370 [drm]
[26173.119910]  ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu]
[26173.119964]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[26173.119971]  do_vfs_ioctl+0xa1/0x610
[26173.119976]  ? SyS_futex+0x12d/0x180
[26173.119980]  SyS_ioctl+0x74/0x80
[26173.119984]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[26173.119988] RIP: 0033:0x7effda21d337
[26173.119990] RSP: 002b:00007effd028eb08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[26173.119993] RAX: ffffffffffffffda RBX: 0000555f9a2f4870 RCX: 00007effda21d337
[26173.119994] RDX: 00007effd028eb70 RSI: 00000000c0186444 RDI: 0000000000000018
[26173.119996] RBP: 00007effd028eae0 R08: 00007effd028ec10 R09: 00007effd028eb50
[26173.119998] R10: 00007effd028ec10 R11: 0000000000000246 R12: 0000000040086409
[26173.119999] R13: 0000000000000018 R14: 0000555f99557420 R15: 0000555f9a1ecf60

Comment 2 Vladimir Usikov 2018-08-26 05:42:43 UTC

Created attachment 141280 [details]
dmesg

Freeze still going on Linux 4.18.4 and mesa 18.1.7. Dmesg different.

After freeze i try cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover

Comment 3 Andrey Grodzovsky 2018-08-27 21:15:21 UTC

(In reply to Vladimir Usikov from comment #2)
> Created attachment 141280 [details]
> dmesg
> 
> Freeze still going on Linux 4.18.4 and mesa 18.1.7. Dmesg different.
> 
> After freeze i try cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover

Please provide clean new dmesg loga also glxinfo.

Comment 4 Vladimir Usikov 2018-08-29 03:08:07 UTC

Created attachment 141329 [details]
clean dmesg

Comment 5 Vladimir Usikov 2018-08-29 03:08:41 UTC

Created attachment 141330 [details]
glxinfo

Comment 6 Andrey Grodzovsky 2018-08-29 13:45:17 UTC

Did you do pm-hibernate before the hang happened ? I see a hibernate print just before the hang in the previous log ?
In latest log i see no prints of hang.

Comment 7 Vladimir Usikov 2018-09-01 15:01:36 UTC

Created attachment 141405 [details]
dmesg_2

>Did you do pm-hibernate before the hang happened ? I see a hibernate print just before the hang in the previous log ?

Yes, several times.

>In latest log i see no prints of hang.

Yes, you request clean dmesg.

Now i attach dmesg output without pm-hibernate.

Comment 8 Andrey Grodzovsky 2018-09-04 19:00:03 UTC

We can try and check the gfx command buffer for latest commands and CUs status - 

Clone and build our open source register analyzer from here - https://cgit.freedesktop.org/amd/umr/

After hang happens please get following outputs -

sudo umr -lb > umr_dump
sudo umr -O verbose,use_colour -R gfx[.] >> umr_dump
sudo umr -O halt_waves,use_colour -wa >> umr_dump
dmesg > dmesg_dump

Comment 9 Vladimir Usikov 2018-10-16 18:22:54 UTC

My Radeon 7950 dead, can`t test any more.

Comment 10 nmr 2019-01-09 14:54:51 UTC

Created attachment 143036 [details]
UMR dump for PoE/gallium-nine induced AMDGPU hang

I am experiencing the same bug, here is the UMR dump.

Comment 11 nmr 2019-01-09 14:55:19 UTC

Created attachment 143037 [details]
dmesg dump for hang

Comment 12 nmr 2019-01-11 13:09:06 UTC

Andrey is there anything I can do to help resolve this? Be happy to help. Haven't looked at the ring buffers, is there some kind of deadlock in there?

Comment 13 Andrey Grodzovsky 2019-01-11 17:38:07 UTC

(In reply to nmr from comment #10)
> Created attachment 143036 [details]
> UMR dump for PoE/gallium-nine induced AMDGPU hang
> 
> I am experiencing the same bug, here is the UMR dump.

Marek, I am seeing waves dumps in here during the hang, could you please take a look and advise ?

Comment 14 Marek Olšák 2019-01-15 00:51:11 UTC

There is some branching and SGPR spilling, so I guess that's the problem.

Comment 15 nmr 2019-01-15 00:58:52 UTC

Marek forgive my ignorance but why would SGPR spilling or branching cause the hang? Is the shader just timing out somehow and the timeout resulting in a kernel module abort?

Comment 16 Marek Olšák 2019-01-15 01:34:01 UTC

The wave dump suggests that image_sample_lz might be responsible for the hang, but its SGPR inputs seem to contain valid descriptors.

Comment 17 Marek Olšák 2019-01-15 01:39:15 UTC

(In reply to nmr from comment #15)
> Marek forgive my ignorance but why would SGPR spilling or branching cause
> the hang? Is the shader just timing out somehow and the timeout resulting in
> a kernel module abort?

Pretty much. The shader is stuck and doesn't continue. Also the shader is insanely huge with lots of SGPR spilling and branching.

Comment 18 nmr 2019-01-15 12:19:04 UTC

Created attachment 143124 [details]
dmesg during reboot/recovery

On the basis that it may be shader induced I repro'd a similar hang with dxvk (v0.95-5-gcc38412). 

FWIW here's dmesg during subsequent GPU recovery (which fails :( ) and reboot (which hangs.) It appears hung on a DMA, and/or hung doing a modeset, acquiring the modeset lock.

Comment 19 nmr 2019-01-15 12:22:12 UTC

Created attachment 143125 [details]
UMR dump for similar dxvk hang

I see that there is only one wave noted in the dump, and the shader appears to be of reasonable length.

>   pgm[7@0x800100025000 + 0x94  ] = 0x3727c5ac      ;;                                                          

Are these timed NOPs or something to achieve the correct cycle delay to avoid load hazards or something?

Comment 20 Axel Davy 2019-01-22 19:43:05 UTC

It looks like from the previous comments the problem is in radeonsi.

As a 'temporary fix', you could try this patch:
https://github.com/iXit/Mesa-3D/commit/976f3fe791b0aa34cc04eaac53147eb60089e0f7

This patch recompiles the shaders with the boolean and integer constant values given by the app, thus the branches controlled by them are simplified.

Comment 21 nmr 2019-01-22 21:54:43 UTC

There may also be a bug in radeonsi, and thanks for the heads up, but every circumstance where user code causes a kernel hang is a bug.

Comment 22 nmr 2019-02-08 14:31:08 UTC

amdgpu still hangs kernel in Linux waldorf 4.19.0-2-amd64 #1 SMP Debian 4.19.16-1 (2019-01-17) x86_64 GNU/Linux

Comment 23 nmr 2019-02-18 14:40:38 UTC

Marek, is this even the right bug tracker for the kernel module or is this just for user space?

Comment 24 Alex Deucher 2019-02-18 16:02:10 UTC

(In reply to nmr from comment #23)
> Marek, is this even the right bug tracker for the kernel module or is this
> just for user space?

Same bug tracker for all components.

Comment 25 nmr 2019-02-18 16:38:09 UTC

Is it likely that this hang will get any traction with the AMDGPU team? Or should I just close it and reset my expectations?

Comment 26 nmr 2019-02-18 21:04:52 UTC

I'm getting the impression that AMD does not regard the kernel hang as the underlying issue. Is that correct?

Comment 27 Alex Deucher 2019-02-18 21:13:57 UTC

(In reply to nmr from comment #26)
> I'm getting the impression that AMD does not regard the kernel hang as the
> underlying issue. Is that correct?

The GPU hang is most likely caused by a bug in mesa.  What kernel are you using?  GPU reset was only recently enabled by default on certain asics.  Even if a GPU reset is successful, user mode programs (like X or the wayland desktop compositor) need to properly catch and handle GPU resets which they don't currently today.  Can you try 4.20 or newer?

Comment 28 nmr 2019-02-18 22:08:42 UTC

I get that it's triggered by Mesa, but don't you think it's a bug itself that user-space can hang the kernel? I can't even switch virtual consoles when it hangs. 

I'm currently running Linux waldorf 4.19.0-2-amd64 #1 SMP Debian 4.19.16-1 (2019-01-17) x86_64 GNU/Linux

I'll report back when I upgrade to 4.20

Comment 29 GitLab Migration User 2019-09-25 18:02:08 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1295.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.