Bug 104362 - GPU fault detected on wine-nine Path of Exile
Summary: GPU fault detected on wine-nine Path of Exile
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-21 18:56 UTC by Vladimir Usikov
Modified: 2019-01-15 12:22 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg log (194.02 KB, text/plain)
2017-12-21 18:56 UTC, Vladimir Usikov
no flags Details
dmesg (105.73 KB, text/plain)
2018-08-26 05:42 UTC, Vladimir Usikov
no flags Details
clean dmesg (71.20 KB, text/plain)
2018-08-29 03:08 UTC, Vladimir Usikov
no flags Details
glxinfo (143.02 KB, text/plain)
2018-08-29 03:08 UTC, Vladimir Usikov
no flags Details
dmesg_2 (71.16 KB, text/plain)
2018-09-01 15:01 UTC, Vladimir Usikov
no flags Details
UMR dump for PoE/gallium-nine induced AMDGPU hang (7.64 MB, text/plain)
2019-01-09 14:54 UTC, nmr
no flags Details
dmesg dump for hang (93.29 KB, text/plain)
2019-01-09 14:55 UTC, nmr
no flags Details
dmesg during reboot/recovery (108.03 KB, text/plain)
2019-01-15 12:19 UTC, nmr
no flags Details
UMR dump for similar dxvk hang (43.30 MB, text/plain)
2019-01-15 12:22 UTC, nmr
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vladimir Usikov 2017-12-21 18:56:21 UTC
Created attachment 136347 [details]
dmesg log

When i start game Path of Exile on wine-nine computer freez.
I can`t switch in kernel console.
When i connect on ssh i can get dmesg output.

Radeon HD 7950 (TAHITI)

kernel 4.13.4 - 4.14.6
kernel module AMDGPU

Mesa 17.2.1 - 17.3.1

wine-nine 2.20 - 2.21
Comment 1 Vladimir Usikov 2017-12-31 16:29:00 UTC
What I mean by freezing.
The computer does not respond to the keyboard and mouse.
When I press 'Num Lock' or 'Caps Lock' the LED does not light up.
Clicking on Ctrl+Alt+F# does not switch to the TTY#.
Today, when I went to a hung computer, I saw a new error in dmesg.

[26173.119284] INFO: task amdgpu_cs:0:660 blocked for more than 120 seconds.
[26173.119292]       Tainted: G           O    4.14.9-1-ARCH #1
[26173.119295] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26173.119299] amdgpu_cs:0     D    0   660    629 0x00000000
[26173.119303] Call Trace:
[26173.119316]  ? __schedule+0x290/0x890
[26173.119320]  schedule+0x2f/0x90
[26173.119407]  amd_sched_entity_push_job+0xd2/0x110 [amdgpu]
[26173.119415]  ? wait_woken+0x80/0x80
[26173.119488]  amdgpu_job_submit+0x76/0x90 [amdgpu]
[26173.119550]  amdgpu_vm_bo_update_mapping.constprop.25+0x35a/0x3c0 [amdgpu]
[26173.119612]  ? amdgpu_vm_prt_cb+0x20/0x20 [amdgpu]
[26173.119673]  amdgpu_vm_bo_update+0x272/0x550 [amdgpu]
[26173.119734]  amdgpu_cs_ioctl+0x12a9/0x1a50 [amdgpu]
[26173.119797]  ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu]
[26173.119826]  drm_ioctl_kernel+0x59/0xb0 [drm]
[26173.119851]  drm_ioctl+0x2d5/0x370 [drm]
[26173.119910]  ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu]
[26173.119964]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[26173.119971]  do_vfs_ioctl+0xa1/0x610
[26173.119976]  ? SyS_futex+0x12d/0x180
[26173.119980]  SyS_ioctl+0x74/0x80
[26173.119984]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[26173.119988] RIP: 0033:0x7effda21d337
[26173.119990] RSP: 002b:00007effd028eb08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[26173.119993] RAX: ffffffffffffffda RBX: 0000555f9a2f4870 RCX: 00007effda21d337
[26173.119994] RDX: 00007effd028eb70 RSI: 00000000c0186444 RDI: 0000000000000018
[26173.119996] RBP: 00007effd028eae0 R08: 00007effd028ec10 R09: 00007effd028eb50
[26173.119998] R10: 00007effd028ec10 R11: 0000000000000246 R12: 0000000040086409
[26173.119999] R13: 0000000000000018 R14: 0000555f99557420 R15: 0000555f9a1ecf60
Comment 2 Vladimir Usikov 2018-08-26 05:42:43 UTC
Created attachment 141280 [details]
dmesg

Freeze still going on Linux 4.18.4 and mesa 18.1.7. Dmesg different.

After freeze i try cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
Comment 3 Andrey Grodzovsky 2018-08-27 21:15:21 UTC
(In reply to Vladimir Usikov from comment #2)
> Created attachment 141280 [details]
> dmesg
> 
> Freeze still going on Linux 4.18.4 and mesa 18.1.7. Dmesg different.
> 
> After freeze i try cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover

Please provide clean new dmesg loga also glxinfo.
Comment 4 Vladimir Usikov 2018-08-29 03:08:07 UTC
Created attachment 141329 [details]
clean dmesg
Comment 5 Vladimir Usikov 2018-08-29 03:08:41 UTC
Created attachment 141330 [details]
glxinfo
Comment 6 Andrey Grodzovsky 2018-08-29 13:45:17 UTC
Did you do pm-hibernate before the hang happened ? I see a hibernate print just before the hang in the previous log ?
In latest log i see no prints of hang.
Comment 7 Vladimir Usikov 2018-09-01 15:01:36 UTC
Created attachment 141405 [details]
dmesg_2

>Did you do pm-hibernate before the hang happened ? I see a hibernate print just before the hang in the previous log ?

Yes, several times.

>In latest log i see no prints of hang.

Yes, you request clean dmesg.

Now i attach dmesg output without pm-hibernate.
Comment 8 Andrey Grodzovsky 2018-09-04 19:00:03 UTC
We can try and check the gfx command buffer for latest commands and CUs status - 

Clone and build our open source register analyzer from here - https://cgit.freedesktop.org/amd/umr/

After hang happens please get following outputs -

sudo umr -lb > umr_dump
sudo umr -O verbose,use_colour -R gfx[.] >> umr_dump
sudo umr -O halt_waves,use_colour -wa >> umr_dump
dmesg > dmesg_dump
Comment 9 Vladimir Usikov 2018-10-16 18:22:54 UTC
My Radeon 7950 dead, can`t test any more.
Comment 10 nmr 2019-01-09 14:54:51 UTC
Created attachment 143036 [details]
UMR dump for PoE/gallium-nine induced AMDGPU hang

I am experiencing the same bug, here is the UMR dump.
Comment 11 nmr 2019-01-09 14:55:19 UTC
Created attachment 143037 [details]
dmesg dump for hang
Comment 12 nmr 2019-01-11 13:09:06 UTC
Andrey is there anything I can do to help resolve this? Be happy to help. Haven't looked at the ring buffers, is there some kind of deadlock in there?
Comment 13 Andrey Grodzovsky 2019-01-11 17:38:07 UTC
(In reply to nmr from comment #10)
> Created attachment 143036 [details]
> UMR dump for PoE/gallium-nine induced AMDGPU hang
> 
> I am experiencing the same bug, here is the UMR dump.

Marek, I am seeing waves dumps in here during the hang, could you please take a look and advise ?
Comment 14 Marek Olšák 2019-01-15 00:51:11 UTC
There is some branching and SGPR spilling, so I guess that's the problem.
Comment 15 nmr 2019-01-15 00:58:52 UTC
Marek forgive my ignorance but why would SGPR spilling or branching cause the hang? Is the shader just timing out somehow and the timeout resulting in a kernel module abort?
Comment 16 Marek Olšák 2019-01-15 01:34:01 UTC
The wave dump suggests that image_sample_lz might be responsible for the hang, but its SGPR inputs seem to contain valid descriptors.
Comment 17 Marek Olšák 2019-01-15 01:39:15 UTC
(In reply to nmr from comment #15)
> Marek forgive my ignorance but why would SGPR spilling or branching cause
> the hang? Is the shader just timing out somehow and the timeout resulting in
> a kernel module abort?

Pretty much. The shader is stuck and doesn't continue. Also the shader is insanely huge with lots of SGPR spilling and branching.
Comment 18 nmr 2019-01-15 12:19:04 UTC
Created attachment 143124 [details]
dmesg during reboot/recovery

On the basis that it may be shader induced I repro'd a similar hang with dxvk (v0.95-5-gcc38412). 

FWIW here's dmesg during subsequent GPU recovery (which fails :( ) and reboot (which hangs.) It appears hung on a DMA, and/or hung doing a modeset, acquiring the modeset lock.
Comment 19 nmr 2019-01-15 12:22:12 UTC
Created attachment 143125 [details]
UMR dump for similar dxvk hang

I see that there is only one wave noted in the dump, and the shader appears to be of reasonable length.

>   pgm[7@0x800100025000 + 0x94  ] = 0x3727c5ac      ;;                                                          

Are these timed NOPs or something to achieve the correct cycle delay to avoid load hazards or something?


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.