Bug 110928

Summary: wx5100 gpu crash
Product: DRI Reporter: baopeng <baopeng88_com>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: critical    
Priority: medium    
Version: DRI git   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
gstack info
none
situation_1_dmesg none

Description baopeng 2019-06-17 06:56:34 UTC
When we used wx5100 for rendering and encoding, we encountered some gpu hangs. This situation is very bad and must be resolved by rebooting. The log information is as follows. Please help analyze, thank you very much.
situation 1:
2019-06-16T14:39:24.708544+08:00|err|kernel[-]|[398383.549799] amdgpu 0005:01:00.0: GPU fault detected: 146 0x04203d0c for process a.babycard.ssvs pid 330210 thread RenderThread pid 330511
2019-06-16T14:39:24.708703+08:00|err|kernel[-]|[398383.549803] amdgpu 0005:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00102184
2019-06-16T14:39:24.708812+08:00|err|kernel[-]|[398383.549805] amdgpu 0005:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D03D014
2019-06-16T14:39:24.708908+08:00|err|kernel[-]|[398383.549809] amdgpu 0005:01:00.0: VM fault (0x14, vmid 6, pasid 33627) at page 1057156, write from 'SDM1' (0x53444d31) (61)

After the GPU fault, about 17 seconds later:

2019-06-16T14:39:41.924400+08:00|err|kernel[-]|[398400.765123] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vce0 timeout, signaled seq=3868950, emitted seq=3868954
2019-06-16T14:39:41.924463+08:00|info|kernel[-]|[398400.765132] [drm] GPU recovery disabled.

situation 2:
[Thu Jun  6 22:00:14 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=919191055, emitted seq=919191057
[Thu Jun  6 22:00:14 2019] [drm] GPU recovery disabled.
[Thu Jun  6 22:00:16 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=101603699, emitted seq=101603701
[Thu Jun  6 22:00:16 2019] [drm] GPU recovery disabled.

situation 3:
2019-06-16T14:59:05.248325+08:00|err|kernel[-]|[399194.411704] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=230984670, emitted seq=230984673
2019-06-16T14:59:05.248404+08:00|info|kernel[-]|[399194.411708] [drm] GPU recovery disabled.

can you help me to analyze these situations to solve these problems? thank you.
Comment 1 baopeng 2019-06-17 06:58:02 UTC
Created attachment 144567 [details]
gstack info
Comment 2 baopeng 2019-06-17 07:01:13 UTC
Created attachment 144568 [details]
situation_1_dmesg

situation 1 dmesg
Comment 3 Martin Peres 2019-11-19 09:31:27 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/829.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.