Bug 102322 - System crashes after "[drm] IP block:gmc_v8_0 is hung!" / [drm] IP block:sdma_v3_0 is hung!
Summary: System crashes after "[drm] IP block:gmc_v8_0 is hung!" / [drm] IP block:sdma...
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
: 107152 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-08-20 22:53 UTC by dwagner
Modified: 2019-09-26 12:29 UTC (History)
13 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Deadlock fix (2.53 KB, patch)
2018-06-26 15:21 UTC, Andrey Grodzovsky
no flags Details | Splinter Review
.config (204.27 KB, text/plain)
2018-08-15 14:24 UTC, Andrey Grodzovsky
no flags Details
trace-cmd induced output during 3-fps-video replay and crash (671.00 KB, application/x-xz)
2018-08-16 21:55 UTC, dwagner
no flags Details
dmesg from boot to after the 3-fps-video test crash (63.98 KB, text/plain)
2018-08-16 21:56 UTC, dwagner
no flags Details
output of umr command after 3-fps-video test crash (57.06 KB, text/plain)
2018-08-16 21:57 UTC, dwagner
no flags Details
add_debug_info.patch (3.17 KB, patch)
2018-08-17 21:25 UTC, Andrey Grodzovsky
no flags Details | Splinter Review
script used to generate the gpu_debug_3.txt (when executed via ssh -t ...) (1.55 KB, text/plain)
2018-08-18 21:37 UTC, dwagner
no flags Details
dmesg / trace / umr output from gpu_debug3.sh (1.65 MB, text/plain)
2018-08-18 21:38 UTC, dwagner
no flags Details
xz-compressed output of gpu_debug3.sh - dmesg, trace, umr (1.65 MB, application/x-xz)
2018-08-18 21:40 UTC, dwagner
no flags Details
add_debug_info2.patch (3.18 KB, patch)
2018-08-20 14:16 UTC, Andrey Grodzovsky
no flags Details | Splinter Review
latest crash trace output, without gpu_reset (1.76 MB, application/x-xz)
2018-08-22 00:26 UTC, dwagner
no flags Details
crash trace with amdgpu.vm_update_mode=3 (84.65 KB, text/plain)
2018-08-22 22:18 UTC, dwagner
no flags Details
test script (1.57 KB, application/x-shellscript)
2018-11-15 23:38 UTC, dwagner
no flags Details
gpu_debug4_output.txt.gz (650.68 KB, application/x-xz)
2018-11-15 23:39 UTC, dwagner
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description dwagner 2017-08-20 22:53:09 UTC
I consistently experience complete system crashes when browsing web pages using firefox for about 30 minutes, with the following dmesg output from the amdgpu driver:

[ 2330.720711] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=40778, last emitted seq=40780
[ 2330.720768] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, last signaled seq=31305, last emitted seq=31306
[ 2330.720771] [drm] IP block:gmc_v8_0 is hung!
[ 2330.720774] [drm] IP block:gmc_v8_0 is hung!
[ 2330.720775] [drm] IP block:sdma_v3_0 is hung!
[ 2330.720778] [drm] IP block:sdma_v3_0 is hung!

(Above cited messages are the last to make it to a network-filesystem by running "dmesg -w" before the system stops to do anything.)

I am running a kernel compiled from https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next as of "commit 94097b0f7f1bfa54b3b1f8b0d74bbd271a0564e4" (so the very latest as of today).
My GPU is an RX 460.

Notice that this bug may be the same symptom as reported in https://bugs.freedesktop.org/show_bug.cgi?id=98874

However, the system crashes for me occur usually while vertically scrolling through some (ordinary) web page.
Comment 1 dwagner 2017-11-19 16:40:30 UTC
Sadly, not only did this bug not attract any attention, it also still occurs, and seemingly even more frequent than before, on current bleeding-edge kernels from amd-staging-drm-next, and also with the now current Firefox 57 and the now current versions of Xorg, Mesa etc. from Arch Linux.
Comment 2 dwagner 2018-02-24 18:36:55 UTC
Just to mention this once again: These system crashes still occur, and way too frequently to consider the amdgpu driver stable enough for professional use. Sample dmesg output from today:

Feb 24 18:26:55 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=5430589, last emitted seq=5430591
Feb 24 18:26:55 [drm] IP block:gmc_v8_0 is hung!
Feb 24 18:26:55 [drm] IP block:gfx_v8_0 is hung!
Feb 24 18:27:02 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=185928, last emitted seq=185930
Feb 24 18:27:02 [drm] IP block:gmc_v8_0 is hung!
Feb 24 18:27:02 [drm] IP block:gfx_v8_0 is hung!
Feb 24 18:27:05 [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:43:crtc-0] hw_done or flip_done timed out
Comment 3 dwagner 2018-06-03 21:00:01 UTC
Just for the record, others have reported similar symptoms - here is a recent example: https://bugs.freedesktop.org/show_bug.cgi?id=106666
Comment 4 dwagner 2018-06-03 21:02:41 UTC
I was asked in https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1027705-amdgpu-on-linux-4-18-to-offer-greater-vega-power-savings-displayport-1-4-fixes?p=1027933#post1027933 to mention here that I have experienced this kind of bug only when using the "new" display code (amdgpu.dc=1).

I cannot strictly rule out that it could also happen with dc=0, since I have tried dc=0 only for short periods occasionally, but during those periods I did not see this kind of crash.
Comment 5 dwagner 2018-06-25 21:43:03 UTC
Just for the record: To rule out my personally compiled kernels are somehow "more buggy than what others compile", I tried the current Arch-Linux-supplied Linux 4.17.2-1-ARCH kernel.

Survives about 5 minutes of Firefox-browsing between crashes with:

Jun 20 00:01:11 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=1895, last em>
Jun 20 00:01:11 ryzen kernel: [drm] IP block:gmc_v8_0 is hung!

(4.13.* did at least survive days.)
Comment 6 Andrey Grodzovsky 2018-06-25 22:11:14 UTC
Verify you are using latest AMD firmware and up to date MESA/LLVM

Firmware here  (amdgpu folder) - https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/

Andrey
Comment 7 dwagner 2018-06-25 23:08:10 UTC
(In reply to Andrey Grodzovsky from comment #6)
> Verify you are using latest AMD firmware and up to date MESA/LLVM

Firmware:

pacman -Q linux-firmware
linux-firmware 20180606.d114732-1

ll  /usr/lib/firmware/amdgpu/vega10_vce.bin
-rw-r--r-- 1 root root 165344 Jun  7 08:01 /usr/lib/firmware/amdgpu/vega10_vce.bin


MESA:

pacman -Q mesa
mesa 18.1.2-1


LLVM:
pacman -Q llvm-libs
llvm-libs 6.0.0-4

Is this new enough?


BTW: In a forum somebody asked what the dmesg output on crash looked like if I enabled amdgpu.gpu_recovery=1 - the result is a few lines more of output, but still a fatal system crash:

Jun 26 00:50:09 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=12277, last emitted seq=12279
Jun 26 00:50:09 ryzen kernel: [drm] IP block:gmc_v8_0 is hung!
Jun 26 00:50:09 ryzen kernel: [drm] IP block:gfx_v8_0 is hung!
Jun 26 00:50:09 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin!
Jun 26 00:50:15 ryzen kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:42:crtc-0] flip_done timed out
Jun 26 00:50:15 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:42:crtc-0] flip_done timed out
Jun 26 00:50:25 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:40:plane-4] flip_done timed out
Comment 8 Andrey Grodzovsky 2018-06-26 15:20:45 UTC
(In reply to dwagner from comment #7)
> (In reply to Andrey Grodzovsky from comment #6)
> > Verify you are using latest AMD firmware and up to date MESA/LLVM
> 
> Firmware:
> 
> pacman -Q linux-firmware
> linux-firmware 20180606.d114732-1
> 
> ll  /usr/lib/firmware/amdgpu/vega10_vce.bin
> -rw-r--r-- 1 root root 165344 Jun  7 08:01
> /usr/lib/firmware/amdgpu/vega10_vce.bin
> 
> 
> MESA:
> 
> pacman -Q mesa
> mesa 18.1.2-1
> 
> 
> LLVM:
> pacman -Q llvm-libs
> llvm-libs 6.0.0-4
> 
> Is this new enough?

The kernel and MESA seems new enough, LLVM is 6 so maybe you should try 7.
The firmware also looks pretty late but I still would advise to manually override all firmware files with files from here https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu
Just backup your existing firmware/amdgpu folder for any case.

> 
> 
> BTW: In a forum somebody asked what the dmesg output on crash looked like if
> I enabled amdgpu.gpu_recovery=1 - the result is a few lines more of output,
> but still a fatal system crash:
> 
> Jun 26 00:50:09 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> ring gfx timeout, last signaled seq=12277, last emitted seq=12279
> Jun 26 00:50:09 ryzen kernel: [drm] IP block:gmc_v8_0 is hung!
> Jun 26 00:50:09 ryzen kernel: [drm] IP block:gfx_v8_0 is hung!
> Jun 26 00:50:09 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin!
> Jun 26 00:50:15 ryzen kernel: [drm:drm_atomic_helper_wait_for_flip_done
> [drm_kms_helper]] *ERROR* [CRTC:42:crtc-0] flip_done timed out
> Jun 26 00:50:15 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies
> [drm_kms_helper]] *ERROR* [CRTC:42:crtc-0] flip_done timed out
> Jun 26 00:50:25 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies
> [drm_kms_helper]] *ERROR* [PLANE:40:plane-4] flip_done timed out

It's a know issue, try the patch I attached to resolve the deadlock , but you will probably experience other failures after that anyway. 

Andrey
Comment 9 Andrey Grodzovsky 2018-06-26 15:21:27 UTC
Created attachment 140345 [details] [review]
Deadlock fix
Comment 10 dwagner 2018-06-26 22:52:22 UTC
(In reply to Andrey Grodzovsky from comment #8)
> The kernel and MESA seems new enough, LLVM is 6 so maybe you should try 7.

LLVM 7 has not been released, and replacing LLVM 6 with the current subversion head of LLVM 7 means to basically recompile and reinstall half of the operating system (starting at radeonsi, then Xorg, then its dependencies...)

I'm fine with using experimental new kernels to find a more stable amdgpu driver - but if a kernel driver crashes just because some user-space application (X11) utilizes a wrong compiler version at run time, then some part of the driver design is very wrong. 

> The firmware also looks pretty late but I still would advise to manually
> override all firmware files with files from here
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/
> tree/amdgpu

I did a "diff -r" on the git files with the ones installed by Arch, they are all binary identical.

> > Jun 26 00:50:25 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies
> > [drm_kms_helper]] *ERROR* [PLANE:40:plane-4] flip_done timed out
> 
> It's a know issue, try the patch I attached to resolve the deadlock , but
> you will probably experience other failures after that anyway. 

Ok, thanks for the patch, will try this next time I compile a new kernel.
Comment 11 Michel Dänzer 2018-06-27 07:48:45 UTC
(In reply to Andrey Grodzovsky from comment #8)
> The kernel and MESA seems new enough, LLVM is 6 so maybe you should try 7.

LLVM 6 is fine.
Comment 12 Andrey Grodzovsky 2018-06-27 13:53:37 UTC
(In reply to dwagner from comment #2)
> Just to mention this once again: These system crashes still occur, and way
> too frequently to consider the amdgpu driver stable enough for professional
> use. Sample dmesg output from today:
> 
> Feb 24 18:26:55 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> last signaled seq=5430589, last emitted seq=5430591
> Feb 24 18:26:55 [drm] IP block:gmc_v8_0 is hung!
> Feb 24 18:26:55 [drm] IP block:gfx_v8_0 is hung!
> Feb 24 18:27:02 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
> timeout, last signaled seq=185928, last emitted seq=185930
> Feb 24 18:27:02 [drm] IP block:gmc_v8_0 is hung!
> Feb 24 18:27:02 [drm] IP block:gfx_v8_0 is hung!
> Feb 24 18:27:05 [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR*
> [CRTC:43:crtc-0] hw_done or flip_done timed out

Can you load the kernel with grub command line amdgpu.vm_update_mode=3 to force CPU VM update mode and see if this helps ?
Comment 13 dwagner 2018-06-27 23:15:48 UTC
(In reply to Andrey Grodzovsky from comment #12)
> Can you load the kernel with grub command line amdgpu.vm_update_mode=3 to
> force CPU VM update mode and see if this helps ?

Sure. Too early yet to say "hurray", but at an uptime of one hour, currently, 4.17.2 survived with amdgpu.vm_update_mode=3 already about 20 times longer than without that option before the first crash.

One (probably just informal) message is emitted by the kernel:
[   19.319565] CPU update of VM recommended only for large BAR system

Can you explain a little: What is a "large BAR system", and what does the vm_update_mode=3 option actually cause? Should I expect any weird side effects to look for?


BTW: Not a result of that option, but of the kernel version, seems to be the fact that the shader clock keeps at a pretty high frequency all the time - even without any 3d or compute load, just displaying a quiet 4k/60Hz desktop image:

cat pp_dpm_sclk
0: 214Mhz 
1: 481Mhz 
2: 760Mhz 
3: 1020Mhz 
4: 1102Mhz 
5: 1138Mhz 
6: 1180Mhz *
7: 1220Mhz 

Much lower shader clocks are used only if I lower the refresh rate of the screen. Is there a reason why the shader clocks should stay high even in the absence of 3d/compute load?

(I would have better understood if the minimum memory clock was depending on the refresh rate, but memory clock stays as low as with the older kernels.)
Comment 14 Alex Deucher 2018-06-28 02:17:57 UTC
(In reply to dwagner from comment #13)
> 
> Much lower shader clocks are used only if I lower the refresh rate of the
> screen. Is there a reason why the shader clocks should stay high even in the
> absence of 3d/compute load?
> 

Certain display requirements can cause the engine clock to be kept higher as well.
Comment 15 Andrey Grodzovsky 2018-06-28 04:17:19 UTC
(In reply to dwagner from comment #13)
> (In reply to Andrey Grodzovsky from comment #12)
> > Can you load the kernel with grub command line amdgpu.vm_update_mode=3 to
> > force CPU VM update mode and see if this helps ?
> 
> Sure. Too early yet to say "hurray", but at an uptime of one hour,
> currently, 4.17.2 survived with amdgpu.vm_update_mode=3 already about 20
> times longer than without that option before the first crash.
> 
> One (probably just informal) message is emitted by the kernel:
> [   19.319565] CPU update of VM recommended only for large BAR system
> 
> Can you explain a little: What is a "large BAR system", and what does the
> vm_update_mode=3 option actually cause? Should I expect any weird side
> effects to look for?

I think it just means systems with large VRAM so it will require large BAR for mapping. But I am not sure on that point.
vm_update_mode=3 means GPUVM page tables update is done using CPU. By default we do it using DMA engine on the ASIC. The log showed a hang in this engine so I assumed there is something wrong with SDMA commands we submit.
I assume more CPU utilization as a side effect and maybe slower rendering.

> 
> 
> BTW: Not a result of that option, but of the kernel version, seems to be the
> fact that the shader clock keeps at a pretty high frequency all the time -
> even without any 3d or compute load, just displaying a quiet 4k/60Hz desktop
> image:
> 
> cat pp_dpm_sclk
> 0: 214Mhz 
> 1: 481Mhz 
> 2: 760Mhz 
> 3: 1020Mhz 
> 4: 1102Mhz 
> 5: 1138Mhz 
> 6: 1180Mhz *
> 7: 1220Mhz 
> 
> Much lower shader clocks are used only if I lower the refresh rate of the
> screen. Is there a reason why the shader clocks should stay high even in the
> absence of 3d/compute load?
> 
> (I would have better understood if the minimum memory clock was depending on
> the refresh rate, but memory clock stays as low as with the older kernels.)
Comment 16 Alex Deucher 2018-06-28 04:36:41 UTC
(In reply to Andrey Grodzovsky from comment #15)
> I think it just means systems with large VRAM so it will require large BAR
> for mapping. But I am not sure on that point.

That's correct.  the updates are done with the CPU rather than the GPU (SDMA).  The default BAR size on most systems is usually 256MB for 32 bit compatibility so the window for CPU access to vram (where the page tables live) is limited.
Comment 17 Andrey Grodzovsky 2018-06-28 10:33:22 UTC
(In reply to Alex Deucher from comment #16)
> (In reply to Andrey Grodzovsky from comment #15)
> > I think it just means systems with large VRAM so it will require large BAR
> > for mapping. But I am not sure on that point.
> 
> That's correct.  the updates are done with the CPU rather than the GPU
> (SDMA).  The default BAR size on most systems is usually 256MB for 32 bit
> compatibility so the window for CPU access to vram (where the page tables
> live) is limited.

Thanks Alex.

dwagner, this is obviously just a work around and not a fix. It points to some problem with SDMA packets, if you want to continue exploring we can try to dump some fence traces and SDMA HW ring content to examine the latest packets before the hang happened.
Comment 18 dwagner 2018-06-28 19:56:46 UTC
The good news: So far no crashes during normal uptime with amdgpu.vm_update_mode=3

The bad news: System crashes immediately upon S3 resume (with messages quite different from the ones I saw with earlier S3-resume crashes) - I filed bug report https://bugs.freedesktop.org/show_bug.cgi?id=107065 on this.

(In reply to Andrey Grodzovsky from comment #17)
> dwagner, this is obviously just a work around and not a fix. It points to
> some problem with SDMA packets, if you want to continue exploring we can try
> to dump some fence traces and SDMA HW ring content to examine the latest
> packets before the hang happened.

If you can include some debug output into "amd-staging-drm-next" that helps finding the root cause, I might be able to provide some output - if the kernel survives long enough after the crash to write the system journal - this has not always been the case.
Comment 19 Andrey Grodzovsky 2018-06-28 21:09:09 UTC
Can you use addr2line or gdb with 'list' command to give the line number matching (In reply to dwagner from comment #18)
> The good news: So far no crashes during normal uptime with
> amdgpu.vm_update_mode=3
> 
> The bad news: System crashes immediately upon S3 resume (with messages quite
> different from the ones I saw with earlier S3-resume crashes) - I filed bug
> report https://bugs.freedesktop.org/show_bug.cgi?id=107065 on this.
> 
> (In reply to Andrey Grodzovsky from comment #17)
> > dwagner, this is obviously just a work around and not a fix. It points to
> > some problem with SDMA packets, if you want to continue exploring we can try
> > to dump some fence traces and SDMA HW ring content to examine the latest
> > packets before the hang happened.
> 
> If you can include some debug output into "amd-staging-drm-next" that helps
> finding the root cause, I might be able to provide some output - if the
> kernel survives long enough after the crash to write the system journal -
> this has not always been the case.

No need to recompile, just need to see what is the content of SDMA ring buffer when the hang occurs.

Clone and build our register analyzer from here - https://cgit.freedesktop.org/amd/umr/ and once the hang happens just run 

sudo umr -lb
sudo umr -R gfx[.]
sudo umr -R sdma0[.]
sudo umr -R sdma1[.]

I will probably need more info later but let's try this first.
Comment 20 dwagner 2018-06-28 22:56:03 UTC
(In reply to Andrey Grodzovsky from comment #19)
> No need to recompile, just need to see what is the content of SDMA ring
> buffer when the hang occurs.
> 
> Clone and build our register analyzer from here -
> https://cgit.freedesktop.org/amd/umr/ and once the hang happens just run 
> 
> sudo umr -lb
> sudo umr -R gfx[.]
> sudo umr -R sdma0[.]
> sudo umr -R sdma1[.]
> 
> I will probably need more info later but let's try this first.

How can I run "umr" on a crashed system? I guess those register values are retained over a press of the reset button / reboot?
Comment 21 dwagner 2018-06-28 22:57:21 UTC
(I meant to write "I guess those register values are NOT retained over a reboot, right?")
Comment 22 Andrey Grodzovsky 2018-06-29 00:10:03 UTC
(In reply to dwagner from comment #21)
> (I meant to write "I guess those register values are NOT retained over a
> reboot, right?")

Yes, my assumption was that at least some times you still have SSH access to the system in those cases.
Comment 23 dwagner 2018-07-04 23:03:36 UTC
Just for the record: At this point, I can say that with
amggpu.vm_update_mode=3 4.17.2-ARCH runs at least for hours,
not only the minutes it runs without this option before crashing.

I cannot, however, say that above combination reaches the
some-days-between-amdgpu-crashes uptimes that 4.13.x reached -
in order to be able to test this, I would need S3 resumes to work,
which is subject to bug report 107065.

Without working S3 resumes, there is no way for me to test longer
uptimes because amdgpu consistently crashes (in any version I know
of) if I just let the system run but switch off the display, and I do
not want to keep the connected 4k TV switched on all day and night.
Comment 24 Michel Dänzer 2018-07-05 13:59:56 UTC
Can you try bisecting between 4.13 and 4.17 to find where stability went downhill for you?
Comment 25 dwagner 2018-07-05 23:32:43 UTC
(In reply to Michel Dänzer from comment #24)
> Can you try bisecting between 4.13 and 4.17 to find where stability went
> downhill for you?

A bisect like that is not likely to converge in any reasonable time, given the stochastic nature of those crashes.

While the mean-time-between-driver-crashes is dramatically different, there will be occasions on which 4.13 will crash early enough to yield a false "bad", and there will be occasions on which 4.17 is lasting like the 20 minutes or so to assume a false "good".

What about the multitude of debug-options - isn't there one that could allow for some more insight on when/why the driver crashes?
Comment 26 dwagner 2018-07-06 23:20:20 UTC
Today for the first time I had a sudden "crash while just browsing with Firefox" while using the amggpu.vm_update_mode=3 parameter with the current-as-of-today amd-staging-drm-next (bb2e406ba66c2573b68e609e148cab57b1447095) with patch  https://bugs.freedesktop.org/attachment.cgi?id=140418 applied on top.

Different kernel messages than with previous crashed of this kind were emitted:

Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: GPU fault detected: 146 0x0c80440c
Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100190
Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C
Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: VM fault (0x0c, vmid 7, pasid 32768) at page 1048976, read from 'TC1' (0x54433100) (68)
Jul 07 01:08:25 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=75244, last emitted seq=75245
Jul 07 01:08:25 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin!

Hope this helps somehow.
Comment 27 Michel Dänzer 2018-07-07 08:36:28 UTC
(In reply to dwagner from comment #26)
> Today for the first time I had a sudden "crash while just browsing with
> Firefox" [...]

That could be a Mesa issue, anyway it should probably be tracked separately from this report.
Comment 28 dwagner 2018-07-07 20:08:40 UTC
(In reply to Michel Dänzer from comment #27)
> That could be a Mesa issue, anyway it should probably be tracked separately
> from this report.

Created separate bug report https://bugs.freedesktop.org/show_bug.cgi?id=107152

(If that is a Mesa issue, no more than user processes / X11 should have crashed - but not the kernel amdgpu driver... right?)
Comment 29 Andrey Grodzovsky 2018-07-09 14:34:51 UTC
(In reply to dwagner from comment #28)
> (In reply to Michel Dänzer from comment #27)
> > That could be a Mesa issue, anyway it should probably be tracked separately
> > from this report.
> 
> Created separate bug report
> https://bugs.freedesktop.org/show_bug.cgi?id=107152
> 
> (If that is a Mesa issue, no more than user processes / X11 should have
> crashed - but not the kernel amdgpu driver... right?)

Not exactly, MESA could create a bad request (faulty GPU address) which would lead to this. It can even be triggered on purpose using a debug flag from MESA.
Comment 30 dwagner 2018-07-11 22:32:41 UTC
(In reply to Andrey Grodzovsky from comment #29)
> > (If that is a Mesa issue, no more than user processes / X11 should have
> > crashed - but not the kernel amdgpu driver... right?)
> 
> Not exactly, MESA could create a bad request (faulty GPU address) which
> would lead to this. It can even be triggered on purpose using a debug flag
> from MESA.

My understanding is that all parts of MESA run as user processes, outside of the kernel space. If such code is allowed to pass parameters into kernel functions that make the kernel crash, that would be a veritable security hole which attackers could exploit to stage at least denial-of-service attacks, if not worse.
Comment 31 Doctor 2018-07-15 08:56:58 UTC
I got that one too and was able to track the problem down a bit further. Chrome and video with the gpu enabled will blow it up too. Interesting I was able to reproduce it consistantly with my rtl8188eu usb driver plug it in connect and wpa_supplicant will cause it to explode.
Comment 32 Doctor 2018-07-15 09:03:01 UTC
I ended up due to working on a live dev cd for codexl since all my machines are memory based and use no magnetic media. Just cherry picking the code back to the  last 4.16 and no problems Heres the working 4.16 . I chased this rabbit for awhile and it pops up like the dam wood chuck in caddie shack.


Here is the latest as of 11 hours ago 4.19-wip
https://github.com/tekcomm/linux-image-4.19-wip-generic


Here is the latest as of 11 hours ago 4.16 version from three weeks ago with no woodchucks
https://github.com/tekcomm/linux-kernel-amdgpu-binaries
Comment 33 Doctor 2018-07-15 09:07:08 UTC
I think it may be something as stupid as a var too.
Comment 34 dwagner 2018-07-15 19:59:36 UTC
(In reply to Doctor from comment #32)
> Just cherry picking the code
> back to the  last 4.16 and no problems Heres the working 4.16 . I chased
> this rabbit for awhile and it pops up like the dam wood chuck in caddie
> shack.
> 
> Here is the latest as of 11 hours ago 4.19-wip
> https://github.com/tekcomm/linux-image-4.19-wip-generic

I am not sure I understand what you are trying to tell us, here.

The repository you linked does not seem to contain any relevant commits changing kernel source code.
Comment 35 Andrey Grodzovsky 2018-07-16 14:06:32 UTC
(In reply to dwagner from comment #30)
> (In reply to Andrey Grodzovsky from comment #29)
> > > (If that is a Mesa issue, no more than user processes / X11 should have
> > > crashed - but not the kernel amdgpu driver... right?)
> > 
> > Not exactly, MESA could create a bad request (faulty GPU address) which
> > would lead to this. It can even be triggered on purpose using a debug flag
> > from MESA.
> 
> My understanding is that all parts of MESA run as user processes, outside of
> the kernel space. If such code is allowed to pass parameters into kernel
> functions that make the kernel crash, that would be a veritable security
> hole which attackers could exploit to stage at least denial-of-service
> attacks, if not worse.

There is no impact on the kernlel, please note that this is a GPU page fault, not CPU page fault so the kernel keeps working normal, doesn't hang and workable. You might get black screen out of this and have to reboot the graphic card or maybe the entire system to recover but I don't see any system security and stability compromise here.
Comment 36 Roshless 2018-07-29 10:02:00 UTC
*** Bug 107311 has been marked as a duplicate of this bug. ***
Comment 37 dwagner 2018-08-08 23:07:38 UTC
In the related bug report (https://bugs.freedesktop.org/show_bug.cgi?id=107152) I noticed that this bug can be triggered very reliably and quickly by playing a video with a deliberately lowered frame rate:
 "mpv --no-correct-pts --fps=3 --ao=null some_arbitrary_video.webm"

This led me to assume this bug might be caused by the dynamic power management, that often ramps performance up/down when a video is played at such a low frame rate.

And indeed, I found this confirmed by many experiments: If I use a script like
> #!/bin/bash
> cd /sys/class/drm/card0/device
> echo manual >power_dpm_force_performance_level
> # low
> echo 0 >pp_dpm_mclk 
> echo 0 >pp_dpm_sclk
> # medium
> #echo 1 >pp_dpm_mclk 
> #echo 1 >pp_dpm_sclk
> # high
> #echo 1 >pp_dpm_mclk 
> #echo 6 >pp_dpm_sclk
to enforce just any performance level, then the crashes do not occur anymore - also with the "low frame rate video test".

So it seems that the transition from one "dpm" performance level to another, with a certain probability, causes these crashes. And the more often the transitions occur, the sooner one will experience them.

(BTW: For unknown reason, invoking "xrandr" or enabling a monitor after sleep causes the above settings to get lost, so one has to invoke above script again.)
Comment 38 dwagner 2018-08-09 20:56:06 UTC
*** Bug 107152 has been marked as a duplicate of this bug. ***
Comment 39 Andrey Grodzovsky 2018-08-14 21:27:41 UTC
(In reply to dwagner from comment #37)
> In the related bug report
> (https://bugs.freedesktop.org/show_bug.cgi?id=107152) I noticed that this
> bug can be triggered very reliably and quickly by playing a video with a
> deliberately lowered frame rate:
>  "mpv --no-correct-pts --fps=3 --ao=null some_arbitrary_video.webm"

> 
> This led me to assume this bug might be caused by the dynamic power
> management, that often ramps performance up/down when a video is played at
> such a low frame rate.

I tried exactly the same - reproduce with same card model and latest kernel and run webm clip with mpv same way you did and it didn't happen. 

> 
> And indeed, I found this confirmed by many experiments: If I use a script
> like
> > #!/bin/bash
> > cd /sys/class/drm/card0/device
> > echo manual >power_dpm_force_performance_level
> > # low
> > echo 0 >pp_dpm_mclk 
> > echo 0 >pp_dpm_sclk
> > # medium
> > #echo 1 >pp_dpm_mclk 
> > #echo 1 >pp_dpm_sclk
> > # high
> > #echo 1 >pp_dpm_mclk 
> > #echo 6 >pp_dpm_sclk
> to enforce just any performance level, then the crashes do not occur anymore
> - also with the "low frame rate video test".
> 
> So it seems that the transition from one "dpm" performance level to another,
> with a certain probability, causes these crashes. And the more often the
> transitions occur, the sooner one will experience them.
> 
> (BTW: For unknown reason, invoking "xrandr" or enabling a monitor after
> sleep causes the above settings to get lost, so one has to invoke above
> script again.)
Comment 40 Andrey Grodzovsky 2018-08-15 14:24:24 UTC
Created attachment 141112 [details]
.config

I uploaded my .config file - maybe something in your Kconfig flags makes this happen - you can try and rebuild latest kernel from Alex's repository using my .config and see if you don't experience this anymore. 
https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next

Other than that, since you system hard hangs so you can't do any postmortem dumps, you can at least provide output from events tracing though trace_pipe to catch live logs on the fly. Maybe we can infer something from there...

So again - 
Load the system and before starting reproduce run the following trace command -

sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e "amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv"

then cd /sys/kernel/debug/tracing && cat trace_pipe

When the problem happens just copy all the output from the terminal to a log file. Make sure your terminal app has largest possible buffer to catch ALL the output.
Comment 41 dwagner 2018-08-15 22:03:38 UTC
(In reply to Andrey Grodzovsky from comment #40)
> Created attachment 141112 [details]
> .config
> 
> I uploaded my .config file - maybe something in your Kconfig flags makes
> this happen - you can try and rebuild latest kernel from Alex's repository
> using my .config and see if you don't experience this anymore. 
> https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next

Did just that - but still the video test crashes after at most few minutes, and does not crash with DPM turned off. So we can rule out our .config differences (of which there are many).

> Other than that, since you system hard hangs so you can't do any postmortem
> dumps, you can at least provide output from events tracing though trace_pipe
> to catch live logs on the fly. Maybe we can infer something from there...
> 
> So again - 
> Load the system and before starting reproduce run the following trace
> command -
> 
> sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e
> "amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv"
> 
> then cd /sys/kernel/debug/tracing && cat trace_pipe
> 
> When the problem happens just copy all the output from the terminal to a log
> file. Make sure your terminal app has largest possible buffer to catch ALL
> the output.

Will try that on next opportunity, probably tomorrow evening.
Comment 42 dwagner 2018-08-16 21:53:57 UTC
Ok, did the proposed debugging session with trace-cmd, with output to a different PC over ssh. Using today's amd-staging-drm-next and btw., Arch updated the Xorg server earlier today.

This time it took about 4 minutes until the video playback with 3 fps crashed - the symptom was the same (as in one-colored blank screen and a subsequent system crash), but this time the kernel and ssh session survived the crash for some seconds, enough for me to also issue the earlier suggested "umr -O verbose -R gfx[.]" command after the amdgpu crash, so I can upload the output of that, too, but this was the last command executed, the system crashed completely while running it (so its output may be partial).

Find attached dmesg, trace, and umr output.
Comment 43 dwagner 2018-08-16 21:55:49 UTC
Created attachment 141155 [details]
trace-cmd induced output during 3-fps-video replay and crash
Comment 44 dwagner 2018-08-16 21:56:38 UTC
Created attachment 141156 [details]
dmesg from boot to after the 3-fps-video test crash
Comment 45 dwagner 2018-08-16 21:57:19 UTC
Created attachment 141157 [details]
output of umr command after 3-fps-video test crash
Comment 46 Andrey Grodzovsky 2018-08-16 22:31:11 UTC
Thanks.
Comment 47 Andrey Grodzovsky 2018-08-17 21:25:08 UTC
Created attachment 141174 [details] [review]
add_debug_info.patch

A am attaching a basic debug patch, please try to apply it. It should give a bit more info in dmesg whe VM fault happens. I wasn't able to test it on  my system so it might be buggy or crash.

Reproduce again like before with the cmd-trace like before and once the fault happens if possible try quickly run 

sudo umr -O halt_waves -wa

and only if you still have running system after that do the 
sudo umr -O verbose -R gfx[.]

The driver should be loaded amdgpu.vm_fault_stop=2 from grub
Also check if adding amdgpu.vm_debug=1 makes the issue reproduce more quickly
Comment 48 dwagner 2018-08-18 21:36:03 UTC
(In reply to Andrey Grodzovsky from comment #47)
> Created attachment 141174 [details] [review] [review]
> add_debug_info.patch
> 
> A am attaching a basic debug patch, please try to apply it.

Done.

> It should give a
> bit more info in dmesg whe VM fault happens. 

Hmm - I could not see any additional output resulting from it.

> Reproduce again like before with the cmd-trace like before and once the
> fault happens if possible try quickly run 
> 
> sudo umr -O halt_waves -wa
> 
> and only if you still have running system after that do the 
> sudo umr -O verbose -R gfx[.]
> 
> The driver should be loaded amdgpu.vm_fault_stop=2 from grub

Did that - will attach the script "gpu_debug3.sh" and its output - this time, dmesg and trace output are in the same file, if you want to look only at the dmesg part, "grep '^\[' gpu_debug_3.txt" will get it. 

I reproduced the bug 4 times, on 2 occasions no error was emitted before crashing, the 2 other times both umr commands could still run - since the error message looked the same, I'll attach the shorter file, where the crash occurred more quickly.

> Also check if adding amdgpu.vm_debug=1 makes the issue reproduce more quickly

I used that setting, but it did not seem to make a difference for how quickly the crash occurred - still "some seconds to some minutes".
Comment 49 dwagner 2018-08-18 21:37:20 UTC
Created attachment 141189 [details]
script used to generate the gpu_debug_3.txt (when executed via ssh -t ...)
Comment 50 dwagner 2018-08-18 21:38:10 UTC
Created attachment 141190 [details]
dmesg / trace / umr output from gpu_debug3.sh
Comment 51 dwagner 2018-08-18 21:40:01 UTC
Created attachment 141191 [details]
xz-compressed output of gpu_debug3.sh - dmesg, trace, umr
Comment 52 dwagner 2018-08-18 21:43:23 UTC
One other experiment I made: I wrote a script to quickly toggle pp_dpm_mclk and pp_dpm_sclk while playing a 3 fps video with power_dpm_force_performance_level=manual. Could not reproduce the crashes that happen with power_dpm_force_performance_level=auto this way.
Comment 53 Andrey Grodzovsky 2018-08-20 14:16:08 UTC
Created attachment 141198 [details] [review]
add_debug_info2.patch

Try this patch instead, i might be missing some prints in the first one.
In the last log you attached I haven't seen any UMR dumps or GPU fault prints in dmesg. THe GPU fault has to be in the log to compare the faulty address against the debug prints in the patch.
Comment 54 dwagner 2018-08-21 08:41:52 UTC
(In reply to Andrey Grodzovsky from comment #53)
> Created attachment 141198 [details] [review] [review]
> add_debug_info2.patch
> 
> Try this patch instead, i might be missing some prints in the first one.

Can try that this evening.

> In the last log you attached I haven't seen any UMR dumps or GPU fault
> prints in dmesg. THe GPU fault has to be in the log to compare the faulty
> address against the debug prints in the patch.

In above attached file "xz-compressed output of gpu_debug3.sh" there is umr output at the time of the crash (238 seconds after the reboot):

----------------------------------------------
...
          mpv/vo-897   [005] ....   235.191542: dma_fence_wait_start: driver=drm_sched timeline=gfx context=162 seqno=87
          mpv/vo-897   [005] d...   235.191548: dma_fence_enable_signal: driver=drm_sched timeline=gfx context=162 seqno=87
     kworker/0:2-92    [000] ....   238.275988: dma_fence_signaled: driver=amdgpu timeline=sdma1 context=11 seqno=210
     kworker/0:2-92    [000] ....   238.276004: dma_fence_signaled: driver=amdgpu timeline=sdma1 context=11 seqno=211
[  238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=32624, emitted seq=32626
[  238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
[  238.180641] amdgpu 0000:0a:00.0: GPU reset begin!

crash detected!

executing umr -O halt_waves -wa
No active waves!


executing umr -O verbose -R gfx[.]

polaris11.gfx.rptr == 1792
polaris11.gfx.wptr == 1792
polaris11.gfx.drv_wptr == 1792
polaris11.gfx.ring[1761] == 0xffff1000    ... 
polaris11.gfx.ring[1762] == 0xffff1000    ... 
polaris11.gfx.ring[1763] == 0xffff1000    ... 
polaris11.gfx.ring[1764] == 0xffff1000    ... 
polaris11.gfx.ring[1765] == 0xffff1000    ... 
polaris11.gfx.ring[1766] == 0xffff1000    ... 
polaris11.gfx.ring[1767] == 0xffff1000    ... 
polaris11.gfx.ring[1768] == 0xffff1000    ... 
polaris11.gfx.ring[1769] == 0xffff1000    ... 
polaris11.gfx.ring[1770] == 0xffff1000    ... 
polaris11.gfx.ring[1771] == 0xffff1000    ... 
polaris11.gfx.ring[1772] == 0xffff1000    ... 
polaris11.gfx.ring[1773] == 0xffff1000    ... 
polaris11.gfx.ring[1774] == 0xffff1000    ... 
polaris11.gfx.ring[1775] == 0xffff1000    ... 
polaris11.gfx.ring[1776] == 0xffff1000    ... 
polaris11.gfx.ring[1777] == 0xffff1000    ... 
polaris11.gfx.ring[1778] == 0xffff1000    ... 
polaris11.gfx.ring[1779] == 0xffff1000    ... 
polaris11.gfx.ring[1780] == 0xffff1000    ... 
polaris11.gfx.ring[1781] == 0xffff1000    ... 
polaris11.gfx.ring[1782] == 0xffff1000    ... 
polaris11.gfx.ring[1783] == 0xffff1000    ... 
polaris11.gfx.ring[1784] == 0xffff1000    ... 
polaris11.gfx.ring[1785] == 0xffff1000    ... 
polaris11.gfx.ring[1786] == 0xffff1000    ... 
polaris11.gfx.ring[1787] == 0xffff1000    ... 
polaris11.gfx.ring[1788] == 0xffff1000    ... 
polaris11.gfx.ring[1789] == 0xffff1000    ... 
polaris11.gfx.ring[1790] == 0xffff1000    ... 
polaris11.gfx.ring[1791] == 0xffff1000    ... 
polaris11.gfx.ring[1792] == 0xc0032200    rwD 

trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
trying to get VMID from dmesg output for 'umr -O verbose -vm ...'

done after crash, flashing NUMLOCK LED.
     amdgpu_cs:0-799   [001] ....   286.852838: amdgpu_bo_list_set: list=0000000099c16b5c, bo=000000001771c26f, bo_size=131072
     amdgpu_cs:0-799   [001] ....   286.852846: amdgpu_bo_list_set: list=0000000099c16b5c, bo=0000000046bfd439, bo_size=131072
...
----------------------------------------------

But sure, there were no "VM_CONTEXT1_PROTECTION_FAULT_ADDR" error messages this time. Sometimes such are emitted, sometimes not.
Comment 55 Andrey Grodzovsky 2018-08-21 14:43:24 UTC
(In reply to dwagner from comment #54)
> (In reply to Andrey Grodzovsky from comment #53)
> > Created attachment 141198 [details] [review] [review] [review]
> > add_debug_info2.patch
> > 
> > Try this patch instead, i might be missing some prints in the first one.
> 
> Can try that this evening.
> 
> > In the last log you attached I haven't seen any UMR dumps or GPU fault
> > prints in dmesg. THe GPU fault has to be in the log to compare the faulty
> > address against the debug prints in the patch.
> 
> In above attached file "xz-compressed output of gpu_debug3.sh" there is umr
> output at the time of the crash (238 seconds after the reboot):
> 
> ----------------------------------------------
> ...
>           mpv/vo-897   [005] ....   235.191542: dma_fence_wait_start:
> driver=drm_sched timeline=gfx context=162 seqno=87
>           mpv/vo-897   [005] d...   235.191548: dma_fence_enable_signal:
> driver=drm_sched timeline=gfx context=162 seqno=87
>      kworker/0:2-92    [000] ....   238.275988: dma_fence_signaled:
> driver=amdgpu timeline=sdma1 context=11 seqno=210
>      kworker/0:2-92    [000] ....   238.276004: dma_fence_signaled:
> driver=amdgpu timeline=sdma1 context=11 seqno=211
> [  238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
> timeout, signaled seq=32624, emitted seq=32626
> [  238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
> [  238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
> 
> crash detected!
> 
> executing umr -O halt_waves -wa
> No active waves!

Did you use amdgpu.vm_fault_stop=2 parameter ? In case a fault happened that should have froze GPUs compute units and hence the above command would produce a lot of wave info.

> 
> 
> executing umr -O verbose -R gfx[.]
> 
> polaris11.gfx.rptr == 1792
> polaris11.gfx.wptr == 1792
> polaris11.gfx.drv_wptr == 1792
> polaris11.gfx.ring[1761] == 0xffff1000    ... 
> polaris11.gfx.ring[1762] == 0xffff1000    ... 
> polaris11.gfx.ring[1763] == 0xffff1000    ... 
> polaris11.gfx.ring[1764] == 0xffff1000    ... 
> polaris11.gfx.ring[1765] == 0xffff1000    ... 
> polaris11.gfx.ring[1766] == 0xffff1000    ... 
> polaris11.gfx.ring[1767] == 0xffff1000    ... 
> polaris11.gfx.ring[1768] == 0xffff1000    ... 
> polaris11.gfx.ring[1769] == 0xffff1000    ... 
> polaris11.gfx.ring[1770] == 0xffff1000    ... 
> polaris11.gfx.ring[1771] == 0xffff1000    ... 
> polaris11.gfx.ring[1772] == 0xffff1000    ... 
> polaris11.gfx.ring[1773] == 0xffff1000    ... 
> polaris11.gfx.ring[1774] == 0xffff1000    ... 
> polaris11.gfx.ring[1775] == 0xffff1000    ... 
> polaris11.gfx.ring[1776] == 0xffff1000    ... 
> polaris11.gfx.ring[1777] == 0xffff1000    ... 
> polaris11.gfx.ring[1778] == 0xffff1000    ... 
> polaris11.gfx.ring[1779] == 0xffff1000    ... 
> polaris11.gfx.ring[1780] == 0xffff1000    ... 
> polaris11.gfx.ring[1781] == 0xffff1000    ... 
> polaris11.gfx.ring[1782] == 0xffff1000    ... 
> polaris11.gfx.ring[1783] == 0xffff1000    ... 
> polaris11.gfx.ring[1784] == 0xffff1000    ... 
> polaris11.gfx.ring[1785] == 0xffff1000    ... 
> polaris11.gfx.ring[1786] == 0xffff1000    ... 
> polaris11.gfx.ring[1787] == 0xffff1000    ... 
> polaris11.gfx.ring[1788] == 0xffff1000    ... 
> polaris11.gfx.ring[1789] == 0xffff1000    ... 
> polaris11.gfx.ring[1790] == 0xffff1000    ... 
> polaris11.gfx.ring[1791] == 0xffff1000    ... 
> polaris11.gfx.ring[1792] == 0xc0032200    rwD 
> 
> trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
> trying to get VMID from dmesg output for 'umr -O verbose -vm ...'
> 
> done after crash, flashing NUMLOCK LED.
>      amdgpu_cs:0-799   [001] ....   286.852838: amdgpu_bo_list_set:
> list=0000000099c16b5c, bo=000000001771c26f, bo_size=131072
>      amdgpu_cs:0-799   [001] ....   286.852846: amdgpu_bo_list_set:
> list=0000000099c16b5c, bo=0000000046bfd439, bo_size=131072
> ...
> ----------------------------------------------
> 
> But sure, there were no "VM_CONTEXT1_PROTECTION_FAULT_ADDR" error messages
> this time. Sometimes such are emitted, sometimes not.
Comment 56 dwagner 2018-08-21 21:16:52 UTC
(In reply to Andrey Grodzovsky from comment #55)
> > In above attached file "xz-compressed output of gpu_debug3.sh" there is umr
> > output at the time of the crash (238 seconds after the reboot):
> > 
> > ----------------------------------------------
> > ...
> >           mpv/vo-897   [005] ....   235.191542: dma_fence_wait_start:
> > driver=drm_sched timeline=gfx context=162 seqno=87
> >           mpv/vo-897   [005] d...   235.191548: dma_fence_enable_signal:
> > driver=drm_sched timeline=gfx context=162 seqno=87
> >      kworker/0:2-92    [000] ....   238.275988: dma_fence_signaled:
> > driver=amdgpu timeline=sdma1 context=11 seqno=210
> >      kworker/0:2-92    [000] ....   238.276004: dma_fence_signaled:
> > driver=amdgpu timeline=sdma1 context=11 seqno=211
> > [  238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
> > timeout, signaled seq=32624, emitted seq=32626
> > [  238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
> > [  238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
> > 
> > crash detected!
> > 
> > executing umr -O halt_waves -wa
> > No active waves!
> 
> Did you use amdgpu.vm_fault_stop=2 parameter ? In case a fault happened that
> should have froze GPUs compute units and hence the above command would
> produce a lot of wave info.

Yes I did, as can be seen from the kernel command line at the very beginning of the file I attached:
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-linux_amd root=UUID=b5d56e15-18f3-4783-af84-bbff3bbff3ef rw cryptdevice=/dev/nvme0n1p2:root:allow-discards libata.force=1.5 video=DP-1:d video=DVI-D-1:d video=HDMI-A-1:1024x768 amdgpu.dc=1 amdgpu.vm_update_mode=0 amdgpu.dpm=-1 amdgpu.ppfeaturemask=0xffffffff amdgpu.vm_fault_stop=2 amdgpu.vm_debug=1

Could the "amdgpu 0000:0a:00.0: GPU reset begin!" message indicate a procedure that discards whatever has been in thoses "waves" before? If yes, could amdgpu.gpu_recovery=0 prevent that from happening?
Comment 57 Andrey Grodzovsky 2018-08-21 21:29:48 UTC
(In reply to dwagner from comment #56)
> (In reply to Andrey Grodzovsky from comment #55)
> > > In above attached file "xz-compressed output of gpu_debug3.sh" there is umr
> > > output at the time of the crash (238 seconds after the reboot):
> > > 
> > > ----------------------------------------------
> > > ...
> > >           mpv/vo-897   [005] ....   235.191542: dma_fence_wait_start:
> > > driver=drm_sched timeline=gfx context=162 seqno=87
> > >           mpv/vo-897   [005] d...   235.191548: dma_fence_enable_signal:
> > > driver=drm_sched timeline=gfx context=162 seqno=87
> > >      kworker/0:2-92    [000] ....   238.275988: dma_fence_signaled:
> > > driver=amdgpu timeline=sdma1 context=11 seqno=210
> > >      kworker/0:2-92    [000] ....   238.276004: dma_fence_signaled:
> > > driver=amdgpu timeline=sdma1 context=11 seqno=211
> > > [  238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
> > > timeout, signaled seq=32624, emitted seq=32626
> > > [  238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
> > > [  238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
> > > 
> > > crash detected!
> > > 
> > > executing umr -O halt_waves -wa
> > > No active waves!
> > 
> > Did you use amdgpu.vm_fault_stop=2 parameter ? In case a fault happened that
> > should have froze GPUs compute units and hence the above command would
> > produce a lot of wave info.
> 
> Yes I did, as can be seen from the kernel command line at the very beginning
> of the file I attached:
> [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-linux_amd
> root=UUID=b5d56e15-18f3-4783-af84-bbff3bbff3ef rw
> cryptdevice=/dev/nvme0n1p2:root:allow-discards libata.force=1.5 video=DP-1:d
> video=DVI-D-1:d video=HDMI-A-1:1024x768 amdgpu.dc=1 amdgpu.vm_update_mode=0
> amdgpu.dpm=-1 amdgpu.ppfeaturemask=0xffffffff amdgpu.vm_fault_stop=2
> amdgpu.vm_debug=1
> 
> Could the "amdgpu 0000:0a:00.0: GPU reset begin!" message indicate a
> procedure that discards whatever has been in thoses "waves" before? If yes,
> could amdgpu.gpu_recovery=0 prevent that from happening?

Yes, missed that one. No resets.
Comment 58 dwagner 2018-08-22 00:24:35 UTC
Here comes another trace log, with your info2.patch applied.

Something must have changed since the last test, as it took pretty long this time to reproduce the crash. Could that have been caused by https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c?h=amd-staging-drm-next&id=b385925f3922faca7435e50e31380bb2602fd6b8 now being part of the kernel?

However, the latest trace you find attached below is not much different to the last one, xzcat /tmp/gpu_debug5.txt.xz  | grep '^\[' will tell you:

[ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=475104, emitted seq=475106
[ 1510.023117] [drm] GPU recovery disabled.

     amdgpu_cs:0-806   [012] ....  1787.493126: amdgpu_vm_bo_cs: soffs=00001001a0, eoffs=00001001b9, flags=70
     amdgpu_cs:0-806   [012] ....  1787.493127: amdgpu_vm_bo_cs: soffs=0000100200, eoffs=00001021e0, flags=70
     amdgpu_cs:0-806   [012] ....  1787.493127: amdgpu_vm_bo_cs: soffs=0000102200, eoffs=00001041e0, flags=70
     amdgpu_cs:0-806   [012] ....  1787.493129: amdgpu_vm_bo_cs: soffs=000010c1e0, eoffs=000010c2e1, flags=70
     amdgpu_cs:0-806   [012] ....  1787.493131: drm_sched_job: entity=00000000406345a7, id=10239, fence=000000007a120377, ring=gfx, job count:8, hw job count:0

And later in the file you can find:
------------------------------------------------------
crash detected!

executing umr -O halt_waves -wa
No active waves!

executing umr -O verbose -R gfx[.]

polaris11.gfx.rptr == 512
polaris11.gfx.wptr == 512
polaris11.gfx.drv_wptr == 512
polaris11.gfx.ring[ 481] == 0xffff1000    ... 
polaris11.gfx.ring[ 482] == 0xffff1000    ... 
polaris11.gfx.ring[ 483] == 0xffff1000    ... 
polaris11.gfx.ring[ 484] == 0xffff1000    ... 
polaris11.gfx.ring[ 485] == 0xffff1000    ... 
polaris11.gfx.ring[ 486] == 0xffff1000    ... 
polaris11.gfx.ring[ 487] == 0xffff1000    ... 
polaris11.gfx.ring[ 488] == 0xffff1000    ... 
polaris11.gfx.ring[ 489] == 0xffff1000    ... 
polaris11.gfx.ring[ 490] == 0xffff1000    ... 
polaris11.gfx.ring[ 491] == 0xffff1000    ... 
polaris11.gfx.ring[ 492] == 0xffff1000    ... 
polaris11.gfx.ring[ 493] == 0xffff1000    ... 
polaris11.gfx.ring[ 494] == 0xffff1000    ... 
polaris11.gfx.ring[ 495] == 0xffff1000    ... 
polaris11.gfx.ring[ 496] == 0xffff1000    ... 
polaris11.gfx.ring[ 497] == 0xffff1000    ... 
polaris11.gfx.ring[ 498] == 0xffff1000    ... 
polaris11.gfx.ring[ 499] == 0xffff1000    ... 
polaris11.gfx.ring[ 500] == 0xffff1000    ... 
polaris11.gfx.ring[ 501] == 0xffff1000    ... 
polaris11.gfx.ring[ 502] == 0xffff1000    ... 
polaris11.gfx.ring[ 503] == 0xffff1000    ... 
polaris11.gfx.ring[ 504] == 0xffff1000    ... 
polaris11.gfx.ring[ 505] == 0xffff1000    ... 
polaris11.gfx.ring[ 506] == 0xffff1000    ... 
polaris11.gfx.ring[ 507] == 0xffff1000    ... 
polaris11.gfx.ring[ 508] == 0xffff1000    ... 
polaris11.gfx.ring[ 509] == 0xffff1000    ... 
polaris11.gfx.ring[ 510] == 0xffff1000    ... 
polaris11.gfx.ring[ 511] == 0xffff1000    ... 
polaris11.gfx.ring[ 512] == 0xc0032200    rwD 


trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
trying to get VMID from dmesg output for 'umr -O verbose -vm ...'

done after crash.
-------------------------------------------

So even without GPU reset, still no "waves". And the error message also does not state any VM fault address.
Comment 59 dwagner 2018-08-22 00:26:06 UTC
Created attachment 141228 [details]
latest crash trace output, without gpu_reset
Comment 60 Andrey Grodzovsky 2018-08-22 14:33:03 UTC
(In reply to dwagner from comment #58)
> Here comes another trace log, with your info2.patch applied.
> 
> Something must have changed since the last test, as it took pretty long this
> time to reproduce the crash. Could that have been caused by
> https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/amdgpu/
> nbio_v7_4.c?h=amd-staging-drm-
> next&id=b385925f3922faca7435e50e31380bb2602fd6b8 now being part of the
> kernel?

Don't think it's related. This code is more related to virtualization.

> 
> However, the latest trace you find attached below is not much different to
> the last one, xzcat /tmp/gpu_debug5.txt.xz  | grep '^\[' will tell you:
> 
> [ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
> timeout, signaled seq=475104, emitted seq=475106
> [ 1510.023117] [drm] GPU recovery disabled.

That just means you are again running with GPU VM update mode set to use SDMA. Which is seen in you dmesg (amdgpu.vm_update_mode=0) , so are again experiencing the original issue of SDMA hang. Please use amdgpu.vm_update_mode=3 to get back to VM_FAULTs issue.
 
> 
>      amdgpu_cs:0-806   [012] ....  1787.493126: amdgpu_vm_bo_cs:
> soffs=00001001a0, eoffs=00001001b9, flags=70
>      amdgpu_cs:0-806   [012] ....  1787.493127: amdgpu_vm_bo_cs:
> soffs=0000100200, eoffs=00001021e0, flags=70
>      amdgpu_cs:0-806   [012] ....  1787.493127: amdgpu_vm_bo_cs:
> soffs=0000102200, eoffs=00001041e0, flags=70
>      amdgpu_cs:0-806   [012] ....  1787.493129: amdgpu_vm_bo_cs:
> soffs=000010c1e0, eoffs=000010c2e1, flags=70
>      amdgpu_cs:0-806   [012] ....  1787.493131: drm_sched_job:
> entity=00000000406345a7, id=10239, fence=000000007a120377, ring=gfx, job
> count:8, hw job count:0
> 
> And later in the file you can find:
> ------------------------------------------------------
> crash detected!
> 
> executing umr -O halt_waves -wa
> No active waves!
> 
> executing umr -O verbose -R gfx[.]
> 
> polaris11.gfx.rptr == 512
> polaris11.gfx.wptr == 512
> polaris11.gfx.drv_wptr == 512
> polaris11.gfx.ring[ 481] == 0xffff1000    ... 
> polaris11.gfx.ring[ 482] == 0xffff1000    ... 
> polaris11.gfx.ring[ 483] == 0xffff1000    ... 
> polaris11.gfx.ring[ 484] == 0xffff1000    ... 
> polaris11.gfx.ring[ 485] == 0xffff1000    ... 
> polaris11.gfx.ring[ 486] == 0xffff1000    ... 
> polaris11.gfx.ring[ 487] == 0xffff1000    ... 
> polaris11.gfx.ring[ 488] == 0xffff1000    ... 
> polaris11.gfx.ring[ 489] == 0xffff1000    ... 
> polaris11.gfx.ring[ 490] == 0xffff1000    ... 
> polaris11.gfx.ring[ 491] == 0xffff1000    ... 
> polaris11.gfx.ring[ 492] == 0xffff1000    ... 
> polaris11.gfx.ring[ 493] == 0xffff1000    ... 
> polaris11.gfx.ring[ 494] == 0xffff1000    ... 
> polaris11.gfx.ring[ 495] == 0xffff1000    ... 
> polaris11.gfx.ring[ 496] == 0xffff1000    ... 
> polaris11.gfx.ring[ 497] == 0xffff1000    ... 
> polaris11.gfx.ring[ 498] == 0xffff1000    ... 
> polaris11.gfx.ring[ 499] == 0xffff1000    ... 
> polaris11.gfx.ring[ 500] == 0xffff1000    ... 
> polaris11.gfx.ring[ 501] == 0xffff1000    ... 
> polaris11.gfx.ring[ 502] == 0xffff1000    ... 
> polaris11.gfx.ring[ 503] == 0xffff1000    ... 
> polaris11.gfx.ring[ 504] == 0xffff1000    ... 
> polaris11.gfx.ring[ 505] == 0xffff1000    ... 
> polaris11.gfx.ring[ 506] == 0xffff1000    ... 
> polaris11.gfx.ring[ 507] == 0xffff1000    ... 
> polaris11.gfx.ring[ 508] == 0xffff1000    ... 
> polaris11.gfx.ring[ 509] == 0xffff1000    ... 
> polaris11.gfx.ring[ 510] == 0xffff1000    ... 
> polaris11.gfx.ring[ 511] == 0xffff1000    ... 
> polaris11.gfx.ring[ 512] == 0xc0032200    rwD 
> 
> 
> trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
> trying to get VMID from dmesg output for 'umr -O verbose -vm ...'
> 
> done after crash.
> -------------------------------------------
> 
> So even without GPU reset, still no "waves". And the error message also does
> not state any VM fault address.
Comment 61 dwagner 2018-08-22 22:18:11 UTC
> Please use amdgpu.vm_update_mode=3 to get back to VM_FAULTs issue.

The "good" news is that reproduction of the crashes with 3-fps-video-replay is very quick when using amdgpu.vm_update_mode=3.

But the bad news is that I have not been able to get useful error output when using vm_update_mode=3.

At first I tried with also amdgpu.vm_debug=1, and with that in 10 crashes not a single error output line was emitted to either the ssh channel or the system journal.

I then tried with amdgpu.vm_debug=0, and while a few error lines output become logged, then, not quite anything useful - see also in attached example:

[  912.447139] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=12818, emitted seq=12819
[  912.447145] [drm] GPU recovery disabled.

These are the only lines indicating the error, not even the
 echo "crash detected!"
after the
 "dmesg -w | tee /dev/tty | grep -m 1 -e "amdgpu.*GPU" -e "amdgpu.*ERROR"
gets emitted, much less the theoretically following umr commands.

What could I do to not let the kernel die so quickly when using amdgpu.vm_update_mode=3?
Comment 62 dwagner 2018-08-22 22:18:49 UTC
Created attachment 141243 [details]
crash trace with amdgpu.vm_update_mode=3
Comment 63 Anthony Ruhier 2018-09-19 23:35:10 UTC
FYI, I also had this bug under linux 4.17 and 4.18, but it seems to have been fixed in 4.19-rc3. The suspend/hibernate issue has also been fixed.
Comment 64 Anthony Ruhier 2018-09-19 23:35:42 UTC
(In reply to Anthony Ruhier from comment #63)
> FYI, I also had this bug under linux 4.17 and 4.18, but it seems to have
> been fixed in 4.19-rc3. The suspend/hibernate issue has also been fixed.

Forgot to say that I have a vega 64.
Comment 65 dwagner 2018-09-23 22:04:23 UTC
(In reply to Anthony Ruhier from comment #63)
> FYI, I also had this bug under linux 4.17 and 4.18, but it seems to have
> been fixed in 4.19-rc3. The suspend/hibernate issue has also been fixed.

Unluckily, I cannot confirm either observation: The current amd-staging-drm-next git head still crashes on me quickly, still well reproduceable with the 3-fps-video-replay test.

And going into S3 suspend does not work for me with the current amd-staging-drm-next either.
Comment 66 Anthony Ruhier 2018-09-23 23:42:52 UTC
(In reply to dwagner from comment #65)
> (In reply to Anthony Ruhier from comment #63)
> > FYI, I also had this bug under linux 4.17 and 4.18, but it seems to have
> > been fixed in 4.19-rc3. The suspend/hibernate issue has also been fixed.
> 
> Unluckily, I cannot confirm either observation: The current
> amd-staging-drm-next git head still crashes on me quickly, still well
> reproduceable with the 3-fps-video-replay test.
> 
> And going into S3 suspend does not work for me with the current
> amd-staging-drm-next either.

Last time I tested, amd-staging-drm-next seemed to be based on 4.19-rc1, on which I had the issue too. I switched to vanilla 4.19-rc4 (now -rc5) and it was fixed.
Comment 67 Roshless 2018-09-25 12:11:29 UTC
Tried on 4.19-rc5, still crashes for me after about 2-3 days (of 6-12h use)
Comment 68 dwagner 2018-11-14 00:23:15 UTC
Tested today's current amd-staging-drm-next git head, to see if there has been any improvement over the last two months.

The bad news: The 3-fps-video-replay test still crashes the driver reproducably after few minutes, as long as the default automatic power management is active.

The mediocre news: At least it looks as if the linux kernel now survives the driver crash to some extent, I found messages in the journal like this:

Nov 14 00:59:36 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=22008, emitted seq=22010
Nov 14 00:59:36 ryzen kernel: [drm] GPU recovery disabled.
Nov 14 00:59:37 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=107, emitted seq=109
Nov 14 00:59:37 ryzen kernel: [drm] GPU recovery disabled.
Nov 14 00:59:40 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=22008, emitted seq=22010
Nov 14 00:59:40 ryzen kernel: [drm] GPU recovery disabled.
Nov 14 00:59:41 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=107, emitted seq=109

... and so on repeating for several minutes after the screen went blank.

Will test tomorrow if this means I can now collect the diagnostics outputs that were asked for earlier.

Some good news: S3 suspends/resumes are working fine right now. There are some scary messages emitted upon resume, but they do not seem to have bad consequences:

[  281.465654] [drm:emulated_link_detect [amdgpu]] *ERROR* Failed to read EDID
[  281.490719] [drm:emulated_link_detect [amdgpu]] *ERROR* Failed to read EDID
[  282.006225] [drm] Fence fallback timer expired on ring sdma0
[  282.512879] [drm] Fence fallback timer expired on ring sdma0
[  282.556651] [drm] UVD and UVD ENC initialized successfully.
[  282.657771] [drm] VCE initialized successfully.
Comment 69 dwagner 2018-11-15 23:37:57 UTC
As promised in above comment, today I ran my debug script "gpu_debug4.sh" to obtain the diagnostic output after the crash as requested above.
This output is in attached "gpu_debug4_output.txt".
Since the trace output, the "dmesg -w" output and stdout are written to the same file, they are roughly chronologic.

If you want to look only at the dmesg-output, use
> grep '^\[' gpu_debug4_output.txt

(gpu_debug4.sh is a slight variation of earlier gpu_debug3.sh, just writing to a local log file.)

BTW: I ran the script multiple times, crashes occurred after 5 to 300 seconds, the diagnostic output always looked like in attached gpu_debug4_output.txt.
Comment 70 dwagner 2018-11-15 23:38:29 UTC
Created attachment 142483 [details]
test script
Comment 71 dwagner 2018-11-15 23:39:44 UTC
Created attachment 142484 [details]
gpu_debug4_output.txt.gz
Comment 72 dwagner 2018-12-17 22:56:07 UTC
Just for the record, since another month has passed: I can still reproduce the crash with today's git head of amd-staging-drm-next within minutes. (Also using the very latest firmware files from https://people.freedesktop.org/~agd5f/radeon_ucode/ )
Comment 73 Jānis Jansons 2018-12-22 20:41:14 UTC
Someone suggested I buy Ryzen 2400G APU, but almost every time some network lag happens while watching TV stream through Kodi and FPS of that video goes to 0, display just freezes and you have to power cycle the computer.
There is no space for external graphics card in my case and I don't want the increased power consumption, so at this point I'm just considering switch to Intel CPU.

I have been following this case for 4 months now with hope that it would move forward a bit but it seems stuck.

I can give additional dumps and test some patches if that would help but seems like others have given plenty of information on how to reproduce it.
Comment 74 fin4478 2018-12-24 12:56:16 UTC Comment hidden (spam)
Comment 75 dwagner 2018-12-24 14:49:24 UTC
Audio is unrelated to this bug. In my reproduction scripts, I do not output any audio at all. 

The video-at-3-fps replay that I use for reproduction seems to just trigger a certain pattern of the memory- and shader-clocks getting increased/decreased (with dynamic power management being enabled) that makes the occurrence of this bug likely. Any other GPU-usage pattern that triggers a lot of memory/shader clock changes seems to also increase the crash likelihood - manual use of some web-browser where GPU load spikes are caused a few times per second seems to be also a scenario where this bug is triggered now and then.
Comment 76 dwagner 2019-01-19 17:01:52 UTC
Just for the record, since another month has passed: I can still reproduce the crash with today's git head of amd-staging-drm-next within minutes.

As a bonus bug, with today's git head I also get unexplainable "minimal" memory and shader clock values - and a doubled power consumption (12W instead of 6W) for my default 3840x2160 60Hz display mode in comparison to last month's drm-next of the day:

> cd /sys/class/drm/card0/device

> xrandr --output HDMI-A-0 --mode 3840x2160 --rate 30
> echo manual >power_dpm_force_performance_level
> echo 0 >pp_dpm_mclk
> echo 0 >pp_dpm_sclk
> grep -H \\* pp_dpm_mclk pp_dpm_sclk
pp_dpm_mclk:0: 300Mhz *
pp_dpm_sclk:0: 214Mhz *

> xrandr --output HDMI-A-0 --mode 3840x2160 --rate 50
> echo manual >power_dpm_force_performance_level
> echo 0 >pp_dpm_mclk
> echo 0 >pp_dpm_sclk
> grep -H \\* pp_dpm_mclk pp_dpm_sclk
pp_dpm_mclk:1: 1750Mhz *
pp_dpm_sclk:1: 481Mhz *

> xrandr --output HDMI-A-0 --mode 3840x2160 --rate 60
> echo manual >power_dpm_force_performance_level
> echo 0 >pp_dpm_mclk
> echo 0 >pp_dpm_sclk
> grep -H \\* pp_dpm_mclk pp_dpm_sclk
pp_dpm_mclk:0: 300Mhz *
pp_dpm_sclk:6: 1180Mhz *

But that power consumption issue is negligible in comparison to the show-stopping crashes that are the topic of this bug report.
Comment 77 dwagner 2019-02-16 15:06:38 UTC
Since another month has passed: I can still reproduce the crash with today's git head of amd-staging-drm-next (and an up-to-date Arch Linux) within minutes by replaying a video at 3 fps.

Additional new bonus bugs this time:
- system consistently hangs at soft-reboots if X11 was started before
- system crashes immediately upon X11 start if vm_update_mode=3 is used
- system crashes if the HDMI-connected TV is shut off while screen blanking

Again, the bonus bugs are either irrelevant in comparison to the instability this report is about or have been reported already by others.
Comment 78 Mauro Gaspari 2019-04-11 06:40:13 UTC
Hi, I am affected by similar issues too using AMDGPU drivers on linux, and I have opened another bug, before finding this.
You can have a look at my findings and the workarounds I am applying. So far I had good success with those, but I am interested in knowing your thoughts, recommendations, and feedback.

Also if the bug I opened is a duplicate of this one, feel free to let me know and I will mark it as duplicate.

https://bugs.freedesktop.org/show_bug.cgi?id=109955

Cheers
Mauro
Comment 79 Jaap Buurman 2019-04-12 22:11:37 UTC
I am also running into the same issue. I have two questions that might help tracking down why we are having issues, but not all people that are running a Vega graphics card.

1)

What is the output of the following command for you guys?

cat /sys/class/drm/card0/device/vbios_version 

I am running the following version:

113-D0500100-103

According to the techpowerup GPU bios database, this is a vega bios that was replaced two days (!) later by a new version. Perhaps issues were found that required another bios update? I might install Windows on a spare HDD and try to flash my Vega to see if that changes anything.

2)

Memory clocking is different for people running multiple monitors. Are you guys also running multiple monitors by any chance?
Comment 80 dwagner 2019-04-12 23:00:53 UTC
(In reply to Jaap Buurman from comment #79)
> I am also running into the same issue. I have two questions that might help
> tracking down why we are having issues, but not all people that are running
> a Vega graphics card.

As you can see from my initial description, I'm running an RX460, which uses not a "Vega", but a "Polaris 11" AMD GPU.

> What is the output of the following command for you guys?
> 
> cat /sys/class/drm/card0/device/vbios_version 

"113-BAFFIN_PRO_1606"

I have not heard of any update to this from the vendor - there is just some unofficial hacked version around (which I do not use) that is said to enable some switched-off CUs.

> Memory clocking is different for people running multiple monitors. Are you
> guys also running multiple monitors by any chance?

No, I'm using just one 3840x2160 @ 60Hz HDMI display.
Comment 81 Jaap Buurman 2019-04-13 13:27:53 UTC
(In reply to Alex Deucher from comment #14)
> (In reply to dwagner from comment #13)
> > 
> > Much lower shader clocks are used only if I lower the refresh rate of the
> > screen. Is there a reason why the shader clocks should stay high even in the
> > absence of 3d/compute load?
> > 
> 
> Certain display requirements can cause the engine clock to be kept higher as
> well.

In this bug report and another similar one (https://bugs.freedesktop.org/show_bug.cgi?id=109955), everybody having the issue seems to be using a setup that requires higher engine clocks in idle AFAIK. Either high refresh displays, or in my case, multiple monitors. Could this be part of the issue that seems to trigger this bug? I might be grasping at straws here, but I have had this problem for as long as I have this Vega64 (bought at launch), while it is 100% stable under Windows 10 in the same setup.
Comment 82 Matt Coffin 2019-06-03 20:03:49 UTC
I am also experiencing this issue.

* Kernel: 5.1.3-arch2-1-ARCH
* LLVM 8.0.0
* AMDVLK (dev branch pulled 20190602)
* Mesa 19.0.4
* Card: XFX Radeon RX 590

I've seen this error, bug 105733, bug 105152, bug 107536, and bug 109955 all repeatable (which one each time appears to be non-deterministic) with the same process.

I just launch "House Flipper" from Steam (DX11 title), with DXVK 1.2.1, on either the mesa RADV or AMDVLK vulkan implementations.

At 2560x1440 resolution (both 60Hz and 144Hz refresh rates), the crash(es) occur. At 1080p@60Hz, I get no crashes, but they come back if I disable v-sync and framerate limiting.

I logged power consumption with `sensors | egrep '^power' | awk '{ print $1 " " $2; }'`, and found that the crash often occurs soon after the card hits its maximum power draw at around 190W.

I don't have much experience debugging or developing software at the kernel/driver level, but I'm happy to help with providing information as I go through the learning process here. I'll compile the amd-staging-drm-next kernel later tonight and post some results and logs.

Please let me know if there's more information I could provide that may be of use here. Thanks for your hard work!
Comment 83 Wilko Bartels 2019-07-08 07:51:29 UTC
(In reply to Jaap Buurman from comment #81)
> issue seems to be using a setup that requires higher engine clocks in idle
> AFAIK. Either high refresh displays, or in my case, multiple monitors. Could
> this be part of the issue that seems to trigger this bug? I might be
> grasping at straws here, but I have had this problem for as long as I have
> this Vega64 (bought at launch), while it is 100% stable under Windows 10 in
> the same setup.

This might be true. I was running i3 with xrandr set to 144hz when the freeze scenario began (somewhat last mont, did not "game" much before). Than switched to icewm to test and issue was gone. Later when i configured icewm to also have proper xrandr setting issue comes back. I didnt know that could be related. Will test this tonight.
Comment 84 Wilko Bartels 2019-07-09 07:38:25 UTC
(In reply to Wilko Bartels from comment #83)
> (In reply to Jaap Buurman from comment #81)
> > issue seems to be using a setup that requires higher engine clocks in idle
> > AFAIK. Either high refresh displays, or in my case, multiple monitors. Could
> > this be part of the issue that seems to trigger this bug? I might be
> > grasping at straws here, but I have had this problem for as long as I have
> > this Vega64 (bought at launch), while it is 100% stable under Windows 10 in
> > the same setup.
> 
> This might be true. I was running i3 with xrandr set to 144hz when the
> freeze scenario began (somewhat last mont, did not "game" much before). Than
> switched to icewm to test and issue was gone. Later when i configured icewm
> to also have proper xrandr setting issue comes back. I didnt know that could
> be related. Will test this tonight.

nevermind. it crashed on 60hz as well (once) yesterday
Comment 85 dwagner 2019-07-09 21:50:04 UTC
(In reply to Wilko Bartels from comment #84)
> nevermind. it crashed on 60hz as well (once) yesterday

It sure does. This bug is now about two years old, during which amdgpu has never been stable, got worse, and every contemporary kernel, whether "official" ones or ones compiled from git heads of development trees has this very problem, which I can reproduce within minutes.

I've given up hoping for a fix. I'll buy an Intel Xe GPU as soon as it hits the shelves.
Comment 86 Paul Ezvan 2019-09-07 05:42:21 UTC
I was also impacted by this bug (amdgpu hangs on random conditions with similar messages as the one exposed) with any kernel/mesa version combination other than the ones on Debian Stretch (any other distro or using Mesa from backports would trigger those crashes).
This was on a Ryzen 1700 platform with chipset B450. I had this issue with a RX480 and a RX560 (as I tried to replace the GPU in case it was faulty, I also replace the motherboard).

I was still impacted with Fedora 30 with recurring GPU hangs. Then I replaced the CPU/motherboard with a Core i7-9700k/Z390 platform. Since then I did not have a single GPU hang on Fedora 30.

My hypothesis on this problem not being easily reproducible is that it would happen only on specific GPU/CPU combinations.
Comment 87 dwagner 2019-09-12 23:09:47 UTC
(In reply to Paul Ezvan from comment #86)
> My hypothesis on this problem not being easily reproducible is that it would
> happen only on specific GPU/CPU combinations.

... and at least a specific operating system (Linux) and a specific driver (amdgpu with dc=1).

If your hypothesis was true - do you suggest everyone plagued by this bug just buys a new main-board and an Intel CPU to evade it?

Since my Ryzen system is perfectly stable when used as a server, not displaying anything but the text console, I'm inclined to rather keep my main-board and CPU and just exchange the GPU for another brand that comes with stable drivers.
Comment 88 jeroenimo 2019-09-25 21:37:12 UTC
Found this thread while googling the error from the log.

AMD Ryzen 3600
Asrock B350 motherboard
ASrock RX560 Radeon GPU


Ubuntu and Xubuntu  18.04 and 19.04 both lockups so not useable, after login almost imminent black screen, ssh access still possible. Seems a newer kernel and mesa drivers. sometimes 5 min , sometimes after 2 secomds

Linux mint 19.2
Seems a lot more stable but so far only  1 lockup with black screen

uname -a
Linux jeroenimo-amd 4.15.0-64-generic #73-Ubuntu SMP Thu Sep 12 13:16:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux


Last log from mint:

Sep 25 23:01:57 jeroenimo-amd kernel: [ 4980.207322] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:43:crtc-0] flip_done timed out
Sep 25 23:01:57 jeroenimo-amd kernel: [ 4980.207331] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:45:crtc-1] flip_done timed out
Sep 25 23:02:07 jeroenimo-amd kernel: [ 4990.451366] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:43:crtc-0] flip_done timed out

 I suspect I'm in the same trouble as most.

Win 10 flawless so it's really software..
Comment 89 jeroenimo 2019-09-26 08:35:24 UTC
I found a way to crash the system with glmark2
It almost instantly crashes it.
Comment 90 jeroenimo 2019-09-26 12:29:04 UTC
I managed to run glmark2 without crashing the system with 

By running the card manual at lowest frequency

from root shell:
echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
echo 0 > /sys/class/drm/card0/device/pp_dpm_sclk

root@jeroenimo-amd:/home/jeroen# cat /sys/class/drm/card0/device/pp_dpm_sclk 
0: 214Mhz *
1: 387Mhz 
2: 843Mhz 
3: 995Mhz 
4: 1062Mhz 
5: 1108Mhz 
6: 1149Mhz 
7: 1176Mhz 
root@jeroenimo-amd:/home/jeroen# 

If I go to higher e.g. 2: 843Mhz I manage to crash it.. although it takes a while before it crashes. 

when I force the card to anything above 4 I get an immediate crash without even starting glmark2

I hope this helps!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.