107152 – GPU fault detected: 146 / VM_CONTEXT1_PROTECTION_FAULT / ring gfx timeout

Bug 107152 - GPU fault detected: 146 / VM_CONTEXT1_PROTECTION_FAULT / ring gfx timeout

Summary: GPU fault detected: 146 / VM_CONTEXT1_PROTECTION_FAULT / ring gfx timeout

Status:	RESOLVED DUPLICATE of bug 102322

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	DRI git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-07-07 20:03 UTC by dwagner
Modified:	2019-01-24 06:45 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg of from booting until gpu fault few minutes later (83.87 KB, text/plain) 2018-07-07 20:04 UTC, dwagner	no flags	Details
Xorg.log from the session ending in the "gpu fault" (28.01 KB, text/plain) 2018-07-07 20:05 UTC, dwagner	no flags	Details
test script that attempted to catch useful output after crashes - but failed (1.07 KB, text/plain) 2018-08-03 23:42 UTC, dwagner	no flags	Details
View All

Description dwagner 2018-07-07 20:03:25 UTC

While just doing some Firefox-browsing amdgpu and then the whole system crashed on me with the following messages emitted to the journal:

Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: GPU fault detected: 146 0x0c80440c
Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100190
Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C
Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: VM fault (0x0c, vmid 7, pasid 32768) at page 1048976, read from 'TC1' (0x54433100) (68)
Jul 07 01:08:25 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=75244, last emitted seq=75245
Jul 07 01:08:25 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin!

Kernel version used: amd-staging-drm-next current as of commit bb2e406ba66c2573b68e609e148cab57b1447095, with patch  https://bugs.freedesktop.org/attachment.cgi?id=140418 applied on top of it.

Mesa version: 18.1.3-1 (current from Arch Linux)

(This report was separated from https://bugs.freedesktop.org/show_bug.cgi?id=102322)

Comment 1 dwagner 2018-07-07 20:04:35 UTC

Created attachment 140497 [details]
dmesg of from booting until gpu fault few minutes later

Comment 2 dwagner 2018-07-07 20:05:22 UTC

Created attachment 140498 [details]
Xorg.log from the session ending in the "gpu fault"

Comment 3 krzysiek 2018-07-30 21:12:38 UTC

Hi,

I get this GPU hung, 1-2 a day, mostly when using PHPStorm (Java based PHP Editor)

System is KDE Neon (Ubuntu 18.04 + latest KDE), I use padoka PPA (currently 1:18.2~git180730133900.0ea243d~b~padoka0)

GPU POLARIS11 0x1002:0x67EF 0x1458:0x230A 0xE5
[ 8004.993577] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0000620c
[ 8004.993584] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 8004.993587] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0806200C
[ 8004.993591] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 4, pasid 32771) at page 0, read from 'CBC0' (0x43424330) (98)
[ 8263.966497] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=194416, last emitted seq=194418
[ 8263.966510] [drm] IP block:gfx_v8_0 is hung!
[ 8263.966562] amdgpu 0000:01:00.0: GPU reset begin!

If you need more info please let me know
Krzysiek

Comment 4 dwagner 2018-07-31 21:41:14 UTC

Saw this kind of crash (still with the latest amd-staging-drm-next kernel) three times in a row today, just by playing a specific video immediately after rebooting and starting X11 with mpv, before the 10 minute video ended.
The video (which just shows a static cover image) can be obtained via:

youtube-dl -f 248+251 'https://www.youtube.com/watch?v=kYKE78Pcjog'

The log messages were just like reported above, I guess the additional "hw_done or flip_done timed out" after the "GPU reset begin!" is not really relevant:

Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0: GPU fault detected: 147 0x0f580402 for process Xorg pid 793 thread amdgpu_cs:0 pid 794
Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010C3EB
Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02004002
Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0: VM fault (0x02, vmid 1, pasid 32768) at page 1098731, read from 'TC3' (0x54433300) (4)
Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0: GPU fault detected: 146 0x0c984424 for process Xorg pid 793 thread amdgpu_cs:0 pid 794
Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100193
Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04044024
Jul 31 22:20:21 ryzen kernel: amdgpu 0000:0a:00.0: VM fault (0x24, vmid 2, pasid 32768) at page 1048979, read from 'TC1' (0x54433100) (68)
Jul 31 22:20:25 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=26570, emitted seq=26573
Jul 31 22:20:25 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin!
Jul 31 22:20:35 ryzen kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:44:crtc-0] hw_done or flip_done timed out

Comment 5 dwagner 2018-07-31 21:45:29 UTC

Just in case somebody aims at reproducing this with mpv, this is the content of the .config/mpv/mpv.conf file in use:

audio-device='alsa/iec958:CARD=Generic,DEV=0'
audio-delay=0.2
fs=no
vo=gpu
gpu-api=auto
profile=gpu-hq
fbo-format=rgba16f
hwdec=no
video-align-y=-1
hidpi-window-scale=no
target-prim=bt.709
tone-mapping=hable
cache=65536
cache-initial=2048
cache-secs=20

So not the amdgpu HDMI audio output was used, and no video decoding hardware acceleration.

Comment 6 Andrey Grodzovsky 2018-08-02 21:25:55 UTC

dwanger, how quickly is this reproducible ?
A wild guess, what if you boot kernel with IOMMU disabled ? Add iommu=off to grub command line.

Comment 7 dwagner 2018-08-02 21:54:20 UTC

(In reply to Andrey Grodzovsky from comment #6)
> dwanger, how quickly is this reproducible ?

With the above video playback test (which I should refer to as the "Othan" test, because that is the name of the song in the video) actually quite fast - never took more than 10 minutes so far to get to the crash.

> A wild guess, what if you boot kernel with IOMMU disabled ? Add iommu=off to
> grub command line.

Tried this: No difference, two attempts with current amd-staging-drm-next, one with hw_update_mode=0 and one with hw_update_mode=3, both crashed in < 1 minute of replay.

Interestingly, the "Othan test" can even crash the 4.13 kernel quicker then the usual one or two days of uptime I can get with that old kernel.

There isn't really anything special with the video other than it being encoded at only 6 frames per second.

And btw., the video replay crashes even with --vo=xv, so without mpv making use of opengl. Replay does not crash with --vo=null. 

In contrast, when I replay videos with the usual 24fps, this runs much longer without crashing.

Comment 8 Andrey Grodzovsky 2018-08-03 16:54:28 UTC

dwanger, i think you already have all the trace tools installed from previous debug sessions so this should be quick for you - 

Update to latest kernel from https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next

Load the system and before starting reproduce run the following trace command -

sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e "amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv"


after VM_FAULT happened extract the log from /sys/kernel/debug/tracing

also run 
sudo umr -O verbose -R gfx[.]
sudo umr -O halt_waves -wa

Now let's say this your log crash 

Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100190
Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C
Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: VM fault (0x0c, vmid 7, pasid 32768) at page 1048976, read from 'TC1' (0x54433100) (68)

Do

umr -O verbose -vm 7@100190000 1 

where 7 is vmid value and 100190000 is VM_CONTEXT1_PROTECTION_FAULT_ADDR value with extra '000' to get from  virtual page number to actual virtual address (left shift 4096b).

I can look at the log then and also run it by our MESA/LLVM experts to try and figure out what's going on.

Comment 9 dwagner 2018-08-03 23:42:04 UTC

Created attachment 140959 [details]
test script that attempted to catch useful output after crashes - but failed

Comment 10 dwagner 2018-08-03 23:48:14 UTC

(In reply to Andrey Grodzovsky from comment #8)
> dwanger, i think you already have all the trace tools installed from
> previous debug sessions so this should be quick for you
Yes, and I tried really hard (with above attached script run as "root" on a text console while the "othan_test.sh" script played the video on the screen) to catch any useful output, but that failed for the same reason I mentioned in
https://bugs.freedesktop.org/show_bug.cgi?id=102322#c20
- the system simply crashes too hard to quickly to be able to do anything after amdgpu.ko crashes. The output I get in gpu_result.txt stops at the "waiting for the crash" line.

It is only in about 1 out of 10 crashes that the syslog at least contains the error messages from the amdgpu crash, in the other 90% of cases the same crash occurs with no message being recorded at all.

If there was any method to let other processes survive for a while after amdgpu crashes, please let me know.

Comment 11 dwagner 2018-08-05 19:59:27 UTC

I did some additional experiments to understand what is so special about the "Othan" video that playing it causes amdgpu to crash relatively fast.

Since the only "odd" parameter of it is its "6 fps" frame rate, I tried replaying other videos, first at their normal rate (like 24 fps), which did not cause quick crashes, then at an artificially lower set rate - and indeed, that causes fast crashing regardless of what video I play.

The framerate that caused the "quickest" crashing seemed to be 3 fps, running
> mpv --no-correct-pts --fps=3 --ao=null some_arbitrary_video.webm
was usually crashing amd-staging-drm-next within < 1 minute for me.


Just some random thought: Could the reason be some timed hysteresis in power management of the GPU?
Would there be some possibility to lock the GPU on a specific power level to then try if those crashes still occur?

Comment 12 dwagner 2018-08-08 23:13:20 UTC

Indeed, I found my theory confirmed by many experiments: If I use a script like
> #!/bin/bash
> cd /sys/class/drm/card0/device
> echo manual >power_dpm_force_performance_level
> # low
> echo 0 >pp_dpm_mclk 
> echo 0 >pp_dpm_sclk
> # medium
> #echo 1 >pp_dpm_mclk 
> #echo 1 >pp_dpm_sclk
> # high
> #echo 1 >pp_dpm_mclk 
> #echo 6 >pp_dpm_sclk
to enforce just any performance level, then the crashes do not occur anymore - also with the "low frame rate video test".

So it seems that the transition from one "dpm" performance level to another, with a certain probability, causes these crashes. And the more often the transitions occur, the sooner one will experience them.

The dynamic power management issue can now be pursued with the original bug report https://bugs.freedesktop.org/show_bug.cgi?id=102322 for the vm_update_mode=0 case - there is probably not much sense in keeping this bug report open just because errors also occur with wm_update_mode=3, just less often.

Comment 13 Andrey Grodzovsky 2018-08-09 16:25:40 UTC

(In reply to dwagner from comment #12)
> Indeed, I found my theory confirmed by many experiments: If I use a script
> like
> > #!/bin/bash
> > cd /sys/class/drm/card0/device
> > echo manual >power_dpm_force_performance_level
> > # low
> > echo 0 >pp_dpm_mclk 
> > echo 0 >pp_dpm_sclk
> > # medium
> > #echo 1 >pp_dpm_mclk 
> > #echo 1 >pp_dpm_sclk
> > # high
> > #echo 1 >pp_dpm_mclk 
> > #echo 6 >pp_dpm_sclk
> to enforce just any performance level, then the crashes do not occur anymore
> - also with the "low frame rate video test".
> 
> So it seems that the transition from one "dpm" performance level to another,
> with a certain probability, causes these crashes. And the more often the
> transitions occur, the sooner one will experience them.
> 
> The dynamic power management issue can now be pursued with the original bug
> report https://bugs.freedesktop.org/show_bug.cgi?id=102322 for the
> vm_update_mode=0 case - there is probably not much sense in keeping this bug
> report open just because errors also occur with wm_update_mode=3, just less
> often.

Agreed.

Comment 14 dwagner 2018-08-09 20:56:06 UTC


*** This bug has been marked as a duplicate of bug 102322 ***

Comment 15 Ida Wallace 2019-01-24 06:45:47 UTC

Thanks for letting us know about the duplicate bug of GPU fault and System crashes, so solution seekers can refer both references to understand the bug and try to solve it easily.

Ida,
http://www.assignmenthelpfolks.com/

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.