Bug 112226

Summary: [HadesCanyon/regression] GPU hang causes also X server to die
Product: DRI Reporter: Eero Tamminen <eero.t.tamminen>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: critical    
Priority: not set    
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=108898
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
Xorg log none

Description Eero Tamminen 2019-11-07 13:53:19 UTC
Setup:
* HW: KBL HadesCanyon (i7-8809G with Radeon RX Vega M GH)
* OS: Ubuntu 18.04 with Unity desktop (compiz)
* SW: Git builds of drm-tip kernel, Mesa and X server

Issue:
* AMD GPU driver stopped recovering from bug 108898 KBL HadesCanyon GPU hangs.

It still claims to recover from the bug:
-------------------------------------------------------
[ 1057.512690] Iteration 2/3: bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_manhattan
[ 1119.867403] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[ 1124.987449] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
-------------------------------------------------------
But now all 3D tests run after this error will fail.

This started to happen between following (drm-tip) kernel commits:
* 2019-10-28 16:01:46: 912b87256c: drm-tip: 2019y-10m-28d-16h-00m-10s UTC integration manifest
* 2019-10-29 17:58:05: a2c9f8ce2a: drm-tip: 2019y-10m-29d-17h-57m-39s UTC integration manifest

And following Mesa commits:
* 2019-10-28 17:47:06: d298740a1c: iris: Disallow incomplete resource creation
* 2019-10-29 16:19:34: ff6e148a3d: freedreno/a6xx: add a618 support


Note:
* I'm not seeing the same issue by using few months old Mesa with latest drm-tip kernel, so some change in Mesa triggers this kernel issue
* If latest Mesa is used with drm-tip kernel 5.3, 4/5 times X fails to start.  This started to happen with Mesa version within couple of days of the GPU hang recovery issue, so potentially there are more issue in Mesa (HadesCanyon) AMD support
Comment 1 Alex Deucher 2019-11-07 14:04:38 UTC
Please attach your dmesg output and xorg log is using X.  Please note that after a GPU reset, in most cases you need to restart your desktop environment because no desktop environments properly handle the loss of their contexts at the moment.
Comment 2 Eero Tamminen 2019-11-07 14:25:02 UTC
Created attachment 145908 [details]
dmesg
Comment 3 Eero Tamminen 2019-11-07 14:35:53 UTC
(In reply to Alex Deucher from comment #1)
> Please attach your dmesg output and xorg log is using X.  Please note that
> after a GPU reset, in most cases you need to restart your desktop
> environment because no desktop environments properly handle the loss of
> their contexts at the moment.

Failed tests complain about the invalid MIT-MAGIC-COOKIE-1, so it seems that later failures are because X went down (and came back up with display manager).

AFAIK reset should affect only the context running in the GPU when it was reseted, not the others [1], and in this case the problematic client should be GfxBench (Manhattan test-case, see bug 108898), not X server.

Btw. Why AMD kernel module doesn't tell which process / context had the issue, like i915 does?

[1] At least that's the case with i915, as long as the whole system doesn't hang. 


(In reply to Eero Tamminen from comment #0)
> * If latest Mesa is used with drm-tip kernel 5.3, 4/5 times X fails to
> start.  This started to happen with Mesa version within couple of days of
> the GPU hang recovery issue, so potentially there are more issue in Mesa
> (HadesCanyon) AMD support

Correction.  That issue happens only when using latest Mesa with few months old X server and (5.3) drm-tip kernel. If latest git versions of all are used, X starts fine.  But since the indicated date, it dies later, when Manhattan test-case causes problems.
Comment 4 Eero Tamminen 2019-11-07 14:46:55 UTC
Created attachment 145909 [details]
Xorg log

X dies to ConfigureWindow() -> miResizeWindow() -> miCopyRegion() -> glamor_create_pixmap() -> radeonsi_dri.so -> abort().

Lightdm log show abort to be:
X: src/gallium/winsys/amdgpu/drm/amdgpu_cs.c:1061: amdgpu_cs_check_space: Assertion `rcs->current.cdw <= rcs->current.max_dw' failed.

This is the same abort that causes X server to fail at boot with git Mesa and a bit older X server & drm-tip kernel.

Is above abort due to something on the kernel side, or Mesa issue?
Comment 5 Alex Deucher 2019-11-07 17:22:42 UTC
(In reply to Eero Tamminen from comment #3)
> 
> AFAIK reset should affect only the context running in the GPU when it was
> reseted, not the others [1], and in this case the problematic client should
> be GfxBench (Manhattan test-case, see bug 108898), not X server.
> 
> Btw. Why AMD kernel module doesn't tell which process / context had the
> issue, like i915 does?

It does, but in the case of a whole GPU reset, vram is lost after a reset so the buffers from all processes that use the GPU are lost.  Depending on the nature of the hang, a whole GPU reset may be required rather than just killing the shader wave.
Comment 6 Martin Peres 2019-11-19 10:01:09 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/951.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.