88301 – Dota causes GPU fault and kernel hang

Bug 88301 - Dota causes GPU fault and kernel hang

Summary: Dota causes GPU fault and kernel hang

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Radeon (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-01-11 20:17 UTC by Tilman Sauerbeck
Modified:	2015-01-30 18:26 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments

Description Tilman Sauerbeck 2015-01-11 20:17:52 UTC

I've got an apitrace of Dota that reliably causes GPU faults on radeonsi on my Bonaire XTX.

The trace: http://files.code-monkey.de/dota.trace (468797931 bytes).

I'm running Mesa from master (8d2542fc9d5af4db355b67cc2a1ff2f413685a27) and can reproduce the problem with kernel 3.18.2 and 3.19.0-rc2 (built from airlied's drm-fixes tree; 79305ec6e60d320832505e95c1a028d309fcd2b6).
agd5f's "fix VM flush" patches from 2015-01-06 don't help either.

Example kernel log excerpt:
radeon 0000:01:00.0: GPU fault detected: 147 0x000c4801
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x01000000
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C048001
VM fault (0x01, vmid 6) at page 16777216, read from 'TC2' (0x54433200) (72)
radeon 0000:01:00.0: GPU fault detected: 146 0x000c080c
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C00800C
VM fault (0x0c, vmid 6) at page 0, read from 'TC0' (0x54433000) (8)

Comment 1 Michel Dänzer 2015-01-15 07:43:27 UTC

Replaying this trace on my Kaveri, I get similar GPUVM faults, but no hangs.

What are the symptoms of the hangs?

Comment 2 Tilman Sauerbeck 2015-01-15 20:06:38 UTC

Sorry about that misinformation.
It's not a hang at all since I'm still able to use sysrq to reboot.

What's happening after the GPU faults is that apparently the driver attempts to get the card back into working shape, but fails to do so. X doesn't become usable again after the GPU faults anyway.

Here's the kernel log following the GPU faults:

radeon 0000:01:00.0: ring 0 stalled for more than 10428msec
radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000107f5c last fence id 0x00000000001080f7 on ring 0)
radeon 0000:01:00.0: failed to get a new IB (-35)
[drm:radeon_cs_ib_fill] *ERROR* Failed to get ib !
radeon 0000:01:00.0: Saved 7977 dwords of commands on ring 0.
radeon 0000:01:00.0: GPU softreset: 0x00000009
[snipped list of registers that were reset (I think)]

[drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0
[drm] PCIE gen 2 link speeds already enabled
[drm] PCIE GART of 1024M enabled (table at 0x000000000078C000).
radeon 0000:01:00.0: WB enabled
radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff8800bac4cc00
radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff8800bac4cc04
radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff8800bac4cc08
radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff8800bac4cc0c
radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff8800bac4cc10
radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000076c98 and cpu addr 0xffffc90010c36c98
radeon 0000:01:00.0: fence driver on ring 6 use gpu addr 0x0000000080000c18 and cpu addr 0xffff8800bac4cc18
radeon 0000:01:00.0: fence driver on ring 7 use gpu addr 0x0000000080000c1c and cpu addr 0xffff8800bac4cc1c
[drm] ring test on 0 succeeded in 3 usecs
[drm:cik_ring_test] *ERROR* radeon: ring 1 test failed (scratch(0x3010C)=0xCAFEDEAD)
[drm:cik_ring_test] *ERROR* radeon: ring 2 test failed (scratch(0x3010C)=0xCAFEDEAD)
[drm:cik_sdma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)
[drm:cik_resume] *ERROR* cik startup failed on resume
[drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed

Comment 3 smoki 2015-01-15 21:45:46 UTC

 I can reproduce this on Kabini with current proper mesa and llvm, but not with Tom's perf-Jan-08-2015 llvm + vgpr-spilling-Jan07-2014 mesa branches it works fine there.

Comment 4 Tilman Sauerbeck 2015-01-16 06:54:22 UTC

(In reply to smoki from comment #3)
>  I can reproduce this on Kabini with current proper mesa and llvm, but not
> with Tom's perf-Jan-08-2015 llvm + vgpr-spilling-Jan07-2014 mesa branches it
> works fine there.

Indeed, switching llvm to perf-Jan-08-2015 fixes the GPU faults.

For the record, with my Mesa installation from git master I do get
> Warning: Compiler emitted unknown config register: 0x286e8
in glretrace, but that doesn't seem to cause any visible breakage.

Should I leave the bug open until the fix hits LLVM trunk?

Comment 5 smoki 2015-01-30 16:24:03 UTC

 This is fixed now with current mesa and llvm.

 Tilman, you may want to confirm and eventually to close this bug.

Comment 6 Tilman Sauerbeck 2015-01-30 18:26:05 UTC

(In reply to smoki from comment #5)
>  This is fixed now with current mesa and llvm.
> 
>  Tilman, you may want to confirm and eventually to close this bug.

Confirmed. Thanks for the heads-up.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.