I've got an apitrace of Dota that reliably causes GPU faults on radeonsi on my Bonaire XTX. The trace: http://files.code-monkey.de/dota.trace (468797931 bytes). I'm running Mesa from master (8d2542fc9d5af4db355b67cc2a1ff2f413685a27) and can reproduce the problem with kernel 3.18.2 and 3.19.0-rc2 (built from airlied's drm-fixes tree; 79305ec6e60d320832505e95c1a028d309fcd2b6). agd5f's "fix VM flush" patches from 2015-01-06 don't help either. Example kernel log excerpt: radeon 0000:01:00.0: GPU fault detected: 147 0x000c4801 radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x01000000 radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C048001 VM fault (0x01, vmid 6) at page 16777216, read from 'TC2' (0x54433200) (72) radeon 0000:01:00.0: GPU fault detected: 146 0x000c080c radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C00800C VM fault (0x0c, vmid 6) at page 0, read from 'TC0' (0x54433000) (8)
Replaying this trace on my Kaveri, I get similar GPUVM faults, but no hangs. What are the symptoms of the hangs?
Sorry about that misinformation. It's not a hang at all since I'm still able to use sysrq to reboot. What's happening after the GPU faults is that apparently the driver attempts to get the card back into working shape, but fails to do so. X doesn't become usable again after the GPU faults anyway. Here's the kernel log following the GPU faults: radeon 0000:01:00.0: ring 0 stalled for more than 10428msec radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000107f5c last fence id 0x00000000001080f7 on ring 0) radeon 0000:01:00.0: failed to get a new IB (-35) [drm:radeon_cs_ib_fill] *ERROR* Failed to get ib ! radeon 0000:01:00.0: Saved 7977 dwords of commands on ring 0. radeon 0000:01:00.0: GPU softreset: 0x00000009 [snipped list of registers that were reset (I think)] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0 [drm] PCIE gen 2 link speeds already enabled [drm] PCIE GART of 1024M enabled (table at 0x000000000078C000). radeon 0000:01:00.0: WB enabled radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff8800bac4cc00 radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff8800bac4cc04 radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff8800bac4cc08 radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff8800bac4cc0c radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff8800bac4cc10 radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000076c98 and cpu addr 0xffffc90010c36c98 radeon 0000:01:00.0: fence driver on ring 6 use gpu addr 0x0000000080000c18 and cpu addr 0xffff8800bac4cc18 radeon 0000:01:00.0: fence driver on ring 7 use gpu addr 0x0000000080000c1c and cpu addr 0xffff8800bac4cc1c [drm] ring test on 0 succeeded in 3 usecs [drm:cik_ring_test] *ERROR* radeon: ring 1 test failed (scratch(0x3010C)=0xCAFEDEAD) [drm:cik_ring_test] *ERROR* radeon: ring 2 test failed (scratch(0x3010C)=0xCAFEDEAD) [drm:cik_sdma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD) [drm:cik_resume] *ERROR* cik startup failed on resume [drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed
I can reproduce this on Kabini with current proper mesa and llvm, but not with Tom's perf-Jan-08-2015 llvm + vgpr-spilling-Jan07-2014 mesa branches it works fine there.
(In reply to smoki from comment #3) > I can reproduce this on Kabini with current proper mesa and llvm, but not > with Tom's perf-Jan-08-2015 llvm + vgpr-spilling-Jan07-2014 mesa branches it > works fine there. Indeed, switching llvm to perf-Jan-08-2015 fixes the GPU faults. For the record, with my Mesa installation from git master I do get > Warning: Compiler emitted unknown config register: 0x286e8 in glretrace, but that doesn't seem to cause any visible breakage. Should I leave the bug open until the fix hits LLVM trunk?
This is fixed now with current mesa and llvm. Tilman, you may want to confirm and eventually to close this bug.
(In reply to smoki from comment #5) > This is fixed now with current mesa and llvm. > > Tilman, you may want to confirm and eventually to close this bug. Confirmed. Thanks for the heads-up.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.