Bug 103736 - Sudden system freezes, GPU fault detected
Summary: Sudden system freezes, GPU fault detected
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-14 15:12 UTC by Shiverly
Modified: 2019-11-19 08:26 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg errors (5.45 KB, text/plain)
2017-11-14 15:12 UTC, Shiverly
no flags Details
Crash while playing Counter-Strike: Global Offensive (41.66 KB, text/plain)
2018-01-28 11:04 UTC, Lennart Sauerbeck
no flags Details
Errors while playing CS:GO, crash and reboot after opening VLC (304.19 KB, text/plain)
2018-01-28 11:13 UTC, Lennart Sauerbeck
no flags Details

Description Shiverly 2017-11-14 15:12:32 UTC
Created attachment 135450 [details]
dmesg errors

I installed Ubuntu Mate 17.10 and M-bab drivers (https://github.com/M-Bab/linux-kernel-amdgpu-binaries, without them one monitor is always black but powered on). 

Almost every day system freezes suddenly after random amount of time, which can be from 5 minutes to 3+ hours. Only power button helps, no logs are saved but dmesg has errors. 

I think this is either AMDGPU bug or something ryzen related (most likely not, because they manifest as sudden reboots, never as system freezes. And last bios update stopped them 2 months ago).

Graphics:  Card: Advanced Micro Devices [AMD/ATI] Tonga PRO [Radeon R9 285/380]
           Display Server: x11 (X.Org 1.19.5 )
           drivers: ati,amdgpu (unloaded: modesetting,fbdev,vesa,radeon)
           Resolution: 1920x1080@60.00hz, 1920x1080@60.00hz
           OpenGL: renderer: AMD Radeon R9 200 Series (TONGA / DRM 3.23.0 / 4.13.11+, LLVM 5.0.1)
           version: 4.5 Mesa 17.4.0-devel
Comment 1 Michel Dänzer 2017-11-14 15:53:16 UTC
(In reply to Shiverly from comment #0)
> [...] dmesg has errors. 

I only see messages about failing to allocate a larger BAR, which is harmless.


> I think this is either AMDGPU bug or something ryzen related (most likely
> not, because they manifest as sudden reboots, never as system freezes. And
> last bios update stopped them 2 months ago).

FWIW, Andres Rodriguez reported similar symptoms with a Ryzen system on IRC, and raising voltages / disabling Cool'n'Quiet / disabling C6 states fixed them for him.
Comment 2 Andres Rodriguez 2017-11-14 18:37:45 UTC
> FWIW, Andres Rodriguez reported similar symptoms with a Ryzen system on IRC,
> and raising voltages / disabling Cool'n'Quiet / disabling C6 states fixed
> them for him.

I raised the memory and the core voltages specifically. The other voltages like SoC were left untouched.
Comment 3 Shiverly 2017-11-15 17:55:25 UTC
(In reply to Andres Rodriguez from comment #2)
> > FWIW, Andres Rodriguez reported similar symptoms with a Ryzen system on IRC,
> > and raising voltages / disabling Cool'n'Quiet / disabling C6 states fixed
> > them for him.
> 
> I raised the memory and the core voltages specifically. The other voltages
> like SoC were left untouched.

I didn't have these symptoms in arch or ubuntu 16.04 LTS, only when using this driver/kernel combination (which is only one that keeps both monitors usable). Long compilation jobs don't cause system freezes either.
Comment 4 Shiverly 2017-11-15 18:29:26 UTC
I got some logs. Maybe they are related (found them in journalctl)

Nov 15 20:09:06 tibu-pc kernel: gmc_v8_0_process_interrupt: 626 callbacks suppressed
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068154, read from 'TC5' (0x54433500) (192)
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C0001
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x092EE4BA
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x05d0c001
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x02, vmid 5) at page 5545728, read from 'TC7' (0x54433700) (68)
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A044002
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00549F00
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x05d00001
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068154, read from 'TC0' (0x54433000) (8)
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A008001
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x092EE4BA
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x05d00801
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x02, vmid 5) at page 5541638, read from 'TC9' (0x54433900) (136)
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A088002
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00548F06
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06500001
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068170, read from 'TC0' (0x54433000) (8)
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A008001
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x092EE4CA
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06500801
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x02, vmid 5) at page 5541634, read from 'TC8' (0x54433800) (64)
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A040002
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00548F02
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06504001
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068170, read from 'TC7' (0x54433700) (68)
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A044001
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x092EE4CA
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06504401
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068170, read from 'TC2' (0x54433200) (0)
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A000001
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x092EE4CA
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06500001
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068171, read from 'TC7' (0x54433700) (68)
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A044001
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x092EE4CB
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06584401
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068170, read from 'TC7' (0x54433700) (68)
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A044001
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x092EE4CA
Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06504401
Nov 15 20:08:20 tibu-pc kernel: gmc_v8_0_process_interrupt: 1830 callbacks suppressed
Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 1) at page 154054911, read from 'TC5' (0x54433500) (192)
Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x020C0001
Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x092EB0FF
Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x07f8c001
Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 1) at page 154054911, read from 'TC0' (0x54433000) (8)
Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02008001
Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x092EB0FF
Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x07f80801
Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 1) at page 154054911, read from 'TC11' (0x54433131) (128)
Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02080001
Comment 5 Shiverly 2017-11-18 09:10:44 UTC
One way to get crash quickly is to play Overpass map in CS:GO in terrorist spawn. Textures near the stairs show corrupted, and system always hangs in first 5 minutes of gameplay. I think it's 3D related, because just using simple text editor or being in Ctrl-Alt-Fx terminal never hangs, but browser can cause hang but it's less quick to manifest than playing 3D game.
Comment 6 Lennart Sauerbeck 2018-01-28 11:04:36 UTC
Created attachment 137005 [details]
Crash while playing Counter-Strike: Global Offensive

I think I'm running into the same issues. Attached is the kernel output while playing Counter-Strike: Global Offensive. It worked during the warmup, but froze in the first round, so I'd say about 3-5 minutes after starting the game.

I'm running an up-to-date Debian unstable with Linux 4.14.13 and Mesa 17.3.3.
Comment 7 Lennart Sauerbeck 2018-01-28 11:13:23 UTC
Created attachment 137006 [details]
Errors while playing CS:GO, crash and reboot after opening VLC

Another crash pretty much right after the one from my previous comment. After rebooting the system to continue playing Counter-Strike: Global Offensive the errors kept coming, though the system did not freeze (note the timestamps in the error log).

After shutting down the game, I started VLC to watch a stream and the system froze immediately. After a short while (<5 minutes) I used Magic SysReq keys to reboot the system safely, which can also be seen in the log.

A possibly important detail: My system doesn't freeze entirely, only the graphics output does. Sound still works for a time, even voice chatting continues to work. However, all X output freezes (e.g. conky on desktop).

I haven't tried going to a virtual console, so do not know whether that still works.

I also had the same issue while playing Euro Truck Simulator 2, but it never happened while playing Dota 2. Given this, it seems like some illegal instruction is passed to the graphics driver. Would an ApiTrace help? If so, I can try to record one.
Comment 8 Lennart Sauerbeck 2018-02-11 10:24:40 UTC
I was able to record an ApiTrace which shows the problem consistently. However, it's 2.5 gigabytes and contains personal information I'd rather not share on a public bugtracker -- I think a trace can only be truncated, removing stuff from the beginning messes up the OpenGL context?

I cannot switch to the virtual console when the freeze is triggered.

I also built radeonsi from current Mesa git (9b9a89cd795fda462a6ee898ef6e5135ca79d94e) but the problem persisted.
Comment 9 Ernst Sjöstrand 2018-02-12 19:22:18 UTC
I get

[  133.978908] amdgpu 0000:09:00.0: GPU fault detected: 147 0x00198802
[  133.978911] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00500003
[  133.978912] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02188002
[  133.978914] amdgpu 0000:09:00.0: VM fault (0x02, vmid 1) at page 5242883, read from 'TC4' (0x54433400) (392)

or from another boot

[  204.841497] amdgpu 0000:09:00.0: GPU fault detected: 147 0x00188402
[  204.841501] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00500003
[  204.841502] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A084002
[  204.841504] amdgpu 0000:09:00.0: VM fault (0x02, vmid 5) at page 5242883, read from '' (0x00000000) (132)


When I try to launch steam. It never gets to draw any UI, the computer just freezes.
This happens with both 4.13(-ubuntu33) and 4.15.2 kernel with Mesa/LLVM from git (padoka).
When I reverted to Mesa 17.2.8 + LLVM 5.0.0 I could launch steam again.
Comment 10 Ernst Sjöstrand 2018-02-13 21:30:28 UTC
The Vehicle Game demo seem to trigger this quite reliably for me:
https://wiki.unrealengine.com/Linux_Demos
Comment 11 Ernst Sjöstrand 2018-02-13 22:18:00 UTC
Ok, the vm faults I see are caused by using Padoka ppa which currently has
https://cgit.freedesktop.org/mesa/mesa/commit/?id=847d0a393d7f0f967f39302900d5330f32b804c8
but not
https://reviews.llvm.org/D41663

That means it can't be the same as the original issue, and also that the solution for me is just to update to more recent versions. Sorry for the noise in this bug.
Comment 12 aceman 2018-03-31 22:41:56 UTC
Ernst, I have also traced the error you have to usage of OpenCL in the Mesa clover driver on RX560 with LLVM upgraded from 5.0.1 to 6.0. What do you say is the solution? Is Mesa using intrinsics that are only in LLVM git? Or is that LLVM changeset you posted already in the release LLVM 6.0?
Comment 13 Ernst Sjöstrand 2018-04-01 08:58:07 UTC
aceman: the problem was mismatching development snapshots, couldn't happen if you have any real releases in the mix.
Comment 14 aceman 2018-04-01 17:28:52 UTC
I'm using Mesa git, but LLVM 6.0 release. Is that fine wrt. this mismatch?
Comment 15 Martin Peres 2019-11-19 08:26:09 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/258.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.