95517 – Constant GPU VM faults

Bug 95517 - Constant GPU VM faults

Summary: Constant GPU VM faults

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-05-21 11:08 UTC by lumetili
Modified:	2019-09-25 17:54 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg (94.40 KB, text/plain) 2016-05-21 11:08 UTC, lumetili	Details
lspci -nn (21.37 KB, text/plain) 2016-05-21 11:09 UTC, lumetili	Details
kernel oops (157.58 KB, text/plain) 2016-05-27 19:31 UTC, lumetili	Details
GPU softreset (21.36 KB, text/plain) 2016-06-07 20:02 UTC, lumetili	Details
View All

Description lumetili 2016-05-21 11:08:36 UTC

Created attachment 123957 [details]
dmesg

With some uptime I eventually start getting these GPU VM faults:

[106098.543115] VM fault (0x0c, vmid 6) at page 58321, read from TC (68)
[106098.543119] radeon 0000:02:00.0: GPU fault detected: 146 0x0a0c480c
[106098.543121] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000E3CB
[106098.543123] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C00800C

It's always the same fault number but the memory addresses(?) change.

See attached dmesg.

My system doesn't crash, but black glitches do appear on the screen and I need to switch desktops or move windows around to make them go away.

KDE desktop with compositing enabled.

I'm running Arch Linux with latest stable packages:

Linux carrier 4.5.4-1-ARCH #1 SMP PREEMPT Wed May 11 22:21:28 CEST 2016 x86_64 GNU/Linux

Name            : xf86-video-ati
Version         : 1:7.7.0-1

P.S. I use pci=nommconf because I have massive issues with PCI-E devices otherwise (radeon goes nuts coincidentally).

Comment 1 lumetili 2016-05-21 11:09:19 UTC

Created attachment 123958 [details]
lspci -nn

Comment 2 Michel Dänzer 2016-05-24 06:49:32 UTC

What version of Mesa are you using? Did this already happen with older versions of Mesa and/or the kernel?

Comment 3 lumetili 2016-05-24 15:04:43 UTC

Name            : mesa
Version         : 11.2.2-1

I've only had this computer for a few weeks now so I can't really say if older versions worked better. I used the same GPU on my old computer until April and I don't recall encountering this (almost exact same software setup except for older packages of course).

I just rebooted and it took a while for this to occur again but it did:

[77378.984705] radeon 0000:02:00.0: GPU fault detected: 146 0x0e02440c
[77378.984717] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00010870
[77378.984722] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0204400C
[77378.984727] VM fault (0x0c, vmid 1) at page 67696, read from TC (68)

I'm not sure if it's because of Chromium but it seems to at least be happening when browsing the web - I have "Override software rendering list" enabled in chrome://flags because performance sucks otherwise.

Comment 4 lumetili 2016-05-25 23:11:01 UTC

OK my 3D accelerated VirtualBox VM just got stuck and journalctl -kf started spitting out VM faults once a second in a loop, at first there's a:

May 26 01:39:24 carrier kernel: radeon 0000:02:00.0: GPU fault detected: 146 0x0042080c
May 26 01:39:24 carrier kernel: radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00014482
May 26 01:39:24 carrier kernel: radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0200800C
May 26 01:39:24 carrier kernel: VM fault (0x0c, vmid 1) at page 83074, read from TC (8)

And then a:

May 26 01:39:24 carrier kernel: radeon 0000:02:00.0: GPU fault detected: 146 0x0390350c
May 26 01:39:24 carrier kernel: radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000C49C
May 26 01:39:24 carrier kernel: radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x1003500C
May 26 01:39:24 carrier kernel: VM fault (0x0c, vmid 8) at page 50332, read from VGT (53)

It devolves into looping the VGT entry with VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0000C4A<NUMBER> with NUMBER cycling from 0..F

Comment 5 lumetili 2016-05-27 14:11:14 UTC

I swapped in another card of the same generation (HD 7750 -> R7 250), these errors keep happening, though they might not even be noticeable before the computer completely freezes eventually. Not even SysRq works, I have to reset from power button.

This time there was another error included:

May 27 06:16:22 carrier kernel: radeon 0000:02:00.0: GPU fault detected: 146 0x00b2480c
May 27 06:16:22 carrier kernel: radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00009EF8
May 27 06:16:22 carrier kernel: radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x1200400C
May 27 06:16:22 carrier kernel: VM fault (0x0c, vmid 9) at page 40696, read from TC (4)
May 27 06:16:22 carrier kernel: radeon 0000:02:00.0: IH ring buffer overflow (0x000019F0, 0x00001E60, 0x00001A00)

Comment 6 lumetili 2016-05-27 19:31:26 UTC

Created attachment 124132 [details]
kernel oops

Without pci=nommconf , the kernel logs PCI bus errors that get corrected, they seem to be connected to this issue? I reliably get a visible kernel oops from radeon with this configuration. With nommconf, no errors seemingly get logged (except VM faults) but regardless, eventually it crashes so hard not even SysRq works.

Attached kernel oops.

Comment 7 lumetili 2016-06-07 20:02:42 UTC

Created attachment 124390 [details]
GPU softreset

PCIe errors go away with pcie_aspm=off as well apparently.

Regardless, my computer keeps crashing so badly not even SysRq works. I don't get any kernel error messages about anything except for the GPU faults. Usually I get a whole bunch of them just before completely crashing.

Last night the computer was idle and screens were in powersaving mode and they suddenly woke up and then turned off again - apparently it crashed.

This time there's a better error in the journal from before the crash:

GPU fault -> GPU lockup -> Couldn't update BO_VA -> GPU softreset -> dead

See attachment.

Comment 8 Ismael 2016-08-23 23:14:30 UTC

I get the same error, but only when running "The Talos Principle". The system does not crash, but the performance is pretty bad on the game (15fps at high 1080p).

Ago 24 00:58:22 boreal kernel: VM fault (0x0c, vmid 7) at page 0, read from 'TC4' (0x54433400) (136)
Ago 24 00:58:22 boreal kernel: radeon 0000:01:00.0: GPU fault detected: 146 0x000e880c
Ago 24 00:58:22 boreal kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Ago 24 00:58:22 boreal kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E08800C


The error repeats a lot, several times per second.

Comment 9 Marek Olšák 2016-08-24 21:47:38 UTC

(In reply to Ismael from comment #8)
> I get the same error, but only when running "The Talos Principle". The
> system does not crash, but the performance is pretty bad on the game (15fps
> at high 1080p).
> 
> Ago 24 00:58:22 boreal kernel: VM fault (0x0c, vmid 7) at page 0, read from
> 'TC4' (0x54433400) (136)
> Ago 24 00:58:22 boreal kernel: radeon 0000:01:00.0: GPU fault detected: 146
> 0x000e880c
> Ago 24 00:58:22 boreal kernel: radeon 0000:01:00.0:  
> VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
> Ago 24 00:58:22 boreal kernel: radeon 0000:01:00.0:  
> VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E08800C
> 
> 
> The error repeats a lot, several times per second.

This is fixed by:
https://cgit.freedesktop.org/mesa/mesa/commit/?id=2c13abb49137d0f81b530b3c67f1ed79c58c796e

I think the original issue is unrelated.

For the original poster: I recommend testing mesa/master.

Comment 10 Timothy Arceri 2018-09-20 01:30:46 UTC

No further feedback from the original reporter assuming fixed and closing. Please reopen if issues continue.

Comment 11 Dima 2018-12-20 16:20:45 UTC

I don't know if it's relevant or not, due my using amdgpu-pro driver, but I get this massages sometime:
[14849.076326] gmc_v8_0_process_interrupt: 165 callbacks suppressed
[14849.076331] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c
[14849.076336] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001152FF
[14849.076338] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A07700C
[14849.076341] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32770) at page 1135359, read from 'SDM0' (0x53444d30) (119)
[14851.218731] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c
[14851.218736] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001152FF
[14851.218738] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C07700C
[14851.218741] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 6, pasid 32770) at page 1135359, read from 'SDM0' (0x53444d30) (119)
[14860.154325] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c
[14860.154330] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001152FF
[14860.154331] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0407700C
[14860.154334] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 2, pasid 32770) at page 1135359, read from 'SDM0' (0x53444d30) (119)
[15073.787603] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c
[15073.787608] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001152FF
[15073.787610] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C07700C
[15073.787612] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 6, pasid 32770) at page 1135359, read from 'SDM0' (0x53444d30) (119)
[15095.908340] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c
[15095.908345] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001152FF
[15095.908347] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0607700C
[15095.908350] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 3, pasid 32770) at page 1135359, read from 'SDM0' (0x53444d30) (119)
[15197.968706] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0cf8770c
[15197.968711] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0011D39F
[15197.968713] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A07700C
[15197.968715] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32770) at page 1168287, read from 'SDM0' (0x53444d30) (119)
[15710.487271] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c
[15710.487275] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001094FF
[15710.487277] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A07700C
[15710.487279] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32770) at page 1086719, read from 'SDM0' (0x53444d30) (119)
[15759.495971] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c
[15759.495978] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001094FF
[15759.495981] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0807700C
[15759.495985] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 4, pasid 32770) at page 1086719, read from 'SDM0' (0x53444d30) (119)
[15768.854519] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c
[15768.854525] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001094FF
[15768.854526] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A07700C
[15768.854529] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32770) at page 1086719, read from 'SDM0' (0x53444d30) (119)
[15818.316441] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c
[15818.316447] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001094FF
[15818.316448] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0207700C
[15818.316451] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 1, pasid 32770) at page 1086719, read from 'SDM0' (0x53444d30) (119)

Comment 12 GitLab Migration User 2019-09-25 17:54:28 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1231.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.