Summary: | Constant GPU VM faults | ||
---|---|---|---|
Product: | Mesa | Reporter: | lumetili |
Component: | Drivers/Gallium/radeonsi | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED MOVED | QA Contact: | Default DRI bug account <dri-devel> |
Severity: | normal | ||
Priority: | medium | CC: | keramidasceid |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
dmesg
lspci -nn kernel oops GPU softreset |
Created attachment 123958 [details]
lspci -nn
What version of Mesa are you using? Did this already happen with older versions of Mesa and/or the kernel? Name : mesa Version : 11.2.2-1 I've only had this computer for a few weeks now so I can't really say if older versions worked better. I used the same GPU on my old computer until April and I don't recall encountering this (almost exact same software setup except for older packages of course). I just rebooted and it took a while for this to occur again but it did: [77378.984705] radeon 0000:02:00.0: GPU fault detected: 146 0x0e02440c [77378.984717] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00010870 [77378.984722] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0204400C [77378.984727] VM fault (0x0c, vmid 1) at page 67696, read from TC (68) I'm not sure if it's because of Chromium but it seems to at least be happening when browsing the web - I have "Override software rendering list" enabled in chrome://flags because performance sucks otherwise. OK my 3D accelerated VirtualBox VM just got stuck and journalctl -kf started spitting out VM faults once a second in a loop, at first there's a: May 26 01:39:24 carrier kernel: radeon 0000:02:00.0: GPU fault detected: 146 0x0042080c May 26 01:39:24 carrier kernel: radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00014482 May 26 01:39:24 carrier kernel: radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0200800C May 26 01:39:24 carrier kernel: VM fault (0x0c, vmid 1) at page 83074, read from TC (8) And then a: May 26 01:39:24 carrier kernel: radeon 0000:02:00.0: GPU fault detected: 146 0x0390350c May 26 01:39:24 carrier kernel: radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0000C49C May 26 01:39:24 carrier kernel: radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x1003500C May 26 01:39:24 carrier kernel: VM fault (0x0c, vmid 8) at page 50332, read from VGT (53) It devolves into looping the VGT entry with VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0000C4A<NUMBER> with NUMBER cycling from 0..F I swapped in another card of the same generation (HD 7750 -> R7 250), these errors keep happening, though they might not even be noticeable before the computer completely freezes eventually. Not even SysRq works, I have to reset from power button. This time there was another error included: May 27 06:16:22 carrier kernel: radeon 0000:02:00.0: GPU fault detected: 146 0x00b2480c May 27 06:16:22 carrier kernel: radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00009EF8 May 27 06:16:22 carrier kernel: radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x1200400C May 27 06:16:22 carrier kernel: VM fault (0x0c, vmid 9) at page 40696, read from TC (4) May 27 06:16:22 carrier kernel: radeon 0000:02:00.0: IH ring buffer overflow (0x000019F0, 0x00001E60, 0x00001A00) Created attachment 124132 [details]
kernel oops
Without pci=nommconf , the kernel logs PCI bus errors that get corrected, they seem to be connected to this issue? I reliably get a visible kernel oops from radeon with this configuration. With nommconf, no errors seemingly get logged (except VM faults) but regardless, eventually it crashes so hard not even SysRq works.
Attached kernel oops.
Created attachment 124390 [details]
GPU softreset
PCIe errors go away with pcie_aspm=off as well apparently.
Regardless, my computer keeps crashing so badly not even SysRq works. I don't get any kernel error messages about anything except for the GPU faults. Usually I get a whole bunch of them just before completely crashing.
Last night the computer was idle and screens were in powersaving mode and they suddenly woke up and then turned off again - apparently it crashed.
This time there's a better error in the journal from before the crash:
GPU fault -> GPU lockup -> Couldn't update BO_VA -> GPU softreset -> dead
See attachment.
I get the same error, but only when running "The Talos Principle". The system does not crash, but the performance is pretty bad on the game (15fps at high 1080p). Ago 24 00:58:22 boreal kernel: VM fault (0x0c, vmid 7) at page 0, read from 'TC4' (0x54433400) (136) Ago 24 00:58:22 boreal kernel: radeon 0000:01:00.0: GPU fault detected: 146 0x000e880c Ago 24 00:58:22 boreal kernel: radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 Ago 24 00:58:22 boreal kernel: radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E08800C The error repeats a lot, several times per second. (In reply to Ismael from comment #8) > I get the same error, but only when running "The Talos Principle". The > system does not crash, but the performance is pretty bad on the game (15fps > at high 1080p). > > Ago 24 00:58:22 boreal kernel: VM fault (0x0c, vmid 7) at page 0, read from > 'TC4' (0x54433400) (136) > Ago 24 00:58:22 boreal kernel: radeon 0000:01:00.0: GPU fault detected: 146 > 0x000e880c > Ago 24 00:58:22 boreal kernel: radeon 0000:01:00.0: > VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 > Ago 24 00:58:22 boreal kernel: radeon 0000:01:00.0: > VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E08800C > > > The error repeats a lot, several times per second. This is fixed by: https://cgit.freedesktop.org/mesa/mesa/commit/?id=2c13abb49137d0f81b530b3c67f1ed79c58c796e I think the original issue is unrelated. For the original poster: I recommend testing mesa/master. No further feedback from the original reporter assuming fixed and closing. Please reopen if issues continue. I don't know if it's relevant or not, due my using amdgpu-pro driver, but I get this massages sometime: [14849.076326] gmc_v8_0_process_interrupt: 165 callbacks suppressed [14849.076331] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c [14849.076336] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001152FF [14849.076338] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A07700C [14849.076341] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32770) at page 1135359, read from 'SDM0' (0x53444d30) (119) [14851.218731] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c [14851.218736] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001152FF [14851.218738] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C07700C [14851.218741] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 6, pasid 32770) at page 1135359, read from 'SDM0' (0x53444d30) (119) [14860.154325] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c [14860.154330] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001152FF [14860.154331] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0407700C [14860.154334] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 2, pasid 32770) at page 1135359, read from 'SDM0' (0x53444d30) (119) [15073.787603] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c [15073.787608] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001152FF [15073.787610] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C07700C [15073.787612] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 6, pasid 32770) at page 1135359, read from 'SDM0' (0x53444d30) (119) [15095.908340] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c [15095.908345] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001152FF [15095.908347] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0607700C [15095.908350] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 3, pasid 32770) at page 1135359, read from 'SDM0' (0x53444d30) (119) [15197.968706] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0cf8770c [15197.968711] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0011D39F [15197.968713] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A07700C [15197.968715] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32770) at page 1168287, read from 'SDM0' (0x53444d30) (119) [15710.487271] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c [15710.487275] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001094FF [15710.487277] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A07700C [15710.487279] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32770) at page 1086719, read from 'SDM0' (0x53444d30) (119) [15759.495971] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c [15759.495978] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001094FF [15759.495981] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0807700C [15759.495985] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 4, pasid 32770) at page 1086719, read from 'SDM0' (0x53444d30) (119) [15768.854519] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c [15768.854525] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001094FF [15768.854526] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A07700C [15768.854529] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32770) at page 1086719, read from 'SDM0' (0x53444d30) (119) [15818.316441] amdgpu 0000:01:00.0: GPU fault detected: 146 0x07f8770c [15818.316447] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001094FF [15818.316448] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0207700C [15818.316451] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 1, pasid 32770) at page 1086719, read from 'SDM0' (0x53444d30) (119) -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1231. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 123957 [details] dmesg With some uptime I eventually start getting these GPU VM faults: [106098.543115] VM fault (0x0c, vmid 6) at page 58321, read from TC (68) [106098.543119] radeon 0000:02:00.0: GPU fault detected: 146 0x0a0c480c [106098.543121] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0000E3CB [106098.543123] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C00800C It's always the same fault number but the memory addresses(?) change. See attached dmesg. My system doesn't crash, but black glitches do appear on the screen and I need to switch desktops or move windows around to make them go away. KDE desktop with compositing enabled. I'm running Arch Linux with latest stable packages: Linux carrier 4.5.4-1-ARCH #1 SMP PREEMPT Wed May 11 22:21:28 CEST 2016 x86_64 GNU/Linux Name : xf86-video-ati Version : 1:7.7.0-1 P.S. I use pci=nommconf because I have massive issues with PCI-E devices otherwise (radeon goes nuts coincidentally).