96906 – OpenCL program causes steady stream of GPU fault detected errors

Bug 96906 - OpenCL program causes steady stream of GPU fault detected errors

Summary: OpenCL program causes steady stream of GPU fault detected errors

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu-pro (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-07-12 18:56 UTC by Jolan Luff
Modified:	2019-11-19 07:57 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments

Description Jolan Luff 2016-07-12 18:56:04 UTC

Hi,

I have an OpenCL program which causes a steady stream of errors:

kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x03f84801
kernel: VM fault (0x01, vmid 6) at page 135290494, read from 'TC7' (0x54433700) (132)
kernel: amdgpu 0000:05:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C084001
kernel: amdgpu 0000:05:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08105E7E
kernel: amdgpu 0000:05:00.0: GPU fault detected: 147 0x03f08401
kernel: VM fault (0x01, vmid 3) at page 135290494, read from 'TC2' (0x54433200) (200)
kernel: amdgpu 0000:04:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x060C8001
kernel: amdgpu 0000:04:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08105E7E
kernel: amdgpu 0000:04:00.0: GPU fault detected: 147 0x03f0c801
kernel: VM fault (0x01, vmid 6) at page 135290495, read from 'TC1' (0x54433100) (4)
kernel: amdgpu 0000:05:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C004001
kernel: amdgpu 0000:05:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08105E7F

AMDGPU-Pro: 16.30.3.306809
Kernel: 4.7.0-rc6-mainline
GPU: RX480

Comment 1 Jolan Luff 2016-07-12 18:57:45 UTC

The program in question is "Claymore's Dual Ethereum+Decred/Siacoin GPU
Miner v5.0 (Windows/Linux)":

https://bitcointalk.org/index.php?topic=1433925.0

I have it running on 5 computers on 8 cards total and some single GPU setups hit it only sporadically, I have a dual GPU system where the program is basically unusable, but in general the program still works despite some slow down due to thousands of errors being generated and printed.

Comment 2 Jolan Luff 2016-07-12 18:59:09 UTC

Also should mention that the same program works on Fiji/Hawaii hardware with catalyst/fglrx 15.12 and from various reports seems to work error-free on Windows.

Comment 3 Rick 2016-07-17 18:21:10 UTC

This seems to be happening with mining. Different mining software creates the same error logs. Jolan Luff is mining with Claymore's, I'm using ethminer. On this forum thread is "langxxl" who is also using ethminer and getting the same errors as myself. https://forum.ethereum.org/discussion/8250/ubuntu-16-04-lts-rx-480-mining-ethereum-confirmed-working?


It still hashes, just not very well with an average of 16Mh/s when it should be around 25Mh/s.

ASUS 8GB RX480
Ubuntu 16.04 LTS
AMDGPU-PRO v16.3
parity v1.2.2-beta
ethminer v1.2.9


[ 1084.587016] VM fault (0x01, vmid 3) at page 47326788, read from 'TC0' (0x54433000) (8)
[ 1084.587016] amdgpu 0000:07:00.0: GPU fault detected: 147 0x05708801
[ 1084.587016] amdgpu 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00E0983C
[ 1084.587016] amdgpu 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06088002
[ 1084.587016] VM fault (0x02, vmid 3) at page 14719036, read from 'TC6' (0x54433600) (136)
[ 1084.587016] amdgpu 0000:07:00.0: GPU fault detected: 147 0x0d78c401
[ 1084.587016] amdgpu 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x03B9780A
[ 1084.587016] amdgpu 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x060C4001
[ 1084.587016] VM fault (0x01, vmid 3) at page 62486538, read from 'TC3' (0x54433300) (196)
[ 1084.587016] amdgpu 0000:07:00.0: GPU fault detected: 147 0x0c88c801
[ 1084.587016] amdgpu 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x016D8DF6
[ 1084.587016] amdgpu 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06044001
[ 1084.587016] VM fault (0x01, vmid 3) at page 23956982, read from 'TC5' (0x54433500) (68)
[ 1084.587016] amdgpu 0000:07:00.0: GPU fault detected: 147 0x09e08801
[ 1084.587016] amdgpu 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x006F7D1B
[ 1084.587016] amdgpu 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x060C4002
etc. etc...

Comment 4 John Bridgman 2016-12-28 23:16:15 UTC

Is this still happening with the 16.50 driver ?

Comment 5 Jolan Luff 2017-01-02 20:35:31 UTC

I haven't tested with anything newer than 16.30.x yet. (I use Arch Linux and the unofficial package hasn't been updated)

I didn't see any mention of OpenCL fixes in the changelog so I haven't tried to update myself.  I did just check and it looks like 16.50.X may be coming soon.  Will report back if no one else beats me to it.

Comment 6 Sebastian 2017-01-20 13:25:54 UTC

Still happens for me with 16.50

Ubuntu 16.04.1
Radeon R9 380
claymore dualminer

exactly same pc/setup works fine with an RX 470



[  211.556980] VM fault (0x01, vmid 5) at page 135286044, read from 'TC2' (0x54433200) (0)
[  211.557252] amdgpu 0000:02:00.0: GPU fault detected: 147 0x08e0c001
[  211.557253] amdgpu 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08104D1C
[  211.557253] amdgpu 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C0001
[  211.557254] VM fault (0x01, vmid 5) at page 135286044, read from 'TC5' (0x54433500) (192)
[  211.557257] amdgpu 0000:02:00.0: GPU fault detected: 147 0x08e84401
[  211.557257] amdgpu 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08104D1D
[  211.557258] amdgpu 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A044001

Comment 7 EoD 2017-07-09 12:05:11 UTC

I don't think this is related to amdgpu-pro at all. I am getting the very same error when running the ethminer via Mesa/Clover.

> [OpenCL] Device:   AMD Radeon R9 380 Series (AMD TONGA / DRM 3.15.0 / 4.12.0, LLVM 3.9.1) / OpenCL 1.1 Mesa 17.2.0-devel (git-038c45a40e)

> [   50.993264] amdgpu 0000:01:00.0: GPU fault detected: 147 0x02e8c001
> [   50.993264] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x03ABDF7A
> [   50.993264] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A080001
> [   50.993265] amdgpu 0000:01:00.0: VM fault (0x01, vmid 5) at page 61595514, read from 'TC11' (0x54433131) (128)
> [   50.993267] amdgpu 0000:01:00.0: GPU fault detected: 147 0x04200001
> [   50.993268] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x025E46F9
> [   50.993268] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A008001
> [   50.993269] amdgpu 0000:01:00.0: VM fault (0x01, vmid 5) at page 39732985, read from 'TC0' (0x54433000) (8) 
> [   50.994135] amdgpu 0000:01:00.0: IH ring buffer overflow (0x000C7820, 0x0000BE00, 0x00007830)


I ran the following code (mesa-compatible version)
  https://github.com/EoD/ethminer/tree/fix_mesa_compilation

with these parameters (demo mode)
  ./ethminer -G -Z

Comment 8 Rafael Ristovski 2018-02-04 15:11:39 UTC

I'm getting the same output when running vertminer with amdgpu + OpenCL lib from amdgpu-pro ("Unsupported" setup as mixing binary with open source amdgpu).
I believe the issue is in the kernel driver, as that's the common link between all these setups.

>[ 1228.836110] amdgpu 0000:03:00.0: GPU fault detected: 147 0x09460402
>[ 1228.836110] amdgpu 0000:03:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001F0B43
>[ 1228.836111] amdgpu 0000:03:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06008002
>[ 1228.836112] amdgpu 0000:03:00.0: VM fault (0x02, vmid 3) at page 2034499, read from '' (0x00000000) (8)
>[ 1228.836122] amdgpu 0000:03:00.0: GPU fault detected: 147 0x0fe60402
>[ 1228.836123] amdgpu 0000:03:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
>[ 1228.836123] amdgpu 0000:03:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06008002
>[ 1228.836124] amdgpu 0000:03:00.0: VM fault (0x02, vmid 3) at page 0, read from '' (0x00000000) (8)

Comment 9 DanaGoyette 2019-05-20 02:20:13 UTC

It's not just miners that can cause it.  I get similar messages when running Luxmark and the LuxVR mode.  In my case, the device is a Radeon Pro WX 4100 ("Baffin").

Once these errors occur, all sorts of OpenGL applications freeze, including Chrome tabs.  After that, it seems like I can only fix it by rebooting.

Comment 10 Martin Peres 2019-11-19 07:57:48 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/8.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.