Bug 87278

Summary:	Packet0 not allowed and GPU fault detected errors with Serious Engine games
Product:	Mesa	Reporter:	Daniel Scharrer <daniel>
Component:	Drivers/Gallium/radeonsi	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED MOVED	QA Contact:
Severity:	normal
Priority:	medium	CC:	alexandre.f.demers, ashmikuz, haagch, keramidasceid, maraeo
Version:	git
Hardware:	Other
OS:	All
See Also:	https://bugs.freedesktop.org/show_bug.cgi?id=84500
Whiteboard:
i915 platform:		i915 features:
Attachments:	dmesg output with the GPU fault errors filtered out standard output from The Talos Principle and Serious Sam 3 possible fix for VM faults VM faults and Packet0 error when quitting the current game sorry i should've attached it dmesg with blizzard's heroes of the storm beta in wine Output with R600_DEBUG=ps,vs,gs dmesg to "Output with R600_DEBUG=ps,vs,gs"

Description Daniel Scharrer 2014-12-13 10:38:05 UTC

Created attachment 110808 [details]
dmesg output with the GPU fault errors filtered out

Running Serious Sam 3 or The Talos Principle spams dmesg with thousands of these errors:

[ 6001.212237] radeon 0000:01:00.0: GPU fault detected: 147 0x02528801
[ 6001.212243] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0FF02192
[ 6001.212246] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x12088001
[ 6001.212249] VM fault (0x01, vmid 9) at page 267395474, read from TC (136)

There are also a few "Packet0 not allowed" errors (followed by a hex dump):

[15446.473341] radeon 0000:01:00.0: Packet0 not allowed!

So far it's only these errors in dmesg - I haven't observed any actual rendering issues, crashes, GPU lockups because of this.

I have only attached a filtered kernel log with the GPU fault errors removed  - the full log is available at http://constexpr.org/tmp/serious-dmesg.log (140 MiB).

Both of these games use the Serious Engine 3.5 (Serious Sam 3) or 4 (The Talos Principle). This is also reproducible with The Talos Principle Public Test which as of now is still available as a free download on Steam.

Kernel: 3.18.0-gentoo
GPU: Radeon HD 7950
Driver: radeonsi, Mesa 10.5.0-devel (git-ff96537)

This might be related to bug 84500 - however those spurious Packet0 have been gone for a while now with updated Mesa - now I got them again but only while running Serious Engine games.

Comment 1 Daniel Scharrer 2014-12-13 10:39:59 UTC

Created attachment 110809 [details]
standard output from The Talos Principle and Serious Sam 3

The standard output has this repeated a few times:

radeon: The kernel rejected CS, see dmesg for more information.

Comment 2 Christoph Haag 2014-12-13 18:45:42 UTC

This also happens on a HD 7970M (pitcairn).

00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wimbledon XT [Radeon HD 7970M] (rev ff)

On latest mesa git master and both with linux 3.18 and drm-next-3.19.

Setting the game to lowest settings doesn't seem to help.

When closely looking at the ground the vm faults seem to stop, when looking a little bit in the distance, they start happening again.

[47303.471209] radeon 0000:01:00.0: GPU fault detected: 147 0x0fe2c401
[47303.471211] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0FF03E4E
[47303.471212] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x020C4001
[47303.471214] VM fault (0x01, vmid 1) at page 267402830, read from TC (196)
[47303.487684] radeon 0000:01:00.0: GPU fault detected: 147 0x09c2c801
[47303.487689] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0FF03E4E
[47303.487691] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x020C8001
[47303.487694] VM fault (0x01, vmid 1) at page 267402830, read from TC (200)
[47303.487696] radeon 0000:01:00.0: GPU fault detected: 147 0x09c28401
[47303.487698] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0FFFFFFF
[47303.487699] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x020C4001
[47303.487701] VM fault (0x01, vmid 1) at page 268435455, read from TC (196)
[47303.504293] radeon 0000:01:00.0: GPU fault detected: 147 0x09c24801
[47303.504298] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0FF03E4E
[47303.504300] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048001
[47303.504302] VM fault (0x01, vmid 1) at page 267402830, read from TC (72)

Comment 3 Michel Dänzer 2014-12-16 08:48:35 UTC

Does the environment variable R600_DEBUG=nodma avoid this problem?

Comment 4 Christoph Haag 2014-12-16 08:56:39 UTC

(In reply to Michel Dänzer from comment #3)
> Does the environment variable R600_DEBUG=nodma avoid this problem?

No, R600_DEBUG=nodma does not help.

Comment 5 Alexandre Demers 2014-12-17 05:52:30 UTC

(In reply to Daniel Scharrer from comment #0)
> Created attachment 110808 [details]
> dmesg output with the GPU fault errors filtered out
> 
> Running Serious Sam 3 or The Talos Principle spams dmesg with thousands of
> these errors:
> 
> [ 6001.212237] radeon 0000:01:00.0: GPU fault detected: 147 0x02528801
> [ 6001.212243] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x0FF02192
> [ 6001.212246] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x12088001
> [ 6001.212249] VM fault (0x01, vmid 9) at page 267395474, read from TC (136)
> 
> There are also a few "Packet0 not allowed" errors (followed by a hex dump):
> 
> [15446.473341] radeon 0000:01:00.0: Packet0 not allowed!
> 
> So far it's only these errors in dmesg - I haven't observed any actual
> rendering issues, crashes, GPU lockups because of this.
> 
> I have only attached a filtered kernel log with the GPU fault errors removed
> - the full log is available at http://constexpr.org/tmp/serious-dmesg.log
> (140 MiB).
> 
> Both of these games use the Serious Engine 3.5 (Serious Sam 3) or 4 (The
> Talos Principle). This is also reproducible with The Talos Principle Public
> Test which as of now is still available as a free download on Steam.
> 
> Kernel: 3.18.0-gentoo
> GPU: Radeon HD 7950
> Driver: radeonsi, Mesa 10.5.0-devel (git-ff96537)
> 
> This might be related to bug 84500 - however those spurious Packet0 have
> been gone for a while now with updated Mesa - now I got them again but only
> while running Serious Engine games.

I haven't had a look at the log when launching SS3, but for sure it crashes in no time. It crashes in no time once in a game. It could be related to your bug. However, I think the VM and the Packet 0 are different bugs. I'll have a look in the logs if I get something similar.

Comment 6 Michel Dänzer 2014-12-17 09:39:33 UTC

(In reply to Christoph Haag from comment #2)
> [47303.471211] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
> 0x0FF03E4E

The fact that bits 32-39 of the faulting addresses are FF indicates incorrect shader code generation resulting in those address bits being clobbered.

Can you generate an apitrace which reproduces the Packet0 error and/or GPUVM faults?

Comment 7 Daniel Scharrer 2014-12-17 10:34:09 UTC

(In reply to Michel Dänzer from comment #3)
> Does the environment variable R600_DEBUG=nodma avoid this problem?

No change here either. But now that you mention it: IIRC SS3 did sometimes lock up the GPU before http://cgit.freedesktop.org/mesa/mesa/commit/?id=ae4536b4f71cbe76230ea7edc7eb4d6041e651b4

(In reply to Michel Dänzer from comment #6)
> Can you generate an apitrace which reproduces the Packet0 error and/or GPUVM
> faults?

Here is one with VM faults from starting The Talos Principle Public Test, just up to the main menu:

http://constexpr.org/tmp/Talos_Demo.trace (149 MiB)

Sometimes there is also a Packet0 error at the end. Didn't get it while recording, got it 2/3 times while replaying.

Comment 8 Michel Dänzer 2014-12-19 03:36:47 UTC

(In reply to Daniel Scharrer from comment #7)
> Here is one with VM faults from starting The Talos Principle Public Test,
> just up to the main menu:
> 
> http://constexpr.org/tmp/Talos_Demo.trace (149 MiB)

Thanks. The VM faults generated by this apitrace turned out to be a Mesa regression. I bisected it:

5e0fbe1b631d883eb0e033938a534a259c8d95fd is the first bad commit
commit 5e0fbe1b631d883eb0e033938a534a259c8d95fd
Author: Marek Olšák <marek.olsak@amd.com>
Date:   Sat Oct 4 20:41:03 2014 +0200

    radeonsi: remove vs.ucps_enabled from the shader key
    
    Written CLIPDIST outputs are simply disabled in PA_CL_VS_OUT_CNTL.

Comment 9 Michel Dänzer 2014-12-19 03:39:03 UTC

Note that I'm only getting the VM faults with my Cape Verde card, not with my Kaveri. Seems to be SI specific.

I haven't been able to reproduce the Packet0 error with this apitrace.

Comment 10 Marek Olšák 2014-12-19 11:13:22 UTC

Created attachment 111036 [details] [review]
possible fix for VM faults

Does this patch fix the VM faults?

Comment 11 Christoph Haag 2014-12-19 13:32:53 UTC

(In reply to Marek Olšák from comment #10)
> Created attachment 111036 [details] [review] [review]
> possible fix for VM faults
> 
> Does this patch fix the VM faults?

I don't think it does for me. I still get

radeon 0000:01:00.0: GPU fault detected: 147 0x0342c401
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0FF0081A
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x020C4001
VM fault (0x01, vmid 1) at page 267388954, read from TC (196)
radeon 0000:01:00.0: GPU fault detected: 147 0x03424401
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0FF0081A
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02044001
VM fault (0x01, vmid 1) at page 267388954, read from TC (68)
radeon 0000:01:00.0: GPU fault detected: 147 0x03428401
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0FF0081A
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02084001
VM fault (0x01, vmid 1) at page 267388954, read from TC (132)
radeon 0000:01:00.0: GPU fault detected: 147 0x03420401
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0FF0081A
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02004001
VM fault (0x01, vmid 1) at page 267388954, read from TC (4)
radeon 0000:01:00.0: GPU fault detected: 147 0x03428401
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0FF0081A
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02084001
VM fault (0x01, vmid 1) at page 267388954, read from TC (132)

Comment 12 Alexandre Demers 2014-12-28 16:29:25 UTC

Sorry, it has been a couple of weeks, but I confirm I have the same problem with my R9 270X. I'm using latest mesa, drm, ddx from git repositories on a 3.19-rc1 kernel.

I'll be attaching my log. I'll also test the patch.

Comment 13 Alexandre Demers 2014-12-28 16:30:59 UTC

Created attachment 111430 [details]
VM faults and Packet0 error when quitting the current game

on R9 270X with today's latest mesa, drm, ddx from git repositories and kernel 3.19-rc1

Comment 14 Alexandre Demers 2014-12-30 05:15:52 UTC

(In reply to Alexandre Demers from comment #13)
> Created attachment 111430 [details]
> VM faults and Packet0 error when quitting the current game
> 
> on R9 270X with today's latest mesa, drm, ddx from git repositories and
> kernel 3.19-rc1

Patch tested and I get the same error as before, as Christoph.

Comment 15 Arash 2015-01-05 07:14:58 UTC

same issue, dota2 and with no Packet0 message.
kernel 3.18
llvm git
mesa git
r9-270x/radeonsi

it happens fairly frequently (or every time i pick 'morphling')  :(
i haven't had this problem with 6570 (r600g)...

...
Jan 05 06:25:08 -- kernel: switching to power state:
Jan 05 06:25:08 -- kernel:         ui class: performance
Jan 05 06:25:08 -- kernel:         internal class: none
Jan 05 06:25:08 -- kernel:         caps: 
Jan 05 06:25:08 -- kernel:         uvd    vclk: 0 dclk: 0
Jan 05 06:25:08 -- kernel:                 power level 0    sclk: 30000 mclk: 15000 vddc: 875 vddci: 850 pcie gen: 1
Jan 05 06:25:08 -- kernel:                 power level 1    sclk: 45000 mclk: 140000 vddc: 950 vddci: 1025 pcie gen: 1
Jan 05 06:25:08 -- kernel:                 power level 2    sclk: 105000 mclk: 140000 vddc: 1163 vddci: 1025 pcie gen: 1
Jan 05 06:25:08 -- kernel:                 power level 3    sclk: 112000 mclk: 140000 vddc: 1206 vddci: 1025 pcie gen: 1
Jan 05 06:25:08 -- kernel:         status: c r 
Jan 05 06:25:40 -- kernel: IPVS: Creating netns size=2056 id=21
Jan 05 06:25:40 -- kernel: IPVS: ftp: loaded support on port[0] = 21
Jan 05 06:28:59 -- kernel: radeon 0000:01:00.0: GPU fault detected: 147 0x00044401
Jan 05 06:28:59 -- kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x01000000
Jan 05 06:28:59 -- kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04044001
Jan 05 06:28:59 -- kernel: VM fault (0x01, vmid 2) at page 16777216, read from TC (68)
Jan 05 06:28:59 -- kernel: radeon 0000:01:00.0: GPU fault detected: 147 0x00044401
Jan 05 06:28:59 -- kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x01000000
Jan 05 06:28:59 -- kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04044001
Jan 05 06:28:59 -- kernel: VM fault (0x01, vmid 2) at page 16777216, read from TC (68)
Jan 05 06:29:10 -- kernel: radeon 0000:01:00.0: ring 0 stalled for more than 10293msec
Jan 05 06:29:10 -- kernel: radeon 0000:01:00.0: GPU lockup (current fence id 0x00000000002165c7 last fence id 0x00000000002165e9 on ring 0)
Jan 05 06:29:10 -- kernel: radeon 0000:01:00.0: failed to get a new IB (-35)
Jan 05 06:29:10 -- kernel: [drm:radeon_cs_ib_fill] *ERROR* Failed to get ib !
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0: failed to get a new IB (-35)
Jan 05 06:29:11 -- kernel: [drm:radeon_cs_ib_fill] *ERROR* Failed to get ib !
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0: Saved 1355 dwords of commands on ring 0.
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0: GPU softreset: 0x0000004D
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   GRBM_STATUS               = 0xF7D24028
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0xEFC00000
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0xEFC00000
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x40000000
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00008006
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80228647
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44483106
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100100
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0: GPU reset succeeded, trying to resume
Jan 05 06:29:11 -- kernel: [drm] probing gen 2 caps for device 8086:2e31 = 2212501/0
Jan 05 06:29:11 -- kernel: [drm] PCIE GART of 1024M enabled (table at 0x0000000000277000).
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0: WB enabled
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000100000c00 and cpu addr 0xffff8800db08fc00
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000100000c04 and cpu addr 0xffff8800db08fc04
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000100000c08 and cpu addr 0xffff8800db08fc08
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000100000c0c and cpu addr 0xffff8800db08fc0c
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000100000c10 and cpu addr 0xffff8800db08fc10
Jan 05 06:29:11 -- kernel: radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc900048b5a18
Jan 05 06:29:11 -- kernel: [drm] ring test on 0 succeeded in 3 usecs
Jan 05 06:29:11 -- kernel: [drm] ring test on 1 succeeded in 1 usecs
Jan 05 06:29:11 -- kernel: [drm] ring test on 2 succeeded in 1 usecs
Jan 05 06:29:11 -- kernel: [drm] ring test on 3 succeeded in 6 usecs
Jan 05 06:29:11 -- kernel: [drm] ring test on 4 succeeded in 6 usecs
Jan 05 06:29:11 -- kernel: [drm] ring test on 5 succeeded in 2 usecs
Jan 05 06:29:11 -- kernel: [drm] UVD initialized successfully.
Jan 05 06:29:11 -- kernel: switching from power state:
Jan 05 06:29:11 -- kernel:         ui class: none
Jan 05 06:29:11 -- kernel:         internal class: boot 
Jan 05 06:29:11 -- kernel:         caps: 
Jan 05 06:29:11 -- kernel:         uvd    vclk: 0 dclk: 0
Jan 05 06:29:11 -- kernel:                 power level 0    sclk: 15000 mclk: 15000 vddc: 950 vddci: 950 pcie gen: 1
Jan 05 06:29:11 -- kernel:         status: c b 
...

Comment 16 Arash 2015-01-05 08:16:02 UTC

Created attachment 111751 [details]
sorry i should've attached it

Comment 17 Marek Olšák 2015-02-07 13:36:25 UTC

This seems to be fixed with current Mesa git. Can you confirm?

Comment 18 Christoph Haag 2015-02-07 14:56:40 UTC

(In reply to Marek Olšák from comment #17)
> This seems to be fixed with current Mesa git. Can you confirm?

I played for a while and the problems were gone for me.

The performance is still very bad and sometimes there is some graphics/texture corruption flickering. But this specific issue here seems to be fixed.

Comment 19 Alexandre Demers 2015-02-07 15:57:51 UTC

(In reply to Marek Olšák from comment #17)
> This seems to be fixed with current Mesa git. Can you confirm?

Indeed, no GPU faults and no Packet 0 observered. SS3 doesn't crash the whole desktop anymore.

Do you have any idea what was pushed that may have fixed the bug we were seeing.

Comment 20 Daniel Scharrer 2015-02-07 19:48:36 UTC

I can confirm that the GPU fault errors are gone, but still get Packet0 errors (both in game and in the apitrace from Comment 7).

Also, there were still GPU fault errors in The Talos Principle and demo (but not the apitrace) until I also updated LLVM.

Comment 21 Alexandre Demers 2015-02-07 23:08:14 UTC

(In reply to Daniel Scharrer from comment #20)
> I can confirm that the GPU fault errors are gone, but still get Packet0
> errors (both in game and in the apitrace from Comment 7).
> 
> Also, there were still GPU fault errors in The Talos Principle and demo (but
> not the apitrace) until I also updated LLVM.

Good point about LLVM, because I'm also using yesterday's svn LLVM code.

Comment 22 Marek Olšák 2015-02-08 10:18:11 UTC

(In reply to Alexandre Demers from comment #19)
> (In reply to Marek Olšák from comment #17)
> > This seems to be fixed with current Mesa git. Can you confirm?
> 
> Indeed, no GPU faults and no Packet 0 observered. SS3 doesn't crash the
> whole desktop anymore.
> 
> Do you have any idea what was pushed that may have fixed the bug we were
> seeing.

Sorry, I have absolutely no idea. It could have been something in LLVM or perhaps something here:
http://cgit.freedesktop.org/mesa/mesa/log/?id=d8185aa9a8e3588fe014faef8afaeae56d45e90b

Thanks for the feedback. I'm closing the bug.

Comment 23 Daniel Scharrer 2015-02-18 18:26:24 UTC

(Some of) the GPU faults are back with Mesa git-8a71fd8 and LLVM r229671:

 [11047.892869] radeon 0000:01:00.0: GPU fault detected: 147 0x04088801
 [11047.892875] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0FF00820
 [11047.892878] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08088001
 [11047.892881] VM fault (0x01, vmid 4) at page 267388960, read from TC (136)
 [...]

There are also plenty of Packet0 errors still/again.

This happens after 4db985a5fa9ea985616a726b1770727309502d81 which reverts 0e9cdedd2e3943bdb7f3543a3508b883b167e427 "radeon/llvm: enable unsafe math for graphics shaders" as mentioned in bug 89069 comment 21. Unlike before this bug was closed, now there are only GPU faults after actually loading a level, which is not covered in the above trace. Here is the new, longer trace from the other bug report - maybe it will also allow others to better reproduce the Packet0 errors:

 http://constexpr.org/tmp/TalosDemo-radeonsi.2.trace.xz (83 MiB)

Comment 24 ArneJ 2015-03-10 18:59:36 UTC

I also see a lot of these messages in The Talos Principle on a R9 270X here:

...
[416091.177464] radeon 0000:01:00.0: GPU fault detected: 147 0x000a0401
[416091.177467] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06F02080
[416091.177468] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A004001
[416091.177469] VM fault (0x01, vmid 5) at page 116400256, read from TC (4)
[416091.195605] radeon 0000:01:00.0: GPU fault detected: 147 0x000a0401
[416091.195608] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06F02080
[416091.195610] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A004001
[416091.195611] VM fault (0x01, vmid 5) at page 116400256, read from TC (4)
[416091.213688] radeon 0000:01:00.0: GPU fault detected: 147 0x000a4801
[416091.213692] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06F02080
[416091.213693] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A048001
[416091.213694] VM fault (0x01, vmid 5) at page 116400256, read from TC (72)
[416091.231852] radeon 0000:01:00.0: GPU fault detected: 147 0x002a4801
[416091.231855] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06F02081
[416091.231857] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A048001
[416091.231858] VM fault (0x01, vmid 5) at page 116400257, read from TC (72)
[416091.250052] radeon 0000:01:00.0: GPU fault detected: 147 0x000a8801
[416091.250056] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06F02080
[416091.250057] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A088001
[416091.250058] VM fault (0x01, vmid 5) at page 116400256, read from TC (136)
[416091.268150] radeon 0000:01:00.0: GPU fault detected: 147 0x000a8801
[416091.268153] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06F02080
[416091.268154] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A088001
[416091.268156] VM fault (0x01, vmid 5) at page 116400256, read from TC (136)
[416091.286178] radeon 0000:01:00.0: GPU fault detected: 147 0x002a4801
[416091.286181] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06F02081
[416091.286182] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A048001
[416091.286183] VM fault (0x01, vmid 5) at page 116400257, read from TC (72)
[416091.304253] radeon 0000:01:00.0: GPU fault detected: 147 0x000a4801
[416091.304256] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x06F02080
[416091.304257] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A048001
[416091.304259] VM fault (0x01, vmid 5) at page 116400256, read from TC (72)
...

It looks like the game is stuttering when new textures are loaded or something like that. For example I go to a new area and when I walk straight, everything is smooth. When I start looking around, I get stuttering. This happens only once. After the initial stuttering, the game runs at normal speed again.
I also see some graphics corruptions like in https://bugs.freedesktop.org/show_bug.cgi?id=88978 which I can also see in dota itself.

I'm running mesa 5750595ca97b2f8f18d22af35b431a6c66dd899a and llvm r231783.

lspci says:
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Curacao XT [Radeon R9 270X]

Comment 25 Christoph Haag 2015-03-31 22:33:28 UTC

Created attachment 114793 [details]
dmesg with blizzard's heroes of the storm beta in wine

When playing Heroes of the Storm (I think you need a beta key to play) with wine (wine-staging with csmt, I admit), I get a lot of GPU problems with radeonsi.

I've seen "radeon 0000:01:00.0: Packet0 not allowed!" and a lot of GPU faults, so perhaps it is related.

It's kinda unplayable with radeonsi because it often hangs and it takes several seconds for it to recover.

recent llvm 3.7 svn, recent mesa git, linux 3.19-ck

Comment 26 Tom Stellard 2015-04-01 02:08:04 UTC

Can you run the game with R600_DEBUG=ps,vs,gs and post the output?

Comment 27 Christoph Haag 2015-04-01 11:54:40 UTC

Created attachment 114806 [details]
Output with R600_DEBUG=ps,vs,gs

Uhm, good luck with that 7 megabyte file. Not sure what's the binary garbage at the beginning.

Comment 28 Christoph Haag 2015-04-01 11:55:35 UTC

Created attachment 114807 [details]
dmesg to "Output with R600_DEBUG=ps,vs,gs"

Comment 29 Daniel Scharrer 2015-05-08 23:09:26 UTC

With Mesa git-3bdbc1e, LLVM r236436 and Linux 4.0.1-gentoo my previous Talos traces don't produce any GPU VM faults anymore. However, the game still does. Here is a new trace:

 http://constexpr.org/tmp/Talos-radeonsi.3.trace.xz (147 MiB)

This traces still produces VM faults even when re-enabling unsafe-fp-math optimizations (see bug 89069).

There is also some junk being rendered at the end of the trace.

Comment 30 Daniel Scharrer 2015-08-01 18:38:40 UTC

I no longer get any GPU faults or Packet0 errors with current LLVM and mesa (2b83133, even with unsafe math disabled again).

Comment 31 Julien Isorce 2018-08-09 17:35:20 UTC

I (In reply to Marek Olšák from comment #17)
> This seems to be fixed with current Mesa git. Can you confirm?

It looks like it is not fixed in mesa git as I can reproduce it with the apitrace in comment #c29 with Cap Verde, radeon driver, mesa 18+, kernel 4.15.0-15-generic, LLVM 7.0.0, xorg 1.20.99.1, xf86-video-ati 18.0.1.

(same result with kernel 4.4, mesa 12.0.6, llvm 3)

radeon 0000:08:00.0: GPU fault detected: 146 0x0d64520c
radeon 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001196EB
radeon 0000:08:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0405200C
VM fault (0x0c, vmid 2) at page 1152747, read from CB_CMASK (82)

I always get the errors above and sometimes I get the gpu lockup and also sometimes the Packet0 not allowed!.

The possible fix in #c10 does not help.

Comment 32 GitLab Migration User 2019-09-25 17:51:36 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1213.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.