Bug 97039

Summary: The Talos Principle and Serious Sam 3 GPU faults
Product: Mesa Reporter: smoki <smoki00790>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: 0xe2.0x9a.0x9b, vedran
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Likely fix - RW_BUFFERS pointer is not written for LS stage
apitrace

Description smoki 2016-07-22 04:09:26 UTC
One more regression i forgot to fill bug about so to mention are GPU faults with The Talos Principle or Serious Sam 3... produce a lot of these right upon starting a game and continue...

[  823.723639] radeon 0000:00:01.0: GPU fault detected: 146 0x0006200c
[  823.723649] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  823.723652] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0602000C

 Mesa bisect goes to 860b658b97f859ee7d0dd076a8ac0332601ffa65
 
 radeonsi: move clip plane constant buffer to RW buffers
Comment 1 Vedran Miletić 2016-07-22 08:30:09 UTC
Which GPU? Don't remember seeing those on Tonga.
Comment 2 smoki 2016-07-23 01:24:44 UTC
 Test is on Kabini APU (i have Bonaire and Kaveri so i might test that too), but someone on irc already mentioned it happens with amdgpu on Bonaire... so yup if it does not happen with Tonga it might be CIK related.
Comment 3 Nicolai Hähnle 2016-07-23 14:18:52 UTC
Created attachment 125277 [details] [review]
Likely fix - RW_BUFFERS pointer is not written for LS stage

Could you please try whether the attached patch fixes the problem for you?
Comment 4 smoki 2016-07-23 22:28:07 UTC
 No, still the same faults happens.
Comment 5 smoki 2016-07-26 18:15:48 UTC
 This works before 860b658b97f859ee7d0dd076a8ac0332601ffa65

 So fixed.
Comment 6 Nicolai Hähnle 2016-07-27 06:14:29 UTC
I'm confused. You first wrote that 860b658b97f859ee7d0dd076a8ac0332601ffa65 is the commit which started the faults. Which one is it? Is Mesa master okay?
Comment 7 smoki 2016-07-29 09:42:31 UTC
 
 Why i closed this, have not idea... please ignore Comment 4

 Yes Nicolai, bug is still there with current git of mesa and llvm.
Comment 8 Jan Ziak (http://atom-symbol.net) 2016-08-16 18:13:45 UTC
(In reply to smoki from comment #2)
>  Test is on Kabini APU (i have Bonaire and Kaveri so i might test that too),
> but someone on irc already mentioned it happens with amdgpu on Bonaire... so
> yup if it does not happen with Tonga it might be CIK related.

I have Kaveri iGPU and Talos Principle. I can try to run Talos Principle on Kaveri later this day (machine needs to be rebooted to enable the iGPU).

Were you able to reproduce the issue on your Kaveri iGPU, or is it just reproducible on Kabini?
Comment 9 Jan Ziak (http://atom-symbol.net) 2016-08-17 08:16:22 UTC
(In reply to Jan Ziak from comment #8)
> (In reply to smoki from comment #2)
> >  Test is on Kabini APU (i have Bonaire and Kaveri so i might test that too),
> > but someone on irc already mentioned it happens with amdgpu on Bonaire... so
> > yup if it does not happen with Tonga it might be CIK related.
> 
> I have Kaveri iGPU and Talos Principle. I can try to run Talos Principle on
> Kaveri later this day (machine needs to be rebooted to enable the iGPU).
> 
> Were you able to reproduce the issue on your Kaveri iGPU, or is it just
> reproducible on Kabini?

"Talos Principle [publicbeta]" runs fine on my machine's Kaveri iGPU

Mesa 12.1.0-devel (git-e988999)
Linux 4.8.0-rc2, amdgpu.ko
Comment 10 smoki 2016-08-17 10:58:12 UTC
 Hmmm, do you have any GPU faults with it? Check dmesg.

 As this is what is this bug about, those two games runs but introduce constant GPU faults.
Comment 11 Jan Ziak (http://atom-symbol.net) 2016-08-17 11:30:28 UTC
(In reply to smoki from comment #10)
>  Hmmm, do you have any GPU faults with it? Check dmesg.
> 
> As this is what is this bug about, those two games runs but introduce
> constant GPU faults.

You are right:

[18409.862642] VM fault (0x02, vmid 5) at page 0, read from 'TC4' (0x54433400) (136)
[18409.917406] amdgpu 0000:01:00.0: GPU fault detected: 147 0x000a8802
[18409.917409] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[18409.917410] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A088002

The GPU is R9 390.
Comment 12 smoki 2016-08-17 11:37:22 UTC
 
 Exactly, i have those faults with Kabini, Kaveri and Bonaire using radeon. And you have it with Kaveri and Hawaii (Grenada) using amdgpu.

 So for now reported only GCN 1.1 is affected by this and regardless of kernel driver used.
Comment 13 Jan Ziak (http://atom-symbol.net) 2016-08-17 12:03:14 UTC
Created attachment 125846 [details]
apitrace

The last glDrawArrays in the trace prints a VM fault to dmesg.
Comment 14 Marek Olšák 2016-08-18 12:17:30 UTC
Thanks. I know where the problem is and I'm working on it.

Just FYI, the VM fault is completely harmless.
Comment 15 smoki 2016-08-18 13:59:42 UTC
 It is not harmless, as lower end machine is (APU) then you even lost sort of 20% with these constant non harmeless messAgess on top of already low perf.

 Also i obesrved one random GPU lockup if continue playing Serious Sam 3 with it, but let just say that is separate issue for now.
Comment 16 Jan Ziak (http://atom-symbol.net) 2016-08-18 14:13:43 UTC
(In reply to smoki from comment #15)
>  It is not harmless, as lower end machine is (APU) then you even lost sort
> of 20% with these constant non harmless messages on top of already low
> perf.

Of course, I (and probably also you) will do a performance measurement after Marek's patch is available.

(There is lot of work to be done in Mesa to make it perform generally better and approach Nvidia Linux performance. Considering the amount of work required I do not expect it to materialize this year.)
Comment 17 smoki 2016-08-18 14:37:59 UTC
 I already do measurement month ago, when i bisected this.

 I would like to have 2core Temash APU to show even bigger issue that is i think highly recommended for perf measurements, but Kabini is slowest i have :D If i don't care i wouldn't be here, otherwise i would run Pascal Titan X with blob and be as ignorant as i can.

 But no i am not, not me. Here i actually don't pretend or push on higher perf at all, but opposite - just things to not regress this much... so it is small regression testing contribution from me.
Comment 18 Marek Olšák 2016-08-24 14:04:08 UTC
Fixed by 2c13abb49137d0f81b530b3c67f1ed79c58c796e. Closing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.