Bug 79659 - R9 270X lockup with unigine valley since radeonsi: enable ARB_sample_shading
Summary: R9 270X lockup with unigine valley since radeonsi: enable ARB_sample_shading
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-06-04 22:58 UTC by Andy Furniss
Modified: 2014-11-11 02:41 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments

Description Andy Furniss 2014-06-04 22:58:51 UTC
R9 270X since -

commit f98a7d89be5d307c7a80fbde028a610f4377c3b9
Author: Marek Olšák <marek.olsak@amd.com>
Date:   Wed May 7 13:15:41 2014 +0200

    radeonsi: enable ARB_sample_shading

unigine valley run like -

vblank_mode=0 MESA_GLSL_VERSION_OVERRIDE=330 MESA_GL_VERSION_OVERRIDE=3.3 ./valley

will gpu lock then hard lock if I don't sysrq sub quickly enough 

Jun  4 21:59:31 ph4 kernel: radeon 0000:01:00.0: ring 3 stalled for more than 10003msec
Jun  4 21:59:31 ph4 kernel: radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000000d8a9 last fence id 0x000000000000d8a7 on ring 3)
Jun  4 21:59:31 ph4 kernel: radeon 0000:01:00.0: failed to get a new IB (-35)
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0: Saved 1677 dwords of commands on ring 0.
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0: GPU softreset: 0x0000004D
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0:   GRBM_STATUS               = 0xF7D20028
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0xEC400000
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0xEDC00000
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0:   SRBM_STATUS               = 0x200400C0
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x40000000
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00008006
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80228647
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44483106
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jun  4 21:59:32 ph4 kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100100
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0:   SRBM_STATUS               = 0x200400C0
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0: GPU reset succeeded, trying to resume
Jun  4 21:59:33 ph4 kernel: [drm] probing gen 2 caps for device 1022:9603 = 300d02/0
Jun  4 21:59:33 ph4 kernel: [drm] PCIE gen 2 link speeds already enabled
Jun  4 21:59:33 ph4 kernel: [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0: WB enabled
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff8800cc194c00
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff8800cc194c04
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff8800cc194c08
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff8800cc194c0c
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff8800cc194c10
Jun  4 21:59:33 ph4 kernel: radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc900105b5a18
Jun  4 21:59:33 ph4 kernel: [drm] ring test on 0 succeeded in 3 usecs
Jun  4 21:59:33 ph4 kernel: [drm] ring test on 1 succeeded in 1 usecs
Jun  4 21:59:33 ph4 kernel: [drm] ring test on 2 succeeded in 1 usecs
Jun  4 21:59:33 ph4 kernel: [drm] ring test on 3 succeeded in 2 usecs
Jun  4 21:59:33 ph4 kernel: [drm] ring test on 4 succeeded in 1 usecs
Jun  4 21:59:33 ph4 kernel: [drm] ring test on 5 succeeded in 2 usecs
Jun  4 21:59:33 ph4 kernel: [drm] UVD initialized successfully.
Jun  4 21:59:43 ph4 kernel: radeon 0000:01:00.0: ring 0 stalled for more than 10000msec
Jun  4 21:59:43 ph4 kernel: radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000002dc5a last fence id 0x000000000002dc3f on ring 0)
Jun  4 21:59:43 ph4 kernel: [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
Jun  4 21:59:43 ph4 kernel: [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
Jun  4 21:59:43 ph4 kernel: radeon 0000:01:00.0: ib ring test failed (-35).
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0: GPU softreset: 0x00000048
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   SRBM_STATUS               = 0x200400C0
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00400002
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x84010243
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Jun  4 21:59:44 ph4 kernel: SysRq : Emergency Sync
Jun  4 21:59:44 ph4 kernel: Emergency Sync complete
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   SRBM_STATUS               = 0x200400C0
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
Jun  4 21:59:44 ph4 kernel: radeon 0000:01:00.0: GPU reset succeeded, trying to resume
Comment 1 Michel Dänzer 2014-06-05 07:35:52 UTC
Looks like it's failing to compile some (all?) fragment shaders:

GLShader::loadFragment(): error in "core/shaders/default/sky/fragment_volume_ambient.shader" file
defines: UNKNOWN,QUALITY_LOW,QUALITY_MEDIUM,QUALITY_HIGH,MULTISAMPLE_0,USE_INSTANCING,USE_GEOMETRY_SHADER,USE_TEXTURE_3D,USE_TEXTURE_ARRAY,USE_ALPHA_FADE,USE_REFLECTION,USE_OCCLUSION,HAS_DEFERRED_COLOR,HAS_DEFERRED_NORMAL,USE_RGB10A2,USE_ENVIRONMENT,USE_NORMALIZATION,USE_DIRECTIONAL_LIGHTMAPS,USE_SHADOW_KERNEL,OPENGL,HAS_ARB_DRAW_INSTANCED,HAS_ARB_TEXTURE_SNORM,SHADING_LANGUAGE=330,USE_ARB_BLEND_FUNC_EXTENDED,USE_ARB_SHADER_BIT_ENCODING,USE_ARB_SAMPLE_SHADING,,TURBULENCE
0:170(1): error: syntax error, unexpected EXTENSION, expecting $end

... and so on.
Comment 2 José Suárez 2014-06-05 18:50:31 UTC
I have been experiencing the same problem both with Unigine Heaven and Unigine Valley since 2 June git version. I had not been able to identify the commit which was causing the problem, but given that mesa 10.3 git of 28 May works OK, while git of 2 June (and subsequent days git versions) does not, I presume the problem must be caused by the radeonsi related commits applied on 2 June. Particularly, the 'radeonsi: enable ARB_sample_shading' commit was applied on 2 June, so I presume Andy Furniss' guess is correct (not sure if he had bisected).

I'm also getting the "UNKNOWN,QUALITY_LOW,QUALITY_MEDIUM,QUALITY_HIGH,MULTISAMPLE_0,USE_INSTANCING,USE_GEOMETRY_SHADER,USE_TEXTURE_3D,USE_TEXTURE_ARRAY,USE_ALPHA_FADE,USE_REFLECTION,USE_OCCLUSION,HAS_DEFERRED_COLOR,HAS_DEFERRED_NORMAL,USE_RGB10A2,USE_ENVIRONMENT,USE_NORMALIZATION,USE_DIRECTIONAL_LIGHTMAPS,USE_SHADOW_KERNEL,OPENGL,HAS_ARB_DRAW_INSTANCED,HAS_ARB_TEXTURE_SNORM,SHADING_LANGUAGE=330,USE_ARB_BLEND_FUNC_EXTENDED,USE_ARB_SHADER_BIT_ENCODING,USE_ARB_SAMPLE_SHADING,,TURBULENCE
0:170(1): error: syntax error, unexpected EXTENSION, expecting $end" kind of log on the console from which I launch heaven or valley.

I'm using a Radeon HD 7870.
Comment 3 Marek Olšák 2014-06-05 22:52:48 UTC
What happens if you set this environment variable?

force_glsl_extensions_warn=true
Comment 4 Michel Dänzer 2014-06-06 03:03:28 UTC
(In reply to comment #3)
> force_glsl_extensions_warn=true

That's enabled by default for Heaven in /etc/drirc, but I just tried setting it explicitly just in case. Doesn't help.
Comment 5 Marek Olšák 2014-06-11 00:05:18 UTC
The problem is Unigine don't know how to use GLSL, again.

There is "#extension GL_ARB_sample_shading : enable" in the middle of (all?) shaders. This is not allowed by any GLSL specification. All #extension directives must occur before any non-preprocessor tokens, which pretty much means "at the beginning of shader code".

What I see: Valley is loading. Then there is hang and it recovers successfully. After that, Valley seems to have exited. That's all.
Comment 6 Matt Turner 2014-06-11 02:20:24 UTC
If you only want to run the application and don't care about a fix, you can run with

MESA_EXTENSION_OVERRIDE=-GL_ARB_sample_shading

We should implement a driconf workaround for this.
Comment 7 Andy Furniss 2014-06-11 09:50:54 UTC
(In reply to comment #6)
> If you only want to run the application and don't care about a fix, you can
> run with
> 
> MESA_EXTENSION_OVERRIDE=-GL_ARB_sample_shading
> 
> We should implement a driconf workaround for this.

Thanks, that works and is also needed for heaven 4.0
Comment 8 Andy Furniss 2014-06-11 09:58:33 UTC
(In reply to comment #5)
> The problem is Unigine don't know how to use GLSL, again.
> 
> There is "#extension GL_ARB_sample_shading : enable" in the middle of (all?)
> shaders. This is not allowed by any GLSL specification. All #extension
> directives must occur before any non-preprocessor tokens, which pretty much
> means "at the beginning of shader code".
> 
> What I see: Valley is loading. Then there is hang and it recovers
> successfully. After that, Valley seems to have exited. That's all.

It's repeatedly more serious than that for me - maybe because I am fullscreen?

But anyway if I don't sysrq quickly enough when the monitor goes off I am in ext4 bitching about disk errors territory after I hard reset, so no waiting to see if the GPU reset works for me (which it never seems to do on SI - but then I haven't had this card for long).

Heaven 4.0 is also affected, but I don't lock with that - it renders junk but I can quit OK, after that there is a 90% chance my display is mostly trash. fbcon is OK when I quit X, but restarting X will still result in trashed display.
Comment 9 Marek Olšák 2014-07-11 14:49:51 UTC
The hangs are gone if I apply my workaround which fixes the compile failures.
Comment 10 Andy Furniss 2014-07-11 16:21:42 UTC
(In reply to comment #9)
> The hangs are gone if I apply my workaround which fixes the compile failures.

If you mean -

st/mesa, gallium: add a workaround for Unigine Heaven 4.0 and Valley 1.0

I hadn't tried, I assumed they would go in, and now it looks like the stuff in common has moved up a level.

Checking patch src/gallium/state_trackers/dri/common/dri_context.c...
error: src/gallium/state_trackers/dri/common/dri_context.c: No such file or directory
Comment 11 Ed Tomlinson 2014-07-13 18:05:11 UTC
On a r7 260x this bug leads to a dead system and a reboot.  From my pov its fine if the demo fails but its NOT fine if it brings down my box...
Comment 12 Andy Furniss 2014-07-18 13:12:46 UTC
(In reply to comment #9)
> The hangs are gone if I apply my workaround which fixes the compile failures.

Working for me now the workaround is in.

One nit WRT drirc, I don't know what the expected behavior is, but Mesa doesn't use the configured/installed location.

So for me who configures --prefix=/usr and so gets the default --sysconfdir=PREFIX/etc drirc ends up in /usr/etc/ but doesn't get read from there by the same mesa - is that expected?
Comment 13 Marek Olšák 2014-07-18 21:59:48 UTC
(In reply to comment #12)
> (In reply to comment #9)
> > The hangs are gone if I apply my workaround which fixes the compile failures.
> 
> Working for me now the workaround is in.
> 
> One nit WRT drirc, I don't know what the expected behavior is, but Mesa
> doesn't use the configured/installed location.
> 
> So for me who configures --prefix=/usr and so gets the default
> --sysconfdir=PREFIX/etc drirc ends up in /usr/etc/ but doesn't get read from
> there by the same mesa - is that expected?

This is weird. It should have been installed in /etc.
Comment 14 Alexander Tsoy 2014-10-08 13:18:11 UTC
Interesting.. My Bonaire XTX (R7 260X) is not affected by this bug. How is this possible? Cape Verde PRO (HD 7750) is affected and workaround from comment 6 fixes the problem. On both systems I have mesa-10.3 which contains the commit mentioned in comment 0.
Comment 15 Alexander Tsoy 2014-10-09 06:08:59 UTC
Ah.. "st/mesa,gallium: add a workaround for Unigine Heaven 4.0 and Valley 1.0" is also included in mesa-10.3. So both Heaven 4.0 and Valley 1.0 should just work. The question is why ARB_sample_shading is causing GPU lockup on VERDE. Should I open a separate bug for this issue?
Comment 16 Andy Furniss 2014-10-09 08:35:05 UTC
(In reply to Alexander Tsoy from comment #15)
> Ah.. "st/mesa,gallium: add a workaround for Unigine Heaven 4.0 and Valley
> 1.0" is also included in mesa-10.3. So both Heaven 4.0 and Valley 1.0 should
> just work. The question is why ARB_sample_shading is causing GPU lockup on
> VERDE. Should I open a separate bug for this issue?

Maybe check that the your /etc/drirc has the workaround and/or if you have a .drirc in your home dir that it has it also, though I haven't tested if having a .drirc under $HOME without the workaround overrides one in /etc with it.
Comment 17 Alexander Tsoy 2014-10-09 13:25:46 UTC
(In reply to Andy Furniss from comment #16)

drirc was the first thing I checked. I filed a new bug 84836.
Comment 18 Marek Olšák 2014-10-10 16:30:36 UTC
(In reply to Alexander Tsoy from comment #15)
> Ah.. "st/mesa,gallium: add a workaround for Unigine Heaven 4.0 and Valley
> 1.0" is also included in mesa-10.3. So both Heaven 4.0 and Valley 1.0 should
> just work. The question is why ARB_sample_shading is causing GPU lockup on
> VERDE. Should I open a separate bug for this issue?

I can re-test VERDE when I get home.

I haven't investigated why the hw hangs, because it only happens if shader compilation fails and so none of the sample_shading shader stuff makes it to the driver. I think the likely cause is that Unigine attempted to do rendering with a shader that hasn't actually been compiled and things went wrong after that.
Comment 19 Michel Dänzer 2014-11-11 02:41:53 UTC
Resolving per comment #12.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.