Bug 91865

Summary: [r600g] GPU hang in 'gsraytrace' - NI/Turks (6670)
Product: Mesa Reporter: Dieter Nützel <Dieter>
Component: Drivers/Gallium/r600Assignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: critical    
Priority: medium    
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=93706
Whiteboard:
i915 platform: i915 features:
Attachments: dmesg-4.2.0-1.gefc468a-desktop.log
gsraytrace_31672_00000000
dmesg-4.3.0-17.g6a48ac7-default.log
glretrace_2862_00000000
apitrace gsraytrace.trace.xz

Description Dieter Nützel 2015-09-03 13:07:17 UTC

    
Comment 1 Dieter Nützel 2015-09-03 13:15:53 UTC
Even with R600_DEBUG=nosb

mesa-demos/glsl> ./gsraytrace 
Gallium debugger active. The hang detection timout is 1000 ms.
ATTENTION: default value of option vblank_mode overridden by environment.
GL_RENDERER = Gallium 0.4 on AMD TURKS (DRM 2.43.0, LLVM 3.8.0)

ESC                 = exit demo
left mouse + drag   = rotate camera

dd: GPU hang detected!
dd: Aborting the process...
Abort

OpenGL renderer string: Gallium 0.4 on AMD TURKS (DRM 2.43.0, LLVM 3.8.0)
OpenGL core profile version string: 3.3 (Core Profile) Mesa 11.1.0-devel (git-6e37304)

I'll attach
dmesg-4.2.0-1.gefc468a-desktop.log
gsraytrace_31672_00000000 (Marek's GREAT Gallium debugger log)

[ 5676.604919] radeon 0000:01:00.0: ring 0 stalled for more than 31033msec
[ 5676.604988] radeon 0000:01:00.0: GPU lockup (current fence id 0x00000000000a9cea last fence id 0x00000000000a9d18 on ring 0)
[ 5676.647958] radeon 0000:01:00.0: Saved 1479 dwords of commands on ring 0.
[ 5676.647983] radeon 0000:01:00.0: GPU softreset: 0x0000000D
[ 5676.647986] radeon 0000:01:00.0:   GRBM_STATUS               = 0xF7631028
[ 5676.647988] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0xF8000002
[ 5676.647990] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[ 5676.647992] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[ 5676.647994] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[ 5676.647996] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[ 5676.647998] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x400C0000
[ 5676.648000] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00048002
[ 5676.648002] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80268647
[ 5676.648004] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44483106
[ 5676.664891] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00007F6B
[ 5676.664945] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100100
[ 5676.666105] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003828
[ 5676.666107] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000007
[ 5676.666109] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[ 5676.666111] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[ 5676.666113] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[ 5676.666115] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[ 5676.666117] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[ 5676.666119] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[ 5676.666121] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[ 5676.666124] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[ 5676.666158] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[ 5676.689479] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
[ 5676.692387] [drm] PCIE GART of 1024M enabled (table at 0x0000000000274000).
[ 5676.692506] radeon 0000:01:00.0: WB enabled
[ 5676.692509] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff88003723cc00
[ 5676.692510] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff88003723cc0c
[ 5676.694278] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000072118 and cpu addr 0xffffc90003432118
[ 5676.710865] [drm] ring test on 0 succeeded in 2 usecs
[ 5676.710877] [drm] ring test on 3 succeeded in 7 usecs
[ 5676.887988] [drm] ring test on 5 succeeded in 2 usecs
[ 5676.887997] [drm] UVD initialized successfully.
[ 5676.935630] [drm] ib test on ring 0 succeeded in 0 usecs
[ 5676.935687] [drm] ib test on ring 3 succeeded in 0 usecs
[ 5677.587054] [drm] ib test on ring 5 succeeded
Comment 2 Dieter Nützel 2015-09-03 13:16:37 UTC
Created attachment 118067 [details]
dmesg-4.2.0-1.gefc468a-desktop.log
Comment 3 Dieter Nützel 2015-09-03 13:17:11 UTC
Created attachment 118068 [details]
gsraytrace_31672_00000000
Comment 4 Dieter Nützel 2015-09-03 13:46:09 UTC
dmesg snipped from another hang:

[ 1361.853214] radeon 0000:01:00.0: ring 0 stalled for more than 10099msec
[ 1361.853222] radeon 0000:01:00.0: GPU lockup (current fence id 0x00000000000b21b6 last fence id 0x00000000000b21ea on ring 0)
[ 1361.873984] [drm:r600_ib_test [radeon]] *ERROR* radeon: fence wait failed (-35).
[ 1361.874010] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on GFX ring (-35).
[ 1361.921903] radeon 0000:01:00.0: GPU softreset: 0x00000009
[ 1361.921906] radeon 0000:01:00.0:   GRBM_STATUS               = 0xB2737828
[ 1361.921909] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x1E000007
[ 1361.921911] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[ 1361.921913] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[ 1361.921915] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[ 1361.921917] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[ 1361.921919] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x400C0000
[ 1361.921921] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00048006
[ 1361.921923] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80268647
[ 1361.921925] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[ 1361.922215] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00007F6B
[ 1361.922270] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[ 1361.923429] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003828
[ 1361.923431] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000007
[ 1361.923433] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[ 1361.923435] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[ 1361.923437] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[ 1361.923439] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[ 1361.923441] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[ 1361.923443] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[ 1361.923445] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[ 1361.923448] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[ 1361.923480] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[ 1361.946777] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
[ 1361.949613] [drm] PCIE GART of 1024M enabled (table at 0x0000000000274000).
[ 1361.949732] radeon 0000:01:00.0: WB enabled
[ 1361.949734] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff88036f42ac00
[ 1361.949736] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff88036f42ac0c
[ 1361.951504] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000072118 and cpu addr 0xffffc90003432118
[ 1361.968087] [drm] ring test on 0 succeeded in 2 usecs
[ 1361.968099] [drm] ring test on 3 succeeded in 7 usecs
[ 1362.145206] [drm] ring test on 5 succeeded in 2 usecs
[ 1362.145218] [drm] UVD initialized successfully.
[ 1362.165411] [drm] ib test on ring 0 succeeded in 0 usecs
[ 1362.165468] [drm] ib test on ring 3 succeeded in 0 usecs
[ 1362.817284] [drm] ib test on ring 5 succeeded
Comment 5 Barto 2015-10-08 16:25:57 UTC
same bug with an amd radeon HD4650 pcie and r600 driver, archlinux 64 bits,

OpenGL renderer string: Gallium 0.4 on AMD RV730 (DRM 2.43.0, LLVM 3.6.2)
OpenGL core profile version string: 3.3 (Core Profile) Mesa 11.0.2
OpenGL core profile shading language version string: 3.30
OpenGL version string: 3.0 Mesa 11.0.2
OpenGL shading language version string: 1.30
OpenGL ES profile version string: OpenGL ES 3.0 Mesa 11.0.2
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.00
Comment 6 Dieter Nützel 2015-11-27 03:19:07 UTC
Created attachment 120157 [details]
dmesg-4.3.0-17.g6a48ac7-default.log

New log files.
apitrace, too.

Hope we get any further.
Comment 7 Dieter Nützel 2015-11-27 03:19:36 UTC
Created attachment 120158 [details]
glretrace_2862_00000000
Comment 8 Dieter Nützel 2015-11-27 03:20:57 UTC
Created attachment 120159 [details]
apitrace gsraytrace.trace.xz
Comment 9 Dieter Nützel 2015-12-18 04:30:23 UTC
Hurray!

What an additional gift beside your impending new arrival, Dave ;-)

This is finally fixed by this:

commit 2239f3eaff5c72c4cb1d4a5be97feb4af3d08d25
Author: Dave Airlie <airlied@redhat.com>
Date:   Mon Nov 30 15:48:22 2015 +1000

    r600/shader: emit tessellation factors to GDS at end of TCS.
    
    When we are finished the shader, we read back all the tess factors
    from LDS and write them to special global memory storage using
    GDS instructions.
    
    This also handles adding NOP when GDS or ENDLOOP end the TCS.
    
    Signed-off-by: Dave Airlie <airlied@redhat.com>

:040000 040000 101f51186ea311e90fa8423ee772f2b1076737bf b01929ff47ca5035660b2c84ca2fdeb6604549fa M  src

Tomorrow,
I'll test this on RV730 (AGP), too and CLOSE both when all goes smooth.

(For the rendering issues with R600_DEBUG=nosb I'll open a new ticket.)

Merry Christmas!
And all the best for your family.
Comment 10 Dieter Nützel 2016-01-14 17:01:03 UTC
OK, 

is this SOLVED by 'accident'?

For RV730 GPU hang (Bug 83319) the above identified commit do NOT solve the hang.
(Bug is updated, now.)

The observed issues with R600_DEBUG=nosb for all three 'raytrace' variants
(vsraytrace/fsraytrace/gsraytrace) stays.
latest: Mesa 11.2.0-devel (git-6470435)

I'll open a new ticket for this.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.