Bug 69328

Summary:	Recoverable and unrecoverable lockups with opencl-example on trinity APU
Product:	Mesa	Reporter:	slicksam
Component:	Drivers/DRI/R600	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	normal
Priority:	medium
Version:	git
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	Don't set DB_DEST or CB_DEST* bit on cp_coher_cntl

Description slicksam 2013-09-13 17:10:25 UTC

Software in use:

Up to date mesa, llvm, clang, libclc, firmware, as of 20130910.  Gentoo's Linux 3.11. opencl-example-12905ac620b83713b07ece763ff3c36fb3c2e7e5.

Hardware in use:  AMD A8 5600K APU (Radeon HD 7560D, Aruba), 32GB system RAM, Biostar Hi-Fi A85W motherboard.

Steps to reproduce: 

Run hello_world program from opencl-example.  First run works correctly.  Second run either completely locks up the machine or locks up the GPU (which recovers after a short time).  Same behavior for other tests - the first test completes and the second causes problems.


This is what the second opencl-example run looks like:

localhost opencl-example-12905ac620b83713b07ece763ff3c36fb3c2e7e5 # ./hello_world 
There are 1 platforms.
There are 1 GPU devices.
clCreateContext() succeeded.
clCreateCommandQueue() succeeded.
clCreateProgramWithSource() suceeded.
clBuildProgram() suceeded.
clCreateKernel() suceeded.
clCreateBuffer() succeeded.
clSetKernelArg() succeeded.
clEnqueueNDRangeKernel() suceeded.
((( 10 second hang here, or forever if the machine is toast )))
clEnqueueReadBuffer() suceeded.
pi = 3.141590


And, here is the dmesg output from the recoverable lockups:

[ 1365.806285] radeon 0000:00:01.0: GPU lockup CP stall for more than 10000msec
[ 1365.806292] radeon 0000:00:01.0: GPU lockup (waiting for 0x0000000000007ec3 last fence id 0x0000000000007ec2)
[ 1365.821261] radeon 0000:00:01.0: Saved 559 dwords of commands on ring 0.
[ 1365.821293] radeon 0000:00:01.0: GPU softreset: 0x00000008
[ 1365.821297] radeon 0000:00:01.0:   GRBM_STATUS               = 0xB0003828
[ 1365.821300] radeon 0000:00:01.0:   GRBM_STATUS_SE0           = 0x00000007
[ 1365.821304] radeon 0000:00:01.0:   GRBM_STATUS_SE1           = 0x00000007
[ 1365.821307] radeon 0000:00:01.0:   SRBM_STATUS               = 0x20000040
[ 1365.821332] radeon 0000:00:01.0:   SRBM_STATUS2              = 0x00000000
[ 1365.821335] radeon 0000:00:01.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[ 1365.821338] radeon 0000:00:01.0:   R_008678_CP_STALLED_STAT2 = 0x40000000
[ 1365.821341] radeon 0000:00:01.0:   R_00867C_CP_BUSY_STAT     = 0x00010002
[ 1365.821344] radeon 0000:00:01.0:   R_008680_CP_STAT          = 0x80220243
[ 1365.821347] radeon 0000:00:01.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[ 1365.821350] radeon 0000:00:01.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[ 1365.821354] radeon 0000:00:01.0:   VM_CONTEXT0_PROTECTION_FAULT_ADDR   0x00000000
[ 1365.821357] radeon 0000:00:01.0:   VM_CONTEXT0_PROTECTION_FAULT_STATUS 0x00000000
[ 1365.821360] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 1365.821363] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 1365.827029] radeon 0000:00:01.0: GRBM_SOFT_RESET=0x00004001
[ 1365.827083] radeon 0000:00:01.0: SRBM_SOFT_RESET=0x00000100
[ 1365.828237] radeon 0000:00:01.0:   GRBM_STATUS               = 0x00003828
[ 1365.828240] radeon 0000:00:01.0:   GRBM_STATUS_SE0           = 0x00000007
[ 1365.828243] radeon 0000:00:01.0:   GRBM_STATUS_SE1           = 0x00000007
[ 1365.828246] radeon 0000:00:01.0:   SRBM_STATUS               = 0x20000040
[ 1365.828271] radeon 0000:00:01.0:   SRBM_STATUS2              = 0x00000000
[ 1365.828274] radeon 0000:00:01.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[ 1365.828277] radeon 0000:00:01.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[ 1365.828280] radeon 0000:00:01.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[ 1365.828283] radeon 0000:00:01.0:   R_008680_CP_STAT          = 0x00000000
[ 1365.828286] radeon 0000:00:01.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[ 1365.828289] radeon 0000:00:01.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[ 1365.828317] radeon 0000:00:01.0: GPU reset succeeded, trying to resume
[ 1365.843638] [drm] PCIE GART of 512M enabled (table at 0x0000000000276000).
[ 1365.843775] radeon 0000:00:01.0: WB enabled
[ 1365.843781] radeon 0000:00:01.0: fence driver on ring 0 use gpu addr 0x0000000020000c00 and cpu addr 0xffff8807dea6bc00
[ 1365.844520] radeon 0000:00:01.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc900057b5a18
[ 1365.844524] radeon 0000:00:01.0: fence driver on ring 1 use gpu addr 0x0000000020000c04 and cpu addr 0xffff8807dea6bc04
[ 1365.844528] radeon 0000:00:01.0: fence driver on ring 2 use gpu addr 0x0000000020000c08 and cpu addr 0xffff8807dea6bc08
[ 1365.844532] radeon 0000:00:01.0: fence driver on ring 3 use gpu addr 0x0000000020000c0c and cpu addr 0xffff8807dea6bc0c
[ 1365.844536] radeon 0000:00:01.0: fence driver on ring 4 use gpu addr 0x0000000020000c10 and cpu addr 0xffff8807dea6bc10
[ 1365.863416] [drm] ring test on 0 succeeded in 2 usecs
[ 1365.863477] [drm] ring test on 3 succeeded in 2 usecs
[ 1365.863485] [drm] ring test on 4 succeeded in 1 usecs
[ 1365.909450] [drm] ring test on 5 succeeded in 1 usecs
[ 1365.909454] [drm] UVD initialized successfully.
[ 1365.936524] [drm] ib test on ring 0 succeeded in 0 usecs
[ 1365.937056] [drm] ib test on ring 3 succeeded in 0 usecs
[ 1365.937581] [drm] ib test on ring 4 succeeded in 0 usecs
[ 1365.958519] [drm] ib test on ring 5 succeeded

Comment 1 Alex Deucher 2013-09-13 17:17:37 UTC

Is this a regression?  If so, can you bisect?

Comment 2 Tom Stellard 2013-09-13 23:29:26 UTC

Created attachment 85794 [details] [review]
Don't set DB_DEST or CB_DEST* bit on cp_coher_cntl

This patch fixes the hangs for me and all the run_test.sh tests pass.  However, this is just a hack and not a proper solution.  Can you test this patch?

Comment 3 slicksam 2013-09-15 06:09:44 UTC

The patch seems to make it work!

rotl tests fail however:

Running ./math-int rotl 1 1 2
Failed
Running ./math-int rotl 1 32 1
Failed
Running ./math-int rotl -1 5 -1
Failed
Running ./math-int rotl 4096 23 8
Failed

Comment 4 Marek Olšák 2013-09-15 10:14:19 UTC

The component Drivers/DRI/R600 doesn't exist in Mesa.

Comment 5 slicksam 2013-09-15 18:00:41 UTC

@Marek Olšák

Sorry, should be gallium/r600 right?

Comment 6 Tom Stellard 2013-09-15 22:50:43 UTC

(In reply to comment #3)
> The patch seems to make it work!
> 
> rotl tests fail however:
> 
> Running ./math-int rotl 1 1 2
> Failed
> Running ./math-int rotl 1 32 1
> Failed
> Running ./math-int rotl -1 5 -1
> Failed
> Running ./math-int rotl 4096 23 8
> Failed

Do you the latest code from my opencl-example repo?  Older versions were missing the rotl.cl file.

Comment 7 Alex Deucher 2013-09-17 13:39:17 UTC

I wonder if this may also be related to bug 69321.  Does reverting f0435ebb07d01a77ca0d98967a002898811a5206 also help?

Comment 8 slicksam 2013-09-17 16:04:47 UTC

> Do you the latest code from my opencl-example repo?  Older versions were missing the rotl.cl file

Yes, that was it.  It passes all tests here now!

Comment 9 slicksam 2013-09-17 16:09:42 UTC

@Alex Deucher:

I'm not that familiar with git, can you give me the command to do that, starting with the checked out branch-master gentoo grabs to the build directory?  It keeps .git there at least.

Comment 10 Alex Deucher 2013-09-17 16:16:10 UTC

git revert f0435ebb07d01a77ca0d98967a002898811a5206

Comment 11 slicksam 2013-09-17 16:43:04 UTC

@Alex Deucher:

Reverting f0435ebb07d01a77ca0d98967a002898811a5206 also makes it work.

Comment 12 Alex Deucher 2013-09-17 16:50:19 UTC


*** This bug has been marked as a duplicate of bug 69321 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.