Bug 103586 - OpenCL/Clover: AMD Turks: corrupt output buffer (depending on dimension order?)
Summary: OpenCL/Clover: AMD Turks: corrupt output buffer (depending on dimension order?)
Status: NEW
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/r600 (show other bugs)
Version: 17.2
Hardware: Other All
: medium normal
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 99553
  Show dependency treegraph
 
Reported: 2017-11-06 00:20 UTC by Dave Gilbert
Modified: 2018-09-12 05:32 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
foo.ll from debug run (132.57 KB, text/plain)
2017-11-08 17:03 UTC, Dave Gilbert
Details
foo.link-0.asm (292.35 KB, text/plain)
2017-11-08 17:03 UTC, Dave Gilbert
Details
foo.link-0.ll (132.16 KB, text/plain)
2017-11-08 17:04 UTC, Dave Gilbert
Details
annotated asm dump (3.54 KB, text/plain)
2017-11-08 20:58 UTC, Jan Vesely
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Gilbert 2017-11-06 00:20:53 UTC
I've got a trivial kernel that draws a sphere in a voxel cube; each voxel should end up as 0 or 1; if I use global id 0 as z, 1 as y, 2 as x  I get corruptions where some voxels have random junk in; if I reverse the order so that global id 0 is x, 1 is y and 2 is z then it's happy.
(Confirmed the code is clean with oclgrind and happy on Intel.

Versions:

Number of devices                                 1
  Device Name                                     AMD TURKS (DRM 2.50.0 / 4.13.0-1-amd64, LLVM 5.0.0)
  Device Vendor                                   AMD
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.1 Mesa 17.2.4
  Driver Version                                  17.2.4
  Device OpenCL C Version                         OpenCL C 1.1 

(on debian testing, was on stable, but same behaviour)

01:00.0 0300: 1002:6841
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Thames [Radeon HD 7550M/7570M/7650M] (prog-if 00 [VGA controller])
	Subsystem: Hewlett-Packard Company Thames [Radeon HD 7550M/7570M/7650M]
	Flags: bus master, fast devsel, latency 0, IRQ 37
	Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Memory at d4300000 (64-bit, non-prefetchable) [size=128K]
	I/O ports at 4000 [size=256]
	Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: radeon
	Kernel modules: radeon

in an HP Elitebook laptop.

Code that triggers this:
https://github.com/penguin42/opencl-play/commit/c98470685874769e4a59975791459180564b6f6e

build and run with:
g++ -O2 ocl.cpp -lOpenCL && ./a.out 2> z
then check output with:
tr '01' '  ' <z|grep -v '^ *$'|egrep -v 'got_dev|^Z'
which should be empty,

(In some builds I've found I've had to increase the SIZE constant to 256 to trigger it)

Then my commit e89fe62 fixes it with:
diff --git a/sphere.ocl b/sphere.ocl
index b4f23af..c89ecb9 100644
--- a/sphere.ocl
+++ b/sphere.ocl
@@ -1,10 +1,10 @@
 __kernel void hello(__global uint* o) {
-  int z = get_global_id(0);
+  int z = get_global_id(2);
   int y = get_global_id(1);
-  int x = get_global_id(2);
-  int zr = get_global_size(0);
+  int x = get_global_id(0);
+  int zr = get_global_size(2);
   int yr = get_global_size(1);
-  int xr = get_global_size(2);
+  int xr = get_global_size(0);
   float zf = ((float)z - ((float)zr)/2) / (float)zr;
   float yf = ((float)y - ((float)yr)/2) / (float)yr;
   float xf = ((float)x - ((float)xr)/2) / (float)xr;

by just swapping z/x around - which should make no difference given it's a cube.

But....hmm, I've seen it fail in that direction now as well.

The corruptions all seem to be near the maximum x/y/z value - almost like one small chunk in the max corner.

Here's the kernel:
__kernel void hello(__global uint* o) {
  int z = get_global_id(0);
  int y = get_global_id(1);
  int x = get_global_id(2);
  int zr = get_global_size(0);
  int yr = get_global_size(1);
  int xr = get_global_size(2);
  float zf = ((float)z - ((float)zr)/2) / (float)zr;
  float yf = ((float)y - ((float)yr)/2) / (float)yr;
  float xf = ((float)x - ((float)xr)/2) / (float)xr;

  o[z*yr*xr + y*xr + x] = ((zf * zf) + (yf * yf) + (xf * xf)) <  0.25;
}
Comment 1 Jan Vesely 2017-11-08 15:23:09 UTC
can you run using CLOVER_DEBUG=llvm,native CLOVER_DEBUG_FILE=foo and attach both llvm and isa dumps?
Comment 2 Dave Gilbert 2017-11-08 17:03:33 UTC
Created attachment 135311 [details]
foo.ll from debug run
Comment 3 Dave Gilbert 2017-11-08 17:03:52 UTC
Created attachment 135312 [details]
foo.link-0.asm
Comment 4 Dave Gilbert 2017-11-08 17:04:44 UTC
Created attachment 135313 [details]
foo.link-0.ll

That's all 3 of the debug files it produced.
(I wasn't sure which were the llvm and which the isa dumps; I guess the asm is the isa? and the ll's are both llvm dumps?)
Comment 5 Jan Vesely 2017-11-08 20:44:44 UTC
(In reply to Dave Gilbert from comment #4)
> Created attachment 135313 [details]
> foo.link-0.ll
> 
> That's all 3 of the debug files it produced.
> (I wasn't sure which were the llvm and which the isa dumps; I guess the asm
> is the isa? and the ll's are both llvm dumps?)

yes. the first .ll is from compilation step, the other one is from linking step.

.ll dump looks correct.
.asm also looks correct.

you can try producing multiple asm dumps for working and non-working runs. But I don't think that the llvm is the culprit here.

Can you try waiting for the kernel execution to complete explicitly before mapping the buffer?
Ideally call clFinish() on line 63.
Comment 6 Dave Gilbert 2017-11-08 20:56:13 UTC
(In reply to Jan Vesely from comment #5)
> (In reply to Dave Gilbert from comment #4)
> > Created attachment 135313 [details]
> > foo.link-0.ll
> > 
> > That's all 3 of the debug files it produced.
> > (I wasn't sure which were the llvm and which the isa dumps; I guess the asm
> > is the isa? and the ll's are both llvm dumps?)
> 
> yes. the first .ll is from compilation step, the other one is from linking
> step.
> 
> .ll dump looks correct.
> .asm also looks correct.
> 
> you can try producing multiple asm dumps for working and non-working runs.
> But I don't think that the llvm is the culprit here.
> 
> Can you try waiting for the kernel execution to complete explicitly before
> mapping the buffer?
> Ideally call clFinish() on line 63.

Since I'm on the C++ binding (probably a mistake) I used:
  queue.finish();

and it seems to be working.

(This also corresponds possibly to what I'm seeing on a more complex kernel; with a more complex kernel I'm seeing on a whole pile of data on the last few Z slices as being bogus suggesting it's not finished).

Dave
Comment 7 Jan Vesely 2017-11-08 20:58:16 UTC
Created attachment 135318 [details]
annotated asm dump
Comment 8 Jan Vesely 2017-11-08 22:25:07 UTC
(In reply to Dave Gilbert from comment #6)
> (In reply to Jan Vesely from comment #5)
> > (In reply to Dave Gilbert from comment #4)
> > > Created attachment 135313 [details]
> > > foo.link-0.ll
> > > 
> > > That's all 3 of the debug files it produced.
> > > (I wasn't sure which were the llvm and which the isa dumps; I guess the asm
> > > is the isa? and the ll's are both llvm dumps?)
> > 
> > yes. the first .ll is from compilation step, the other one is from linking
> > step.
> > 
> > .ll dump looks correct.
> > .asm also looks correct.
> > 
> > you can try producing multiple asm dumps for working and non-working runs.
> > But I don't think that the llvm is the culprit here.
> > 
> > Can you try waiting for the kernel execution to complete explicitly before
> > mapping the buffer?
> > Ideally call clFinish() on line 63.
> 
> Since I'm on the C++ binding (probably a mistake) I used:
>   queue.finish();
> 
> and it seems to be working.
> 
> (This also corresponds possibly to what I'm seeing on a more complex kernel;
> with a more complex kernel I'm seeing on a whole pile of data on the last
> few Z slices as being bogus suggesting it's not finished).
> 
> Dave

thanks for testing. I see you are using mesa 17.2.

there were few changes to blocking call synchronization that went to mesa 17.3:
02f8ac6b70033a1b240d497c4664c359d2398cc3 (clover: Wrap event::wait_count in a method taking care of the required locking.)
bc4000ee40c78efe1e5e8a6244d4bb55389d8418 (clover: Run the associated action before an event is signalled.)
3a5b69c09ba355c616c274b0c7f5aba3bd21fd54 (clover: Wait for requested operation if blocking flag is set)

which might help address the issue. Can you test mesa 17.3?
Comment 9 Dave Gilbert 2017-11-08 23:07:56 UTC
(In reply to Jan Vesely from comment #8)
> (In reply to Dave Gilbert from comment #6)
> > (In reply to Jan Vesely from comment #5)
> > > (In reply to Dave Gilbert from comment #4)
> > > > Created attachment 135313 [details]
> > > > foo.link-0.ll
> > > > 
> > > > That's all 3 of the debug files it produced.
> > > > (I wasn't sure which were the llvm and which the isa dumps; I guess the asm
> > > > is the isa? and the ll's are both llvm dumps?)
> > > 
> > > yes. the first .ll is from compilation step, the other one is from linking
> > > step.
> > > 
> > > .ll dump looks correct.
> > > .asm also looks correct.
> > > 
> > > you can try producing multiple asm dumps for working and non-working runs.
> > > But I don't think that the llvm is the culprit here.
> > > 
> > > Can you try waiting for the kernel execution to complete explicitly before
> > > mapping the buffer?
> > > Ideally call clFinish() on line 63.
> > 
> > Since I'm on the C++ binding (probably a mistake) I used:
> >   queue.finish();
> > 
> > and it seems to be working.
> > 
> > (This also corresponds possibly to what I'm seeing on a more complex kernel;
> > with a more complex kernel I'm seeing on a whole pile of data on the last
> > few Z slices as being bogus suggesting it's not finished).
> > 
> > Dave
> 
> thanks for testing. I see you are using mesa 17.2.
> 
> there were few changes to blocking call synchronization that went to mesa
> 17.3:
> 02f8ac6b70033a1b240d497c4664c359d2398cc3 (clover: Wrap event::wait_count in
> a method taking care of the required locking.)
> bc4000ee40c78efe1e5e8a6244d4bb55389d8418 (clover: Run the associated action
> before an event is signalled.)
> 3a5b69c09ba355c616c274b0c7f5aba3bd21fd54 (clover: Wait for requested
> operation if blocking flag is set)
> 
> which might help address the issue. Can you test mesa 17.3?

Yeh, I'll figure out how to get 17.3 built on this box.
Comment 10 Dave Gilbert 2017-11-09 03:13:13 UTC
I believe I'm still seeing this:

dg@hath:~/ocl2$ clinfo 
Number of platforms                               1
  Platform Name                                   Clover
  Platform Vendor                                 Mesa
  Platform Version                                OpenCL 1.1 Mesa 17.4.0-devel (git-a16dc04ad5)
....
dg@hath:~/ocl2$ echo $LD_LIBRARY_PATH 
/home/dg/mesa/try/lib:

so I *think* it's using my build.

and I believe I'm still seeing it.
Is my test valid or do I really need that finish?
Comment 11 Jan Vesely 2017-11-09 16:55:06 UTC
(In reply to Dave Gilbert from comment #10)
> I believe I'm still seeing this:
> 
> dg@hath:~/ocl2$ clinfo 
> Number of platforms                               1
>   Platform Name                                   Clover
>   Platform Vendor                                 Mesa
>   Platform Version                                OpenCL 1.1 Mesa
> 17.4.0-devel (git-a16dc04ad5)
> ....
> dg@hath:~/ocl2$ echo $LD_LIBRARY_PATH 
> /home/dg/mesa/try/lib:
> 
> so I *think* it's using my build.

yes, that looks OK.

> and I believe I'm still seeing it.
> Is my test valid or do I really need that finish?

it should be OK. Can you replace the clFinish with clWaitForEvents (or the respective C++ method) to wait for kernel execution?
It looks to me that clover creates new map without waiting for all the dep events.
Comment 12 Dave Gilbert 2017-11-13 19:56:12 UTC
(In reply to Jan Vesely from comment #11)
> (In reply to Dave Gilbert from comment #10)
> > I believe I'm still seeing this:
> > 
> > dg@hath:~/ocl2$ clinfo 
> > Number of platforms                               1
> >   Platform Name                                   Clover
> >   Platform Vendor                                 Mesa
> >   Platform Version                                OpenCL 1.1 Mesa
> > 17.4.0-devel (git-a16dc04ad5)
> > ....
> > dg@hath:~/ocl2$ echo $LD_LIBRARY_PATH 
> > /home/dg/mesa/try/lib:
> > 
> > so I *think* it's using my build.
> 
> yes, that looks OK.
> 
> > and I believe I'm still seeing it.
> > Is my test valid or do I really need that finish?
> 
> it should be OK. Can you replace the clFinish with clWaitForEvents (or the
> respective C++ method) to wait for kernel execution?
> It looks to me that clover creates new map without waiting for all the dep
> events.

It doesn't seem to help, if I add:
--- a/ocl.cpp
+++ b/ocl.cpp
@@ -74,6 +74,7 @@ static int got_dev(cl::Platform &plat, std::vector<cl::Device> &devices, cl::Dev
     cl::Event eventBarrier2;
     queue.enqueueBarrierWithWaitList(NULL,&eventBarrier2);
     std::cerr << __func__ << "enqueueMapBuffer gave: " << err << std::endl;
+    event.wait();
     eventMap.wait();
     eventBarrier2.wait();
 

that doesn't seem to help and I think event is the event triggered by the kernel.
Comment 13 Vedran Miletić 2017-11-13 20:36:27 UTC
(In reply to Dave Gilbert from comment #6)
> (In reply to Jan Vesely from comment #5)
> > (In reply to Dave Gilbert from comment #4)
> > > Created attachment 135313 [details]
> > > foo.link-0.ll
> > > 
> > > That's all 3 of the debug files it produced.
> > > (I wasn't sure which were the llvm and which the isa dumps; I guess the asm
> > > is the isa? and the ll's are both llvm dumps?)
> > 
> > yes. the first .ll is from compilation step, the other one is from linking
> > step.
> > 
> > .ll dump looks correct.
> > .asm also looks correct.
> > 
> > you can try producing multiple asm dumps for working and non-working runs.
> > But I don't think that the llvm is the culprit here.
> > 
> > Can you try waiting for the kernel execution to complete explicitly before
> > mapping the buffer?
> > Ideally call clFinish() on line 63.
> 
> Since I'm on the C++ binding (probably a mistake) I used:
>   queue.finish();
> 
> and it seems to be working.
> 
> (This also corresponds possibly to what I'm seeing on a more complex kernel;
> with a more complex kernel I'm seeing on a whole pile of data on the last
> few Z slices as being bogus suggesting it's not finished).
> 
> Dave

This reminds me of a certain issue I experienced with OpenMM. Is it limited to Turks, or it happens on SI+ cards?
Comment 14 Jan Vesely 2017-11-16 06:21:06 UTC
(In reply to Dave Gilbert from comment #12)
> 
> It doesn't seem to help, if I add:
> --- a/ocl.cpp
> +++ b/ocl.cpp
> @@ -74,6 +74,7 @@ static int got_dev(cl::Platform &plat,
> std::vector<cl::Device> &devices, cl::Dev
>      cl::Event eventBarrier2;
>      queue.enqueueBarrierWithWaitList(NULL,&eventBarrier2);
>      std::cerr << __func__ << "enqueueMapBuffer gave: " << err << std::endl;
> +    event.wait();
>      eventMap.wait();
>      eventBarrier2.wait();
>  
> 
> that doesn't seem to help and I think event is the event triggered by the
> kernel.

can you move it few lines up? (before the call to mapBuffer).
Comment 15 Dave Gilbert 2017-11-19 03:21:58 UTC
Hi Jan,
  Yes, doing:
--- a/ocl.cpp
+++ b/ocl.cpp
@@ -65,6 +65,7 @@ static int got_dev(cl::Platform &plat, std::vector<cl::Device> &devices, cl::Dev
     events.push_back(event);
     cl::Event eventMap;
     queue.enqueueBarrierWithWaitList(&events);
+    event.wait();
     mapped = (cl_uint*)queue.enqueueMapBuffer(output, CL_TRUE /* blocking */, CL_MAP_READ,
                            0 /* offset */, 
                            SIZE * SIZE * SIZE * sizeof(cl_uint) /* size */,

does seem to work.

Vedran: I've only got a Turks to play with; feel free to try my test on something else.
Comment 16 Jan Vesely 2017-11-22 23:44:01 UTC
(In reply to Dave Gilbert from comment #15)
> Hi Jan,
>   Yes, doing:
> --- a/ocl.cpp
> +++ b/ocl.cpp
> @@ -65,6 +65,7 @@ static int got_dev(cl::Platform &plat,
> std::vector<cl::Device> &devices, cl::Dev
>      events.push_back(event);
>      cl::Event eventMap;
>      queue.enqueueBarrierWithWaitList(&events);
> +    event.wait();
>      mapped = (cl_uint*)queue.enqueueMapBuffer(output, CL_TRUE /* blocking
> */, CL_MAP_READ,
>                             0 /* offset */, 
>                             SIZE * SIZE * SIZE * sizeof(cl_uint) /* size */,
> 
> does seem to work.

thanks, that means the kernel work event works correctly.
I'll need to double check the specs wrt synchronization points. we either miss a wait, or fail to update mapped buffers after kernel finishes execution.

> 
> Vedran: I've only got a Turks to play with; feel free to try my test on
> something else.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.