103586 – OpenCL/Clover: AMD Turks: corrupt output buffer (depending on dimension order?)

Bug 103586 - OpenCL/Clover: AMD Turks: corrupt output buffer (depending on dimension order?)

Summary: OpenCL/Clover: AMD Turks: corrupt output buffer (depending on dimension order?)

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/r600 (show other bugs)
Version:	17.2
Hardware:	Other All

Importance:	medium normal
Assignee:	mesa-dev
QA Contact:	mesa-dev

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	99553
	Show dependency tree / graph

Reported:	2017-11-06 00:20 UTC by Dave Gilbert
Modified:	2019-09-18 19:24 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
foo.ll from debug run (132.57 KB, text/plain) 2017-11-08 17:03 UTC, Dave Gilbert	Details
foo.link-0.asm (292.35 KB, text/plain) 2017-11-08 17:03 UTC, Dave Gilbert	Details
foo.link-0.ll (132.16 KB, text/plain) 2017-11-08 17:04 UTC, Dave Gilbert	Details
annotated asm dump (3.54 KB, text/plain) 2017-11-08 20:58 UTC, Jan Vesely	Details
View All

Description Dave Gilbert 2017-11-06 00:20:53 UTC

I've got a trivial kernel that draws a sphere in a voxel cube; each voxel should end up as 0 or 1; if I use global id 0 as z, 1 as y, 2 as x  I get corruptions where some voxels have random junk in; if I reverse the order so that global id 0 is x, 1 is y and 2 is z then it's happy.
(Confirmed the code is clean with oclgrind and happy on Intel.

Versions:

Number of devices                                 1
  Device Name                                     AMD TURKS (DRM 2.50.0 / 4.13.0-1-amd64, LLVM 5.0.0)
  Device Vendor                                   AMD
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.1 Mesa 17.2.4
  Driver Version                                  17.2.4
  Device OpenCL C Version                         OpenCL C 1.1 

(on debian testing, was on stable, but same behaviour)

01:00.0 0300: 1002:6841
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Thames [Radeon HD 7550M/7570M/7650M] (prog-if 00 [VGA controller])
	Subsystem: Hewlett-Packard Company Thames [Radeon HD 7550M/7570M/7650M]
	Flags: bus master, fast devsel, latency 0, IRQ 37
	Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Memory at d4300000 (64-bit, non-prefetchable) [size=128K]
	I/O ports at 4000 [size=256]
	Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: radeon
	Kernel modules: radeon

in an HP Elitebook laptop.

Code that triggers this:
https://github.com/penguin42/opencl-play/commit/c98470685874769e4a59975791459180564b6f6e

build and run with:
g++ -O2 ocl.cpp -lOpenCL && ./a.out 2> z
then check output with:
tr '01' '  ' <z|grep -v '^ *$'|egrep -v 'got_dev|^Z'
which should be empty,

(In some builds I've found I've had to increase the SIZE constant to 256 to trigger it)

Then my commit e89fe62 fixes it with:
diff --git a/sphere.ocl b/sphere.ocl
index b4f23af..c89ecb9 100644
--- a/sphere.ocl
+++ b/sphere.ocl
@@ -1,10 +1,10 @@
 __kernel void hello(__global uint* o) {
-  int z = get_global_id(0);
+  int z = get_global_id(2);
   int y = get_global_id(1);
-  int x = get_global_id(2);
-  int zr = get_global_size(0);
+  int x = get_global_id(0);
+  int zr = get_global_size(2);
   int yr = get_global_size(1);
-  int xr = get_global_size(2);
+  int xr = get_global_size(0);
   float zf = ((float)z - ((float)zr)/2) / (float)zr;
   float yf = ((float)y - ((float)yr)/2) / (float)yr;
   float xf = ((float)x - ((float)xr)/2) / (float)xr;

by just swapping z/x around - which should make no difference given it's a cube.

But....hmm, I've seen it fail in that direction now as well.

The corruptions all seem to be near the maximum x/y/z value - almost like one small chunk in the max corner.

Here's the kernel:
__kernel void hello(__global uint* o) {
  int z = get_global_id(0);
  int y = get_global_id(1);
  int x = get_global_id(2);
  int zr = get_global_size(0);
  int yr = get_global_size(1);
  int xr = get_global_size(2);
  float zf = ((float)z - ((float)zr)/2) / (float)zr;
  float yf = ((float)y - ((float)yr)/2) / (float)yr;
  float xf = ((float)x - ((float)xr)/2) / (float)xr;

  o[z*yr*xr + y*xr + x] = ((zf * zf) + (yf * yf) + (xf * xf)) <  0.25;
}

Comment 1 Jan Vesely 2017-11-08 15:23:09 UTC

can you run using CLOVER_DEBUG=llvm,native CLOVER_DEBUG_FILE=foo and attach both llvm and isa dumps?

Comment 2 Dave Gilbert 2017-11-08 17:03:33 UTC

Created attachment 135311 [details]
foo.ll from debug run

Comment 3 Dave Gilbert 2017-11-08 17:03:52 UTC

Created attachment 135312 [details]
foo.link-0.asm

Comment 4 Dave Gilbert 2017-11-08 17:04:44 UTC

Created attachment 135313 [details]
foo.link-0.ll

That's all 3 of the debug files it produced.
(I wasn't sure which were the llvm and which the isa dumps; I guess the asm is the isa? and the ll's are both llvm dumps?)

Comment 5 Jan Vesely 2017-11-08 20:44:44 UTC

(In reply to Dave Gilbert from comment #4)
> Created attachment 135313 [details]
> foo.link-0.ll
> 
> That's all 3 of the debug files it produced.
> (I wasn't sure which were the llvm and which the isa dumps; I guess the asm
> is the isa? and the ll's are both llvm dumps?)

yes. the first .ll is from compilation step, the other one is from linking step.

.ll dump looks correct.
.asm also looks correct.

you can try producing multiple asm dumps for working and non-working runs. But I don't think that the llvm is the culprit here.

Can you try waiting for the kernel execution to complete explicitly before mapping the buffer?
Ideally call clFinish() on line 63.

Comment 6 Dave Gilbert 2017-11-08 20:56:13 UTC

(In reply to Jan Vesely from comment #5)
> (In reply to Dave Gilbert from comment #4)
> > Created attachment 135313 [details]
> > foo.link-0.ll
> > 
> > That's all 3 of the debug files it produced.
> > (I wasn't sure which were the llvm and which the isa dumps; I guess the asm
> > is the isa? and the ll's are both llvm dumps?)
> 
> yes. the first .ll is from compilation step, the other one is from linking
> step.
> 
> .ll dump looks correct.
> .asm also looks correct.
> 
> you can try producing multiple asm dumps for working and non-working runs.
> But I don't think that the llvm is the culprit here.
> 
> Can you try waiting for the kernel execution to complete explicitly before
> mapping the buffer?
> Ideally call clFinish() on line 63.

Since I'm on the C++ binding (probably a mistake) I used:
  queue.finish();

and it seems to be working.

(This also corresponds possibly to what I'm seeing on a more complex kernel; with a more complex kernel I'm seeing on a whole pile of data on the last few Z slices as being bogus suggesting it's not finished).

Dave

Comment 7 Jan Vesely 2017-11-08 20:58:16 UTC

Created attachment 135318 [details]
annotated asm dump

Comment 8 Jan Vesely 2017-11-08 22:25:07 UTC

(In reply to Dave Gilbert from comment #6)
> (In reply to Jan Vesely from comment #5)
> > (In reply to Dave Gilbert from comment #4)
> > > Created attachment 135313 [details]
> > > foo.link-0.ll
> > > 
> > > That's all 3 of the debug files it produced.
> > > (I wasn't sure which were the llvm and which the isa dumps; I guess the asm
> > > is the isa? and the ll's are both llvm dumps?)
> > 
> > yes. the first .ll is from compilation step, the other one is from linking
> > step.
> > 
> > .ll dump looks correct.
> > .asm also looks correct.
> > 
> > you can try producing multiple asm dumps for working and non-working runs.
> > But I don't think that the llvm is the culprit here.
> > 
> > Can you try waiting for the kernel execution to complete explicitly before
> > mapping the buffer?
> > Ideally call clFinish() on line 63.
> 
> Since I'm on the C++ binding (probably a mistake) I used:
>   queue.finish();
> 
> and it seems to be working.
> 
> (This also corresponds possibly to what I'm seeing on a more complex kernel;
> with a more complex kernel I'm seeing on a whole pile of data on the last
> few Z slices as being bogus suggesting it's not finished).
> 
> Dave

thanks for testing. I see you are using mesa 17.2.

there were few changes to blocking call synchronization that went to mesa 17.3:
02f8ac6b70033a1b240d497c4664c359d2398cc3 (clover: Wrap event::wait_count in a method taking care of the required locking.)
bc4000ee40c78efe1e5e8a6244d4bb55389d8418 (clover: Run the associated action before an event is signalled.)
3a5b69c09ba355c616c274b0c7f5aba3bd21fd54 (clover: Wait for requested operation if blocking flag is set)

which might help address the issue. Can you test mesa 17.3?

Comment 9 Dave Gilbert 2017-11-08 23:07:56 UTC

(In reply to Jan Vesely from comment #8)
> (In reply to Dave Gilbert from comment #6)
> > (In reply to Jan Vesely from comment #5)
> > > (In reply to Dave Gilbert from comment #4)
> > > > Created attachment 135313 [details]
> > > > foo.link-0.ll
> > > > 
> > > > That's all 3 of the debug files it produced.
> > > > (I wasn't sure which were the llvm and which the isa dumps; I guess the asm
> > > > is the isa? and the ll's are both llvm dumps?)
> > > 
> > > yes. the first .ll is from compilation step, the other one is from linking
> > > step.
> > > 
> > > .ll dump looks correct.
> > > .asm also looks correct.
> > > 
> > > you can try producing multiple asm dumps for working and non-working runs.
> > > But I don't think that the llvm is the culprit here.
> > > 
> > > Can you try waiting for the kernel execution to complete explicitly before
> > > mapping the buffer?
> > > Ideally call clFinish() on line 63.
> > 
> > Since I'm on the C++ binding (probably a mistake) I used:
> >   queue.finish();
> > 
> > and it seems to be working.
> > 
> > (This also corresponds possibly to what I'm seeing on a more complex kernel;
> > with a more complex kernel I'm seeing on a whole pile of data on the last
> > few Z slices as being bogus suggesting it's not finished).
> > 
> > Dave
> 
> thanks for testing. I see you are using mesa 17.2.
> 
> there were few changes to blocking call synchronization that went to mesa
> 17.3:
> 02f8ac6b70033a1b240d497c4664c359d2398cc3 (clover: Wrap event::wait_count in
> a method taking care of the required locking.)
> bc4000ee40c78efe1e5e8a6244d4bb55389d8418 (clover: Run the associated action
> before an event is signalled.)
> 3a5b69c09ba355c616c274b0c7f5aba3bd21fd54 (clover: Wait for requested
> operation if blocking flag is set)
> 
> which might help address the issue. Can you test mesa 17.3?

Yeh, I'll figure out how to get 17.3 built on this box.

Comment 10 Dave Gilbert 2017-11-09 03:13:13 UTC

I believe I'm still seeing this:

dg@hath:~/ocl2$ clinfo 
Number of platforms                               1
  Platform Name                                   Clover
  Platform Vendor                                 Mesa
  Platform Version                                OpenCL 1.1 Mesa 17.4.0-devel (git-a16dc04ad5)
....
dg@hath:~/ocl2$ echo $LD_LIBRARY_PATH 
/home/dg/mesa/try/lib:

so I *think* it's using my build.

and I believe I'm still seeing it.
Is my test valid or do I really need that finish?

Comment 11 Jan Vesely 2017-11-09 16:55:06 UTC

(In reply to Dave Gilbert from comment #10)
> I believe I'm still seeing this:
> 
> dg@hath:~/ocl2$ clinfo 
> Number of platforms                               1
>   Platform Name                                   Clover
>   Platform Vendor                                 Mesa
>   Platform Version                                OpenCL 1.1 Mesa
> 17.4.0-devel (git-a16dc04ad5)
> ....
> dg@hath:~/ocl2$ echo $LD_LIBRARY_PATH 
> /home/dg/mesa/try/lib:
> 
> so I *think* it's using my build.

yes, that looks OK.

> and I believe I'm still seeing it.
> Is my test valid or do I really need that finish?

it should be OK. Can you replace the clFinish with clWaitForEvents (or the respective C++ method) to wait for kernel execution?
It looks to me that clover creates new map without waiting for all the dep events.

Comment 12 Dave Gilbert 2017-11-13 19:56:12 UTC

(In reply to Jan Vesely from comment #11)
> (In reply to Dave Gilbert from comment #10)
> > I believe I'm still seeing this:
> > 
> > dg@hath:~/ocl2$ clinfo 
> > Number of platforms                               1
> >   Platform Name                                   Clover
> >   Platform Vendor                                 Mesa
> >   Platform Version                                OpenCL 1.1 Mesa
> > 17.4.0-devel (git-a16dc04ad5)
> > ....
> > dg@hath:~/ocl2$ echo $LD_LIBRARY_PATH 
> > /home/dg/mesa/try/lib:
> > 
> > so I *think* it's using my build.
> 
> yes, that looks OK.
> 
> > and I believe I'm still seeing it.
> > Is my test valid or do I really need that finish?
> 
> it should be OK. Can you replace the clFinish with clWaitForEvents (or the
> respective C++ method) to wait for kernel execution?
> It looks to me that clover creates new map without waiting for all the dep
> events.

It doesn't seem to help, if I add:
--- a/ocl.cpp
+++ b/ocl.cpp
@@ -74,6 +74,7 @@ static int got_dev(cl::Platform &plat, std::vector<cl::Device> &devices, cl::Dev
     cl::Event eventBarrier2;
     queue.enqueueBarrierWithWaitList(NULL,&eventBarrier2);
     std::cerr << __func__ << "enqueueMapBuffer gave: " << err << std::endl;
+    event.wait();
     eventMap.wait();
     eventBarrier2.wait();
 

that doesn't seem to help and I think event is the event triggered by the kernel.

Comment 13 Vedran Miletić 2017-11-13 20:36:27 UTC

(In reply to Dave Gilbert from comment #6)
> (In reply to Jan Vesely from comment #5)
> > (In reply to Dave Gilbert from comment #4)
> > > Created attachment 135313 [details]
> > > foo.link-0.ll
> > > 
> > > That's all 3 of the debug files it produced.
> > > (I wasn't sure which were the llvm and which the isa dumps; I guess the asm
> > > is the isa? and the ll's are both llvm dumps?)
> > 
> > yes. the first .ll is from compilation step, the other one is from linking
> > step.
> > 
> > .ll dump looks correct.
> > .asm also looks correct.
> > 
> > you can try producing multiple asm dumps for working and non-working runs.
> > But I don't think that the llvm is the culprit here.
> > 
> > Can you try waiting for the kernel execution to complete explicitly before
> > mapping the buffer?
> > Ideally call clFinish() on line 63.
> 
> Since I'm on the C++ binding (probably a mistake) I used:
>   queue.finish();
> 
> and it seems to be working.
> 
> (This also corresponds possibly to what I'm seeing on a more complex kernel;
> with a more complex kernel I'm seeing on a whole pile of data on the last
> few Z slices as being bogus suggesting it's not finished).
> 
> Dave

This reminds me of a certain issue I experienced with OpenMM. Is it limited to Turks, or it happens on SI+ cards?

Comment 14 Jan Vesely 2017-11-16 06:21:06 UTC

(In reply to Dave Gilbert from comment #12)
> 
> It doesn't seem to help, if I add:
> --- a/ocl.cpp
> +++ b/ocl.cpp
> @@ -74,6 +74,7 @@ static int got_dev(cl::Platform &plat,
> std::vector<cl::Device> &devices, cl::Dev
>      cl::Event eventBarrier2;
>      queue.enqueueBarrierWithWaitList(NULL,&eventBarrier2);
>      std::cerr << __func__ << "enqueueMapBuffer gave: " << err << std::endl;
> +    event.wait();
>      eventMap.wait();
>      eventBarrier2.wait();
>  
> 
> that doesn't seem to help and I think event is the event triggered by the
> kernel.

can you move it few lines up? (before the call to mapBuffer).

Comment 15 Dave Gilbert 2017-11-19 03:21:58 UTC

Hi Jan,
  Yes, doing:
--- a/ocl.cpp
+++ b/ocl.cpp
@@ -65,6 +65,7 @@ static int got_dev(cl::Platform &plat, std::vector<cl::Device> &devices, cl::Dev
     events.push_back(event);
     cl::Event eventMap;
     queue.enqueueBarrierWithWaitList(&events);
+    event.wait();
     mapped = (cl_uint*)queue.enqueueMapBuffer(output, CL_TRUE /* blocking */, CL_MAP_READ,
                            0 /* offset */, 
                            SIZE * SIZE * SIZE * sizeof(cl_uint) /* size */,

does seem to work.

Vedran: I've only got a Turks to play with; feel free to try my test on something else.

Comment 16 Jan Vesely 2017-11-22 23:44:01 UTC

(In reply to Dave Gilbert from comment #15)
> Hi Jan,
>   Yes, doing:
> --- a/ocl.cpp
> +++ b/ocl.cpp
> @@ -65,6 +65,7 @@ static int got_dev(cl::Platform &plat,
> std::vector<cl::Device> &devices, cl::Dev
>      events.push_back(event);
>      cl::Event eventMap;
>      queue.enqueueBarrierWithWaitList(&events);
> +    event.wait();
>      mapped = (cl_uint*)queue.enqueueMapBuffer(output, CL_TRUE /* blocking
> */, CL_MAP_READ,
>                             0 /* offset */, 
>                             SIZE * SIZE * SIZE * sizeof(cl_uint) /* size */,
> 
> does seem to work.

thanks, that means the kernel work event works correctly.
I'll need to double check the specs wrt synchronization points. we either miss a wait, or fail to update mapped buffers after kernel finishes execution.

> 
> Vedran: I've only got a Turks to play with; feel free to try my test on
> something else.

Comment 17 GitLab Migration User 2019-09-18 19:24:29 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/615.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.