Bug 75400 - regression in OpenCL since commit cc3aeac
Summary: regression in OpenCL since commit cc3aeac
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/r600 (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-23 13:48 UTC by Bruno Jiménez
Modified: 2014-02-24 16:16 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description Bruno Jiménez 2014-02-23 13:48:56 UTC
Hi,

This morning I recompiled mesa and found that the OpenCL support was broken. I have managed to bisect the regresion back to commit cc3aeac ( http://cgit.freedesktop.org/mesa/mesa/commit/?id=cc3aeacab64a6928a903f1dbfeaa7c880a8de5a6 ) Strangely, it's nothing related to clover.

I am using Arch linux with kernel 3.13.4 and a AMD HD5470. Nothing interesting in dmesg or Xorg logs.

If I can give you any more information, just ask.
Comment 1 Emil Velikov 2014-02-23 15:35:02 UTC
Hi Bruno

What you mean with "broken" here ? If you're talking about a compilation problem take a look at bug 75356, which has a patch to resolve it.

If you are having different a problem let us know what it is :)
Comment 2 Bruno Jiménez 2014-02-23 16:10:25 UTC
Hi Emil,

No, it's not a compilation error, nor for mesa nor for opencl code. It's just that OpenCL programs crash with segfaults.

Every test from http://cgit.freedesktop.org/~tstellar/opencl-example/ fails and its 'hello_world' program crash with a segfault.

As the code changed in that bug has nothing to do with clover, maybe the problem is with my configuration?

Here's what I pass to autogen.sh, surely there's something I don't need, but I took them from a PKGBUILD:

  --prefix=/usr \
  --sysconfdir=/etc \
  --with-dri-driverdir=/usr/lib/xorg/modules/dri \
  --with-gallium-drivers=r600,swrast\
  --with-dri-drivers=swrast \
  --enable-gallium-llvm \
  --enable-egl \
  --enable-gallium-egl \
  --with-egl-platforms=x11,drm,wayland \
  --enable-shared-glapi \
  --enable-gbm \
  --enable-gallium-gbm \
  --enable-glx-tls \
  --enable-dri \
  --enable-glx \
  --enable-osmesa \
  --enable-texture-float \
  --enable-xa \
  --enable-vdpau \
  --enable-omx \
  --with-llvm-shared-libs \
  --enable-opencl --enable-opencl-icd \
  --with-clang-libdir=/usr/lib

If there's anything else I can do to help, just ask.
Thanks!
Comment 3 Emil Velikov 2014-02-23 16:31:34 UTC
Strange I do not see how the commit will cause other than compilation issues. FWIW might be worth double-checking that the bisect went fine and attaching a back trace of the segfault.
Comment 4 Bruno Jiménez 2014-02-23 18:35:42 UTC
I am also very surpised of what commit seems to start this. I have done the bisect making Arch packages, installing and then testing them. So, unless I have missed something, which is also possible, that's it.

I have recompiled at commit cc3aeac with debug information, but for some strange reason, gdb don't want to step into OpenCL functions.

Here's what I have guessed:

- Actually, the segfault comes from a fprintf with a "%s" and a null pointer. It can be solved by just adding a default case to 'clUtilErrorString'.

- The real problem happens with 'clGetPlatformIDs', which returns an error value of '-1001'.

I have triggered the return of 'CL_INVALID_VALUE', and tried various combinations of parameters to see if it changed anything. And seems to be one thing or the other.

I have checked the code at mesa/src/gallium/state_trackers/clover/api/platform.cpp (where clGetPlatformIDs is) and have no clue how it can be possible.

Sorry if this isn't enough information, but I completely clueless of what can be happening.

I will check again my packages to see if I have compiled some version and have called it other.

If I can help with anything else, just ask.
Comment 5 Francisco Jerez 2014-02-23 19:45:02 UTC
(In reply to comment #4)
> I am also very surpised of what commit seems to start this. I have done the
> bisect making Arch packages, installing and then testing them. So, unless I
> have missed something, which is also possible, that's it.
> 
> I have recompiled at commit cc3aeac with debug information, but for some
> strange reason, gdb don't want to step into OpenCL functions.
> 
> Here's what I have guessed:
> 
> - Actually, the segfault comes from a fprintf with a "%s" and a null
> pointer. It can be solved by just adding a default case to
> 'clUtilErrorString'.
> 
> - The real problem happens with 'clGetPlatformIDs', which returns an error
> value of '-1001'.
> 
> I have triggered the return of 'CL_INVALID_VALUE', and tried various
> combinations of parameters to see if it changed anything. And seems to be
> one thing or the other.
> 
> I have checked the code at
> mesa/src/gallium/state_trackers/clover/api/platform.cpp (where
> clGetPlatformIDs is) and have no clue how it can be possible.
> 
> Sorry if this isn't enough information, but I completely clueless of what
> can be happening.
> 
> I will check again my packages to see if I have compiled some version and
> have called it other.
> 
> If I can help with anything else, just ask.

Most likely you're getting that segfault somewhere in the ICD loader because it's unable to load Mesa's ICD library.  I guess that this hunk:

+if NEED_WINSYS_XLIB
+AM_CPPFLAGS += -DHAVE_WINSYS_XLIB
+endif

pulls in the XLIB pipe-loader back-end that was previously ifdef-ed out in Clover builds, leading to undefined symbols in the resulting library.
Comment 6 Emil Velikov 2014-02-23 20:21:34 UTC
(In reply to comment #5)
> 
> Most likely you're getting that segfault somewhere in the ICD loader because
> it's unable to load Mesa's ICD library.  I guess that this hunk:
> 
> +if NEED_WINSYS_XLIB
> +AM_CPPFLAGS += -DHAVE_WINSYS_XLIB
> +endif
> 
> pulls in the XLIB pipe-loader back-end that was previously ifdef-ed out in
> Clover builds, leading to undefined symbols in the resulting library.

Would that not cause the build/link to fail ? Hmm guess not, since the opencl target is missing -no-undefined.

Francisco,
Is there any particular reason why we do not use -no-undefined for opencl ?

Bruno,
Feel free to grab the patch from bug 75356, which should handle the symbol problems and continue from there.
Comment 7 Bruno Jiménez 2014-02-23 20:26:12 UTC
Hi Francisco,

The segfaults were caused because 'clGetPlatformIDs' returned an strange error (-1001), and when passed to 'clUtilErrorString' (from 'cl_util.c') it meant an unhandled error case. So it returned nothing, and when fprintf tries to write it it gives a segfault.


Emil,

I'll try that patch as soon as I can.


Thanks!
Comment 8 Bruno Jiménez 2014-02-23 22:19:21 UTC
Hi,

I'm afraid that that patch doesn't help. I have also tried the patch you have sent to the Mailing List ( http://lists.freedesktop.org/archives/mesa-dev/2014-February/054780.html ) but also nothing.

If there's anything else I can do, just ask.
Thanks!
Comment 9 Emil Velikov 2014-02-24 15:04:46 UTC
(In reply to comment #8)
> Hi,
> 
> I'm afraid that that patch doesn't help. I have also tried the patch you
> have sent to the Mailing List (
> http://lists.freedesktop.org/archives/mesa-dev/2014-February/054780.html )
> but also nothing.
> 
Interesting that patch you've linked should have caused build breakage as there is yet another missing symbol/reference :\

Just pushed a few patches that should resolve the missing symbols within pipe-loader, used by opencl. Checkout latest master and give it a try.
Comment 10 Bruno Jiménez 2014-02-24 16:16:26 UTC
Hi,

The latest master branch works perfectly.

Thanks a lot!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.