Bug 108272 - [polaris10] opencl-mesa: Anything using OpenCL segfaults, XFX Radeon RX 580
Summary: [polaris10] opencl-mesa: Anything using OpenCL segfaults, XFX Radeon RX 580
Status: NEW
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: 18.2
Hardware: x86-64 (AMD64) All
: medium critical
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 99553
  Show dependency treegraph
 
Reported: 2018-10-08 08:26 UTC by jamespharvey20
Modified: 2018-12-17 17:46 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
clinfo seems happy (6.32 KB, text/plain)
2018-10-08 08:26 UTC, jamespharvey20
Details
glxinfo output (143.77 KB, text/plain)
2018-10-08 08:27 UTC, jamespharvey20
Details
gdb backtrace of luxmark with opencl-mesa 18.2.2 in debug mode (20.94 KB, text/plain)
2018-10-08 08:28 UTC, jamespharvey20
Details
gdb backtrace of luxmark with opencl-mesa 18.2.2 Arch binary (not debug, no symbols) (18.33 KB, text/plain)
2018-10-08 08:28 UTC, jamespharvey20
Details
gdb backtrace of IndigoBenchmark (4.20 KB, text/plain)
2018-10-08 08:29 UTC, jamespharvey20
Details

Note You need to log in before you can comment on or make changes to this bug.
Description jamespharvey20 2018-10-08 08:26:57 UTC
Created attachment 141931 [details]
clinfo seems happy

Up to date Arch Linux, including: linux 4.18.12.arch1-1, xf86-video-amdgpu 18.1.0-1, mesa 18.2.2-1, opencl-mesa 18.2.2-1, xorg-server 1.20.1-1, and plasma-desktop 5.13.5-1.

(Recently installed system that STARTED with: linux 4.18.9.arch1-1, mesa 18.2.1-1, and opencl-mesa 18.2.1-1.)

I have not tried AMDGPU-PRO.  (Not really interested anyway, and the Arch AUR package for it is 17.40, which requires downgrading to linux 4.9 and Xorg 1.18.)

$ lspci -k | grep VGA
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7)
        Subsystem: XFX Pine Group Inc. Ellesmere [Radeon RX 470/480/570/570X/580/580X]
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

It's the: XFX AMD Radeon RX 580 GTS Black Edition 8GB GDDR5 PCI Express 3.0

clinfo seems happy, see full output attached or here: http://termbin.com/iiow

See glxinfo output attached or here: http://termbin.com/lgqd

Anything using OpenCL immediately segfaults.

Everything that segfaults runs just fine if I uninstall opencl-mesa, and either use opencl-amd (Arch linux package, which extracts OpenCL library from AMDGPU-PRO **18.30** but allows it to run with the open source AMDGPU driver.)  Everthing also runs fine if I instead use intel-opencl-runtime to just run off the CPUs.

-----

luxmark 3.1 crashes in pipe_radeonsi.so, called by libMesaOpenCL.so.1, called by libOpenCL.so.1.

A gdb backtrace with opencl-mesa 18.2.2 that I compiled in debug mode with symbols is attached or here: http://termbin.com/to7v

A gdb backtrace with opencl-mesa 18.2.2 (Arch binary, so without buildtype specified and no symbols) is here: http://termbin.com/m1yx

Setting environment variable MESA_DEBUG causes no additional output.

-----

Granted, IndigoBenchmark_x64_v4.0.64 crashes by calling /usr/lib/libOpenCL.so::clGetPlatformIDs() which is ocl-icd, but without opencl-mesa and instead with opencl-amd, it runs through it just fine.

A gdb backtrace is attached or here: http://termbin.com/junz6
Comment 1 jamespharvey20 2018-10-08 08:27:19 UTC
Created attachment 141932 [details]
glxinfo output
Comment 2 jamespharvey20 2018-10-08 08:28:13 UTC
Created attachment 141933 [details]
gdb backtrace of luxmark with opencl-mesa 18.2.2 in debug mode
Comment 3 jamespharvey20 2018-10-08 08:28:38 UTC
Created attachment 141934 [details]
gdb backtrace of luxmark with opencl-mesa 18.2.2 Arch binary (not debug, no symbols)
Comment 4 jamespharvey20 2018-10-08 08:29:06 UTC
Created attachment 141935 [details]
gdb backtrace of IndigoBenchmark
Comment 5 Jan Vesely 2018-10-09 02:54:49 UTC
These look like two separate problems. The luxmark failure is known. Luxmark requires more than 22 global buffers currently supported by radeonsi. without asserts (src/gallium/drivers/radeonsi/si_compute.c:298) it accesses the global buffer array out of bounds.
just bumping MAX_GLOBAL_BUFFERS to 32 allows luxmark to run, albeit still with many incorrect pixels -- libclc rounding conversions are incorrect.



The second problem is harder to assess. since platform evaluation works OK with clinfo. the failure seems to be in llvm initialization code. Is IndigoBenchmark linking to LLVM (directly or via OpenGL)? if yes, is it linked to the same version as clover?
Comment 6 jamespharvey20 2018-10-09 08:32:40 UTC
Understood about Luxmark.

Doesn't look like it links against LLVM directly, but it links against libGL, libGLX, and libGLdispatch.

Arch is on LLVM 7.0.0-1, and I wouldn't be surprised if that is newer than IndigoBenchmark had on whatever distribution they compiled on.  I'm reaching out to them to see if I can get their attention to come here, or answer what they linked against.

IndigoBenchmark does work with opencl-amd, but I understand maybe that doesn't link against or at least use the llvm initialization code in the same way.

I made a post on their support forum (indigorenderer.com/forum) which is pending moderator approval.  Hopefully I can get them to share more information.

$ ldd ./indigo_benchmark
linux-vdso.so.1 (0x00007fffb2f65000)
libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f761adda000)
libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f761adb9000)
libz.so.1 => /usr/lib/libz.so.1 (0x00007f761aba2000)
libpng12.so.0 => /usr/lib/libpng12.so.0 (0x00007f761a979000)
libQt5Gui.so.5 => /home/jamespharvey20/Downloads/IndigoBenchmark_x64_v4.0.64/./libQt5Gui.so.5 (0x00007f761a404000)
libQt5Core.so.5 => /home/jamespharvey20/Downloads/IndigoBenchmark_x64_v4.0.64/./libQt5Core.so.5 (0x00007f7619e5d000)
libQt5Widgets.so.5 => /home/jamespharvey20/Downloads/IndigoBenchmark_x64_v4.0.64/./libQt5Widgets.so.5 (0x00007f76197fd000)
libQt5Network.so.5 => /home/jamespharvey20/Downloads/IndigoBenchmark_x64_v4.0.64/./libQt5Network.so.5 (0x00007f76196f1000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f7619562000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007f76193dd000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f76193c3000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007f76191ff000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f761ae0c000)
libGL.so.1 => /usr/lib/libGL.so.1 (0x00007f761916a000)
librt.so.1 => /usr/lib/librt.so.1 (0x00007f7619160000)
libGLX.so.0 => /usr/lib/libGLX.so.0 (0x00007f761912d000)
libX11.so.6 => /usr/lib/libX11.so.6 (0x00007f7618fef000)
libXext.so.6 => /usr/lib/libXext.so.6 (0x00007f7618ddd000)
libGLdispatch.so.0 => /usr/lib/libGLdispatch.so.0 (0x00007f7618d1f000)
libxcb.so.1 => /usr/lib/libxcb.so.1 (0x00007f7618cf5000)
libXau.so.6 => /usr/lib/libXau.so.6 (0x00007f7618af1000)
libXdmcp.so.6 => /usr/lib/libXdmcp.so.6 (0x00007f76188eb000)
Comment 7 jamespharvey20 2018-10-09 08:34:24 UTC
Regarding Luxmark, do you know with MAX_GLOBAL_BUFFERS set to 32, is it merely that many pixels will be shown wrong?  Are the benchmark results valid, or with the wrong pixels, are the results garbage?
Comment 8 Jan Vesely 2018-10-09 17:42:29 UTC
(In reply to jamespharvey20 from comment #7)
> Regarding Luxmark, do you know with MAX_GLOBAL_BUFFERS set to 32, is it
> merely that many pixels will be shown wrong?  Are the benchmark results
> valid, or with the wrong pixels, are the results garbage?

The image looks slightly garbled to me, results say that ~30% are incorrect on my raven gpu. I only checked luxball.
Comment 9 Jan Vesely 2018-10-19 18:00:24 UTC
hm, I thought I sent this out yesterday...

the luxball issue should be fixed by 06bf56725db1827dfcb86b1d0bcd71d195fda1d2 ("radeonsi: Bump number of allowed global buffers to 32")

the indigo benchmark might be just manifestation of earlier memory corruption.
it has been the case before that OpenCL apps don't allocate large enough buffer for device name [0,1]. clover uses rather lengthy device names (~80 chars in your case).
you can try modifying the name string in src/gallium/drivers/radeonsi/si_get.c:964 to see if helps hide the issue.


[0] https://github.com/JPaulMora/Pyrit/pull/572/files
[1] https://github.com/Theano/libgpuarray/pull/531/files
Comment 10 Juan A. Suarez 2018-10-31 18:59:20 UTC
Mesa 18.2.4 has been released. Could you check if that version fixes this bug? If so, please, close it.
Comment 11 jamespharvey20 2018-11-01 04:33:33 UTC
When I originally filed this, I assumed it was 1 bug since I tried 2 things with OpenCL, and both failed with opencl-mesa but worked with opencl-amd.

Jan Vesely was correct that there were two separate problems.

I'm hoping Jan Vesely can give guidance on whether to leave this bug open for any of the reasons below, or if I should close it and potentially open up 1-2 new bugs.

The original luxmark bug (segfault) is solved, but that exposes 2 new opencl-mesa bugs when running luxmark.

The original IndigoBenchmark bug (segfault) isn't solved, but as explained below, I understand if we have to consider that unsolvable for now.

I don't think this affects any of these bugs, but I'll mention a few weeks ago, I switched back to my Asus Radeon R9 390.  The same behaviors discussed in this entire bug report occur.  (i.e. 18.2.3 and before crash luxmark.)  If someone really wants me to do so, I can switch back to the RX 580 to test 18.2.4, but I'm betting since it works properly with the R9 390 that the problem is fixed.

ORIGINAL LUXMARK BUG #1
-----------------------------------------

Using mesa 18.2.4, the luxmark segfault is solved.

NEW - LUXMARK BUG #2
------------------------------------

Jan Vesely's comment on 2018-10-09 mentions: "bumping MAX_GLOBAL_BUFFERS to 32 allows luxmark to run, albeit still with many incorrect pixels -- libclc rounding conversions are incorrect."

That's what I'm seeing out of 18.2.4.  Using LuxBall HDR (Simple Benchmark):

MESA 18.2.4: 40626 (Image validation OK (65739 different pixels, 10.27%)

AMDGPU-PRO: 15739 (Image validation OK (5736 different pixels, 0.90%)

There's no typos there.  opencl-mesa scores almost unbelievably higher than opencl-amd, but the different pixels percentage increases by a factor of 11.4.

As Jan's other comment on 2018-10-09 mentions, the image looks garbled and the results are incorrect.

Not sure if this bug should be left open for this issue, or if I should create a new bug.  (Or, if there is a bug already open for it.)  Or, if mesa will say it's purely libclc's problem, and to go to them about it.

NEW - LUXMARK BUG #3
------------------------------------

Although luxmark can now benchmark, when doing so, all input becomes unusably awful.  It reminds me of when Windows has too many things open, suddenly decided it can't cope, and you're waiting to see if it's going to recover or crash.  Keystrokes take too long to be printed, and the mouse becomes slow and jumpy.  Top shows cpu and memory usage are fine, which was my first thought.  BTW, running xf86-video-amdgpu 18.1.0, and when I upgraded mesa, it was both mesa and opencl-mesa.

In comparison, if I use opencl-amd, input is not affected.  I wouldn't even know the GPU is being slammed.

Using the program radeontop, I can see when using mesa, "Graphics pipe", "Texture Addresser", and "Shader Interpolator" are between 95-100%, usually 98-100%.

When using opencl-amd, radeontop shows the same.  (Granted, Vertex Grouper + Tesselator / Shader Export/Scan Converter/Depth Block/Color Block bounce between 5-20% vs on opencl-mesa, they bounce between 1-5%.)

INDIGO BUG
------------------

I edited 18.2.4's si_get.c to be very short:

    snprintf(sscreen->renderer_string, sizeof(sscreen->renderer_string),
       "%s",
       chip_name);

And compiled/installed it, but it didn't affect the crash.

IndigoBenchmark said they're statically linking with LLVM 3.4, which is quite old.  But, it runs fine with opencl-amd, and only crashes on opencl-mesa.  I just posted a followup "where do we go from here"-ish comment there which has to be moderator approved so isn't showing yet. 
 https://www.indigorenderer.com/forum/viewtopic.php?f=37&t=14986

Part of me thinks it needs to be given up on, being a closed-source precompiled binary statically linked against LLVM 3.4.

Part of me thinks since it only crashes with opencl-mesa, and runs perfectly fine with opencl-amd, there's probably (but not definitely) a bug in opencl-mesa.

But, I understand since they don't seem to be paying this any attention, we may have to give up on the Indigo Bug as being unable to be realistically investigated further.
Comment 12 Jan Vesely 2018-12-17 17:46:12 UTC
Hi,

sorry for the delay. somehow I missed the notifications.
(In reply to jamespharvey20 from comment #11)
> When I originally filed this, I assumed it was 1 bug since I tried 2 things
> with OpenCL, and both failed with opencl-mesa but worked with opencl-amd.
> 
> Jan Vesely was correct that there were two separate problems.
> 
> I'm hoping Jan Vesely can give guidance on whether to leave this bug open
> for any of the reasons below, or if I should close it and potentially open
> up 1-2 new bugs.
> 
> The original luxmark bug (segfault) is solved, but that exposes 2 new
> opencl-mesa bugs when running luxmark.
> 
> The original IndigoBenchmark bug (segfault) isn't solved, but as explained
> below, I understand if we have to consider that unsolvable for now.
> 
> I don't think this affects any of these bugs, but I'll mention a few weeks
> ago, I switched back to my Asus Radeon R9 390.  The same behaviors discussed
> in this entire bug report occur.  (i.e. 18.2.3 and before crash luxmark.) 
> If someone really wants me to do so, I can switch back to the RX 580 to test
> 18.2.4, but I'm betting since it works properly with the R9 390 that the
> problem is fixed.
> 
> ORIGINAL LUXMARK BUG #1
> -----------------------------------------
> 
> Using mesa 18.2.4, the luxmark segfault is solved.

As this was the first bug. I'd close this one and open new bugs for both indigo and incorrect rendering in luxmark.

> 
> NEW - LUXMARK BUG #2
> ------------------------------------
> 
> Jan Vesely's comment on 2018-10-09 mentions: "bumping MAX_GLOBAL_BUFFERS to
> 32 allows luxmark to run, albeit still with many incorrect pixels -- libclc
> rounding conversions are incorrect."
> 
> That's what I'm seeing out of 18.2.4.  Using LuxBall HDR (Simple Benchmark):
> 
> MESA 18.2.4: 40626 (Image validation OK (65739 different pixels, 10.27%)
> 
> AMDGPU-PRO: 15739 (Image validation OK (5736 different pixels, 0.90%)
> 
> There's no typos there.  opencl-mesa scores almost unbelievably higher than
> opencl-amd, but the different pixels percentage increases by a factor of
> 11.4.
> 
> As Jan's other comment on 2018-10-09 mentions, the image looks garbled and
> the results are incorrect.
> 
> Not sure if this bug should be left open for this issue, or if I should
> create a new bug.  (Or, if there is a bug already open for it.)  Or, if mesa
> will say it's purely libclc's problem, and to go to them about it.

I'd say this is probably a purely libclc problem, but feel free to open the bug against clover on freedesktop. 10% is rather good I usually saw ~30% wrong pixels on my machines.

> 
> NEW - LUXMARK BUG #3
> ------------------------------------
> 
> Although luxmark can now benchmark, when doing so, all input becomes
> unusably awful.  It reminds me of when Windows has too many things open,
> suddenly decided it can't cope, and you're waiting to see if it's going to
> recover or crash.  Keystrokes take too long to be printed, and the mouse
> becomes slow and jumpy.  Top shows cpu and memory usage are fine, which was
> my first thought.  BTW, running xf86-video-amdgpu 18.1.0, and when I
> upgraded mesa, it was both mesa and opencl-mesa.
> 
> In comparison, if I use opencl-amd, input is not affected.  I wouldn't even
> know the GPU is being slammed.
> 
> Using the program radeontop, I can see when using mesa, "Graphics pipe",
> "Texture Addresser", and "Shader Interpolator" are between 95-100%, usually
> 98-100%.
> 
> When using opencl-amd, radeontop shows the same.  (Granted, Vertex Grouper +
> Tesselator / Shader Export/Scan Converter/Depth Block/Color Block bounce
> between 5-20% vs on opencl-mesa, they bounce between 1-5%.)

This sounds like GPU priority/scheduling problem. I haven't looked into whether it can be solved via opening lower priority pipe for compute, or we need to enable advanced features like CWSR. Please open a separate bug. Hogging a large portion of the GPU might explain some of that high score.

> 
> INDIGO BUG
> ------------------
> 
> I edited 18.2.4's si_get.c to be very short:
> 
>     snprintf(sscreen->renderer_string, sizeof(sscreen->renderer_string),
>        "%s",
>        chip_name);
> 
> And compiled/installed it, but it didn't affect the crash.
> 
> IndigoBenchmark said they're statically linking with LLVM 3.4, which is
> quite old.  But, it runs fine with opencl-amd, and only crashes on
> opencl-mesa.  I just posted a followup "where do we go from here"-ish
> comment there which has to be moderator approved so isn't showing yet. 
>  https://www.indigorenderer.com/forum/viewtopic.php?f=37&t=14986
> 
> Part of me thinks it needs to be given up on, being a closed-source
> precompiled binary statically linked against LLVM 3.4.
> 
> Part of me thinks since it only crashes with opencl-mesa, and runs perfectly
> fine with opencl-amd, there's probably (but not definitely) a bug in
> opencl-mesa.
> 
> But, I understand since they don't seem to be paying this any attention, we
> may have to give up on the Indigo Bug as being unable to be realistically
> investigated further.

Can you check if indigo exports any LLVM symbols? It might be that we end up using those instead of the new ones from libLLVM.*
If that's the case one solution would be to link mesa/clover with static LLVM.
Enabling symbol versioning for LLVM should work as well.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.