Summary: | [polaris10] opencl-mesa: Anything using OpenCL segfaults, XFX Radeon RX 580 | ||
---|---|---|---|
Product: | Mesa | Reporter: | jamespharvey20 |
Component: | Drivers/Gallium/radeonsi | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED MOVED | QA Contact: | Default DRI bug account <dri-devel> |
Severity: | critical | ||
Priority: | medium | ||
Version: | 18.2 | ||
Hardware: | x86-64 (AMD64) | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Bug Depends on: | |||
Bug Blocks: | 99553 | ||
Attachments: |
clinfo seems happy
glxinfo output gdb backtrace of luxmark with opencl-mesa 18.2.2 in debug mode gdb backtrace of luxmark with opencl-mesa 18.2.2 Arch binary (not debug, no symbols) gdb backtrace of IndigoBenchmark |
Description
jamespharvey20
2018-10-08 08:26:57 UTC
Created attachment 141932 [details]
glxinfo output
Created attachment 141933 [details]
gdb backtrace of luxmark with opencl-mesa 18.2.2 in debug mode
Created attachment 141934 [details]
gdb backtrace of luxmark with opencl-mesa 18.2.2 Arch binary (not debug, no symbols)
Created attachment 141935 [details]
gdb backtrace of IndigoBenchmark
These look like two separate problems. The luxmark failure is known. Luxmark requires more than 22 global buffers currently supported by radeonsi. without asserts (src/gallium/drivers/radeonsi/si_compute.c:298) it accesses the global buffer array out of bounds. just bumping MAX_GLOBAL_BUFFERS to 32 allows luxmark to run, albeit still with many incorrect pixels -- libclc rounding conversions are incorrect. The second problem is harder to assess. since platform evaluation works OK with clinfo. the failure seems to be in llvm initialization code. Is IndigoBenchmark linking to LLVM (directly or via OpenGL)? if yes, is it linked to the same version as clover? Understood about Luxmark. Doesn't look like it links against LLVM directly, but it links against libGL, libGLX, and libGLdispatch. Arch is on LLVM 7.0.0-1, and I wouldn't be surprised if that is newer than IndigoBenchmark had on whatever distribution they compiled on. I'm reaching out to them to see if I can get their attention to come here, or answer what they linked against. IndigoBenchmark does work with opencl-amd, but I understand maybe that doesn't link against or at least use the llvm initialization code in the same way. I made a post on their support forum (indigorenderer.com/forum) which is pending moderator approval. Hopefully I can get them to share more information. $ ldd ./indigo_benchmark linux-vdso.so.1 (0x00007fffb2f65000) libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f761adda000) libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f761adb9000) libz.so.1 => /usr/lib/libz.so.1 (0x00007f761aba2000) libpng12.so.0 => /usr/lib/libpng12.so.0 (0x00007f761a979000) libQt5Gui.so.5 => /home/jamespharvey20/Downloads/IndigoBenchmark_x64_v4.0.64/./libQt5Gui.so.5 (0x00007f761a404000) libQt5Core.so.5 => /home/jamespharvey20/Downloads/IndigoBenchmark_x64_v4.0.64/./libQt5Core.so.5 (0x00007f7619e5d000) libQt5Widgets.so.5 => /home/jamespharvey20/Downloads/IndigoBenchmark_x64_v4.0.64/./libQt5Widgets.so.5 (0x00007f76197fd000) libQt5Network.so.5 => /home/jamespharvey20/Downloads/IndigoBenchmark_x64_v4.0.64/./libQt5Network.so.5 (0x00007f76196f1000) libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f7619562000) libm.so.6 => /usr/lib/libm.so.6 (0x00007f76193dd000) libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f76193c3000) libc.so.6 => /usr/lib/libc.so.6 (0x00007f76191ff000) /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f761ae0c000) libGL.so.1 => /usr/lib/libGL.so.1 (0x00007f761916a000) librt.so.1 => /usr/lib/librt.so.1 (0x00007f7619160000) libGLX.so.0 => /usr/lib/libGLX.so.0 (0x00007f761912d000) libX11.so.6 => /usr/lib/libX11.so.6 (0x00007f7618fef000) libXext.so.6 => /usr/lib/libXext.so.6 (0x00007f7618ddd000) libGLdispatch.so.0 => /usr/lib/libGLdispatch.so.0 (0x00007f7618d1f000) libxcb.so.1 => /usr/lib/libxcb.so.1 (0x00007f7618cf5000) libXau.so.6 => /usr/lib/libXau.so.6 (0x00007f7618af1000) libXdmcp.so.6 => /usr/lib/libXdmcp.so.6 (0x00007f76188eb000) Regarding Luxmark, do you know with MAX_GLOBAL_BUFFERS set to 32, is it merely that many pixels will be shown wrong? Are the benchmark results valid, or with the wrong pixels, are the results garbage? (In reply to jamespharvey20 from comment #7) > Regarding Luxmark, do you know with MAX_GLOBAL_BUFFERS set to 32, is it > merely that many pixels will be shown wrong? Are the benchmark results > valid, or with the wrong pixels, are the results garbage? The image looks slightly garbled to me, results say that ~30% are incorrect on my raven gpu. I only checked luxball. hm, I thought I sent this out yesterday... the luxball issue should be fixed by 06bf56725db1827dfcb86b1d0bcd71d195fda1d2 ("radeonsi: Bump number of allowed global buffers to 32") the indigo benchmark might be just manifestation of earlier memory corruption. it has been the case before that OpenCL apps don't allocate large enough buffer for device name [0,1]. clover uses rather lengthy device names (~80 chars in your case). you can try modifying the name string in src/gallium/drivers/radeonsi/si_get.c:964 to see if helps hide the issue. [0] https://github.com/JPaulMora/Pyrit/pull/572/files [1] https://github.com/Theano/libgpuarray/pull/531/files Mesa 18.2.4 has been released. Could you check if that version fixes this bug? If so, please, close it. When I originally filed this, I assumed it was 1 bug since I tried 2 things with OpenCL, and both failed with opencl-mesa but worked with opencl-amd. Jan Vesely was correct that there were two separate problems. I'm hoping Jan Vesely can give guidance on whether to leave this bug open for any of the reasons below, or if I should close it and potentially open up 1-2 new bugs. The original luxmark bug (segfault) is solved, but that exposes 2 new opencl-mesa bugs when running luxmark. The original IndigoBenchmark bug (segfault) isn't solved, but as explained below, I understand if we have to consider that unsolvable for now. I don't think this affects any of these bugs, but I'll mention a few weeks ago, I switched back to my Asus Radeon R9 390. The same behaviors discussed in this entire bug report occur. (i.e. 18.2.3 and before crash luxmark.) If someone really wants me to do so, I can switch back to the RX 580 to test 18.2.4, but I'm betting since it works properly with the R9 390 that the problem is fixed. ORIGINAL LUXMARK BUG #1 ----------------------------------------- Using mesa 18.2.4, the luxmark segfault is solved. NEW - LUXMARK BUG #2 ------------------------------------ Jan Vesely's comment on 2018-10-09 mentions: "bumping MAX_GLOBAL_BUFFERS to 32 allows luxmark to run, albeit still with many incorrect pixels -- libclc rounding conversions are incorrect." That's what I'm seeing out of 18.2.4. Using LuxBall HDR (Simple Benchmark): MESA 18.2.4: 40626 (Image validation OK (65739 different pixels, 10.27%) AMDGPU-PRO: 15739 (Image validation OK (5736 different pixels, 0.90%) There's no typos there. opencl-mesa scores almost unbelievably higher than opencl-amd, but the different pixels percentage increases by a factor of 11.4. As Jan's other comment on 2018-10-09 mentions, the image looks garbled and the results are incorrect. Not sure if this bug should be left open for this issue, or if I should create a new bug. (Or, if there is a bug already open for it.) Or, if mesa will say it's purely libclc's problem, and to go to them about it. NEW - LUXMARK BUG #3 ------------------------------------ Although luxmark can now benchmark, when doing so, all input becomes unusably awful. It reminds me of when Windows has too many things open, suddenly decided it can't cope, and you're waiting to see if it's going to recover or crash. Keystrokes take too long to be printed, and the mouse becomes slow and jumpy. Top shows cpu and memory usage are fine, which was my first thought. BTW, running xf86-video-amdgpu 18.1.0, and when I upgraded mesa, it was both mesa and opencl-mesa. In comparison, if I use opencl-amd, input is not affected. I wouldn't even know the GPU is being slammed. Using the program radeontop, I can see when using mesa, "Graphics pipe", "Texture Addresser", and "Shader Interpolator" are between 95-100%, usually 98-100%. When using opencl-amd, radeontop shows the same. (Granted, Vertex Grouper + Tesselator / Shader Export/Scan Converter/Depth Block/Color Block bounce between 5-20% vs on opencl-mesa, they bounce between 1-5%.) INDIGO BUG ------------------ I edited 18.2.4's si_get.c to be very short: snprintf(sscreen->renderer_string, sizeof(sscreen->renderer_string), "%s", chip_name); And compiled/installed it, but it didn't affect the crash. IndigoBenchmark said they're statically linking with LLVM 3.4, which is quite old. But, it runs fine with opencl-amd, and only crashes on opencl-mesa. I just posted a followup "where do we go from here"-ish comment there which has to be moderator approved so isn't showing yet. https://www.indigorenderer.com/forum/viewtopic.php?f=37&t=14986 Part of me thinks it needs to be given up on, being a closed-source precompiled binary statically linked against LLVM 3.4. Part of me thinks since it only crashes with opencl-mesa, and runs perfectly fine with opencl-amd, there's probably (but not definitely) a bug in opencl-mesa. But, I understand since they don't seem to be paying this any attention, we may have to give up on the Indigo Bug as being unable to be realistically investigated further. Hi, sorry for the delay. somehow I missed the notifications. (In reply to jamespharvey20 from comment #11) > When I originally filed this, I assumed it was 1 bug since I tried 2 things > with OpenCL, and both failed with opencl-mesa but worked with opencl-amd. > > Jan Vesely was correct that there were two separate problems. > > I'm hoping Jan Vesely can give guidance on whether to leave this bug open > for any of the reasons below, or if I should close it and potentially open > up 1-2 new bugs. > > The original luxmark bug (segfault) is solved, but that exposes 2 new > opencl-mesa bugs when running luxmark. > > The original IndigoBenchmark bug (segfault) isn't solved, but as explained > below, I understand if we have to consider that unsolvable for now. > > I don't think this affects any of these bugs, but I'll mention a few weeks > ago, I switched back to my Asus Radeon R9 390. The same behaviors discussed > in this entire bug report occur. (i.e. 18.2.3 and before crash luxmark.) > If someone really wants me to do so, I can switch back to the RX 580 to test > 18.2.4, but I'm betting since it works properly with the R9 390 that the > problem is fixed. > > ORIGINAL LUXMARK BUG #1 > ----------------------------------------- > > Using mesa 18.2.4, the luxmark segfault is solved. As this was the first bug. I'd close this one and open new bugs for both indigo and incorrect rendering in luxmark. > > NEW - LUXMARK BUG #2 > ------------------------------------ > > Jan Vesely's comment on 2018-10-09 mentions: "bumping MAX_GLOBAL_BUFFERS to > 32 allows luxmark to run, albeit still with many incorrect pixels -- libclc > rounding conversions are incorrect." > > That's what I'm seeing out of 18.2.4. Using LuxBall HDR (Simple Benchmark): > > MESA 18.2.4: 40626 (Image validation OK (65739 different pixels, 10.27%) > > AMDGPU-PRO: 15739 (Image validation OK (5736 different pixels, 0.90%) > > There's no typos there. opencl-mesa scores almost unbelievably higher than > opencl-amd, but the different pixels percentage increases by a factor of > 11.4. > > As Jan's other comment on 2018-10-09 mentions, the image looks garbled and > the results are incorrect. > > Not sure if this bug should be left open for this issue, or if I should > create a new bug. (Or, if there is a bug already open for it.) Or, if mesa > will say it's purely libclc's problem, and to go to them about it. I'd say this is probably a purely libclc problem, but feel free to open the bug against clover on freedesktop. 10% is rather good I usually saw ~30% wrong pixels on my machines. > > NEW - LUXMARK BUG #3 > ------------------------------------ > > Although luxmark can now benchmark, when doing so, all input becomes > unusably awful. It reminds me of when Windows has too many things open, > suddenly decided it can't cope, and you're waiting to see if it's going to > recover or crash. Keystrokes take too long to be printed, and the mouse > becomes slow and jumpy. Top shows cpu and memory usage are fine, which was > my first thought. BTW, running xf86-video-amdgpu 18.1.0, and when I > upgraded mesa, it was both mesa and opencl-mesa. > > In comparison, if I use opencl-amd, input is not affected. I wouldn't even > know the GPU is being slammed. > > Using the program radeontop, I can see when using mesa, "Graphics pipe", > "Texture Addresser", and "Shader Interpolator" are between 95-100%, usually > 98-100%. > > When using opencl-amd, radeontop shows the same. (Granted, Vertex Grouper + > Tesselator / Shader Export/Scan Converter/Depth Block/Color Block bounce > between 5-20% vs on opencl-mesa, they bounce between 1-5%.) This sounds like GPU priority/scheduling problem. I haven't looked into whether it can be solved via opening lower priority pipe for compute, or we need to enable advanced features like CWSR. Please open a separate bug. Hogging a large portion of the GPU might explain some of that high score. > > INDIGO BUG > ------------------ > > I edited 18.2.4's si_get.c to be very short: > > snprintf(sscreen->renderer_string, sizeof(sscreen->renderer_string), > "%s", > chip_name); > > And compiled/installed it, but it didn't affect the crash. > > IndigoBenchmark said they're statically linking with LLVM 3.4, which is > quite old. But, it runs fine with opencl-amd, and only crashes on > opencl-mesa. I just posted a followup "where do we go from here"-ish > comment there which has to be moderator approved so isn't showing yet. > https://www.indigorenderer.com/forum/viewtopic.php?f=37&t=14986 > > Part of me thinks it needs to be given up on, being a closed-source > precompiled binary statically linked against LLVM 3.4. > > Part of me thinks since it only crashes with opencl-mesa, and runs perfectly > fine with opencl-amd, there's probably (but not definitely) a bug in > opencl-mesa. > > But, I understand since they don't seem to be paying this any attention, we > may have to give up on the Indigo Bug as being unable to be realistically > investigated further. Can you check if indigo exports any LLVM symbols? It might be that we end up using those instead of the new ones from libLLVM.* If that's the case one solution would be to link mesa/clover with static LLVM. Enabling symbol versioning for LLVM should work as well. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1333. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.