Bug 108879 - [CIK] [regression] All opencl apps hangs indefinitely in si_create_context
Summary: [CIK] [regression] All opencl apps hangs indefinitely in si_create_context
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: 18.3
Hardware: All All
: medium critical
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
: 108572 (view as bug list)
Depends on:
Blocks: 99553
  Show dependency treegraph
 
Reported: 2018-11-27 12:48 UTC by Vedran Miletić
Modified: 2019-04-22 20:07 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
clinfo output with mesa-opencl-icd downgraded to 18.2.8 (6.45 KB, text/plain)
2019-01-03 17:24 UTC, Vasily Galkin
Details
Stacktrace from debian's 18.3.0 build (4.84 KB, text/plain)
2019-01-03 17:30 UTC, Vasily Galkin
Details

Description Vedran Miletić 2018-11-27 12:48:26 UTC
Bisected to 9b331e462e5021d994859756d46cd2519d9c9c6e.

Tracing calls done by clinfo and pressing Ctrl+C gives:

#0  0x00007ffff7e9cead in syscall () from /lib64/libc.so.6
#1  0x00007ffff143422c in sys_futex (val3=-1, addr2=0x0, timeout=0x0, val1=2, op=9, addr1=0x555555899b08) at ../../src/util/futex.h:50                                                
#2  futex_wait (timeout=0x0, value=2, addr=0x555555899b08) at ../../src/util/futex.h:50
#3  do_futex_fence_wait (fence=fence@entry=0x555555899b08, timeout=timeout@entry=false, abs_timeout=abs_timeout@entry=0) at u_queue.c:115                                             
#4  0x00007ffff1434739 in _util_queue_fence_wait (fence=fence@entry=0x555555899b08) at u_queue.c:130                                                                                  
#5  0x00007ffff146cd79 in util_queue_fence_wait (fence=0x555555899b08) at ../../../../src/util/u_queue.h:161                                                                          
#6  si_bind_compute_state (ctx=0x5555558648c0, state=0x555555899af0) at si_compute.c:277
#7  0x00007ffff146eb79 in si_compute_do_clear_or_copy (sctx=sctx@entry=0x5555558648c0, dst=dst@entry=0x5555558a3910, dst_offset=dst_offset@entry=0, src=src@entry=0x0,                
    src_offset=src_offset@entry=0, size=size@entry=16, clear_value=0x7fffffffcea0, clear_value_size=4, coher=SI_COHERENCY_SHADER) at si_compute_blit.c:131                            
#8  0x00007ffff146ecb0 in si_clear_buffer (sctx=sctx@entry=0x5555558648c0, dst=0x5555558a3910, offset=offset@entry=0, size=16, clear_value=clear_value@entry=0x7fffffffcea0,          
    clear_value_size=clear_value_size@entry=4, coher=SI_COHERENCY_SHADER) at si_compute_blit.c:217                                                                                    
#9  0x00007ffff147de40 in si_create_context (screen=screen@entry=0x5555555e07c0, flags=flags@entry=0) at si_pipe.c:624                                                                
#10 0x00007ffff147e8d1 in radeonsi_screen_create (ws=<optimized out>, config=<optimized out>) at si_pipe.c:1123                                                                       
#11 0x00007ffff143edfa in radeon_drm_winsys_create (fd=fd@entry=4, config=config@entry=0x7fffffffd038, screen_create=screen_create@entry=0x7ffff147e1c0 <radeonsi_screen_create>)     
    at radeon_drm_winsys.c:941
#12 0x00007ffff131b49d in create_screen (fd=4, config=0x7fffffffd038) at pipe_radeonsi.c:18                                                                                           
#13 0x00007ffff7cc606b in pipe_loader_create_screen (dev=0x5555555ccc80) at pipe_loader.c:137                                                                                         
#14 0x00007ffff7ce744c in clover::device::device (this=0x5555555b9f50, platform=..., ldev=<optimized out>) at core/device.cpp:47                                                      
#15 0x00007ffff7cf1584 in clover::create<clover::device, clover::platform&, pipe_loader_device*&> () at ./util/pointer.hpp:229                                                        
#16 clover::platform::platform (this=0x7ffff7d9edc0 <(anonymous namespace)::_clover_platform>) at core/platform.cpp:36                                                                
#17 0x00007ffff7cc5a26 in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at api/platform.cpp:141                                                      
#18 _GLOBAL__sub_I_platform.cpp(void) () at api/platform.cpp:141
#19 0x00007ffff7fe1dea in call_init.part () from /lib64/ld-linux-x86-64.so.2
#20 0x00007ffff7fe1eea in _dl_init () from /lib64/ld-linux-x86-64.so.2
#21 0x00007ffff7fe5edf in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#22 0x00007ffff7edf0d7 in _dl_catch_exception () from /lib64/libc.so.6
#23 0x00007ffff7fe574e in _dl_open () from /lib64/ld-linux-x86-64.so.2
#24 0x00007ffff7f6c39a in dlopen_doit () from /lib64/libdl.so.2
#25 0x00007ffff7edf0d7 in _dl_catch_exception () from /lib64/libc.so.6
#26 0x00007ffff7edf173 in _dl_catch_error () from /lib64/libc.so.6
#27 0x00007ffff7f6caf9 in _dlerror_run () from /lib64/libdl.so.2
#28 0x00007ffff7f6c43a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#29 0x00007ffff7f76d11 in _initClIcd_real () from /lib64/libOpenCL.so.1
#30 0x00007ffff7f7935c in clGetPlatformIDs () from /lib64/libOpenCL.so.1
#31 0x000055555555a4ba in main ()

Removing the CIK-specific function call under /* Clear the NULL constant buffer, because loads should return zeros. */ makes clinfo work normally.
Comment 1 Vedran Miletić 2018-11-27 13:50:12 UTC
Confirmed both on:

00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Kaveri [Radeon R5 Graphics] [1002:1315]
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii XT GL [FirePro W9100] [1002:67a0]
Comment 2 Vasily Galkin 2019-01-03 17:22:35 UTC
Faced same problem on CIK gpu: clinfo hanging at start since 18.3.0.
Stack trace is the same - the sys_futex never returns.

The issue reproduces every time. Most important - it affects ALL applications using opencl I tried (clinfo, fresh manual build of https://github.com/ihaque/memtestCL and closed-source Geeks3D GpuTest). They all hang at initialization with similar stack trace.

I'm renaming the bug to indicate that all apps are affected.

GPU is

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tobago PRO [Radeon R7 360 / R9 360 OEM] (rev 81) (prog-if 00 [VGA controller])

("Tobago" is less common variant of "Bonaire" gpus).

Reproduced on two different motherboards (however, both are PCIe 2.0/1.1 - so no  PCIe3.0 atomics if it is related). Changing kernels in 4.17-4.20 range doesn't matter. For example vanilla 4.20.0 with ubuntu config - 4.20.0-042000-generic can be used for issue reproduaction. Distros also doesn't matter I tried debian and vanilla mesa build on archlinux

Kernel parameters are: BOOT_IMAGE=/boot/vmlinuz-4.20.0-042000-generic root=UUID=7917286f-3223-4003-8d58-a2bff30a7730 ro quiet intel_iommu=on amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.dc=1 acpi_enforce_resources=off radeon.si_support=0 radeon.cik_support=0 radeon.modeset=1 nouveau.modeset=1 zswap.enabled=1 zswap.zpool=zsmalloc zswap.compressor=lz4hc zswap.max_pool_percent=42

(actually intel_iommu is DISABLED in bios, so I don't think it is related)

Unlike opencl, both vulkan and opengl works completely fine.

Donwgrading mesa-opencl-icd to 18.2.8 fixes the problem.
This deb package downgrades only 
/usr/lib/x86_64-linux-gnu/gallium-pipe/pipe_nouveau.so
/usr/lib/x86_64-linux-gnu/gallium-pipe/pipe_r300.so
/usr/lib/x86_64-linux-gnu/gallium-pipe/pipe_r600.so
/usr/lib/x86_64-linux-gnu/gallium-pipe/pipe_radeonsi.so
/usr/lib/x86_64-linux-gnu/gallium-pipe/pipe_swrast.so
/usr/lib/x86_64-linux-gnu/gallium-pipe/pipe_vmwgfx.so
/usr/lib/x86_64-linux-gnu/libMesaOpenCL.so.1.0.0

other mesa libs are kept from 18.3.0
Comment 3 Vasily Galkin 2019-01-03 17:24:34 UTC
Created attachment 142960 [details]
clinfo output with mesa-opencl-icd downgraded to 18.2.8
Comment 4 Vasily Galkin 2019-01-03 17:30:00 UTC
Created attachment 142961 [details]
Stacktrace from debian's 18.3.0 build

The dmesg is completely clean - no any errors there, and system working fine. Even hanged clinfo can be interrupted by Ctrl+C.
Comment 5 Jan Vesely 2019-02-11 22:59:21 UTC
(In reply to Vasily Galkin from comment #2)
> GPU is
> 
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
> Tobago PRO [Radeon R7 360 / R9 360 OEM] (rev 81) (prog-if 00 [VGA
> controller])
> 
> ("Tobago" is less common variant of "Bonaire" gpus).
> 
> Reproduced on two different motherboards (however, both are PCIe 2.0/1.1 -
> so no  PCIe3.0 atomics if it is related). Changing kernels in 4.17-4.20
> range doesn't matter. For example vanilla 4.20.0 with ubuntu config -
> 4.20.0-042000-generic can be used for issue reproduaction. Distros also
> doesn't matter I tried debian and vanilla mesa build on archlinux
> 
> Kernel parameters are: BOOT_IMAGE=/boot/vmlinuz-4.20.0-042000-generic
> root=UUID=7917286f-3223-4003-8d58-a2bff30a7730 ro quiet intel_iommu=on
> amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.dc=1
> acpi_enforce_resources=off radeon.si_support=0 radeon.cik_support=0
> radeon.modeset=1 nouveau.modeset=1 zswap.enabled=1 zswap.zpool=zsmalloc
> zswap.compressor=lz4hc zswap.max_pool_percent=42
> 

Have you tried using the radeon module?
Comment 6 Jan Vesely 2019-02-18 01:48:03 UTC
*** Bug 108572 has been marked as a duplicate of this bug. ***
Comment 7 Jan Vesely 2019-02-18 01:48:55 UTC
Has this patch affected the status? :
https://lists.freedesktop.org/archives/mesa-dev/2019-February/215057.html
Comment 8 Marco 2019-02-18 20:37:44 UTC
I applied the two patches:
https://lists.freedesktop.org/archives/mesa-dev/2019-February/215057.html
https://lists.freedesktop.org/archives/mesa-dev/2019-February/215058.html

but problem persist on my card:
AMD KABINI (DRM 3.27.0, 4.20.8-bfq-zstd+, LLVM 8.0.0)
AMD Radeon HD 8500M Series (HAINAN, DRM 3.27.0, 4.20.8-bfq-zstd+, LLVM 8.0.0)

The result is the same, clinfo freezes with the same stack trace
Comment 9 Steffen Klee 2019-04-04 16:51:25 UTC
AMD R9 390 (Linux 4.14, LLVM 8.0.0, AMDGPU kernel driver, Mesa 19.0.1)

Also experiencing hangs when running clinfo and other OpenCL software.
Applying mentioned patches results in segfaults when starting graphical applications as well as OpenCL software.

However, when just applying the workaround in duplicate bug 108572, comment 6, clinfo and other OpenCL software start working again.
Comment 10 Jan Vesely 2019-04-09 16:05:16 UTC
(In reply to Steffen Klee from comment #9)
> AMD R9 390 (Linux 4.14, LLVM 8.0.0, AMDGPU kernel driver, Mesa 19.0.1)
> 
> Also experiencing hangs when running clinfo and other OpenCL software.
> Applying mentioned patches results in segfaults when starting graphical
> applications as well as OpenCL software.
> 
> However, when just applying the workaround in duplicate bug 108572, comment
> 6, clinfo and other OpenCL software start working again.

Thanks for the update. Can you try running piglit cl-api-enqueue-copy-buffer after applying the workaround? It might be just an early initialization issue rather than a problem with compute shader clears in general.
Comment 11 Steffen Klee 2019-04-09 23:39:12 UTC
cl-api-enqueue-copy-buffer passes when using the workaround.
Comment 12 Marek Olšák 2019-04-22 20:07:24 UTC
Should be fixed by b58e5fb6f317be771326f98d498483e45942beaf
Closing.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.