Summary: | drm_intel_gem_bo_context_exec() failed: Device or resource busy .... even after patch | ||
---|---|---|---|
Product: | Beignet | Reporter: | Michal <developer.m3> |
Component: | Beignet | Assignee: | Zhigang Gong <zhigang.gong> |
Status: | RESOLVED MOVED | QA Contact: | |
Severity: | major | ||
Priority: | medium | CC: | rong.r.yang |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
output from clinfo
patch for clFFT to identify wrong execution of OpenCL 1.2 on Haswell |
Description
Michal
2017-04-10 07:11:12 UTC
I have the same error message on Ivybridge [8086:0166] with the jessie-backports (4.9) linux-image, but not with the jessie (3.16) one (which is why I didn't notice before). I haven't tried the patch linked to above. Disabling softpin (by commenting out its CHECK_LIBRARY_EXISTS line in top-level CMakeLists.txt - https://bugs.freedesktop.org/show_bug.cgi?id=98647#c9 ) fixes this; it also disables OpenCL 2.0, but that isn't supported on Ivybridge/Haswell anyway. This might also explain the Linux version dependency, as softpin requires a recent Linux (post https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/uapi/drm/i915_drm.h?id=506a8e87d8d2746b9e9d2433503fe237c54e4750). Attempting it on older ones fails without doing anything (http://sources.debian.net/src/libdrm/2.4.74-1/intel/intel_bufmgr_gem.c/#L3658 , http://sources.debian.net/src/libdrm/2.4.74-1/intel/intel_bufmgr.c/?hl=266#L263 ), and beignet doesn't check for failure. (Does this imply a Linux version requirement for OpenCL 2.0, given that we require softpin for SVM, and if so should we check for working softpin on init?) Hi Rebecca, does it mean that Haswell platform should be error-less in some certain conditions ? i didn't notice that state ... i'm not sure if i understood your comment ... please could anybody confirm my issue that it's a bug and it will be analyzed ? It looks like a bug to me, given that it makes beignet mostly unusable (most of the test suite fails, and I suspect the bits that don't might be compile-only tests) when recent linux(>=4.5)+libdrm(>=2.4.66) is installed on older hardware. (I'm guessing this doesn't affect Skylake, given that those are also the minimum versions for OpenCL 2.0, but would appreciate explicit confirmation of this.) The workaround is to delete the line CHECK_LIBRARY_EXISTS(drm_intel "drm_intel_bo_set_softpin_offset" ${DRM_INTEL_LIBDIR} HAVE_DRM_INTEL_BO_SET_SOFTPIN) from top-level CMakeLists.txt and rebuild (starting from the cmake step). Is this likely to have any negative effects (e.g. on performance) beyond the obvious "disables OpenCL 2"? If not, a potential solution is to use softpin only on 2.0-capable devices, i.e. make HAS_BO_SET_SOFTPIN a runtime check. The numbers in the error message mean "out of resources on the device trying to execute an NDRangeKernel (i.e. an ordinary OpenCL computation)". well ... let's recap ... i have intel i5 4250U - Haswell, intel graphics HD 5000, with OpenCL 1.2, i would like to use clFFT which needs OpenCL 1.X, not OpenCL 2.0 if i compile beignet directly downloaded - as it is - without being patched, version 1.3.1 i get results: ./Beignet-1.3.1-Source/build/utests# ./utest_run ... > test_load_program_from_bin_file()drm_intel_gem_bo_context_exec() failed: Device or resource busy > Beignet: "Exec event 0x1443ba0 error, type is 4592, error status is -5" > [FAILED] > Error: ((float *)buf_data[1])[i] == cpu_dst[i] > at file /root/moje/Beignet-1.3.1-Source/utests/load_program_from_bin_file.cpp, function test_load_program_from_bin_file, line 76 > enqueue_built_in_kernels() [SUCCESS] > builtin_acos_float()drm_intel_gem_bo_context_exec() failed: Device or resource busy > Beignet: "Exec event 0x170cf60 error, type is 4592, error status is -5" > [FAILED] > Error: input_data1:0.000000e+00 -> gpu:0.000000e+00 cpu:1.570796e+00 diff:1.570796e+00 expect:1.907349e-06 > > at file /root/moje/Beignet-1.3.1-Source/utests/generated/builtin_acos_float.cpp, function builtin_acos_float, line 125 > builtin_acos_float2()drm_intel_gem_bo_context_exec() failed: Device or resource busy > Beignet: "Exec event 0xfc0e50 error, type is 4592, error status is -5" > [FAILED] > Error: input_data1:0.000000e+00 -> gpu:1.175566e-38 cpu:1.570796e+00 diff:1.570796e+00 expect:1.907349e-06 then i compile still nonpatched version with commented out line: CHECK_LIBRARY_EXISTS(drm_intel "drm_intel_bo_set_softpin_offset" ${DRM_INTEL_LIBDIR} HAVE_DRM_INTEL_BO_SET_SOFTPIN) ./Beignet-1.3.1-Source/build/benchmark# ./benchmark_run ... > benchmark_copy_image_uint() [Result: 143.895 FPS] [SUCCESS] > benchmark_filter_image_uchar() [Result: 301.592 FPS] [SUCCESS] > benchmark_filter_image_ushort() [Result: 179.282 FPS] [SUCCESS] > benchmark_filter_image_uint() [Result: 89.425 FPS] [SUCCESS] > benchmark_workgroup_broadcast_1D_int()ASSERTION FAILED: 0 > at file /root/moje/Beignet-1.3.1-Source/backend/src/backend/gen_context.cpp, function virtual void gbe::GenContext::emitUnpackLongInstruction(const gbe::SelectionInstruction&), line 2313 > Trace/breakpoint trap and also clFFT fails , it seems that failure comes out when more than 1 kernel is loaded, single kernel runs correctly ... when i enable HAVE_DRM_INTEL_BO_SET_SOFTPIN and apply that patch i get same results like disabled HAVE_DRM_INTEL_BO_SET_SOFTPIN, Did you try utest_run (not benchmark_run) with the no-softpin version?
> it seems that failure comes out when more than 1 kernel is loaded, single kernel runs correctly
Do you mean before or after disabling softpin (i.e. commenting out that line)? For me (Ivy Bridge), with softpin on *any* kernel triggers "Exec event [...] error" and doesn't produce any results (i.e. as-distributed beignet is totally unusable), and with softpin off I haven't been able to trigger it at all (in particular, running two kernels together is fine).
The "ASSERTION FAILED: 0" in benchmark_run appears to be a separate issue: it tries to run all the benchmarks, including ones that aren't supported on the current hardware.
Hi Rebecca, as i wrote above ... for me it's NOT important to have working utests or benchmarks, and as i wrote : clFFT is not working - that's the point for me, clFFT needs OpenCL 1.X , multikernel load, and this is not working in any configuration of softpin or patch, i can do any test , but requester should be anybody who is able to solve my issue from development side ... i already tried all configurations of Makefiles and patches and it didn't help, and i hope Rebecca that you are not extending this thread to make it hard to study for the developers that should solve this bug, Kind regards, How exactly are you triggering the error with clFFT? Their 1D example https://sources.debian.net/src/clfft/2.12.2-1/src/examples/fft1d.c/ with the size increased to 16000 doesn't crash for me (with softpin disabled; I haven't checked whether the results are correct), but that might be because we have different hardware. please check results ... by the way i use clFFT with size of multiplication of 2 ... it means size 8192 or 16384 or else ... 16000 is not by the rule ... i'm successfully using clFFT on ARM platform with Mali GPU OpenCL 1.0, running smoothly ... Intel Haswell OpenCL 1.X platform errrors and bad results ... (In reply to Michal from comment #8) > please check results ... by the way i use clFFT with size of multiplication > of 2 ... it means size 8192 or 16384 or else ... 16000 is not by the rule > ... > > i'm successfully using clFFT on ARM platform with Mali GPU OpenCL 1.0, > running smoothly ... > Intel Haswell OpenCL 1.X platform errrors and bad results ... How about small size 1D FFT? If "drm_intel_gem_bo_context_exec() failed"only happens when large 1D FFT, maybe GPU timeout, you could disable the timeout check by "echo 0 > /sys/module/i915/parameters/enable_hangcheck". Hi rongyang, after: echo 0 > /sys/module/i915/parameters/enable_hangcheck still same error, single kernel solution works :-( ... single kernel solution is useless for me ... > size 8192 or 16384 Those sizes don't crash for me either (with softpin disabled), and the size 8192 agrees with scipy to within reasonable numerical error (relative error <1 in 10^4 max, <2 in 10^6 mean). I suspect this is a hardware-dependent bug, but to be sure, can you post the exact code you're using to trigger the error? > please check results Does that mean you've seen wrong results *without* the error message? Hi Rebecca,
i added messages to clFFT:
> (err: 0)after clGetPlatformIDs
> (err: 0) Platform found: Intel Gen OCL Driver
> drm_intel_gem_bo_context_exec() failed: Device or resource busy
> Beignet: "Exec event 0x16cf2b0 error, type is 4592, error staus is -5"
> (err: 0)after clGetDeviceIDs
> (err: 0)Device found on the above platform: Intel(R) HD Graphics Haswell Ultrabook GT3 Mobile
> (err: 0) after clCreateContext
>
> Performing fft on an one dimensional array of size N = 8192
> (err: 0)after clCreateBuffer
> (err: 0)after clfftCreateDefaultPlan
> (err: 0)after clfftSetPlanPrecision
> (err: 0)after clfftSetLayout
> (err: 0)after clfftSetResultLocation
> (err: 0)after clfftBakePlan
> (err: 0)after clEnqueueWriteBuffer
> drm_intel_gem_bo_context_exec() failed: Device or resource busy
> Beignet: "Exec event 0x154f9c0 error, type is 4592, error staus is -5"
> clEnqueueNDRangeKernel failed second... -14 params: 1 1024 128 0
> ERROR executing the kernel -5
> OPENCL_V< CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST > (659): clEnqueueNDRangeKernel failed
> OPENCL_V< CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST > (556): clfftEnqueueTransform large1D second column failed
> (err: -14)after clfftEnqueueTransform
> (err: -14)after clfftEnqueueTransform = CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST
> (err: 0)after clFinish
> (err: 0)after clEnqueueReadBuffer
today i found that Skylake (HD Graphics 530 Skylake GT2 on i7-6700HQ) works correctly with multikernel solution,
currently only Haswell does NOT work with multikernel solution ... (that i was able to test)
is there a chance that the bug will be fixed even if it's related to one platform ?
Does this bug also exist in beignet git master (i.e. git clone https://anongit.freedesktop.org/git/beignet.git )? (In reply to Michal from comment #12) > Hi Rebecca, > > i added messages to clFFT: > > > > (err: 0)after clGetPlatformIDs > > (err: 0) Platform found: Intel Gen OCL Driver > > drm_intel_gem_bo_context_exec() failed: Device or resource busy > > Beignet: "Exec event 0x16cf2b0 error, type is 4592, error staus is -5" > > (err: 0)after clGetDeviceIDs > > (err: 0)Device found on the above platform: Intel(R) HD Graphics Haswell Ultrabook GT3 Mobile > > (err: 0) after clCreateContext > > > > Performing fft on an one dimensional array of size N = 8192 > > (err: 0)after clCreateBuffer > > (err: 0)after clfftCreateDefaultPlan > > (err: 0)after clfftSetPlanPrecision > > (err: 0)after clfftSetLayout > > (err: 0)after clfftSetResultLocation > > (err: 0)after clfftBakePlan > > (err: 0)after clEnqueueWriteBuffer > > drm_intel_gem_bo_context_exec() failed: Device or resource busy > > Beignet: "Exec event 0x154f9c0 error, type is 4592, error staus is -5" > > clEnqueueNDRangeKernel failed second... -14 params: 1 1024 128 0 > > ERROR executing the kernel -5 > > OPENCL_V< CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST > (659): clEnqueueNDRangeKernel failed > > OPENCL_V< CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST > (556): clfftEnqueueTransform large1D second column failed > > (err: -14)after clfftEnqueueTransform > > (err: -14)after clfftEnqueueTransform = CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST > > (err: 0)after clFinish > > (err: 0)after clEnqueueReadBuffer > > today i found that Skylake (HD Graphics 530 Skylake GT2 on i7-6700HQ) works > correctly with multikernel solution, > > currently only Haswell does NOT work with multikernel solution ... (that i > was able to test) > > is there a chance that the bug will be fixed even if it's related to one > platform ? Can you share or simplify your case for us to reproduce? If can reproduce it, we try to fix it. Hi,
@Rebecca : i tried to use beignet from git, i noticed same behavior, but i added more checks and i found that clFFT was NOT executed at all also for single kernel and also for multikernel, reason why i thought that single kernel runs was that i didn't get back any error message as a result of function, but when i checked results from calculation i found that it was not executed at all :-(
that means beignet is completely useless for Haswell's OpenCL 1.2
@rongyang : please find attached my patch to confirm functionality of clFFT with OpenCL 1.2,
to reproduce my bug please run following:
> git clone https://github.com/clMathLibraries/clFFT
> cd clFFT
> patch -p1 < ../patch_clfft_haswell.patch
> mkdir build ; cd build
> cmake -DCMAKE_BUILD_TYPE:STRING=Debug -DCMAKE_C_FLAGS:STRING=-DCL_USE_DEPRECATED_OPENCL_1_2_APIS=1 ../src/
> make -j4
> ./examples/examples/fft1d > /dev/null
currently in my opinion there is wrong function's execution result value for single kernel solution - so my program is not getting any errors in case if something really went wrong ...
Created attachment 131090 [details]
patch for clFFT to identify wrong execution of OpenCL 1.2 on Haswell
Thanks for posting an exact test case: it doesn't fail for me, but this is probably because we have different hardware. To be clear, since I (as Debian maintainer) have to decide whether a partial fix is better than nothing: - Do you get different results with the patch and with disabling softpin, or are *both* of those "some things (e.g. utest_run) work, but clFFT still doesn't"? - Do the silently wrong results happen with as-downloaded beignet, or do *only* the partial fixes get that far? How can we reproduce this? Has anyone else been able to reproduce this? On some distributions (e.g. Debian sid), the test case requires a slight modification: -DCMAKE_C_FLAGS:STRING="-O3 -DCL_USE_DEPRECATED_OPENCL_1_2_APIS=1" as linking to an __inline symbol (clfftInitSetupData) without optimization fails in recent gcc. The Debian discussion is at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=860805 (anyone can post there, but be warned that Debian does NOT spam-protect email addresses). Hi guys, i have news ... i got possibility to make some tests on different machine : i5-4460 CPU @ 3.20GHz debian jessie same debian version as buggy instance ... what i made: download beignet from current git, compile, then compile clFFT, and results: running successfuly on i5-4460, WOW !!! so i had a look what is different ... and i found that package: beignet-opencl-icd:amd64 1.3.0-2~bpo8+1 is causing this issue ... so i removed that package from everywhere and successfully running clFFT with current beignet from git, tested size N = 4M ... thanks everybody that supported me ... I am glad the problem has been solved. For benchmark fail, I will fix it. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/beignet/beignet/issues/68. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.