Bug 100639

Summary: drm_intel_gem_bo_context_exec() failed: Device or resource busy .... even after patch
Product: Beignet Reporter: Michal <developer.m3>
Component: BeignetAssignee: Zhigang Gong <zhigang.gong>
Status: RESOLVED MOVED QA Contact:
Severity: major    
Priority: medium CC: rong.r.yang
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: output from clinfo
patch for clFFT to identify wrong execution of OpenCL 1.2 on Haswell

Description Michal 2017-04-10 07:11:12 UTC
Created attachment 130772 [details]
output from clinfo

Hi everybody, 

i'm getting errors: 

drm_intel_gem_bo_context_exec() failed: Device or resource busy
Beignet: "Exec event 0x1dda2a0 error, type is 4592, error staus is -5"

i applied patch : 
https://www.mail-archive.com/beignet@lists.freedesktop.org/msg07315.html


my hardware: intel i5-4250U , intel graphics HD 5000,
i'm using debian ,
installed packages:
libdrm  2.4.74-1
llvm 3.5, 3.8
kernel 4.9.13-1~bpo8+1 (2017-02-27) x86_64

root@debian:~# cat /sys/module/i915/parameters/enable_ppgtt
1

my issue has raised with use of clFFT large 1D FFT , error raised is mentioned above, 

i ran included benchmark tools in beignet package, not sure if it could be related : 
... skipping successes ...
benchmark_filter_image_uchar()    [Result: 301.701 FPS]    [SUCCESS]
benchmark_filter_image_ushort()    [Result: 179.831 FPS]    [SUCCESS]
benchmark_filter_image_uint()    [Result: 89.430 FPS]    [SUCCESS]
benchmark_workgroup_broadcast_1D_int()ASSERTION FAILED: 0
  at file .../Beignet-1.3.1-Source/backend/src/backend/gen_context.cpp,
function virtual void gbe::GenContext::emitUnpackLongInstruction(const
gbe::SelectionInstruction&), line 2313
Trace/breakpoint trap


using clinfo is raising also same error message ...
i'm attaching output from clinfo ... 

please could you help me to fix that error ? so i could use clFFT ... 

kind regards,
Comment 1 Rebecca Palmer 2017-04-10 21:46:26 UTC
I have the same error message on Ivybridge [8086:0166] with the jessie-backports (4.9) linux-image, but not with the jessie (3.16) one (which is why I didn't notice before).  I haven't tried the patch linked to above.

Disabling softpin (by commenting out its CHECK_LIBRARY_EXISTS line in top-level CMakeLists.txt - https://bugs.freedesktop.org/show_bug.cgi?id=98647#c9 ) fixes this; it also disables OpenCL 2.0, but that isn't supported on Ivybridge/Haswell anyway.

This might also explain the Linux version dependency, as softpin requires a recent Linux (post https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/uapi/drm/i915_drm.h?id=506a8e87d8d2746b9e9d2433503fe237c54e4750).  Attempting it on older ones fails without doing anything (http://sources.debian.net/src/libdrm/2.4.74-1/intel/intel_bufmgr_gem.c/#L3658 , http://sources.debian.net/src/libdrm/2.4.74-1/intel/intel_bufmgr.c/?hl=266#L263 ), and beignet doesn't check for failure.

(Does this imply a Linux version requirement for OpenCL 2.0, given that we require softpin for SVM, and if so should we check for working softpin on init?)
Comment 2 Michal 2017-04-11 19:47:08 UTC
Hi Rebecca, 

does it mean that Haswell platform should be error-less in some certain conditions ? i didn't notice that state ... 
i'm not sure if i understood your comment ... 

please could anybody confirm my issue that it's a bug and it will be analyzed ?
Comment 3 Rebecca Palmer 2017-04-11 21:16:05 UTC
It looks like a bug to me, given that it makes beignet mostly unusable (most of the test suite fails, and I suspect the bits that don't might be compile-only tests) when recent linux(>=4.5)+libdrm(>=2.4.66) is installed on older hardware.  (I'm guessing this doesn't affect Skylake, given that those are also the minimum versions for OpenCL 2.0, but would appreciate explicit confirmation of this.)

The workaround is to delete the line
  CHECK_LIBRARY_EXISTS(drm_intel "drm_intel_bo_set_softpin_offset" ${DRM_INTEL_LIBDIR} HAVE_DRM_INTEL_BO_SET_SOFTPIN)
from top-level CMakeLists.txt and rebuild (starting from the cmake step).

Is this likely to have any negative effects (e.g. on performance) beyond the obvious "disables OpenCL 2"?  If not, a potential solution is to use softpin only on 2.0-capable devices, i.e. make HAS_BO_SET_SOFTPIN a runtime check. 

The numbers in the error message mean "out of resources on the device trying to execute an NDRangeKernel (i.e. an ordinary OpenCL computation)".
Comment 4 Michal 2017-04-12 07:12:08 UTC
well ... let's recap ... 
i have intel i5 4250U - Haswell, intel graphics HD 5000, with OpenCL 1.2,

i would like to use clFFT which needs OpenCL 1.X, not OpenCL 2.0

if i compile beignet directly downloaded - as it is - without being patched, version 1.3.1 
i get results: 
./Beignet-1.3.1-Source/build/utests# ./utest_run 
...
> test_load_program_from_bin_file()drm_intel_gem_bo_context_exec() failed: Device or resource busy
> Beignet: "Exec event 0x1443ba0 error, type is 4592, error status is -5"
>     [FAILED]
>     Error: ((float *)buf_data[1])[i] == cpu_dst[i]
>   at file /root/moje/Beignet-1.3.1-Source/utests/load_program_from_bin_file.cpp, function test_load_program_from_bin_file, line 76
> enqueue_built_in_kernels()    [SUCCESS]
> builtin_acos_float()drm_intel_gem_bo_context_exec() failed: Device or resource busy
> Beignet: "Exec event 0x170cf60 error, type is 4592, error status is -5"
>     [FAILED]
>     Error: input_data1:0.000000e+00  -> gpu:0.000000e+00  cpu:1.570796e+00 diff:1.570796e+00 expect:1.907349e-06
> 
>   at file /root/moje/Beignet-1.3.1-Source/utests/generated/builtin_acos_float.cpp, function builtin_acos_float, line 125
> builtin_acos_float2()drm_intel_gem_bo_context_exec() failed: Device or resource busy
> Beignet: "Exec event 0xfc0e50 error, type is 4592, error status is -5"
>     [FAILED]
>     Error: input_data1:0.000000e+00  -> gpu:1.175566e-38  cpu:1.570796e+00 diff:1.570796e+00 expect:1.907349e-06

then i compile still nonpatched version with commented out line:
CHECK_LIBRARY_EXISTS(drm_intel "drm_intel_bo_set_softpin_offset" ${DRM_INTEL_LIBDIR} HAVE_DRM_INTEL_BO_SET_SOFTPIN)

./Beignet-1.3.1-Source/build/benchmark# ./benchmark_run
...
> benchmark_copy_image_uint()    [Result: 143.895 FPS]    [SUCCESS]
> benchmark_filter_image_uchar()    [Result: 301.592 FPS]    [SUCCESS]
> benchmark_filter_image_ushort()    [Result: 179.282 FPS]    [SUCCESS]
> benchmark_filter_image_uint()    [Result: 89.425 FPS]    [SUCCESS]
> benchmark_workgroup_broadcast_1D_int()ASSERTION FAILED: 0
>   at file /root/moje/Beignet-1.3.1-Source/backend/src/backend/gen_context.cpp, function virtual void gbe::GenContext::emitUnpackLongInstruction(const gbe::SelectionInstruction&), line 2313
> Trace/breakpoint trap

and also clFFT fails , it seems that failure comes out when more than 1 kernel is loaded, single kernel runs correctly ... 

when i enable HAVE_DRM_INTEL_BO_SET_SOFTPIN and apply that patch i get same results like disabled HAVE_DRM_INTEL_BO_SET_SOFTPIN,
Comment 5 Rebecca Palmer 2017-04-12 22:11:30 UTC
Did you try utest_run (not benchmark_run) with the no-softpin version?

> it seems that failure comes out when more than 1 kernel is loaded, single kernel runs correctly

Do you mean before or after disabling softpin (i.e. commenting out that line)?  For me (Ivy Bridge), with softpin on *any* kernel triggers "Exec event [...] error" and doesn't produce any results (i.e. as-distributed beignet is totally unusable), and with softpin off I haven't been able to trigger it at all (in particular, running two kernels together is fine).

The "ASSERTION FAILED: 0" in benchmark_run appears to be a separate issue: it tries to run all the benchmarks, including ones that aren't supported on the current hardware.
Comment 6 Michal 2017-04-13 07:33:19 UTC
Hi Rebecca,

as i wrote above ... for me it's NOT important to have working utests or benchmarks, and as i wrote : clFFT is not working - that's the point for me, 
clFFT needs OpenCL 1.X , multikernel load, and this is not working in any configuration of softpin or patch, 

i can do any test , but requester should be anybody who is able to solve my issue from development side ... i already tried all configurations of Makefiles and patches and it didn't help, 

and i hope Rebecca that you are not extending this thread to make it hard to study for the developers that should solve this bug, 

Kind regards,
Comment 7 Rebecca Palmer 2017-04-13 20:48:59 UTC
How exactly are you triggering the error with clFFT?  Their 1D example https://sources.debian.net/src/clfft/2.12.2-1/src/examples/fft1d.c/ with the size increased to 16000 doesn't crash for me (with softpin disabled; I haven't checked whether the results are correct), but that might be because we have different hardware.
Comment 8 Michal 2017-04-14 07:16:05 UTC
please check results ... by the way i use clFFT with size of multiplication of 2 ... it means size 8192 or 16384 or else ... 16000 is not by the rule ... 

i'm successfully using clFFT on ARM platform with Mali GPU OpenCL 1.0, running smoothly ... 
Intel Haswell OpenCL 1.X platform errrors and bad results ...
Comment 9 rongyang 2017-04-19 04:54:05 UTC

(In reply to Michal from comment #8)
> please check results ... by the way i use clFFT with size of multiplication
> of 2 ... it means size 8192 or 16384 or else ... 16000 is not by the rule
> ... 
> 
> i'm successfully using clFFT on ARM platform with Mali GPU OpenCL 1.0,
> running smoothly ... 
> Intel Haswell OpenCL 1.X platform errrors and bad results ...

How about small size 1D FFT?
If "drm_intel_gem_bo_context_exec() failed"only happens when large 1D FFT, maybe GPU timeout, you could disable the timeout check by "echo 0 > /sys/module/i915/parameters/enable_hangcheck".
Comment 10 Michal 2017-04-19 06:52:11 UTC
Hi rongyang,

after: echo 0 > /sys/module/i915/parameters/enable_hangcheck
still same error,

single kernel solution works :-( ... single kernel solution is useless for me ...
Comment 11 Rebecca Palmer 2017-04-19 07:23:15 UTC
> size 8192 or 16384
Those sizes don't crash for me either (with softpin disabled), and the size 8192 agrees with scipy to within reasonable numerical error (relative error <1 in 10^4 max, <2 in 10^6 mean).

I suspect this is a hardware-dependent bug, but to be sure, can you post the exact code you're using to trigger the error?

> please check results
Does that mean you've seen wrong results *without* the error message?
Comment 12 Michal 2017-04-19 07:38:07 UTC
Hi Rebecca, 

i added messages to clFFT: 


> (err: 0)after clGetPlatformIDs
> (err: 0) Platform found: Intel Gen OCL Driver
> drm_intel_gem_bo_context_exec() failed: Device or resource busy
> Beignet: "Exec event 0x16cf2b0 error, type is 4592, error staus is -5"
> (err: 0)after clGetDeviceIDs
> (err: 0)Device found on the above platform: Intel(R) HD Graphics Haswell Ultrabook GT3 Mobile
> (err: 0) after clCreateContext
> 
> Performing fft on an one dimensional array of size N = 8192
> (err: 0)after clCreateBuffer
> (err: 0)after clfftCreateDefaultPlan
> (err: 0)after clfftSetPlanPrecision
> (err: 0)after clfftSetLayout
> (err: 0)after clfftSetResultLocation
> (err: 0)after clfftBakePlan
> (err: 0)after clEnqueueWriteBuffer
> drm_intel_gem_bo_context_exec() failed: Device or resource busy
> Beignet: "Exec event 0x154f9c0 error, type is 4592, error staus is -5"
> clEnqueueNDRangeKernel failed second... -14 params: 1 1024 128 0
> ERROR executing the kernel -5
> OPENCL_V< CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST > (659): clEnqueueNDRangeKernel failed
> OPENCL_V< CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST > (556): clfftEnqueueTransform large1D second column failed
> (err: -14)after clfftEnqueueTransform
> (err: -14)after clfftEnqueueTransform = CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST
> (err: 0)after clFinish
> (err: 0)after clEnqueueReadBuffer

today i found that Skylake (HD Graphics 530 Skylake GT2 on i7-6700HQ) works correctly with multikernel solution, 

currently only Haswell does NOT work with multikernel solution ... (that i was able to test)

is there a chance that the bug will be fixed even if it's related to one platform ?
Comment 13 Rebecca Palmer 2017-04-25 22:48:30 UTC
Does this bug also exist in beignet git master (i.e. git clone https://anongit.freedesktop.org/git/beignet.git )?
Comment 14 rongyang 2017-04-27 06:12:57 UTC
(In reply to Michal from comment #12)
> Hi Rebecca, 
> 
> i added messages to clFFT: 
> 
> 
> > (err: 0)after clGetPlatformIDs
> > (err: 0) Platform found: Intel Gen OCL Driver
> > drm_intel_gem_bo_context_exec() failed: Device or resource busy
> > Beignet: "Exec event 0x16cf2b0 error, type is 4592, error staus is -5"
> > (err: 0)after clGetDeviceIDs
> > (err: 0)Device found on the above platform: Intel(R) HD Graphics Haswell Ultrabook GT3 Mobile
> > (err: 0) after clCreateContext
> > 
> > Performing fft on an one dimensional array of size N = 8192
> > (err: 0)after clCreateBuffer
> > (err: 0)after clfftCreateDefaultPlan
> > (err: 0)after clfftSetPlanPrecision
> > (err: 0)after clfftSetLayout
> > (err: 0)after clfftSetResultLocation
> > (err: 0)after clfftBakePlan
> > (err: 0)after clEnqueueWriteBuffer
> > drm_intel_gem_bo_context_exec() failed: Device or resource busy
> > Beignet: "Exec event 0x154f9c0 error, type is 4592, error staus is -5"
> > clEnqueueNDRangeKernel failed second... -14 params: 1 1024 128 0
> > ERROR executing the kernel -5
> > OPENCL_V< CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST > (659): clEnqueueNDRangeKernel failed
> > OPENCL_V< CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST > (556): clfftEnqueueTransform large1D second column failed
> > (err: -14)after clfftEnqueueTransform
> > (err: -14)after clfftEnqueueTransform = CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST
> > (err: 0)after clFinish
> > (err: 0)after clEnqueueReadBuffer
> 
> today i found that Skylake (HD Graphics 530 Skylake GT2 on i7-6700HQ) works
> correctly with multikernel solution, 
> 
> currently only Haswell does NOT work with multikernel solution ... (that i
> was able to test)
> 
> is there a chance that the bug will be fixed even if it's related to one
> platform ?

Can you share or simplify your case for us to reproduce?
If can reproduce it, we try to fix it.
Comment 15 Michal 2017-04-27 11:51:29 UTC
Hi,

@Rebecca : i tried to use beignet from git, i noticed same behavior, but i added more checks and i found that clFFT was NOT executed at all also for single kernel and also for multikernel, reason why i thought that single kernel runs was that i didn't get back any error message as a result of function, but when i checked results from calculation i found that it was not executed at all :-( 

that means beignet is completely useless for Haswell's OpenCL 1.2 

@rongyang : please find attached my patch to confirm functionality of clFFT with OpenCL 1.2, 
to reproduce my bug please run following: 

> git clone https://github.com/clMathLibraries/clFFT
> cd clFFT
> patch -p1 < ../patch_clfft_haswell.patch
> mkdir build ; cd build 
> cmake -DCMAKE_BUILD_TYPE:STRING=Debug -DCMAKE_C_FLAGS:STRING=-DCL_USE_DEPRECATED_OPENCL_1_2_APIS=1 ../src/
> make -j4 
> ./examples/examples/fft1d > /dev/null

currently in my opinion there is wrong function's execution result value for single kernel solution - so my program is not getting any errors in case if something really went wrong ...
Comment 16 Michal 2017-04-27 11:52:36 UTC
Created attachment 131090 [details]
patch for clFFT to identify wrong execution of OpenCL 1.2 on Haswell
Comment 17 Rebecca Palmer 2017-04-28 07:28:22 UTC
Thanks for posting an exact test case: it doesn't fail for me, but this is probably because we have different hardware.

To be clear, since I (as Debian maintainer) have to decide whether a partial fix is better than nothing:

- Do you get different results with the patch and with disabling softpin, or are *both* of those "some things (e.g. utest_run) work, but clFFT still doesn't"?

- Do the silently wrong results happen with as-downloaded beignet, or do *only* the partial fixes get that far?  How can we reproduce this?
Comment 18 Rebecca Palmer 2017-04-29 11:52:03 UTC
Has anyone else been able to reproduce this?

On some distributions (e.g. Debian sid), the test case requires a slight modification:
-DCMAKE_C_FLAGS:STRING="-O3 -DCL_USE_DEPRECATED_OPENCL_1_2_APIS=1"
as linking to an __inline symbol (clfftInitSetupData) without optimization fails in recent gcc.

The Debian discussion is at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=860805 (anyone can post there, but be warned that Debian does NOT spam-protect email addresses).
Comment 19 Michal 2017-05-04 10:18:46 UTC
Hi guys, 

i have news ... 
i got possibility to make some tests on different machine : 
i5-4460  CPU @ 3.20GHz
debian jessie same debian version as buggy instance ... 

what i made: 
download beignet from current git, 
compile, 
then compile clFFT, 

and results: 
running successfuly on i5-4460, 
WOW !!!

so i had a look what is different ... and i found that package: 
beignet-opencl-icd:amd64               1.3.0-2~bpo8+1

is causing this issue ... 

so i removed that package from everywhere and successfully running clFFT with current beignet from git, 
tested size N = 4M ... 

thanks everybody that supported me ...
Comment 20 rongyang 2017-05-05 07:53:56 UTC
I am glad the problem has been solved.
For benchmark fail, I will fix it.
Comment 21 GitLab Migration User 2018-10-12 21:27:00 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/beignet/beignet/issues/68.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.