89914 – kernel not running when global_size is large

Bug 89914 - kernel not running when global_size is large

Summary: kernel not running when global_size is large

Status:	CLOSED WONTFIX

Alias:	None

Product:	Beignet
Classification:	Unclassified
Component:	Beignet (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Zhigang Gong
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-04-06 04:04 UTC by bugReporter92
Modified:	2015-04-17 00:06 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
test code to reproduce problem (2.32 KB, application/binary) 2015-04-06 13:35 UTC, bugReporter92	Details
View All

Description bugReporter92 2015-04-06 04:04:49 UTC

Environment: i5-3230M

When running the attached code against the latest git dev code of beignet, and also the latest release (llvm 3.5 for both), the code does not run all of the kernels in their entirety.
On my machine, the end of output is a whole bunch of zeros indicating that some kernels did not run. When running with a smaller POINTS macro (in test.cpp) (on the order of 500 * 64), all of the data is collected correctly.

So the problem is that with a very large number of kernels, the program stops working correctly. I would expect that all of the kernels would run correctly, or at least that an "out of resources" error would be thrown, if that is indeed what is happening. This is just speculation.

Thanks,
Matt

Comment 1 bugReporter92 2015-04-06 13:35:06 UTC

Created attachment 114886 [details]
test code to reproduce problem

Sorry, I thought I'd attached this yesterday.

Comment 2 Zhigang Gong 2015-04-07 09:10:49 UTC

The bug has been confirmed, and we are working on it. Thanks for reporting it.

Comment 3 Zhigang Gong 2015-04-15 05:58:39 UTC

The root cause is the drm_intel_gem_bo_context_exec() failed to bind the command buffer when there is a very large array. Beignet forget to check the return status. This bug has been fixed in current master branch. Please verify. Thanks.

Comment 4 bugReporter92 2015-04-15 20:14:36 UTC

What behaviour am I supposed to be seeing now? I did a git pull, and ensured that my ICD was picking up the newly built code, but I can't see any difference in behaviour for the test code.

Comment 5 Zhigang Gong 2015-04-16 03:41:56 UTC

(In reply to bugReporter92 from comment #4)
> What behaviour am I supposed to be seeing now? I did a git pull, and ensured
> that my ICD was picking up the newly built code, but I can't see any
> difference in behaviour for the test code.

I thought this is a duplicate bug as debian bug at : http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=781875 "beignet: silently does nothing on large arrays".

After I double checked your test case, it may not be the same. You may run into one known GPU hang issue. Please refer the README.md's known issue section, there is one item to describe how to check whether a GPU hang occur and how to disable GPU hang to try your kernel again.

Comment 6 bugReporter92 2015-04-16 21:15:01 UTC

You're right. I didn't check before whether the GPU was hanging. Hopefully there will be a more graceful solution for handling larger kernels later on.

After turning off the hang check, the kernel works as expected.

Does the driver code get some warning when the GPU hangs? It might be nice to throw an "out of resources" message to the callback in this case as well, as I didn't even suspect that the GPU was hanging.

Comment 7 Zhigang Gong 2015-04-17 00:06:26 UTC

(In reply to bugReporter92 from comment #6)
> You're right. I didn't check before whether the GPU was hanging. Hopefully
> there will be a more graceful solution for handling larger kernels later on.
> 
> After turning off the hang check, the kernel works as expected.
> 
> Does the driver code get some warning when the GPU hangs? It might be nice
> to throw an "out of resources" message to the callback in this case as well,
> as I didn't even suspect that the GPU was hanging.

GPU hang is an asynchronous error event occured in KMD. From user space,  there is no elegant way to catch this error efficiently. To check the dmesge for each kernel running is obviously not a good idea. Right? So we have to just put this in the known issues currently. If you get any better idea, please feel to share with us here or send it to the mail list. Thanks for your feedback.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.