Bug 91078 - [BSW]OpenCL/utests hang sporadically
Summary: [BSW]OpenCL/utests hang sporadically
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: All Linux (All)
: high critical
Assignee: meng
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-06-24 06:10 UTC by meng
Modified: 2017-07-24 22:46 UTC (History)
3 users (show)

See Also:
i915 platform: BSW/CHT
i915 features: GEM/Other


Attachments
dmesg (38.31 KB, text/plain)
2015-06-24 06:10 UTC, meng
no flags Details

Description meng 2015-06-24 06:10:56 UTC
Created attachment 116682 [details]
dmesg

==Regression==
--------------------------
Regression: No. 
Ubuntu: 14.04

==kernel==
--------------------------
drm-intel-next-queued: git-8c6cda

==Test cases==
Beignet: git://anongit.freedesktop.org/git/beignet (master git-e64445f)


==Bug detailed description==
-----------------------------
OpenCL/utests may hang on BSW sporadically(~20%). And the fail tests are not specific. The issue doesn't exist on other platforms(IVB/HSW/BDW).
Please see the attached dmesg.

(gdb) bt
==================
#0  0x00007f6fb5bb1337 in ioctl () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f6fb4cd6e74 in drmIoctl (fd=6, request=request@entry=1074553951, arg=arg@entry=0x7ffdca8d7840) at xf86drm.c:164
#2  0x00007f6fb4ee68f7 in drm_intel_gem_bo_map (bo=0x1505f50, write_enable=1) at intel_bufmgr_gem.c:1325
#3  0x00007f6fb5880446 in cl_mem_map (mem=0x14dc4a0, write=write@entry=1) at /home/OpenCL/beignet/src/cl_mem.c:1908
#4  0x00007f6fb586f223 in clMapBufferIntel (mem=<optimized out>, errcode_ret=0x7ffdca8d790c) at /home/OpenCL/beignet/src/cl_api.c:3215
#5  0x00007f6fb65af04e in test_copy_buf (sz=1024, cb=512, dst_off=0, src_off=<optimized out>) at /home/OpenCL/beignet/utests/enqueue_copy_buf.cpp:24
#6  enqueue_copy_buf () at /home/OpenCL/beignet/utests/enqueue_copy_buf.cpp:61
#7  0x00007f6fb65af4bd in __ANON__enqueue_copy_buf__ () at /home/OpenCL/beignet/utests/enqueue_copy_buf.cpp:66
#8  0x00007f6fb63b92df in UTest::runAllNoIssue () at /home/OpenCL/beignet/utests/utest.cpp:169
#9  0x0000000000401786 in main (argc=1, argv=0x7ffdca8d8308) at /home/OpenCL/beignet/utests/utest_run.cpp:104

==Reproduce steps==
---------------------------- 
1. utests/utest_run
Comment 1 Gordon Jin 2015-07-02 00:36:51 UTC
This blocks our OpenCL testing.
Comment 2 meng 2015-07-02 01:01:49 UTC
The issue is case hang. 
"utests/utest_run" could reproduce the issue.
Note,the issue couldn't be reproduced if running one by one (utests/utest_run -c "subcase").
Comment 3 Ville Syrjala 2015-07-02 08:18:00 UTC
So no GPU hang?

Does the problem happen with i915.enable_execlists=0 too?
Comment 4 meng 2015-07-02 08:33:05 UTC
(In reply to Ville Syrjala from comment #3)
> So no GPU hang?
> 
> Does the problem happen with i915.enable_execlists=0 too?

With i915.execlist=0, the issue still exists. 
For OpenCL testing, we need to disable i915 hang check because OCL kernel may cost 6 seconds or even more.
Comment 5 meng 2015-07-02 08:40:29 UTC
(In reply to meng from comment #4)
When the case hang, gdb attach that, then it could finish. So it's not GPU hang.
Comment 6 Chris Wilson 2015-07-02 08:41:11 UTC
(In reply to meng from comment #4)
> (In reply to Ville Syrjala from comment #3)
> > So no GPU hang?
> > 
> > Does the problem happen with i915.enable_execlists=0 too?
> 
> With i915.execlist=0, the issue still exists. 
> For OpenCL testing, we need to disable i915 hang check because OCL kernel
> may cost 6 seconds or even more.

6 seconds of monopolizing the GPU sounds like a DoS worthy of being banned ;-)

So not even the grace period given to looping kernels is enough to prevent hangcheck firing? I would strongly suggest you fired a bug with the bare minimum required to reproduce (that is an igt).
Comment 7 Chris Wilson 2015-07-02 08:42:00 UTC
(In reply to meng from comment #5)
> (In reply to meng from comment #4)
> When the case hang, gdb attach that, then it could finish. So it's not GPU
> hang.

No, that would be a "missed interrupt" which is normally detected by hangcheck.
Comment 8 cprigent 2015-07-28 17:04:35 UTC
Bug scrub:
Assigned to Jani
Comment 9 cprigent 2016-03-25 16:59:26 UTC
Assigned to Mengmeng
Hi Mengmeng,
Is it still reproduced?
Comment 10 Jani Nikula 2016-06-17 16:33:21 UTC
Timeout, closing. Please reopen if the problem persists on latest kernels.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.