Bug 26691 - Spurious hangcheck whilst executing a long shader over a large vertex buffer
Summary: Spurious hangcheck whilst executing a long shader over a large vertex buffer
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Chris Wilson
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-21 18:07 UTC by Kristof Ralovich
Modified: 2017-07-24 23:08 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
the test case to reproduce the issue (15.23 KB, text/x-csrc)
2010-02-21 18:07 UTC, Kristof Ralovich
no flags Details
hopefully relevant part of kern.log (9.91 KB, text/plain)
2010-02-21 18:10 UTC, Kristof Ralovich
no flags Details
log of crashing X server (29.82 KB, text/plain)
2010-02-21 18:10 UTC, Kristof Ralovich
no flags Details
log from kdm (17.58 KB, text/plain)
2010-02-21 18:11 UTC, Kristof Ralovich
no flags Details
Include instdone in hangcheck. (2.70 KB, patch)
2010-05-10 12:12 UTC, Chris Wilson
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Kristof Ralovich 2010-02-21 18:07:12 UTC
Created attachment 33479 [details]
the test case to reproduce the issue

After running the testcase "vsraytrace" the X server craches and the screen remains black. I can not switch back to console. The kernel is not hung, I can soft-shut down the machine.

kernel 2.6.32.8
X 1.7.5-1
libdrm 2.4.17-1
xf86-video-intel 2.9.1-2
mesa 7.7-3 (same issue with 7.6.1-1)

Everything except for the kernel from official unstable/experimental debian packages.
Comment 1 Kristof Ralovich 2010-02-21 18:08:35 UTC
Comment on attachment 33479 [details]
the test case to reproduce the issue

the test cases uses shaderutil.c/h from progs/util/ in Mesa sources.
Comment 2 Kristof Ralovich 2010-02-21 18:10:01 UTC
Created attachment 33480 [details]
hopefully relevant part of kern.log
Comment 3 Kristof Ralovich 2010-02-21 18:10:46 UTC
Created attachment 33481 [details]
log of crashing X server
Comment 4 Kristof Ralovich 2010-02-21 18:11:26 UTC
Created attachment 33482 [details]
log from kdm
Comment 5 Kristof Ralovich 2010-02-23 19:21:09 UTC
The issue persists with Debian's libdrm 2.4.18-1.
Comment 6 Kristof Ralovich 2010-02-23 19:42:20 UTC
The problem is still there with the 2.6.32.9 kernel.
Comment 7 Kristof Ralovich 2010-03-13 10:32:13 UTC
If you are here, have a look at https://bugs.freedesktop.org/show_bug.cgi?id=27060 too please.
Comment 8 Kristof Ralovich 2010-03-18 19:42:56 UTC
I have re-run the test case with today's Mesa master
8df65e98998b4c104db30cbba8a38be7eb2a9acd (including the above referred patch)
and drm master c1c8bbf80b1f734e23996bf805dc78f32ebaf56f and the X server crash still exists!
Comment 9 Kristof Ralovich 2010-03-18 19:44:19 UTC
I provided the test case that reliably reproduces to problem for me, so would someone from Intel please have a look into this? 
Comment 10 Kristof Ralovich 2010-03-18 19:56:55 UTC
(In reply to comment #9)
> I provided the test case that reliably reproduces to problem for me, so would
> someone from Intel please have a look into this? 
> 

Please
run the test case with LIBGL_ALWAYS_SOFTWARE=1 ./vsraytrace to what the correct
rendering should be.
Comment 11 Chris Wilson 2010-03-20 10:29:49 UTC
The trigger seems to be the number of vertices in the pointset.

Using 10x10, 50x50 does not trigger the hang, but using 250x250 points does. I'm not certain what the significance of this is yet...
Comment 12 Jesse Barnes 2010-04-06 11:24:19 UTC
Could be that our hangcheck timer is too aggressive, given that there's no error reported but we get a hangcheck timeout...

Tag you're it Chris!
Comment 13 Kristof Ralovich 2010-04-18 14:14:13 UTC
Also it would be very nice, if the X server was able to restart after.
Comment 14 fangxun 2010-04-27 02:54:38 UTC
Piglit case glsl-vs-raytrace-bug26691 failed with error message:
intel_bufmgr_gem.c:1070: Error setting domain 598: Input/output error
intel_bufmgr_gem.c:1247: Error setting memory domains 598 (00000040 00000000): Input/output error .

X still alive, not crash. After running this case, all the rest of piglit cases failed.
Comment 15 Gordon Jin 2010-04-27 19:15:36 UTC
promoting to P1, as it impacts the rest piglit execution.
Comment 16 Chris Wilson 2010-05-10 12:12:47 UTC
Created attachment 35551 [details] [review]
Include instdone in hangcheck.

This is a patch that I've been using to reduce the number of spurious errors.
Comment 17 Chris Wilson 2010-05-25 00:53:50 UTC
Dropping priority, as far as we can tell this a bug in the hang-check spuriously firing for which the attached kernel patch should reduce the error rate.
Comment 18 Chris Wilson 2010-07-06 01:03:18 UTC
The hangcheck change is now upstream.

[gm45] Running the test case on the old compiler throws an error that it does not handle multiple returns from a function.

So it appears that this residual will hopefully be fixed with the glsl2 compiler work, which is being tracked at bug 28748.

Marking this bug as closed as the test case is now part of piglit (and the test suite) and is being tracked separately.

Thanks for the bug report and the excellent test case!
Comment 19 Kristof Ralovich 2010-08-18 21:14:13 UTC
I can confirm, it is working on GM45 with upstream mesa. Thank you!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.