Bug 25902 - [g45] GPU hang with specific fragment shader
Summary: [g45] GPU hang with specific fragment shader
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Eric Anholt
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-01-05 10:41 UTC by Nick Bowler
Modified: 2010-03-10 11:13 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Test case which locks up Intel KMS (23.23 KB, application/x-bzip2)
2010-01-05 10:41 UTC, Nick Bowler
Details
License patch (2.37 KB, patch)
2010-01-06 11:26 UTC, Nick Bowler
Details | Splinter Review

Description Nick Bowler 2010-01-05 10:41:03 UTC
Created attachment 32462 [details]
Test case which locks up Intel KMS

I originally posted this to the dri-devel mailing list, and it was suggested that I file a bug report here.  The test case attached here is updated slightly from the one posted to the mailing list.

Mesa 7.7 manages to lock up the Intel KMS driver for me in some
circumstances.  After running the test case (attached), the display
locks up and my kernel logs are spammed with

  [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
  render error detected, EIR: 0x00000000
  [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 5628 at 5621)
  [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
  render error detected, EIR: 0x00000000
  [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 5633 at 5621)
  [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
  render error detected, EIR: 0x00000000
  ...

until I reset the box (other tasks continue to run on the machine).  At
least all kernel versions 2.6.32 through to latest Linus git at the time
of posting are affected, but mesa 7.5.2 does not trigger the lockup.
Another interesting bit that appears in my kernel log is

  mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining
  [drm] MTRR allocation failed.  Graphics performance may suffer.

The Xorg log reports my graphics card as Intel(R) G45/G43. The lockup
occurs when I try to run the following fragment shader:

  uniform bool useSecondary;
  uniform sampler2D tex0, tex1; /* Assigned values 0 and 1, respectively. */
  void main(void)
  {
  	vec4 primary   = texture2D(tex0, gl_TexCoord[0].st);
  	vec4 secondary = texture2D(tex1, gl_TexCoord[0].st);
  	vec3 colour    = (1-primary.a)*gl_Color.rgb + primary.a*primary.rgb;
  
  	/*
	 * Removing the "if useSecondary" here (but keeping the
	 * multiplication) causes the shader to work.
  	 * The failure does not depend on the value assigned to useSecondary.
	 */
  	if (useSecondary) {
  		colour *= secondary.rgb;
  	}
  
  	gl_FragColor = vec4(colour, 1);
  }

Since textures seem to be required to trigger the issue, I have attached
an archive containing the test case -- glew, glut and libpng are
required.  The test case works correctly with the software rasterizer.

Since posting to the list, I also managed to trigger the same behaviour by a different means (ostensibly by creating very large vertex buffer objects, but I haven't really spent much time investigating).  If desired, I can try to produce another test case for that.
Comment 1 Eric Anholt 2010-01-06 10:32:39 UTC
Would you be willing to license your code under http://www.opensource.org/licenses/mit-license.php so we can include it in the open-source testsuite?
Comment 2 Nick Bowler 2010-01-06 11:26:51 UTC
Created attachment 32477 [details] [review]
License patch

Go for it.
Comment 3 Eric Anholt 2010-03-10 10:45:29 UTC
Thanks so much!  Having the small testcase on hand to hack on made a huge difference here.  I've trimmed it down a bit more and integrated it as a piglit testcase (glsl-fs-bug25902).  Turns out the bug was in masked sample operations in the GLSL dispatch path -- confirming both your observations that having the if present was important and that having texturing present was important.  The fix is in Mesa master and will land in 7.8.

commit f6d210c284751ac50a8d6358de7e75a1ff1e4ac7
Author: Eric Anholt <eric@anholt.net>
Date:   Wed Mar 10 10:38:20 2010 -0800

    i965: Fix the response len of masked sampler messages for 8-wide dispatch.
    
    The bad response length would hang the GPU with a masked sample in a
    shader using control flow.  For 8-wide, the response length is always
    4, and masked slots are just not written to.  brw_wm_glsl.c already
    allocates registers in the right locations.
    
    Fixes piglit glsl-fs-bug25902 (fd.o bug #25902).
Comment 4 Nick Bowler 2010-03-10 11:13:11 UTC
Thanks for taking the time to fix this, I'll try out git mesa tonight.

It still worries me that a user program can take out a kernel driver so thoroughly, though: I have to reboot to get working graphics after running the test with mesa 7.7.

However, I suppose that's actually a different bug?


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.