Bug 36821 - [bisected SNB]oglc api-texcoord causes GPU hang
[bisected SNB]oglc api-texcoord causes GPU hang
Status: VERIFIED FIXED
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965
git
All Linux (All)
: high critical
Assigned To: Eric Anholt
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-05-03 23:08 UTC by fangxun
Modified: 2011-08-24 22:36 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
oglc api-mtexcoord output (5.96 KB, text/plain)
2011-05-26 02:06 UTC, fangxun
Details
i915_error_state (286.35 KB, text/plain)
2011-05-26 02:11 UTC, fangxun
Details

Note You need to log in before you can comment on or make changes to this bug.
Description fangxun 2011-05-03 23:08:59 UTC
System Environment:
--------------------------
Arch:           x86_64
Platform:       Huronriver
Libdrm:         (master)2.4.25
Mesa:           (master)929b3d82334e217641c3f39e7914a90dadc6e6b2
Xserver:(master)xorg-server-1.10.0-313-g5cb31cd0cbf83fff5f17a475e7b0e45246b19bf3
Xf86_video_intel:  (master)2.15.0-6-g67e5a74e997f199327f9115c7ba867df3c49da8d
Kernel: (drm-intel-next)daab1470018f025e0b1c8731dfb825ff421ffd9b

Bug detailed description:
-------------------------
This case failed and caused system hang on huronriver. The last known good commit is 1a447749ed421db8eb6ba20012630785aef9bb12, the last known bad commit is bb7ff01deb5c1eb813b90da6f40d987a67e2793b. We unable to bisect the first bad commit due to mesa build error. Anyway, the first bad commit could be any of below:
bb7ff01deb5c1eb813b90da6f40d987a67e2793b
588cebce2d5b6afd24b72603d744d390481310dd
04e3f1d3c29c68343e709d566b7fe13d617f8d13
a82a43e8d99e1715dd11c9c091b5ab734079b6a6
855f56ca13c1003396a81da1a110357d624a2101

Reproduce steps:
-------------------------
1. start X
2. INTEL_STRICT_CONFORMANCE=1 oglconform -z -s -suite all -v 2 -D 33 -test api-
texcoord
Comment 1 Ian Romanick 2011-05-24 17:45:03 UTC
Something fishy is happening with the bisect.  For example, commit 04e3f1d3 adds some comments and moves a structure field.  There is no way that commit broke the build.  I recommend doing 'git clean -fxd' at each bisect step.  Mesa's build system is notoriously bad, so there could be garbage left over from previous builds that cause later builds to fail.
Comment 2 Eric Anholt 2011-05-25 12:23:29 UTC
Please retest with the kernel updated to keithp's drm-intel-next with the forcewake patches.  Incorrect forcewake handling could add up to intermittent GPU hangs.
Comment 3 fangxun 2011-05-26 02:06:40 UTC
Created attachment 47180 [details]
oglc api-mtexcoord output

Tested with keithp's drm-intel-next kernel 9e3c25, System doesn't hang but GPU hangs when case running, and the case still fails. 
dmesg shows:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Comment 4 fangxun 2011-05-26 02:11:47 UTC
Created attachment 47182 [details]
i915_error_state
Comment 5 fangxun 2011-05-26 02:18:49 UTC
(In reply to comment #1)
> Something fishy is happening with the bisect.  For example, commit 04e3f1d3
> adds some comments and moves a structure field.  There is no way that commit
> broke the build.  I recommend doing 'git clean -fxd' at each bisect step. 
> Mesa's build system is notoriously bad, so there could be garbage left over
> from previous builds that cause later builds to fail.

I do 'git clean -fxd' at each bisect step, but the four mesa commit build still fail.
Comment 6 Eric Anholt 2011-06-19 11:25:30 UTC
Reproduced the failure (./oglconform -D 33 -s -v 3 -1 api-texcoord.c), and a workaround (always_flush_batch=true in environment).
Comment 7 Eric Anholt 2011-07-19 13:49:05 UTC
I did the bisect by rebasing the change sequence with the build failure so that the build never failed (squash last commit into first build-failing commit).  That leads the bisect to:

commit a82a43e8d99e1715dd11c9c091b5ab734079b6a6
Author: Eric Anholt <eric@anholt.net>
Date:   Fri Apr 22 16:00:14 2011 -0700

    i965/gen6: Use the dynamic state base address to reduce relocations.

The specific hunk that triggers the failure is moving brw_state_base_address, and bisecting *that* shows that it's when brw_state_base_address moves above gen6_vs_state.
Comment 8 Eric Anholt 2011-07-20 11:56:47 UTC
commit 3e5d36267d8c9536490c902f785137a7fa0637fc
Author: Eric Anholt <eric@anholt.net>
Date:   Tue Jul 19 15:06:15 2011 -0700

    i965: Apply a homebrew workaround for GPU hang in OGLC api-texcoord.
    
    The behavior of flushes in the hardware is a maze of twisty passages,
    and strangely the VS constants appear to be loaded during a pipeline
    flush instead of at the time of the packet emit according to the
    simulator.  On moving the STATE_BASE_ADDRESS packet to where it really
    needed to live (in order for data loads by other packets to be
    correct), we sometimes no longer got a flush between those packets
    where we apparently needed it.  This replicates the flushes implied by
    a STATE_BASE_ADDRESS update, fixing the GPU hangs in OGLC and the
    "engine" demo.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=36821
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=39257
    Tested-by: Keith Packard <keithp@keithp.com> (bzflag and etracer fixed)
    Acked-by: Kenneth Graunke <kenneth@whitecape.org>
Comment 9 fangxun 2011-08-24 22:36:47 UTC
Verified with master commit 1284d5b25507a56634519ac385cbc00a00b94417.