50405 – *ERROR* Hangcheck timer elapsed... GPU hung

Bug 50405 - *ERROR* Hangcheck timer elapsed... GPU hung

Summary: *ERROR* Hangcheck timer elapsed... GPU hung

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Chris Wilson
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-05-27 18:19 UTC by Linus Torvalds
Modified:	2012-05-28 00:03 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
/debug/dri/0/i915_error_state file contents (1.41 MB, application/octet-stream) 2012-05-27 18:19 UTC, Linus Torvalds	no flags	Details
View All

Description Linus Torvalds 2012-05-27 18:19:36 UTC

Created attachment 62149 [details]
/debug/dri/0/i915_error_state file contents

On Westmere (Intel Core-i5 960), I for the first time in a long time just got an X.org hang, saying just

  [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung                   
  [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state 
  [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
  [drm:intel_pipe_set_base] *ERROR* pin & fence failed

with F14 and current kernel -git as of May 27, 2012: commit 1e2aec873ad6d.

i915_error_state file attached as per kernel dmesg.

Nothing much was going on, reading email and running "Bejeweled" in chrome (with WebGL forced on - it shows some artifacts but mostly works).

I could switch to another VT, kill the X server, but when another restarted, it would show the cursor but the screen remained otherwise black. So things were really hung on the GPU side.

Comment 1 Linus Torvalds 2012-05-27 18:31:58 UTC

The driver versions in bugzilla don't make sense. From Xorg.0.log:

  X.Org X Server 1.9.5
  Release Date: 2011-03-17
  ..
  (II) Module intel: vendor="X.Org Foundation"
     compiled for 1.9.0, module version = 2.12.0

fwiw. I realize F14 is old, but this system really *has* been stable, so I think this is a new i915 drm kernel bug.

Comment 2 Linus Torvalds 2012-05-27 18:37:07 UTC

Oh wow. Even a warm-reboot didn't fix it. After a warm-reboot, X came up fine, but starting bejeweled in chrome immediately resulted in the hangcheck timer elapsing, and then

 [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
 [drm:i915_reset] *ERROR* Failed to reset chip.

and I needed to cold-reboot to get it all to fully work again.

Comment 3 Chris Wilson 2012-05-28 00:03:23 UTC

Looks to be the old userspace bug:

commit 3c5b1399e29ef577b8b91655b5e1c215d1b6dfbb
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Nov 9 20:20:06 2010 +0000

    i915: Disable maximum state addresses
    
    As the kernel controls the relocation of state buffers, we should not
    hard code the maximum permissible value for them.
    
    Fixes an eventual hang with full-gtt.
    
    Reported-by: Peter Clifton <pcjc2@cam.ac.uk>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

which will strike eventually (mostly depending on how fragmented the GTT is). 2.14.901 is the first release that won't run afoul of that bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.