Bug 65764 - i915 hangcheck
Summary: i915 hangcheck
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: 9.1
Hardware: Other Linux (All)
: medium normal
Assignee: Ian Romanick
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-06-14 19:47 UTC by Martin Weinberg
Modified: 2016-08-30 11:51 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Relevant lines from dmesg (479 bytes, text/plain)
2013-06-14 20:38 UTC, Martin Weinberg
Details
debug info from kernel (2.15 MB, text/plain)
2013-06-14 20:39 UTC, Martin Weinberg
Details
The i915_error_state file after a series of frequent GPU hangs (2.17 MB, text/plain)
2013-06-17 20:40 UTC, Martin Weinberg
Details
Error state from GPU hang on the 3.9.6 kernel. (2.14 MB, text/plain)
2013-06-23 15:18 UTC, Martin Weinberg
Details
more w/a flushes for gen6 blorb (565 bytes, patch)
2013-06-24 18:48 UTC, Daniel Vetter
Details | Splinter Review
Error state after patching the git mesa drivers. (2.17 MB, text/plain)
2013-06-25 14:36 UTC, Martin Weinberg
Details
error state for old Mesa 9.1.1 drivers (2.17 MB, text/plain)
2013-06-25 20:44 UTC, Martin Weinberg
Details
Relevant dmesg lines (486 bytes, text/plain)
2013-10-14 20:08 UTC, Martin Weinberg
Details
The i915_error_state file after recent GPU hang (2.15 MB, text/plain)
2013-10-14 20:11 UTC, Martin Weinberg
Details
Xorg system log, with possibly relevant info to the hang (146.86 KB, text/plain)
2013-10-14 20:13 UTC, Martin Weinberg
Details

Description Martin Weinberg 2013-06-14 19:47:06 UTC

    
Comment 1 Daniel Vetter 2013-06-14 20:29:25 UTC
We need a notch more information here ... see https://01.org/linuxgraphics/documentation/how-report-bugs-0
Comment 2 Martin Weinberg 2013-06-14 20:38:55 UTC
Created attachment 80827 [details]
Relevant lines from dmesg
Comment 3 Martin Weinberg 2013-06-14 20:39:40 UTC
Created attachment 80828 [details]
debug info from kernel
Comment 4 Martin Weinberg 2013-06-14 21:26:15 UTC
I've been fighting with this problem since 3.8.x and appreciate that the latest kernel (3.10.x) seems to have this well under control.  In particular for 3.8.0, pounding heavily on the drm causes hang checks that require an X restart or a full reboot.

With 3.8.3 and esp. 3.10.x the hangs are more gracefully handled (yay!) but they still occur.  I can reliably (although still randomly) cause a GPU hang by running three glxgears and then watching a youtube video at HD res.

Additional system info gather using apport:

ProblemType: Bug
DistroRelease: Ubuntu 13.04
Package: xorg 1:7.7+1ubuntu4
Uname: Linux 3.10.0-994-generic x86_64
.tmp.unity.support.test.0:

ApportVersion: 2.9.2-0ubuntu8.1
Architecture: amd64
CompizPlugins: [core,composite,opengl,compiztoolbox,decor,vpswitch,snap,mousepoll,resize,place,move,wall,grid,regex,imgpng,session,gnomecompat,animation,fade,unitymtgrabhandles,workarounds,scale,expo,ezoom,unityshell]
CompositorRunning: compiz
CompositorUnredirectDriverBlacklist: '(nouveau|Intel).*Mesa 8.0'
CompositorUnredirectFSW: true
Date: Fri Jun 14 16:02:29 2013
DistUpgraded: Fresh install
DistroCodename: raring
DistroVariant: ubuntu
EcryptfsInUse: Yes
ExtraDebuggingInterest: Yes
GpuHangFrequency: Several times a day
GpuHangReproducibility: Yes, I can easily reproduce it
GpuHangStarted: Immediately after installing this version of Ubuntu
GraphicsCard:
 Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller [8086:0126] (rev 09) (prog-if 00 [VGA controller])
   Subsystem: Lenovo Device [17aa:21da]
InstallationDate: Installed on 2013-04-27 (48 days ago)
InstallationMedia: Ubuntu 13.04 "Raring Ringtail" - Release amd64 (20130424)
Lsusb:
 Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
 Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 001 Device 003: ID 04f2:b217 Chicony Electronics Co., Ltd Lenovo Integrated Camera (0.3MP)
MachineType: LENOVO 4286CTO
MarkForUpload: True
PlymouthDebug: Error: [Errno 13] Permission denied: '/var/log/plymouth-debug.log'
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.10.0-994-generic root=UUID=f9722d0d-2787-4da4-8c83-23da91112a32 ro crashkernel=384M-2G:64M,2G-:128M quiet splash vt.handoff=7
SourcePackage: xorg
Symptom: display
Title: Xorg freeze
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 04/11/2013
dmi.bios.vendor: LENOVO
dmi.bios.version: 8DET68WW (1.38 )
dmi.board.asset.tag: Not Available
dmi.board.name: 4286CTO
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvr8DET68WW(1.38):bd04/11/2013:svnLENOVO:pn4286CTO:pvrThinkPadX220:rvnLENOVO:rn4286CTO:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 4286CTO
dmi.product.version: ThinkPad X220
dmi.sys.vendor: LENOVO
version.compiz: compiz 1:0.9.9~daily13.04.18.1~13.04-0ubuntu1
version.ia32-libs: ia32-libs 20090808ubuntu36
version.libdrm2: libdrm2 2.4.45+git20130607.a0178c00-0ubuntu0sarvatt~raring
version.libgl1-mesa-dri: libgl1-mesa-dri 9.2.0~git20130612.adf324ad-0ubuntu0sarvatt~raring
version.libgl1-mesa-dri-experimental: libgl1-mesa-dri-experimental N/A
version.libgl1-mesa-glx: libgl1-mesa-glx 9.2.0~git20130612.adf324ad-0ubuntu0sarvatt~raring
version.xserver-xorg-core: xserver-xorg-core 2:1.13.4~git20130508+server-1.13-branch.
10c42f57-0ubuntu0ricotz~raring
version.xserver-xorg-input-evdev: xserver-xorg-input-evdev 1:2.7.3-0ubuntu2b2
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:7.1.99+git20130531.bd2557ea-0ubuntu0sarvatt~raring
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.21.9+git20130612.1f180b89-0ubuntu0sarvatt~raring
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:1.0.7+git20130516.bf72ae1f-0ubuntu0sarvatt~raring
xserver.bootTime: Fri Jun 14 15:06:42 2013
xserver.configfile: default
xserver.errors:

xserver.logfile: /var/log/Xorg.0.log
xserver.version: 2:1.13.4~git20130508+server-1.13-branch.10c42f57-0ubuntu0ricotz~raring
xserver.video_driver: intel
Comment 5 Martin Weinberg 2013-06-17 20:40:06 UTC
I thought I should mention, just in case it's not clear, these GPU hangs randomly so my perscription of running multiple instances of glxgears etc. was just recipe for forcing the issue.

Earlier today, I was getting GPU hangs every 3 minutes or so.  I'm including the error_state file from one of these in case it's useful.

I was doing nothing "outrageous" at the time, e.g. editing a source file in emacs and looking at gnuplot window.  Rebooting seemed to improve the situation.
Comment 6 Martin Weinberg 2013-06-17 20:40:52 UTC
Created attachment 80963 [details]
The i915_error_state file after a series of frequent GPU hangs
Comment 7 Chris Wilson 2013-06-18 13:01:17 UTC
Can you please grab a few more error states? That first looks to be a blorp (mesa/i965) failure.
Comment 8 Martin Weinberg 2013-06-18 16:49:56 UTC
That's curious.  When these issues began with the stock Ubuntu 13.04 kernel, I first tried upgrading the intel mesa stuff from the bleeding edge X repository.  That didn't help.  Then I tried newer and newer kernels.  Maybe you guys have fixed the kernel issue and I've made things worse with the experimental mesa drivers.

I will downgrade and see how the original "stable" mesa libs perform with the new kernel and grap and error states.
Comment 9 Martin Weinberg 2013-06-22 18:42:02 UTC
Having downgraded to Mesa 9.1.1 from the stable repository I've found that the problems are gone (so far) for kernels 3.9.6 and 3.10.0rc6, although still present at 3.8.x.

Sorry about that confusion.  But I'm grateful for the attention and help.
Comment 10 Martin Weinberg 2013-06-23 15:17:07 UTC
Looks like I spoke too soon: here is another error state on kernel 3.9.6.  Required an X11 restart.
Comment 11 Martin Weinberg 2013-06-23 15:18:13 UTC
Created attachment 81269 [details]
Error state from GPU hang on the 3.9.6 kernel.
Comment 12 Chris Wilson 2013-06-23 15:49:19 UTC
And the death is still caused by a mesa blorp operation.
Comment 13 Martin Weinberg 2013-06-23 16:50:46 UTC
Ok, that is good to know.  I'm glad to hear that the kernel issues are really fixed.

But what to do about these mesa blorbs?  I guess I'll file a report with the Ubuntu folks.
Comment 14 Martin Weinberg 2013-06-23 20:32:22 UTC
This keeps happening with kernel 3.9.6.  Last one hung was not recoverable and therefore no error state.  Even if it's a blorp, there is clearly kernel dependence.

Can I use the i915_error_txt myself to get some insight?  Or at least tell if the problem is due to a mesa blorp?

I'd sure like to get to the bottom of this.  What a nuisance!
Comment 15 Daniel Vetter 2013-06-24 15:53:50 UTC
Mesa's blorp is just the fancy copypixel engine i965_dri.so uses. Upgrading to latest mesa git should resolve this.
Comment 16 Martin Weinberg 2013-06-24 17:42:58 UTC
I tried using Mesa from git; the hangs are worse.  Seems that the best strategy is to use kernel 3.10 rcX with Mesa 9.1.1.

Does that make sense to you in any way?
Comment 17 Daniel Vetter 2013-06-24 18:48:59 UTC
Created attachment 81357 [details] [review]
more w/a flushes for gen6 blorb

Please try out the attached mesa patch, thanks.
Comment 18 Martin Weinberg 2013-06-25 14:35:30 UTC
That patch may be helping, but drm is still reporting hangchecks, but all have recovered so far.  See attached error state.

The best still seems to be 3.10.rc6 with Mesa 9.1.1.  I do not believe that this combo has hung yet.
Comment 19 Martin Weinberg 2013-06-25 14:36:44 UTC
Created attachment 81413 [details]
Error state after patching the git mesa drivers.
Comment 20 Chris Wilson 2013-06-25 14:59:38 UTC
Can you please try:

diff --git a/src/mesa/drivers/dri/i965/brw_misc_state.c b/src/mesa/drivers/dri/i965/brw_misc_state.c
index 7e41c84..798c727 100644
--- a/src/mesa/drivers/dri/i965/brw_misc_state.c
+++ b/src/mesa/drivers/dri/i965/brw_misc_state.c
@@ -1079,7 +1079,7 @@ static void upload_state_base_address( struct brw_context *brw )
 	* If this isn't programmed to a real bound, the sampler border color
 	* pointer is rejected, causing border color to mysteriously fail.
 	*/
-       OUT_BATCH(0xfffff001);
+       OUT_BATCH(0x7ffff001);
        OUT_BATCH(1); /* Indirect object upper bound */
        OUT_BATCH(1); /* Instruction access upper bound */
        ADVANCE_BATCH();
diff --git a/src/mesa/drivers/dri/i965/gen6_blorp.cpp b/src/mesa/drivers/dri/i965/gen6_blorp.cpp
index 3ccd90e..a0ed34c 100644
--- a/src/mesa/drivers/dri/i965/gen6_blorp.cpp
+++ b/src/mesa/drivers/dri/i965/gen6_blorp.cpp
@@ -97,7 +97,7 @@ gen6_blorp_emit_state_base_address(struct brw_context *brw,
     * If this isn't programmed to a real bound, the sampler border color
     * pointer is rejected, causing border color to mysteriously fail.
     */
-   OUT_BATCH(0xfffff001);
+   OUT_BATCH(0x7ffff001);
    OUT_BATCH(1); /* IndirectObjectUpperBound*/
    OUT_BATCH(1); /* InstructionAccessUpperBound */
    ADVANCE_BATCH();
Comment 21 Martin Weinberg 2013-06-25 16:44:31 UTC
I tried both patches, both singly and together, and the unpatched git drivers (three tests). AFAICT, they all lead to a similar amount of hang checking under graphics load. 

The details: my test consists of running three glxgears instances and then opening up firefox and trying to watch a youtube video.  Of course, this is not something I generally do but it does seem to generate GPU hangs so it's a good test.  In some cases, simply using compiz transitions was sufficient to get a hang, once the glxgears processes were running.  In all three tests with the git drivers, I saw 4-5 hangs in a few minutes.  All of them recovered.

I then downgraded the drivers, restarted and performed the same test: no hangs at all.
Comment 22 Martin Weinberg 2013-06-25 20:44:17 UTC
Created attachment 81425 [details]
error state for old Mesa 9.1.1 drivers

Experienced the first hangcheck using 3.10-rc6 and the Mesa 9.1.1.  Included here just in case it's helpful.
Comment 23 Chris Wilson 2013-07-20 14:52:45 UTC
The last hangcheck is not associated with a blorp...
Comment 24 Martin Weinberg 2013-10-14 20:08:38 UTC
Created attachment 87627 [details]
Relevant dmesg lines
Comment 25 Martin Weinberg 2013-10-14 20:11:28 UTC
Created attachment 87628 [details]
The i915_error_state file after recent GPU hang
Comment 26 Martin Weinberg 2013-10-14 20:13:08 UTC
Created attachment 87629 [details]
Xorg system log, with possibly relevant info to the hang
Comment 27 Martin Weinberg 2013-10-14 20:15:40 UTC
It's been a while since I reported this problem, and it's less often fatal (i.e. requiring a full reboot) with recent kernels and Mesa packages.  Hangs requiring a reboot about once a week in normal usage (still way to frequent, yes?).

I'm currently using Kernel 3.12.0-rc3 and the latest Mesa packages compiled by the xorg-edgers team (obtained from the xorg-edgers ppa).

Any advice??
Comment 28 Daniel Vetter 2013-10-28 18:16:57 UTC
Please test Ken's snb blorp fixes from

http://cgit.freedesktop.org/~kwg/mesa/log/?h=snbfixes
Comment 29 Matt Turner 2016-08-29 22:42:14 UTC
Please let us know whether this is still a problem with the latest Mesa (12.0.1).
Comment 30 Martin Weinberg 2016-08-30 11:51:02 UTC
The problem seems to be gone with Mesa 11.2.0.  Thanks for following up.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.