Bug 48105

Summary: [uxa ilk] GPU hang when running CPU intensive tasks
Product: xorg Reporter: spb.nevill
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: major    
Priority: medium CC: hramrach
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
/var/log/syslog
none
dmesg
none
Xorg.0.log
none
intel_reg_dumper output
none
i915_error_state
none
GPU hang screenshot
none
i915_error_state untruncated
none
Detect writes to address 0
none
Detect writes to address 0 none

Description spb.nevill 2012-03-30 15:09:30 UTC
Created attachment 59295 [details]
/var/log/syslog

Bug description: 
GPU hangs randomly, most often when CPU is heavily used. I can reproduce it 100% in about an hour when using boinc client that puts a heavy load on CPU.

System environment: 
-- chipset: i915
-- system architecture: 64-bit
-- xf86-video-intel: 2.17.0
-- xorg-server: 1.11.4
-- Mesa: 8.0.2
-- libdrm: 2.4.32
-- kernel: 7.3.2.0-20-generic
-- Kubuntu 12.04
-- Acer TRAVELMATE 5744-383G32Mikk

Additional info:

I have not seen gpu hang in Windows 
"i915.semaphores=0" or "i915.semaphores=1" in GRUB makes no difference
syslog, Xorg.0, dmesg and intel_reg_dumper output attached. i915_error_state shows no errors.
Comment 1 spb.nevill 2012-03-30 15:10:44 UTC
Created attachment 59296 [details]
dmesg
Comment 2 spb.nevill 2012-03-30 15:11:15 UTC
Created attachment 59297 [details]
Xorg.0.log
Comment 3 spb.nevill 2012-03-30 15:11:54 UTC
Created attachment 59298 [details]
intel_reg_dumper output
Comment 4 Chris Wilson 2012-03-30 15:15:41 UTC
Describe the symptoms please, because we've established that it is not what we call a GPU hang.
Comment 5 spb.nevill 2012-03-30 15:34:45 UTC
I've read this string and assumed it was a GPU hang:

Mar 29 03:37:50 Nevill-Acer kernel: [ 7982.136794] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung

The interface elements lock up. Most of the windows can no longer be maximized/minimized, or at all interacted with. Buttons on the screen cease to respond. I usually have Psi+ running when it happens, so this becomes the only window I can interact with. I can type, but nothing more. I have to switch to another console to get the logs.

I have also noticed artifacts on the screen a couple of times before the incident, something like what was decribed here:
https://bugs.freedesktop.org/show_bug.cgi?id=36459
There are rectangular blocks of pixels on the screen where colors are off. I tried to take a screenshot first time it happened, but Ksnapshot crashed every time while doing so.
Comment 6 spb.nevill 2012-03-30 15:42:35 UTC
It might be importand to add that I was running applications with wine at the time I've seen artifacts, and the colors were off only in those windows.
Comment 7 Chris Wilson 2012-03-30 16:19:30 UTC
Ok, that's a GPU hang. (Neither your Xorg.0.log nor dmesg contain a reference to the hang and you said error-state was empty so I presumed we were looking for something else.) Please can you attach a copy of /sys/kernel/debug/dri/0/i915_error_state before you reboot (otherwise we lose the debug info)? Is it consistently dieing when you use the wine applications and not at other times?
Comment 8 spb.nevill 2012-03-30 16:26:32 UTC
No, it's dieing pretty much every time I use CPU extensively. I've only seen artifacts directly prior to the hang in wine, but then again, most of the time I wasn't there when it hanged. I'll try to reproduce it tonight and will attach the corresponding file.
Comment 9 spb.nevill 2012-03-31 23:32:38 UTC
Created attachment 59324 [details]
i915_error_state

It took a bit longer than expected. Maybe urning semaphores off helps somewhat?

The file itself reports as an empty 0 byte file with no content in mc, that's why I thought it was empty. However, when I copied it to another folder, I've got a 'Can't read memory' error and ended up with the file in attachment.

I have also managed to take a screenshot this time. Notice that the only window that has minimize/maximize/close buttuns left is chromium brouser - the rest of the windows can not be moved or affected in any way.
Comment 10 spb.nevill 2012-03-31 23:33:38 UTC
Created attachment 59325 [details]
GPU hang screenshot
Comment 11 Chris Wilson 2012-04-01 10:12:30 UTC
That truncated error-state isn't particularly useful. You may need to free up some kernel memory before it can be read, exiting X is usually required in those circumstances.
Comment 12 spb.nevill 2012-04-02 22:52:34 UTC
Created attachment 59407 [details]
i915_error_state untruncated
Comment 13 Chris Wilson 2012-04-03 00:34:13 UTC
So something clobberred the ringbuffer with a wild (tiled) write, a write to address 0 would do the trick:

ringbuffer (render ring) at 0x00001000:
0x00001000:      0xffededed: UNKNOWN
0x00001004:      0xffededed: UNKNOWN
0x00001008:      0xffededed: UNKNOWN
0x0000100c:      0xffededed: UNKNOWN
0x00001010:      0xffededed: UNKNOWN
0x00001014:      0xffededed: UNKNOWN
0x00001018:      0xffededed: UNKNOWN
0x0000101c:      0xffededed: UNKNOWN
0x00001020:      0xffededed: UNKNOWN
0x00001024:      0xffededed: UNKNOWN
0x00001028:      0xffededed: UNKNOWN
0x0000102c:      0xffededed: UNKNOWN
0x00001030:      0xffededed: UNKNOWN
Comment 14 Chris Wilson 2012-04-03 06:49:20 UTC
Created attachment 59421 [details] [review]
Detect writes to address 0

If you have the opportunity can you try applying this patch and attaching the resultant error-state? Hopefully this witch catch the culprit red-handed.
Comment 15 spb.nevill 2012-04-03 07:06:33 UTC
I can try, if you would be so kind as to provide a step-by-step instruction on how to apply it. I am not a programmer, after all.
Comment 16 Chris Wilson 2012-04-03 07:16:30 UTC
First it requires a kernel checkout,

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cd linux
git remote add -f intel git://people.freedesktop.org/~danvet/drm-intel
git merge intel/drm-intel-next-queued

Then apply the patch with

git am -3 < ~/detect-bad-writes.patch

Copy your distro config and build

cp /boot/config-`uname -r` .config
make

sudo make modules_install install or make kpkg (depending upon personal preference or distro)

Boot into new kernel.
Comment 17 spb.nevill 2012-04-03 11:01:49 UTC
I have downloaded the patch, saved it as a "detect-bad-writes.patch" and tried to do as you said. I've got the following error:
> git am -3 < ~/detect-bad-writes.patch

Applying: agp/intel,drm/i915: Catch address-0 writes using invalid PTEs
fatal: sha1 information is lacking or useless (drivers/gpu/drm/i915/i915_gem_gtt.c).
Repository lacks necessary blobs to fall back on 3-way merge.
Cannot fall back to three-way merge.
Patch failed at 0001 agp/intel,drm/i915: Catch address-0 writes using invalid PTEs
When you have resolved this problem run "git am --resolved".
If you would prefer to skip this patch, instead run "git am --skip".
To restore the original branch and stop patching run "git am --abort"

What do I do now? Do I "git add" the file and then run "git am --resolved"?
Comment 18 Chris Wilson 2012-04-03 11:14:33 UTC
Created attachment 59443 [details] [review]
Detect writes to address 0

Patch rebased against intel/drm-intel-next-queued
Comment 19 spb.nevill 2012-04-05 09:57:19 UTC
The distro fails to build for me. I get the following error:
ERROR: "__modver_version_show" [drivers/staging/rts5139/rts5139.ko] undefined!
WARNING: modpost: Found 5 section mismatch(es).

I only have a core-i3 processor, thus the process takes a long time. Basically, I can only try and build once a day, when I am going to work. Could you please rebase your patch against the latest stable kernel so as to keep experimenting and unsuccessful tries to a minimum?

http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.3.1.tar.bz2

If you would do that, how this will change the instruction? Would I still need to merge intel/drm-intel-next-queued?
Comment 20 Chris Wilson 2012-10-30 09:59:58 UTC
The wild write was tracked down to mesa incorrectly clearing its depth/stencil buffers... Definitely fixed by 8.0.4/9.0.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.