|Summary:||[uxa ilk] GPU hang when running CPU intensive tasks|
|Component:||Driver/intel||Assignee:||Chris Wilson <chris>|
|Status:||RESOLVED FIXED||QA Contact:||Xorg Project Team <xorg-team>|
|i915 platform:||i915 features:|
Description spb.nevill 2012-03-30 15:09:30 UTC
Created attachment 59295 [details] /var/log/syslog Bug description: GPU hangs randomly, most often when CPU is heavily used. I can reproduce it 100% in about an hour when using boinc client that puts a heavy load on CPU. System environment: -- chipset: i915 -- system architecture: 64-bit -- xf86-video-intel: 2.17.0 -- xorg-server: 1.11.4 -- Mesa: 8.0.2 -- libdrm: 2.4.32 -- kernel: 22.214.171.124-20-generic -- Kubuntu 12.04 -- Acer TRAVELMATE 5744-383G32Mikk Additional info: I have not seen gpu hang in Windows "i915.semaphores=0" or "i915.semaphores=1" in GRUB makes no difference syslog, Xorg.0, dmesg and intel_reg_dumper output attached. i915_error_state shows no errors.
Comment 3 spb.nevill 2012-03-30 15:11:54 UTC
Created attachment 59298 [details] intel_reg_dumper output
Comment 4 Chris Wilson 2012-03-30 15:15:41 UTC
Describe the symptoms please, because we've established that it is not what we call a GPU hang.
Comment 5 spb.nevill 2012-03-30 15:34:45 UTC
I've read this string and assumed it was a GPU hang: Mar 29 03:37:50 Nevill-Acer kernel: [ 7982.136794] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung The interface elements lock up. Most of the windows can no longer be maximized/minimized, or at all interacted with. Buttons on the screen cease to respond. I usually have Psi+ running when it happens, so this becomes the only window I can interact with. I can type, but nothing more. I have to switch to another console to get the logs. I have also noticed artifacts on the screen a couple of times before the incident, something like what was decribed here: https://bugs.freedesktop.org/show_bug.cgi?id=36459 There are rectangular blocks of pixels on the screen where colors are off. I tried to take a screenshot first time it happened, but Ksnapshot crashed every time while doing so.
Comment 6 spb.nevill 2012-03-30 15:42:35 UTC
It might be importand to add that I was running applications with wine at the time I've seen artifacts, and the colors were off only in those windows.
Comment 7 Chris Wilson 2012-03-30 16:19:30 UTC
Ok, that's a GPU hang. (Neither your Xorg.0.log nor dmesg contain a reference to the hang and you said error-state was empty so I presumed we were looking for something else.) Please can you attach a copy of /sys/kernel/debug/dri/0/i915_error_state before you reboot (otherwise we lose the debug info)? Is it consistently dieing when you use the wine applications and not at other times?
Comment 8 spb.nevill 2012-03-30 16:26:32 UTC
No, it's dieing pretty much every time I use CPU extensively. I've only seen artifacts directly prior to the hang in wine, but then again, most of the time I wasn't there when it hanged. I'll try to reproduce it tonight and will attach the corresponding file.
Comment 9 spb.nevill 2012-03-31 23:32:38 UTC
Created attachment 59324 [details] i915_error_state It took a bit longer than expected. Maybe urning semaphores off helps somewhat? The file itself reports as an empty 0 byte file with no content in mc, that's why I thought it was empty. However, when I copied it to another folder, I've got a 'Can't read memory' error and ended up with the file in attachment. I have also managed to take a screenshot this time. Notice that the only window that has minimize/maximize/close buttuns left is chromium brouser - the rest of the windows can not be moved or affected in any way.
Comment 10 spb.nevill 2012-03-31 23:33:38 UTC
Created attachment 59325 [details] GPU hang screenshot
Comment 11 Chris Wilson 2012-04-01 10:12:30 UTC
That truncated error-state isn't particularly useful. You may need to free up some kernel memory before it can be read, exiting X is usually required in those circumstances.
Comment 12 spb.nevill 2012-04-02 22:52:34 UTC
Created attachment 59407 [details] i915_error_state untruncated
Comment 13 Chris Wilson 2012-04-03 00:34:13 UTC
So something clobberred the ringbuffer with a wild (tiled) write, a write to address 0 would do the trick: ringbuffer (render ring) at 0x00001000: 0x00001000: 0xffededed: UNKNOWN 0x00001004: 0xffededed: UNKNOWN 0x00001008: 0xffededed: UNKNOWN 0x0000100c: 0xffededed: UNKNOWN 0x00001010: 0xffededed: UNKNOWN 0x00001014: 0xffededed: UNKNOWN 0x00001018: 0xffededed: UNKNOWN 0x0000101c: 0xffededed: UNKNOWN 0x00001020: 0xffededed: UNKNOWN 0x00001024: 0xffededed: UNKNOWN 0x00001028: 0xffededed: UNKNOWN 0x0000102c: 0xffededed: UNKNOWN 0x00001030: 0xffededed: UNKNOWN
Comment 14 Chris Wilson 2012-04-03 06:49:20 UTC
Created attachment 59421 [details] [review] Detect writes to address 0 If you have the opportunity can you try applying this patch and attaching the resultant error-state? Hopefully this witch catch the culprit red-handed.
Comment 15 spb.nevill 2012-04-03 07:06:33 UTC
I can try, if you would be so kind as to provide a step-by-step instruction on how to apply it. I am not a programmer, after all.
Comment 16 Chris Wilson 2012-04-03 07:16:30 UTC
First it requires a kernel checkout, git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git cd linux git remote add -f intel git://people.freedesktop.org/~danvet/drm-intel git merge intel/drm-intel-next-queued Then apply the patch with git am -3 < ~/detect-bad-writes.patch Copy your distro config and build cp /boot/config-`uname -r` .config make sudo make modules_install install or make kpkg (depending upon personal preference or distro) Boot into new kernel.
Comment 17 spb.nevill 2012-04-03 11:01:49 UTC
I have downloaded the patch, saved it as a "detect-bad-writes.patch" and tried to do as you said. I've got the following error: > git am -3 < ~/detect-bad-writes.patch Applying: agp/intel,drm/i915: Catch address-0 writes using invalid PTEs fatal: sha1 information is lacking or useless (drivers/gpu/drm/i915/i915_gem_gtt.c). Repository lacks necessary blobs to fall back on 3-way merge. Cannot fall back to three-way merge. Patch failed at 0001 agp/intel,drm/i915: Catch address-0 writes using invalid PTEs When you have resolved this problem run "git am --resolved". If you would prefer to skip this patch, instead run "git am --skip". To restore the original branch and stop patching run "git am --abort" What do I do now? Do I "git add" the file and then run "git am --resolved"?
Comment 18 Chris Wilson 2012-04-03 11:14:33 UTC
Created attachment 59443 [details] [review] Detect writes to address 0 Patch rebased against intel/drm-intel-next-queued
Comment 19 spb.nevill 2012-04-05 09:57:19 UTC
The distro fails to build for me. I get the following error: ERROR: "__modver_version_show" [drivers/staging/rts5139/rts5139.ko] undefined! WARNING: modpost: Found 5 section mismatch(es). I only have a core-i3 processor, thus the process takes a long time. Basically, I can only try and build once a day, when I am going to work. Could you please rebase your patch against the latest stable kernel so as to keep experimenting and unsuccessful tries to a minimum? http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.3.1.tar.bz2 If you would do that, how this will change the instruction? Would I still need to merge intel/drm-intel-next-queued?
Comment 20 Chris Wilson 2012-10-30 09:59:58 UTC
The wild write was tracked down to mesa incorrectly clearing its depth/stencil buffers... Definitely fixed by 8.0.4/9.0.