Created attachment 59295 [details]
GPU hangs randomly, most often when CPU is heavily used. I can reproduce it 100% in about an hour when using boinc client that puts a heavy load on CPU.
-- chipset: i915
-- system architecture: 64-bit
-- xf86-video-intel: 2.17.0
-- xorg-server: 1.11.4
-- Mesa: 8.0.2
-- libdrm: 2.4.32
-- kernel: 220.127.116.11-20-generic
-- Kubuntu 12.04
-- Acer TRAVELMATE 5744-383G32Mikk
I have not seen gpu hang in Windows
"i915.semaphores=0" or "i915.semaphores=1" in GRUB makes no difference
syslog, Xorg.0, dmesg and intel_reg_dumper output attached. i915_error_state shows no errors.
Created attachment 59296 [details]
Created attachment 59297 [details]
Created attachment 59298 [details]
Describe the symptoms please, because we've established that it is not what we call a GPU hang.
I've read this string and assumed it was a GPU hang:
Mar 29 03:37:50 Nevill-Acer kernel: [ 7982.136794] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
The interface elements lock up. Most of the windows can no longer be maximized/minimized, or at all interacted with. Buttons on the screen cease to respond. I usually have Psi+ running when it happens, so this becomes the only window I can interact with. I can type, but nothing more. I have to switch to another console to get the logs.
I have also noticed artifacts on the screen a couple of times before the incident, something like what was decribed here:
There are rectangular blocks of pixels on the screen where colors are off. I tried to take a screenshot first time it happened, but Ksnapshot crashed every time while doing so.
It might be importand to add that I was running applications with wine at the time I've seen artifacts, and the colors were off only in those windows.
Ok, that's a GPU hang. (Neither your Xorg.0.log nor dmesg contain a reference to the hang and you said error-state was empty so I presumed we were looking for something else.) Please can you attach a copy of /sys/kernel/debug/dri/0/i915_error_state before you reboot (otherwise we lose the debug info)? Is it consistently dieing when you use the wine applications and not at other times?
No, it's dieing pretty much every time I use CPU extensively. I've only seen artifacts directly prior to the hang in wine, but then again, most of the time I wasn't there when it hanged. I'll try to reproduce it tonight and will attach the corresponding file.
Created attachment 59324 [details]
It took a bit longer than expected. Maybe urning semaphores off helps somewhat?
The file itself reports as an empty 0 byte file with no content in mc, that's why I thought it was empty. However, when I copied it to another folder, I've got a 'Can't read memory' error and ended up with the file in attachment.
I have also managed to take a screenshot this time. Notice that the only window that has minimize/maximize/close buttuns left is chromium brouser - the rest of the windows can not be moved or affected in any way.
Created attachment 59325 [details]
GPU hang screenshot
That truncated error-state isn't particularly useful. You may need to free up some kernel memory before it can be read, exiting X is usually required in those circumstances.
Created attachment 59407 [details]
So something clobberred the ringbuffer with a wild (tiled) write, a write to address 0 would do the trick:
ringbuffer (render ring) at 0x00001000:
0x00001000: 0xffededed: UNKNOWN
0x00001004: 0xffededed: UNKNOWN
0x00001008: 0xffededed: UNKNOWN
0x0000100c: 0xffededed: UNKNOWN
0x00001010: 0xffededed: UNKNOWN
0x00001014: 0xffededed: UNKNOWN
0x00001018: 0xffededed: UNKNOWN
0x0000101c: 0xffededed: UNKNOWN
0x00001020: 0xffededed: UNKNOWN
0x00001024: 0xffededed: UNKNOWN
0x00001028: 0xffededed: UNKNOWN
0x0000102c: 0xffededed: UNKNOWN
0x00001030: 0xffededed: UNKNOWN
Created attachment 59421 [details] [review]
Detect writes to address 0
If you have the opportunity can you try applying this patch and attaching the resultant error-state? Hopefully this witch catch the culprit red-handed.
I can try, if you would be so kind as to provide a step-by-step instruction on how to apply it. I am not a programmer, after all.
First it requires a kernel checkout,
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
git remote add -f intel git://people.freedesktop.org/~danvet/drm-intel
git merge intel/drm-intel-next-queued
Then apply the patch with
git am -3 < ~/detect-bad-writes.patch
Copy your distro config and build
cp /boot/config-`uname -r` .config
sudo make modules_install install or make kpkg (depending upon personal preference or distro)
Boot into new kernel.
I have downloaded the patch, saved it as a "detect-bad-writes.patch" and tried to do as you said. I've got the following error:
> git am -3 < ~/detect-bad-writes.patch
Applying: agp/intel,drm/i915: Catch address-0 writes using invalid PTEs
fatal: sha1 information is lacking or useless (drivers/gpu/drm/i915/i915_gem_gtt.c).
Repository lacks necessary blobs to fall back on 3-way merge.
Cannot fall back to three-way merge.
Patch failed at 0001 agp/intel,drm/i915: Catch address-0 writes using invalid PTEs
When you have resolved this problem run "git am --resolved".
If you would prefer to skip this patch, instead run "git am --skip".
To restore the original branch and stop patching run "git am --abort"
What do I do now? Do I "git add" the file and then run "git am --resolved"?
Created attachment 59443 [details] [review]
Detect writes to address 0
Patch rebased against intel/drm-intel-next-queued
The distro fails to build for me. I get the following error:
ERROR: "__modver_version_show" [drivers/staging/rts5139/rts5139.ko] undefined!
WARNING: modpost: Found 5 section mismatch(es).
I only have a core-i3 processor, thus the process takes a long time. Basically, I can only try and build once a day, when I am going to work. Could you please rebase your patch against the latest stable kernel so as to keep experimenting and unsuccessful tries to a minimum?
If you would do that, how this will change the instruction? Would I still need to merge intel/drm-intel-next-queued?
The wild write was tracked down to mesa incorrectly clearing its depth/stencil buffers... Definitely fixed by 8.0.4/9.0.