Summary: | [uxa ilk] GPU hang when running CPU intensive tasks | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | spb.nevill | ||||||||||||||||||||
Component: | Driver/intel | Assignee: | Chris Wilson <chris> | ||||||||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> | ||||||||||||||||||||
Severity: | major | ||||||||||||||||||||||
Priority: | medium | CC: | hramrach | ||||||||||||||||||||
Version: | unspecified | ||||||||||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||||
Attachments: |
|
Created attachment 59296 [details]
dmesg
Created attachment 59297 [details]
Xorg.0.log
Created attachment 59298 [details]
intel_reg_dumper output
Describe the symptoms please, because we've established that it is not what we call a GPU hang. I've read this string and assumed it was a GPU hang: Mar 29 03:37:50 Nevill-Acer kernel: [ 7982.136794] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung The interface elements lock up. Most of the windows can no longer be maximized/minimized, or at all interacted with. Buttons on the screen cease to respond. I usually have Psi+ running when it happens, so this becomes the only window I can interact with. I can type, but nothing more. I have to switch to another console to get the logs. I have also noticed artifacts on the screen a couple of times before the incident, something like what was decribed here: https://bugs.freedesktop.org/show_bug.cgi?id=36459 There are rectangular blocks of pixels on the screen where colors are off. I tried to take a screenshot first time it happened, but Ksnapshot crashed every time while doing so. It might be importand to add that I was running applications with wine at the time I've seen artifacts, and the colors were off only in those windows. Ok, that's a GPU hang. (Neither your Xorg.0.log nor dmesg contain a reference to the hang and you said error-state was empty so I presumed we were looking for something else.) Please can you attach a copy of /sys/kernel/debug/dri/0/i915_error_state before you reboot (otherwise we lose the debug info)? Is it consistently dieing when you use the wine applications and not at other times? No, it's dieing pretty much every time I use CPU extensively. I've only seen artifacts directly prior to the hang in wine, but then again, most of the time I wasn't there when it hanged. I'll try to reproduce it tonight and will attach the corresponding file. Created attachment 59324 [details]
i915_error_state
It took a bit longer than expected. Maybe urning semaphores off helps somewhat?
The file itself reports as an empty 0 byte file with no content in mc, that's why I thought it was empty. However, when I copied it to another folder, I've got a 'Can't read memory' error and ended up with the file in attachment.
I have also managed to take a screenshot this time. Notice that the only window that has minimize/maximize/close buttuns left is chromium brouser - the rest of the windows can not be moved or affected in any way.
Created attachment 59325 [details]
GPU hang screenshot
That truncated error-state isn't particularly useful. You may need to free up some kernel memory before it can be read, exiting X is usually required in those circumstances. Created attachment 59407 [details]
i915_error_state untruncated
So something clobberred the ringbuffer with a wild (tiled) write, a write to address 0 would do the trick: ringbuffer (render ring) at 0x00001000: 0x00001000: 0xffededed: UNKNOWN 0x00001004: 0xffededed: UNKNOWN 0x00001008: 0xffededed: UNKNOWN 0x0000100c: 0xffededed: UNKNOWN 0x00001010: 0xffededed: UNKNOWN 0x00001014: 0xffededed: UNKNOWN 0x00001018: 0xffededed: UNKNOWN 0x0000101c: 0xffededed: UNKNOWN 0x00001020: 0xffededed: UNKNOWN 0x00001024: 0xffededed: UNKNOWN 0x00001028: 0xffededed: UNKNOWN 0x0000102c: 0xffededed: UNKNOWN 0x00001030: 0xffededed: UNKNOWN Created attachment 59421 [details] [review] Detect writes to address 0 If you have the opportunity can you try applying this patch and attaching the resultant error-state? Hopefully this witch catch the culprit red-handed. I can try, if you would be so kind as to provide a step-by-step instruction on how to apply it. I am not a programmer, after all. First it requires a kernel checkout, git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git cd linux git remote add -f intel git://people.freedesktop.org/~danvet/drm-intel git merge intel/drm-intel-next-queued Then apply the patch with git am -3 < ~/detect-bad-writes.patch Copy your distro config and build cp /boot/config-`uname -r` .config make sudo make modules_install install or make kpkg (depending upon personal preference or distro) Boot into new kernel. I have downloaded the patch, saved it as a "detect-bad-writes.patch" and tried to do as you said. I've got the following error:
> git am -3 < ~/detect-bad-writes.patch
Applying: agp/intel,drm/i915: Catch address-0 writes using invalid PTEs
fatal: sha1 information is lacking or useless (drivers/gpu/drm/i915/i915_gem_gtt.c).
Repository lacks necessary blobs to fall back on 3-way merge.
Cannot fall back to three-way merge.
Patch failed at 0001 agp/intel,drm/i915: Catch address-0 writes using invalid PTEs
When you have resolved this problem run "git am --resolved".
If you would prefer to skip this patch, instead run "git am --skip".
To restore the original branch and stop patching run "git am --abort"
What do I do now? Do I "git add" the file and then run "git am --resolved"?
Created attachment 59443 [details] [review] Detect writes to address 0 Patch rebased against intel/drm-intel-next-queued The distro fails to build for me. I get the following error: ERROR: "__modver_version_show" [drivers/staging/rts5139/rts5139.ko] undefined! WARNING: modpost: Found 5 section mismatch(es). I only have a core-i3 processor, thus the process takes a long time. Basically, I can only try and build once a day, when I am going to work. Could you please rebase your patch against the latest stable kernel so as to keep experimenting and unsuccessful tries to a minimum? http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.3.1.tar.bz2 If you would do that, how this will change the instruction? Would I still need to merge intel/drm-intel-next-queued? The wild write was tracked down to mesa incorrectly clearing its depth/stencil buffers... Definitely fixed by 8.0.4/9.0. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 59295 [details] /var/log/syslog Bug description: GPU hangs randomly, most often when CPU is heavily used. I can reproduce it 100% in about an hour when using boinc client that puts a heavy load on CPU. System environment: -- chipset: i915 -- system architecture: 64-bit -- xf86-video-intel: 2.17.0 -- xorg-server: 1.11.4 -- Mesa: 8.0.2 -- libdrm: 2.4.32 -- kernel: 7.3.2.0-20-generic -- Kubuntu 12.04 -- Acer TRAVELMATE 5744-383G32Mikk Additional info: I have not seen gpu hang in Windows "i915.semaphores=0" or "i915.semaphores=1" in GRUB makes no difference syslog, Xorg.0, dmesg and intel_reg_dumper output attached. i915_error_state shows no errors.