Bug 63221 - [865G] GPU hung - stale data/cacheline in batch
Summary: [865G] GPU hung - stale data/cacheline in batch
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-07 03:44 UTC by Götz
Modified: 2017-07-24 22:58 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
/debug/dri/0/i915_error_state file content (102.71 KB, text/plain)
2013-04-07 03:44 UTC, Götz
no flags Details
Xorg.0.log (33.01 KB, text/plain)
2013-04-07 03:45 UTC, Götz
no flags Details
vbios.dump (48.00 KB, text/plain)
2013-04-07 03:46 UTC, Götz
no flags Details
Opera colors using UXA (293.47 KB, image/png)
2013-04-07 22:32 UTC, Götz
no flags Details
i915_error_state and Xorg.0.log (123.70 KB, application/octet-stream)
2013-04-08 20:53 UTC, Götz
no flags Details

Description Götz 2013-04-07 03:44:28 UTC
Created attachment 77543 [details]
/debug/dri/0/i915_error_state file content

Hi, today while browsing I got a GPU hung:

[39670.002714] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[39670.002725] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[39671.989404] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[39671.989579] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[39671.989585] [drm:i915_reset] *ERROR* Failed to reset chip.

System environment:
-- chipset: 865G
-- system architecture: 32-bit
-- xf86-video-intel: 2.21.5
-- xserver: 1.14.0
-- mesa: 9.1.1
-- libdrm: 2.4.43
-- kernel: 3.8.5-1-ARCH
-- Linux distribution: Arch Linux
Comment 1 Götz 2013-04-07 03:45:22 UTC
Created attachment 77544 [details]
Xorg.0.log
Comment 2 Götz 2013-04-07 03:46:09 UTC
Created attachment 77545 [details]
vbios.dump
Comment 3 Chris Wilson 2013-04-07 07:41:54 UTC
So in the command stream, there is

0x0100040c: HEAD 0x3e800000: UNKNOWN

instead of 0x6b800000. Similar but not close enough to suggest a physical memory error. A single word is incorrect which does not suggest tiling or GPU overwrites. So either that dword was incorrectly generated by UXA or it got left behind in the upload. The placement suggests that it is not a pwrite error, and again a single dword error suggests that it is not stale data trapped in the CPU cache.

As impossible as it seems, that leaves UXA. Try SNA, see what happens.
Comment 4 Götz 2013-04-07 22:32:45 UTC
Created attachment 77566 [details]
Opera colors using UXA

When using UXA, in the Opera browser the colors and fonts are very badly.
Comment 5 Chris Wilson 2013-04-08 07:44:49 UTC
Yet another broken swrasterizer, one might sympathize with Opera when they realized they have no idea how to write a good rendering engine.
Comment 6 Götz 2013-04-08 16:56:36 UTC
So, is this a bug in Opera?
Comment 7 Chris Wilson 2013-04-08 17:03:10 UTC
Yes, Opera's rendering is presuming an r8g8b8 pixelformat which is not being provided by default by the ddx. You can override the choose of bit depth for the framebuffer using either the Xorg -depth 24 command line or
  Section "Screen"
    Identifier "Screen0"
    DefaultDepth 24
  EndSection
xorg.conf.d snippet
Comment 8 Götz 2013-04-08 17:54:23 UTC
I have seen I made a mistake, the Opera problem came when using SNA, not UXA.

Thanks Chris Wilson, I added the "DefaultDepth 24" option, Opera is now ok, and I will continue testing SNA to see if the hang can be reproduced.
Comment 9 Götz 2013-04-08 20:53:06 UTC
Created attachment 77636 [details]
i915_error_state and Xorg.0.log

Another GPU hang, now using SNA. This time the error in dmesg was shorter:

[18901.991466] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[18901.991477] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state

Software is the same, only xf86-video-intel updated to 2.21.6.
Comment 10 Chris Wilson 2013-04-08 21:09:43 UTC
That has a more obviously corrupt cacheline right at the start of the batch. Do you still have an old kernel to retest?
Comment 11 Götz 2013-04-09 16:53:39 UTC
I'm now testing with Linux 3.7.9. Do you want me to test any specific version?
Comment 12 Chris Wilson 2013-04-09 17:05:32 UTC
(In reply to comment #11)
> I'm now testing with Linux 3.7.9. Do you want me to test any specific
> version?

Might need to try with an even older kernel if available. Basically I'm just trying to see if behaviour has changed and if so what triggered the bug.
Comment 13 Chris Wilson 2013-06-22 15:15:58 UTC
Just checking in to see if this is still a problem, and if we were able to determine if there was a singular regressing commit.
Comment 14 Chris Wilson 2013-07-02 07:56:29 UTC
I know now what caused this regression for you, the fix is:

commit 22fd5ca947b58901927d100d2b1aa0f1672b3435
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jun 28 16:54:08 2013 +0100

    drm/i915: Only clear write-domains after a successful wait-seqno
    
    In the introduction of the non-blocking wait, I cut'n'pasted the wait
    completion code from normal locked path. Unfortunately, this neglected
    that the normal path returned early if the wait returned early. The
    result is that read-only waits may return whilst the GPU is still
    writing to the bo.
    
    Fixes regression from
    commit 3236f57a0162391f84b93f39fc1882c49a8998c7 [v3.7]
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Fri Aug 24 09:35:09 2012 +0100
    
        drm/i915: Use a non-blocking wait for set-to-domain ioctl


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.