61477 – [965g] batch corruption, clflush?

Bug 61477 - [965g] batch corruption, clflush?

Summary: [965g] batch corruption, clflush?

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Chris Wilson
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-02-26 05:23 UTC by Norman Yarvin
Modified:	2017-07-24 22:58 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
i915_error_state (822.34 KB, text/plain) 2013-02-26 05:23 UTC, Norman Yarvin	no flags	Details
Xorg.0.log (38.62 KB, text/plain) 2013-02-26 05:24 UTC, Norman Yarvin	no flags	Details
dmesg (from a normal boot) (65.21 KB, text/plain) 2013-02-26 05:26 UTC, Norman Yarvin	no flags	Details
Screenshot showing window border corruption (28.80 KB, image/png) 2013-06-24 15:11 UTC, Norman Yarvin	no flags	Details
Another crash (811.39 KB, text/plain) 2013-07-21 04:59 UTC, Norman Yarvin	no flags	Details
Another error_state (340.57 KB, application/octet-stream) 2013-12-17 23:36 UTC, Norman Yarvin	no flags	Details
Xorg.0.log (20.30 KB, text/plain) 2014-02-13 00:24 UTC, Norman Yarvin	no flags	Details
yet another i915_error_state (781.39 KB, text/plain) 2014-04-26 20:19 UTC, Norman Yarvin	no flags	Details
View All

Description Norman Yarvin 2013-02-26 05:23:34 UTC

Created attachment 75544 [details]
i915_error_state

About once a week or so, randomly, I've been getting an error whose main visible manifestation is that the mouse pointer disappears.  (The mouse motion logic is still active, and I can successfully click on things and change window focus; it's just the pointer that disappears.)  This is on a G965 chip:

00:02.0 VGA compatible controller: Intel Corporation 82G965 Integrated Graphics Controller (rev 02) (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 514d
        Flags: bus master, fast devsel, latency 0, IRQ 47
        Memory at e0400000 (32-bit, non-prefetchable) [size=1M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        I/O ports at 3410 [size=8]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Capabilities: [d0] Power Management version 2
        Kernel driver in use: i915

and with Linux kernel versions 3.7 and 3.8 (Gentoo), and xf86-video-intel version 2.21.3 (and one or two prior versions, I'm not sure exactly which).  I don't think I had the bug with kernel 3.6; certainly not before that.  When the bug occurs, the system log has the lines:

      Feb 25 22:23:00 muttonhead kernel: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
      Feb 25 22:23:00 muttonhead kernel: [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
      Feb 25 22:23:01 muttonhead kernel: [drm:i915_reset] *ERROR* Failed to reset chip.

and on another occasion there was also the line:

      Feb 19 22:05:56 muttonhead kernel: [drm:kick_ring] *ERROR* Kicking stuck wait on render ring

before the "failed to reset" line.  After the bug occurs, I can still do pretty much everything I've tried (if I can manage it without seeing the mouse pointer), but I'm still using fvwm, and things might be different if I were using a compositing window manager.  The one other failure is that shutting down X doesn't restore the console properly: it just gives me a blank screen.  (But not a hang; control-alt-delete still shuts down the machine properly.)

Comment 1 Norman Yarvin 2013-02-26 05:24:47 UTC

Created attachment 75545 [details]
Xorg.0.log

Comment 2 Norman Yarvin 2013-02-26 05:26:28 UTC

Created attachment 75546 [details]
dmesg (from a normal boot)

This doesn't contain error messages from the bug; it's just from a normal boot, in case the initialization messages are useful.

Comment 3 Chris Wilson 2013-02-26 08:43:54 UTC

Hmm, batch buffer corruption - looks like failed cacheline flushes.

Can you recompile -intel with USE=debug (or something like that) to enable assertions to double check that it is not doing anything silly?

Comment 4 Norman Yarvin 2013-02-26 18:26:46 UTC

There's no use flag for debug (funny, there used to be), but I turned it on by editing the ebuild (--enable-debug passed to configure).  Let me know if there's any log file I should be watching for failed assertions.

Comment 5 Chris Wilson 2013-02-26 18:39:15 UTC

To catch the assertions you need to watch stderr, which will be logged via your display manager. e.g. for gdm, look in /var/log/gdm/:0.log

Comment 6 Norman Yarvin 2013-02-27 00:45:43 UTC

I just hit the bug again.  I didn't see anything that looked like an assertion triggering, just this (in both stderr and Xorg.0.log):

(EE) [mi] EQ overflowing.  Additional events will be discarded until existing events are processed.
(EE) 
(EE) Backtrace:
(EE) 0: /usr/bin/X (xorg_backtrace+0x34) [0x597174]
(EE) 1: /usr/bin/X (mieqEnqueue+0x263) [0x577ad3]
(EE) 2: /usr/bin/X (0x400000+0x4fe84) [0x44fe84]
(EE) 3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7fc0fdb9b000+0x6208) [0x7fc0fdba1208]
(EE) 4: /usr/bin/X (0x400000+0x7a707) [0x47a707]
(EE) 5: /usr/bin/X (0x400000+0xa57e7) [0x4a57e7]
(EE) 6: /lib64/libpthread.so.0 (0x36aaa00000+0x10b80) [0x36aaa10b80]
(EE) 7: /lib64/libc.so.6 (ioctl+0x7) [0x36aa2e3b07]
(EE) 8: /usr/lib64/libdrm.so.2 (drmIoctl+0x28) [0x39262040f8]
(EE) 9: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fc0ff8d6000+0x1be30) [0x7fc0ff8f1e30]
(EE) 10: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fc0ff8d6000+0x1d5f7) [0x7fc0ff8f35f7]
(EE) 11: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fc0ff8d6000+0x4ee5d) [0x7fc0ff924e5d]
(EE) 12: /usr/bin/X (BlockHandler+0x44) [0x43f374]
(EE) 13: /usr/bin/X (WaitForSomething+0x11d) [0x59463d]
(EE) 14: /usr/bin/X (0x400000+0x3af32) [0x43af32]
(EE) 15: /usr/bin/X (0x400000+0x29cba) [0x429cba]
(EE) 16: /lib64/libc.so.6 (__libc_start_main+0xed) [0x36aa22491d]
(EE) 17: /usr/bin/X (0x400000+0x2a011) [0x42a011]
(EE) 
(EE) [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.
(EE) [mi] mieq is *NOT* the cause.  It is a victim.

repeated several times with "overflow continuing".

Sorry, I neglected to save i915_error_state this time.

Comment 7 Chris Wilson 2013-02-27 18:33:01 UTC

Took another look at the error-state to look for patterns; I still think an invalid cacheline is the best answer so far.

Norman, do you mind going back to 3.6 and confirming that it is stable? A bisect will take months if mtbf remains at a week, so please also keep updating us with what was running on the machine at the time and the i915_error_state, and so try to come up with a test case.

Comment 8 Norman Yarvin 2013-02-27 19:34:25 UTC

I don't mind running 3.6.  I can switch either right away, or after hitting the bug again (to give you another copy of i915_error_state), whichever you'd prefer.  As for what I was running to provoke it, mainly just xterms, fvwm, and Firefox; it seems to happen when I'm doing a lot, and switching between windows.  Which is of course not much of a pattern; coming up with a repeatable test case doesn't seem practical yet.

The main local peculiarity is that I'm running dual monitors on the desktop version of this chipset (via an ADD2 card).

Comment 9 Daniel Vetter 2013-03-03 22:29:46 UTC

The only interesting thing I've find is the pattern of 0x0040a0c0 in the batch. It repeats with a stride divisible by 128 and hits every odd dword. Vertical light blue lines in a y-tiled picture? Unfortunately no active y-tiled bo we could blame this on.

I need to take another look a this tomorrow with more coffee ...

Comment 10 Norman Yarvin 2013-03-04 01:16:39 UTC

Before any of you spend more effort searching through code for something that might be wrong, let me try to bisect it.  Yes, I think I can -- because there's another anomaly that cropped up around the same time, but which I'd put out of mind, as it is just cosmetic.  That was that window borders get drawn incorrectly sometimes: they are normally green for the focused window and blue for the unfocused ones, but I've been seeing some mixtures of green and blue.  Also sometimes a piece of window border pops up where it doesn't belong at all, like in the middle of an xterm popup menu.  I'm guessing that on rare occasion it gets misdrawn in a really bad place, and overwrites something important, leading to the GPU hangs.  And yes, the blue window border color is 0x40a0c0.  The GPU hangs are too rare to bisect, but the misdrawn borders are repeatable, so I can bisect that, and see where it leads.

I also seem to remember that for a while, window borders were misdrawn even worse; now the only ones that are misdrawn are quite thin (just a single pixel), but for a while fat ones, too, were being misdrawn.  Not for long, though.  (I forget what got upgraded to half-fix that.)

Comment 11 Norman Yarvin 2013-03-04 01:57:43 UTC

Okay, bisecting the kernel didn't work: I went back all the way to 3.3.0 (long before I saw any of this), and the borders were still being misdrawn.  So even if it is fundamentally a kernel bug, something else changed to trigger it.  Next up: xf86-video-intel.

Comment 12 Norman Yarvin 2013-03-04 02:38:30 UTC

Aha... if I enable the UXA use flag to xf86-video-intel (keeping everything else the same), the window border corruption goes away.

But this is presumably not a desired solution, since previously (when it was getting corruption) it was using SNA, the more recent of the two.

Comment 13 Norman Yarvin 2013-03-04 02:50:46 UTC

And indeed, if (with UXA compiled in) I tell it (via xorg.conf) to use SNA, the window border corruption comes back.  Now I'm not sure if I can bisect this; the change may have been Gentoo switching use flags to give me SNA by default (which they did between 2.19.0 and 2.20.13), rather than being a change in upstream code.

Comment 14 Norman Yarvin 2013-03-04 05:01:57 UTC

Okay, I bisected xf86-video-intel (for the corruption of window borders), with the result:

ab01fd696e1137ddfb9a85ae68c15c05900f0e8e is the first bad commit
commit ab01fd696e1137ddfb9a85ae68c15c05900f0e8e
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Jan 12 09:17:03 2013 +0000

    sna: Experiment with a CPU mapping for certain fallbacks
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

:040000 040000 426b226c68c934f111df2cdba328a05d14ead501 5290f258c0ef8b8649f18a9eebce4ba7a6d76fb7 Msrc

Comment 15 Chris Wilson 2013-03-04 09:44:04 UTC

(In reply to comment #9)
> The only interesting thing I've find is the pattern of 0x0040a0c0 in the
> batch. It repeats with a stride divisible by 128 and hits every odd dword.
> Vertical light blue lines in a y-tiled picture? Unfortunately no active
> y-tiled bo we could blame this on.

Yup, but the repeating data is not pixel data, looks more like contents from a stale batch. And I can easily convince myself that each time it is just once cacheline worth of bad data.

(In reply to comment #14)
> Okay, I bisected xf86-video-intel (for the corruption of window borders),
> with the result:
> 
> ab01fd696e1137ddfb9a85ae68c15c05900f0e8e is the first bad commit
> commit ab01fd696e1137ddfb9a85ae68c15c05900f0e8e
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Sat Jan 12 09:17:03 2013 +0000
> 
>     sna: Experiment with a CPU mapping for certain fallbacks

That ties right in with the theory that coherency on 965g is hosed in certain situations - as I already know snooping is broken and this could just a very similar effect.

Comment 16 Chris Wilson 2013-03-04 09:45:04 UTC

Or another theory I've been entertaining is that the read-read mappings are broken.

Comment 17 Chris Wilson 2013-03-05 11:25:21 UTC

I presume you also checked with master in case it was fixed?

Comment 18 Norman Yarvin 2013-03-05 16:30:06 UTC

Yes, I'm running master now (just pulled again), and it's still buggy.

Comment 19 Chris Wilson 2013-03-07 00:20:53 UTC

So I've pushed a patch to disable the read-read optimisations, which should help the corruption you've seen. I still think the batch corruption is another issue peculiar to 965g.

Comment 20 Norman Yarvin 2013-03-07 03:17:29 UTC

Yes, the window border corruption is fixed now.

Comment 21 Norman Yarvin 2013-05-07 07:35:41 UTC

Well, it's been a couple of months now, with no more crashes, so it seems reasonably safe to say that this one has been fixed.

Comment 22 Norman Yarvin 2013-06-24 03:56:56 UTC

And now it's back, after I installed a new version of the driver.  (Well, the window border corruption is back; I haven't had any crashes, but didn't test for long.)  Git bisect tells me that the first bad commit is, unsurprisingly:


commit 8e42637050275945200797538a34c13c90b295cc
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue May 21 11:13:03 2013 +0100

    sna: Re-enable read-read optimisations
    
    Coacher is optimistic that the issue is no longer reproducible on his
    machine - and whilst I do not understand the root cause, I am confident
    that the kernel code is correct as is our use.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=61628

Comment 23 Chris Wilson 2013-06-24 08:09:18 UTC

You haven't mentioned whether you tested any of the later fixes.

Comment 24 Norman Yarvin 2013-06-24 13:39:32 UTC

I tested 2.21.8, 2.21.9, and current git (as of yesterday evening), and all of them gave the border corruption.  Reverting that one patch (on top of current git) eliminates it again.

Comment 25 Chris Wilson 2013-06-24 13:49:52 UTC

Have you seen any more hangs/error-states?

Comment 26 Norman Yarvin 2013-06-24 14:10:48 UTC

No, but I didn't test for long: I ran 2.21.9 for about a day, and the other versions only briefly, and the hangs I'd seen were always only occasional.

Comment 27 Chris Wilson 2013-06-24 14:29:14 UTC

Can you please refresh my memory on what the corruption looks like with a screenshot or photograph? It would also be interesting to note if the corruption is not apparent in a screenshot.

To undo the optimisation again, apply:

diff --git a/src/sna/sna_accel.c b/src/sna/sna_accel.c
index a3e4ed4..4b091f3 100644
--- a/src/sna/sna_accel.c
+++ b/src/sna/sna_accel.c
@@ -58,7 +58,7 @@
 #define FORCE_INPLACE 0
 #define FORCE_FALLBACK 0
 #define FORCE_FLUSH 0
-#define FORCE_FULL_SYNC 0 /* https://bugs.freedesktop.org/show_bug.cgi?id=61628 */
+#define FORCE_FULL_SYNC 1 /* https://bugs.freedesktop.org/show_bug.cgi?id=61628 */
 
 #define DEFAULT_TILING I915_TILING_X

Comment 28 Norman Yarvin 2013-06-24 15:11:50 UTC

Created attachment 81346 [details]
Screenshot showing window border corruption

Here's a screenshot showing the border corruption.  It's visible on the left edge of the largest of the three xterm windows.  That window has only a single-pixel border, which changes from blue when unfocused to green when focused.  Here it's part blue and part green.  The pattern of blue and green is different each time, and sometimes it is entirely normal.  On rare occasions I've also seen corruption in the other xterms (with larger borders), but somehow that happens much less often; I haven't seen it recently.

Comment 29 Chris Wilson 2013-06-28 16:53:53 UTC

Found an issue with the read-read optimisation in the kernel:

commit 22fd5ca947b58901927d100d2b1aa0f1672b3435
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jun 28 16:54:08 2013 +0100

    drm/i915: Only clear write-domains after a successful wait-seqno
    
    In the introduction of the non-blocking wait, I cut'n'pasted the wait
    completion code from normal locked path. Unfortunately, this neglected
    that the normal path returned early if the wait returned early. The
    result is that read-only waits may return whilst the GPU is still
    writing to the bo.
    
    Fixes regression from
    commit 3236f57a0162391f84b93f39fc1882c49a8998c7 [v3.7]
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Fri Aug 24 09:35:09 2012 +0100
    
        drm/i915: Use a non-blocking wait for set-to-domain ioctl
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=66163
    Cc: stable@vger.kernel.org
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

I'm confident, so I'm going to hope this is the right fix. Please reopen if the issue remains.

Comment 30 Norman Yarvin 2013-06-30 23:57:16 UTC

I tested that patch against kernel 3.9.7, with the latest git version of xf86-video-intel (as of about 45 minutes ago), and I still get the border corruption.

Comment 31 Chris Wilson 2013-07-01 07:31:06 UTC

Ho hum. Does #define FORCE_FULL_SYNC 1 still make a difference?

Comment 32 Norman Yarvin 2013-07-01 13:35:50 UTC

Yes, that still cleans it up.

Comment 33 Norman Yarvin 2013-07-05 16:57:34 UTC

Looking again at that kernel patch, it seems to do nothing at all, since 'ret' was tested to be zero just a few lines above:

        if (ret)
                return ret;

Comment 34 Chris Wilson 2013-07-05 17:09:32 UTC

The kernel you are looking at already has the bug fix (or you are not looking in the right function).

Comment 35 Norman Yarvin 2013-07-05 19:12:50 UTC

Ah, okay; yes, I'd applied the patch to i915_gem_object_wait_rendering, not to i915_gem_object_wait_rendering__nonblocking where it was supposed to be.  (The lines of context in the diff were identical, and I didn't know where to get this via git or as a patch file, so I hand-edited it in... to the wrong place.)  Alas, applying it to the right place doesn't help either (in kernel 3.10.0, and with latest git master xf86-video-intel).

In any case, as regards this being a regression from 3.7, I tested all the way back to 3.3 (see above), and still had the window border corruption.

Comment 36 Chris Wilson 2013-07-05 19:16:35 UTC

There is a good chance that the render corruption is bug55500. If we don't see any more hangs, then I suspect it will only be resolved once I've beaten gen4 into submission. Or vice versa.

Comment 37 Norman Yarvin 2013-07-21 04:59:45 UTC

Created attachment 82763 [details]
Another crash

Okay, after running since my last comment, using the versions of the kernel and xf86-video-intel mentioned there (not updated since), I got another crash, with the same sorts of symptoms as before, resulting in the attached i915_error_state, and, in the system logs, the following messages:

Jul 21 00:42:04 muttonhead kernel: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Jul 21 00:42:04 muttonhead kernel: [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
Jul 21 00:42:05 muttonhead kernel: [drm:i915_reset] *ERROR* Failed to reset chip.

Comment 38 Jani Nikula 2013-12-17 12:03:21 UTC

Timeout. Chris, any news wrt comment #36? Norman, still seeing issues with later releases?

Comment 39 Norman Yarvin 2013-12-17 23:18:15 UTC

For the past couple of months, I've been running versions patched to have FORCE_FULL_SYNC=1, without problems.  Before I started using that patch I had three or four more crashes.  I have three more i915_error state files from those, and can put them up; I didn't do so because I sort of figured you guys have better things to do than to try to get six-year-old hardware working at maximum speed.  (At least I presume that's the downside of FORCE_FULL_SYNC, though I've never perceived any speed difference myself.)

I just checked the current git head, and it gives the window border corruption.

Comment 40 Norman Yarvin 2013-12-17 23:36:53 UTC

Created attachment 90906 [details]
Another error_state

Well, that was quick.  Right after I posted that (and still running the current git head), I got another crash:

Dec 17 18:18:55 muttonhead kernel: [drm] stuck on render ring
Dec 17 18:18:55 muttonhead kernel: [drm] capturing error event; look for more information in /sys/class/drm/card0/error
Dec 17 18:18:55 muttonhead kernel: [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x6d0b000 ctx 0) at 0x6d0bcd8
Dec 17 18:18:56 muttonhead kernel: [drm:i915_reset] *ERROR* Failed to reset chip.

Comment 41 Norman Yarvin 2013-12-17 23:55:49 UTC

Well, with that git version (9289e2c56b7f0cc78c5123691ad96611f0e04bed), FORCE_FULL_SYNC didn't help; even enabling it, I got another crash with the same symptoms -- again, right after posting the previous message here.

Comment 42 Chris Wilson 2014-01-23 11:26:21 UTC

That last one is likely to be 

commit 9d8473c5d9489db439aca73f470bda29a22ebab6
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 7 13:43:35 2014 +0000

    sna/gen4: Check for available batch space before restoring state after CA pass
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=73348
    References: https://bugs.freedesktop.org/show_bug.cgi?id=55500
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

I don't think there are any unique aspects left to this bug now.

Comment 43 Norman Yarvin 2014-01-23 16:58:03 UTC

Well, whatever that last crash I reported was, it's now fixed in the current git head. (But not before making it out into 2.99.907, which is why I tried the current git head.) I can be pretty confident about this, since those crashes happened so fast, whereas the current version has been stable for a couple of days. But I still am getting the window border corruption when I turn off FORCE_FULL_SYNC. As for whether this might be a manifestation of bug 55500, I doubt it. I wasn't getting any of the character corruption reported in that bug, back when it was first reported (nor even later when I first reported this bug); then sometime starting a few months ago I started getting a bit of it; then the git head as of a couple of days ago increased it to where it became bothersome; and now the current git head seems to have fixed it completely. And I was getting the character corruption even with FORCE_FULL_SYNC, whereas the window border corruption has always disappeared with it.

In any case, this seems to me to be three different bugs: the 55500 one, the fast crash one, and this one, the one I've been seeing all along. Of course, given that I'm the only one reporting this one, one might wonder whether there's just something wrong with my hardware: some circuit that just didn't get built right, and results in occasional glitches when stressed by the new SNA code. My guess would be that it's my configuration that is odd, not my hardware, but either way, this is about the hairiest sort of debugging there is, and if it is to benefit only one person it can't be worth doing. So I'm not going to re-open it...

Comment 44 Chris Wilson 2014-01-23 17:39:49 UTC

Just found another issue that could be relevant here as well:

commit e916c922ce3913712cd8a9b76ab037840b7f07f1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jan 23 17:30:29 2014 +0000

    sna: Avoid erroneous discarding operations for partial composites
    
    Composite operations were presumed to cover their entire width x height
    area. However, a few paths submit boxes that do not cover the clip
    region and so the optimisation made during prepare to discard completely
    overwritten data is incorrect (and leads to corruption - stale data is
    seen which the client expected to have been overdrawn). So along these
    more unusual paths, we must add a flag to prevent the overzealous
    discard. Notably, xfce4 triggers this as it uses a lot of unantialiased
    trapezoids in its theme drawing.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=69528
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 45 Norman Yarvin 2014-01-24 03:01:22 UTC

I just pulled to try that fix, and unfortunately no, I still get the border corruption.  (I checked the git log first, to make sure that that fix was included.)  I'll keep running it to see whether I get a crash.

Comment 46 Chris Wilson 2014-02-10 12:09:30 UTC

Sigh. Sorry about this shotgun style of debugging, but there was another potentially relevant logic flip causing damage to be wrongfully discarded fixed in 2.99.910. It is worth retesting.

Comment 47 Norman Yarvin 2014-02-11 19:46:09 UTC

I pulled, and no joy; I'm still getting window border corruption.

Comment 48 Chris Wilson 2014-02-12 23:23:48 UTC

Hmm, then I am still stumped. There's a couple of things that are unique to 965gm, so it could just be hitting code paths no one else does.

Can you please compile xf86-video-intel.git with ./configure --enable-debug and attach your Xorg.0.log? (The log is just to check that assertions are enabled and the software versions are up to date.) --enable-debug will enable assertions, which I hope!, will lead to X crashing when it detects the error. The error will be on stderr (recorded in something like /var/log/gdm/:0.log), and if possible if you can capture the backtrace using gdb, that would be invaluable.

Comment 49 Norman Yarvin 2014-02-13 00:24:56 UTC

Created attachment 93982 [details]
Xorg.0.log

Here's the Xorg.0.log file.  More data if/when I get a crash.

Comment 50 Norman Yarvin 2014-02-13 00:30:50 UTC

By the way, this is the desktop version of the 965 chipset, not the mobile version.

Comment 51 Norman Yarvin 2014-04-26 20:18:45 UTC

Well, it took over two months for it to crash this time, but it did.  From the system log:

Apr 26 15:36:21 muttonhead kernel: [drm] stuck on render ring
Apr 26 15:36:21 muttonhead kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Apr 26 15:36:21 muttonhead kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Apr 26 15:36:21 muttonhead kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Apr 26 15:36:21 muttonhead kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Apr 26 15:36:21 muttonhead kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Apr 26 15:36:21 muttonhead kernel: [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x2380000 ctx 0) at 0x2380310
Apr 26 15:36:21 muttonhead kernel: [drm:i915_reset] *ERROR* Failed to reset chip.

From Xorg.0.log (I can post the whole thing, but these are what look like the interesting bits):

[    20.562] (II) intel(0): SNA compiled from 2.99.910-49-g0b92b12
[    20.562] (II) intel(0): SNA compiled with assertions enabled
[    20.565] (--) intel(0): Integrated Graphics Chipset: Intel(R) 965G

....

[2205130.336] (EE) intel(0): Detected a hung GPU, disabling acceleration.
[2205130.336] batch[1/0]: 8170 8170 65344, nreloc=16, nexec=12, nfence=0, aperture=3067, fenced=0, high=98304,131072: errno=22
[2205130.336] exec[0] = handle:216, presumed offset: 6189000, size: 7987200, tiling 1, fenced 0, snooped 0, deleted 0
[2205130.336] exec[1] = handle:23, presumed offset: 2203000, size: 90112, tiling 0, fenced 0, snooped 0, deleted 0
[2205130.336] exec[2] = handle:154, presumed offset: 81fe000, size: 4096, tiling 0, fenced 0, snooped 0, deleted 0
[2205130.336] exec[3] = handle:109, presumed offset: af53000, size: 4, tiling 0, fenced 0, snooped 0, deleted 0
[2205130.336] exec[4] = handle:25, presumed offset: 1dfb000, size: 4194304, tiling 1, fenced 0, snooped 0, deleted 0
[2205130.336] exec[5] = handle:44, presumed offset: 83b8000, size: 4096, tiling 0, fenced 0, snooped 0, deleted 0
[2205130.336] exec[6] = handle:90, presumed offset: 8a0b000, size: 4096, tiling 0, fenced 0, snooped 0, deleted 0
[2205130.336] exec[7] = handle:45, presumed offset: e905000, size: 4096, tiling 0, fenced 0, snooped 0, deleted 0
[2205130.336] exec[8] = handle:94, presumed offset: 8a0d000, size: 4096, tiling 0, fenced 0, snooped 0, deleted 0
[2205130.336] exec[9] = handle:210, presumed offset: 8225000, size: 4096, tiling 0, fenced 0, snooped 0, deleted 0
[2205130.337] exec[10] = handle:53, presumed offset: 8470000, size: 262144, tiling 0, fenced 0, snooped 0, deleted 0
[2205130.337] exec[11] = handle:73, presumed offset: 0, size: 36864, tiling 0, fenced 0, snooped 0, deleted 0
[2205130.337] reloc[0] = pos:16, target:0, delta:0, read:2, write:2, offset:6189000
[2205130.337] reloc[1] = pos:40, target:0, delta:0, read:2, write:2, offset:6189000
[2205130.337] reloc[2] = pos:64, target:0, delta:0, read:2, write:2, offset:6189000
[2205130.337] reloc[3] = pos:140, target:1, delta:1, read:10, write:0, offset:2203000
[2205130.337] reloc[4] = pos:144, target:11, delta:-225279, read:10, write:0, offset:0
[2205130.337] reloc[5] = pos:36772, target:0, delta:0, read:2, write:2, offset:6189000
[2205130.337] reloc[6] = pos:36740, target:2, delta:0, read:4, write:0, offset:81fe000
[2205130.337] reloc[7] = pos:36676, target:3, delta:80, read:4, write:0, offset:af53000
[2205130.337] reloc[8] = pos:36644, target:4, delta:0, read:4, write:0, offset:1dfb000
[2205130.337] reloc[9] = pos:36580, target:5, delta:0, read:4, write:0, offset:83b8000
[2205130.337] reloc[10] = pos:36420, target:6, delta:0, read:4, write:0, offset:8a0b000
[2205130.337] reloc[11] = pos:36324, target:7, delta:0, read:4, write:0, offset:e905000
[2205130.337] reloc[12] = pos:36228, target:8, delta:0, read:4, write:0, offset:8a0d000
[2205130.337] reloc[13] = pos:36132, target:9, delta:0, read:4, write:0, offset:8225000
[2205130.337] reloc[14] = pos:288, target:10, delta:0, read:20, write:0, offset:8470000
[2205130.337] reloc[15] = pos:416, target:10, delta:0, read:20, write:0, offset:8470000
[2205130.337] Aperture size 536870912, available 513679360

(and there the file ends)

That second excerpt was also printed on stdout or stderr, but other than that, there were no assertions triggered.  I'll attach the i915_error_state file in the next comment.

Comment 52 Norman Yarvin 2014-04-26 20:19:56 UTC

Created attachment 98057 [details]
yet another i915_error_state

Comment 53 Norman Yarvin 2014-12-22 01:02:32 UTC

I haven't seen any crashes in quite a while now, so it looks like the bug is fixed.  Since I've been the only person posting complaints, I'll go ahead and change the status.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.