Bug 27187

Summary: [855GM] gtt chipset flush is not cache coherent
Product: xorg Reporter: Daniel Vetter <daniel>
Component: Driver/intelAssignee: Daniel Vetter <daniel>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: 2points, andrej, arthapex, axet, beier, bonbons, brian, computer-lazarett, fdesk, gomyhr, hege, indan, legolas558, masse, mnowak, moikkis, mrmazda, mtomich, nobled, nomnex, olafur.g, pva, rainy6144, remi, renegabriels, sh29112911, smf.linux, springtidenz, stefan, stenten, thorsten, va
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
cleanup up version of my gtt cache coherency patch
none
new gtt cache coherency patch
none
i855/Acer TM66x: Test results with patch in attachment #34194 from bug #26345
none
BUG in i915_gem.c - obj_priv->pages_refcount is zero in i915_gem_object_put_pages()
none
new patch with totally reworked chipset flush
none
BUG again, with flush retries before
none
dmesg with gtt flush v4 patch
none
error state after freeze
none
new patch, improved retry flushing
none
dmesg with gtt flush v5 patch
none
error state after freeze, v5
none
Clip solid fills
none
dmesg with gtt flush v5 patch
none
gtt chipset flush v6
none
dmesg of 2 failures with v6 patch
none
debugfs dumps after 2 failures with v6 patch
none
dmesg of early flush failure with 2000 retries
none
dmesg of early failure with 1000 retries, 128 whack pages
none
dmesg with gtt flush v6 patch
none
dmesg of multiple flush failure with v6 patch
none
memory check patch for legolas
none
dmesg of failure with v6 patch + gtt/cpu readback info
none
gtt chipset flush v7
none
NEC P520: dmesg output
none
dmesg output (HP Pavilion dv1000)
none
Kernel logs with 3 BUGs in i915_gem.c:1456 / i915_gem_object_put_pages()
none
fix locking around chipset flushing
none
i915_gem_object_put_pages crash happening with libdrm 2.4.18
none
Everything.log for i845 freezes
none
dmesg, i915_error_state and intel_gpu_dump
none
All BUGs I've seen in i915_gem.c since February
none
dri debugfs snapshots taken every second + generator script used
none
excerpt from dmesg containing crash dump for i915_gem_object_put_pages+0x10b/0x110
none
dmesg output after suspend to disk
none
dmesg v7 + locking patch, lockdep enabled
none
only call put_pages when gtt_space != NULL
none
dmesg output snippet with i915_gem_tiling.c warning
none
new patch against current drm-intel-next
none
dmesg of a failed chipset flush warning
none
failed flush with
none
dmesg of deadlock with v8 patch
none
DRI debugfs after overlay crash
none
gtt flush failures (not fatal)
none
GPU hang with newest driver and libdrm. v8 without extra flushing patch.
none
failure after starting xfce4-panel
none
v8 patch rebased against latest drm-intel (anholt repository)
none
v9 against latest drm-intel-next
none
dmesg + i915_error_state
none
2 flush failures with latest version of patch
none
dmesg debufs and X server output
none
Test Program
none
v9 against latest git linus tree (manually fixed)
none
Failed chipset flush backtrace
none
v9.1: v9 patch updated for 2.6.36-rc3
none
v9.3: v9 patch ported to kernel 2.6.36
none
Output from dmesg after a X.org crash on ArchLinux stock kernel
none
Try a magic GWB bit in the HIC register
none
Error state after x11perf test case
none
Poke the GWB bit.
none
Poke the GWB bit.
none
Poke HIC bit + wbinv + cache coherency checker
none
kern.log from GPU freeze with latest patch
none
Xorg.log after freeze
none
freeze log dmesg
none
FF3.6 persona background thrashed (only top!)
none
i915 gpu freeze report
none
Flush failures with Chris's patch
none
Font thrashing on vanilla 2.6.38-rc6
none
font garbled upon scrolling none

Description Daniel Vetter 2010-03-18 15:58:14 UTC
This is just a bug report for myself, to keep track of what I'm doing.
Comment 1 Daniel Vetter 2010-03-18 16:02:50 UTC
This is just a bug report so that I can keep track of my own trials at fixing this issue. I'll upload a clean-up version of my patch soon
Comment 2 Daniel Vetter 2010-03-18 16:13:40 UTC
Created attachment 34223 [details] [review]
cleanup up version of my gtt cache coherency patch
Comment 3 Daniel Vetter 2010-03-18 16:53:15 UTC
Forget about the patch for the moment, I've just noticed that it doesn't work on my i855GM, too.
Comment 5 Daniel Vetter 2010-03-19 10:29:45 UTC
> --- Comment #4 from legolas558 <legolas558@email.it>  2010-03-19 05:01:23 PST ---
> I think that your patch lets other bugs come out and become apparent, like the
> hangcheck timer bug.

Thanks for the list (especially the downstream bugs). btw the hangcheck
timer is the same bug, this is the kernel code noticing that the gpu just
died. Of course, if the dying gpu takes along the complete system, you're
not going to see this.
Comment 6 Daniel Vetter 2010-03-19 10:37:47 UTC
Created attachment 34247 [details] [review]
new gtt cache coherency patch

New version of my patch. Changes:
- improved cache coherency checker. Instead of always using the same address it now constantly changes. Should increase the chance of catching an inconsistency. But still, this won't catch everything and there's still the chance that the gpu reads crap and dies before we detect that the chipset flush doesn't work.

- increased the size of the magic gtt write. I don't like this, but it seems to help. Given that testing of the last patch showed that I've only implemented a fancy delay I've tried different techniques. Unfortunately none worked.

Testing feedback highly welcome. I've you report results, please add some details about your machine: Processor (model and frequency), ram (type, speed & size). Perhaps there's a pattern there.

Oh, and legolas created a nice small script that calculates the cache failure rate of the running kernel:

https://bugs.freedesktop.org/attachment.cgi?id=34240
Comment 7 Bruno 2010-03-20 10:57:18 UTC
Created attachment 34262 [details]
i855/Acer TM66x: Test results with patch in attachment #34194 [details] [review] from bug #26345
Comment 8 Daniel Vetter 2010-03-20 15:23:31 UTC
> --- Comment #7 from Bruno <bonbons67@internet.lu>  2010-03-20 10:57:18 PST ---
> Created an attachment (id=34262)
>  --> (http://bugs.freedesktop.org/attachment.cgi?id=34262)
> i855/Acer TM66x: Test results with patch in attachment #34194 [details] [review] from bug #26345

Thanks for testing. Unfortunately the patch you tested is outdated and
known not to work. At least the failure rates you're getting on your
Pentium M 1.5 GHz are about the same as on my Pentium M 1.2 GHz. Can you
please retest with my latest patch (attached to this bug)?
Comment 9 Bruno 2010-03-20 16:20:41 UTC
Created attachment 34270 [details]
BUG in i915_gem.c - obj_priv->pages_refcount is zero in i915_gem_object_put_pages()

> Thanks for testing. Unfortunately the patch you tested is outdated and
> known not to work. At least the failure rates you're getting on your
> Pentium M 1.5 GHz are about the same as on my Pentium M 1.2 GHz. Can you
> please retest with my latest patch (attached to this bug)?

It seems to be very much a matter of mood, on friday evening I saw no single bad message while today there were a lot.

Since a crash I'm running now with 2.6.34-rc2 + your patch and one of mine for sd.
Both with previous patch and current one I'm hitting a BUG() in i915_gem_object_put_pages(). obj_priv->pages_refcount is zero in there. No idea if it's related or not but it kills i915 driver!

Attached is kernel log of system including BUG(), but no bad chipset flush.
Comment 10 Daniel Vetter 2010-03-21 02:06:17 UTC
> --- Comment #9 from Bruno <bonbons67@internet.lu>  2010-03-20 16:20:41 PST ---
> Created an attachment (id=34270)
>  --> (http://bugs.freedesktop.org/attachment.cgi?id=34270)
> BUG in i915_gem.c - obj_priv->pages_refcount is zero in
> i915_gem_object_put_pages()
> 
> > Thanks for testing. Unfortunately the patch you tested is outdated and
> > known not to work. At least the failure rates you're getting on your
> > Pentium M 1.5 GHz are about the same as on my Pentium M 1.2 GHz. Can you
> > please retest with my latest patch (attached to this bug)?
> 
> It seems to be very much a matter of mood, on friday evening I saw no single
> bad message while today there were a lot.

Just to clarify: You're talking about the BUG in i915_gem_object_put_pages
here, not the "chipset flush failed" warning? And this is still on the old
version of the patch, right?

> Since a crash I'm running now with 2.6.34-rc2 + your patch and one of mine for
> sd.
> Both with previous patch and current one I'm hitting a BUG() in
> i915_gem_object_put_pages(). obj_priv->pages_refcount is zero in there. No idea
> if it's related or not but it kills i915 driver!

Known issue, I get these here from time to time, too. Looks like my
unmap-inactive hack, intended to stress gtt cache flushing is uncovering
another problem somewhere else. I'll look into this more seriously now.
Comment 11 Bruno 2010-03-21 03:10:07 UTC
> > It seems to be very much a matter of mood, on friday evening I saw no single
> > bad message while today there were a lot.
> 
> Just to clarify: You're talking about the BUG in i915_gem_object_put_pages
> here, not the "chipset flush failed" warning? And this is still on the old
> version of the patch, right?

The matter of mood was regarding the chipset flush failed warnings (as for GPU wedges before testing with your patch). Some day everything is running smooth, some days I have to reboot every hour (no noticeable usage difference on my side).

> > Since a crash I'm running now with 2.6.34-rc2 + your patch and one of mine for
> > sd.
> > Both with previous patch and current one I'm hitting a BUG() in
> > i915_gem_object_put_pages(). obj_priv->pages_refcount is zero in there. No
> > idea if it's related or not but it kills i915 driver!
> 
> Known issue, I get these here from time to time, too. Looks like my
> unmap-inactive hack, intended to stress gtt cache flushing is uncovering
> another problem somewhere else. I'll look into this more seriously now.

I got it with both versions of your patch, let's see what today will bring, flush warnings or BUGs (or both).
Comment 12 Daniel Vetter 2010-03-23 03:02:41 UTC
Created attachment 34353 [details] [review]
new patch with totally reworked chipset flush

I've combined a few of my previous, non-working ideas and combined them into this new patch. Hopefully this adapts better to different cpu/chipset combinations (i.e. better chance that it not only works on my machine). The new chipset flush is a magic dance involving a flock of canaries ;)

The magic values (look at include/drm/intel-gtt.h, but not at the comments, their stale) probably need some tuning. But this new chipset flush is adaptive (fyi it reports the maximum number of retries in the regular chipset flush no. reporting), so I hope it works out of the box.

Unfortunately I haven't yet fixed the BUG_ON in put_pages. For some odd reason I can't reproduce it here anymore and reviewing the code hasn't revealed anything yet.

As usual, testing reports highly welcome.
Comment 13 Bruno 2010-03-23 05:22:10 UTC
Created attachment 34362 [details]
BUG again, with flush retries before

Daniel, here is the result of one run with your latest patch (attachment #34353 [details] [review]) at the time it BUGed but also having a few flush retries listed before.
I had the previous version running yesterday without any visible issue up to 2M flushes (but I might not have run the right actions)

It happened while surfing with firefox, some site doing fancy things with javascript and transparency for fade-in/out of images and the like.
Comment 14 Bruno 2010-03-23 06:50:05 UTC
(In reply to comment #12)
> Unfortunately I haven't yet fixed the BUG_ON in put_pages. For some odd reason
> I can't reproduce it here anymore and reviewing the code hasn't revealed
> anything yet.

Seems I can reproduce it damn easily...

In Firefox on page http://habiter.luxweb.com/NR58686_Mersch_Vente_Maison.html
it's sufficient to click on the image thumbnails to hit the BUG_ON (running Firefox 3.6-r2 + Flash plugin 10.0.45.2 (Gentoo)
Hovering the thumbnail images seems also to help getting flush retries.
Comment 15 Daniel Vetter 2010-03-23 09:53:55 UTC
> --- Comment #13 from Bruno <bonbons67@internet.lu>  2010-03-23 05:22:10 PST ---
> Created an attachment (id=34362)
>  --> (http://bugs.freedesktop.org/attachment.cgi?id=34362)
> BUG again, with flush retries before
> 
> Daniel, here is the result of one run with your latest patch (attachment
> #34353) at the time it BUGed but also having a few flush retries listed before.
> I had the previous version running yesterday without any visible issue up to 2M
> flushes (but I might not have run the right actions)

Thanks for testing. Don't worry about the retries, that's an integral part
of the new chipset flush. As long as it's a small number, everything's
fine (I've gotten higher numbers than you while testing).

The important thing is whether you still get backtraces about failed
chipset flushes (there's none in the dmesg you attached). iirc the
previous patch still had problems there for you.

If the BUG is causing you too much grieve while testing, undo the two
changes to drivers/gpu/drm/i915/i915_gem.c the patch makes. But that will
greatly reduce the number of chipset flushed, so massively hampering
testing. Otherwise just check you dmesg for any "chipset flush failed"
messages and hit the reset knob ;)
Comment 16 2points 2010-03-23 11:57:47 UTC
Created attachment 34375 [details]
dmesg with gtt flush v4 patch

dmesg summary: Complaints about chipset flush timeout, subsequent log entries show that the maximum number of retries have been reached. Eventual freeze.
Comment 17 2points 2010-03-23 12:18:05 UTC
(In reply to comment #16)
> dmesg summary: Complaints about chipset flush timeout

While scanning over relevant code, I also noticed this: It should probably be canary_gtt_read, canary_cpu_read in drivers/char/agp/intel-gtt.c:978 instead of canary_cpu_read, canary_cpu_read.
Comment 18 Daniel Vetter 2010-03-23 12:43:17 UTC
> --- Comment #17 from 2points@gmx.org  2010-03-23 12:18:05 PST ---
> (In reply to comment #16)
> > dmesg summary: Complaints about chipset flush timeout
> 
> While scanning over relevant code, I also noticed this: It should probably be
> canary_gtt_read, canary_cpu_read in drivers/char/agp/intel-gtt.c:978 instead of
> canary_cpu_read, canary_cpu_read.

Yep, you're right. Fixed in my local version.

I've looked at your dmesg and the chipset flush clearly doesn't work for
your hw. Can you please try to hang the gpu again (with my latest patch)
and then capture i915_error_state from <debugfs>/dri/0. Just to make sure
that your gpu is crashing due to a cache coherency bug and not due to
something else (rather unlikely). Meanwhile I try to come up with a new
idea to fix your problem.
Comment 19 2points 2010-03-23 13:15:20 UTC
Created attachment 34376 [details]
error state after freeze

Here you go. 

Chris Wilson suggested elsewhere that this particular discrepancy between patch results of the same hardware might be related to chipset/hardware revision and corresponding quirks, so this may or may not explain oddities. (lspci claims rev 02 for my GPU)
Comment 20 Daniel Vetter 2010-03-23 14:43:16 UTC
Created attachment 34377 [details] [review]
new patch, improved retry flushing

New patch to (hopefully) tackle the problems uncovered by 2points. Please retest.

Bruno, I haven't yet found the problem with the BUG in put_pages. I'm hitting it about once every few days (with the same backtrace like you), but I haven't got a clue yet what's causing it.
Comment 21 Brian Rogers 2010-03-24 08:21:43 UTC
Daniel, is this patch worth testing on i845 as well?
Comment 22 Daniel Vetter 2010-03-24 09:12:37 UTC
> --- Comment #21 from Brian Rogers <brian@xyzw.org>  2010-03-24 08:21:43 PST ---
> Daniel, is this patch worth testing on i845 as well?

Since Chris Wilson last tested this patch on his i845 (didn't work), the
patch has changed quite a bit. So it might be worth to retest it, if you
have the time to spare. I certainly appreciate any testing feedback I can
get, because every machine (even seemingly similar 855 boxes) seems to act
differently.
Comment 23 2points 2010-03-24 10:21:07 UTC
Created attachment 34416 [details]
dmesg with gtt flush v5 patch

I've made these observations so far: The amount of reported failed chipset flushes has gone down drastically. In four test runs, only in one so far I've spotted a chipset flush failure warning. However, the GPU will still hang after a while, often without any flush-related warnings in dmesg.

I also like to add to speculations that failure rate might be related to CPU or possibly IO load. X runs for a lot longer if I just leave glxgears running by itself, but as soon as I start doing some work (Kontact, Opera, Flash plugin etc.) GPU freeze seems more likely.
Comment 24 2points 2010-03-24 10:22:47 UTC
Created attachment 34417 [details]
error state after freeze, v5
Comment 25 Chris Wilson 2010-03-24 10:41:46 UTC
Created attachment 34418 [details] [review]
Clip solid fills

So I am working on the assumption that the residual hangs I am seeing after enabling/disable the GMCH to force the GTT flush are real batch buffer bugs...

For instance http://bugs.freedesktop.org/attachment.cgi?id=34417 looks like we attempt to write well beyond the end of the buffer. In which case the attached should workaround the issue.
Comment 26 2points 2010-03-24 12:07:44 UTC
Created attachment 34419 [details]
dmesg with gtt flush v5 patch

(In reply to comment #25)
> Created an attachment (id=34418) [details]
> Clip solid fills
> 
> So I am working on the assumption that the residual hangs I am seeing after
> enabling/disable the GMCH to force the GTT flush are real batch buffer bugs...

Thanks, doing some tests with git HEAD and this patch. As far as chipset flushes go, I'm at slighly over 4.5M flushes right now. Since this is about four times as long as the machine usually lasts without freezing, I'd conclude that the fix is pretty effective.

As for failed chipset flushes, dmesg records three occurences now after about one hour of glxgears and various other tasks.
Comment 27 Geir Ove Myhr 2010-03-24 23:03:18 UTC
(In reply to comment #25)
> For instance http://bugs.freedesktop.org/attachment.cgi?id=34417 looks like we
> attempt to write well beyond the end of the buffer. 

Could you explain how to see this from the file or intel_error_decode output? I'm trying to make some documentation for downstream on how to interpret the dumps.

Is it related to this? I'm not sure what those numbers mean, except that the first is obviously the start address of the batch buffer.
Buffers [13]:
...
  02821000    16384 00000048 00000000 000e354b dirty purgeable
Comment 28 legolas558 2010-03-25 02:14:44 UTC
I forgot to add myself to CC list, will now try latest patch!
Comment 29 legolas558 2010-03-25 10:14:16 UTC
Sorry, I have reported my findings on bug 26345:

- debugfs DRI dumps (see attachment 34436 [details])
- dmesg with GTT failures (attachment 34437 [details])

http://bugs.freedesktop.org/show_bug.cgi?id=26345#c76
Comment 30 Daniel Vetter 2010-03-26 01:18:26 UTC
Created attachment 34469 [details] [review]
gtt chipset flush v6

Changes since v6:
- tuned magic values, hopefully fixing problems seen by 2points and legolas
- some debug checks trying to catch the put_pages BUG_ON problem while it's happening.

As usual, testing feedback higly welcome.
Comment 31 legolas558 2010-03-26 03:17:20 UTC
Created attachment 34473 [details]
dmesg of 2 failures with v6 patch
Comment 32 legolas558 2010-03-26 03:19:48 UTC
Created attachment 34474 [details]
debugfs dumps after 2 failures with v6 patch

I have just tested v6 patch with 3 glxgears and I got 2 flush failures; it is not possible to state if anything sensibly changed when compared with previous patch
Comment 33 Daniel Vetter 2010-03-26 03:44:08 UTC
> --- Comment #32 from legolas558 <legolas558@email.it>  2010-03-26 03:19:48 PST ---
> Created an attachment (id=34474)
>  --> (http://bugs.freedesktop.org/attachment.cgi?id=34474)
> debugfs dumps after 2 failures with v6 patch
> 
> I have just tested v6 patch with 3 glxgears and I got 2 flush failures; it is
> not possible to state if anything sensibly changed when compared with previous
> patch

Indeed, not much changed at all. The real problem is reported in the first
backtrace (grep for "chipset flush timed out" in your dmesg). It
essentially means that the code has given up. All further failed flushes
are most likely just a result of this.

The problem seems to be writes to the gtt just don't show up at the cpu
side on your box. There's a tunable in my patch in the file
drivers/char/agp/intel-gtt.c

#define I830_GTT_MAX_RETRIES 100

Can you try out whether increasing this to a ridiculous number (like 1000)
helps? This might cause your machine to stall sometimes. As soon as you
get the chipset flush timed out message (and the max retries hits the
value you've defined), give up.

Thanks alot for testing this stuff.

btw, has video stability increased further with v6?
Comment 34 legolas558 2010-03-26 04:07:57 UTC
(In reply to comment #33)
> Indeed, not much changed at all. The real problem is reported in the first
> backtrace (grep for "chipset flush timed out" in your dmesg). It
> essentially means that the code has given up. All further failed flushes
> are most likely just a result of this.
> 
I have an uptime of 50 minutes (it's the same session of reported dump/dmesg files) and my ratio is still 2 /  229376 (using gttqual script in attachment 34435 [details]), so it is indeed as you said.

> The problem seems to be writes to the gtt just don't show up at the cpu
> side on your box. There's a tunable in my patch in the file
> drivers/char/agp/intel-gtt.c
> 
> #define I830_GTT_MAX_RETRIES 100
> 
> Can you try out whether increasing this to a ridiculous number (like 1000)
> helps? This might cause your machine to stall sometimes. As soon as you
> get the chipset flush timed out message (and the max retries hits the
> value you've defined), give up.
> 
> Thanks alot for testing this stuff.
> 
Yes I am gonna make this test on next reboot and see what happens.

> btw, has video stability increased further with v6?
> 
I would say definitively yes. I have tried with all videos which triggered the bug and no crash yet. 

Other important notes:
* I am using only the v6 patch on drm-intel from git, and nothing else, neither the patched xf86-video-intel
* my kernel command line parameters are: lapic=yes hpet=force clocksource=hpet i8042.nomux=1

This hardware (Fujitsu Amilo clone) has a known problem with ACPI/i8042 controller; the i8042.nomux=1 is used to prevent touchpad glitches (http://bugzilla.kernel.org/show_bug.cgi?id=8740), while the battery, thermal and ac modules are always unloaded because otherwise this kernel bug would be triggered: http://bugzilla.kernel.org/show_bug.cgi?id=9147

Apart from these facts, I don't know anything else which could be relevant
Comment 35 legolas558 2010-03-26 04:37:11 UTC
Created attachment 34476 [details]
dmesg of early flush failure with 2000 retries

Bug triggered instantly, no sensible (additional) slowdown. This looks like a dead-end...

Shall I attach also the debugfs dri data? Shall I make any special test?
Comment 36 legolas558 2010-03-26 04:51:08 UTC
what if we "give up" on coherency of the first n flushes? something like the screen test of arcade machines, except that we just don't check if test is OK, pretending that coherency becomes consistent after that; I have often found that the last graphics contents of the LVDS pipe are persistent between a reboot (when using a liveCD, for example) and you can actually see the last screenshot a while before screen gets properly cleared and initialized.

Sorry but I am shooting in the dark here
Comment 37 Daniel Vetter 2010-03-26 06:12:51 UTC
> --- Comment #36 from legolas558 <legolas558@email.it>  2010-03-26 04:51:08 PST ---
> what if we "give up" on coherency of the first n flushes? something like the
> screen test of arcade machines, except that we just don't check if test is OK,
> pretending that coherency becomes consistent after that; I have often found
> that the last graphics contents of the LVDS pipe are persistent between a
> reboot (when using a liveCD, for example) and you can actually see the last
> screenshot a while before screen gets properly cleared and initialized.

That's exactly what my patch currently does - it simply gives up after too
many retries. Now if this corrupts a pixmap/texture, it just yields visual
corruptions on the screen. But this can also corrupt the gpu command
buffer (and some other vital things). And if the gpu reads crap from
these, it usually just hangs itself.

You've increased max_retries to 2000, which equals to about 1ms of delay.
And it hasn't helped at all, i.e. the chipset takes probably even longer
to reach a coherent state again. And 1 ms is an eternity for computer hw,
so this will crash your box - sooner or later.

> Sorry but I am shooting in the dark here

We all are ;) But there are some more constants to tune, this time in
include/drm/intel-gtt.h

#define I830_CC_GTT_WHACK_PAGES		16

Try to increase this (doubling it each step is sensible, the algo only
uses as much as required, this is just an upper bound). But don't go above
128, that'd be crazy (and I would have to figure out a new trick).

btw, dmesg is usually enough - I'll ask if I need anything else.
Comment 38 legolas558 2010-03-26 07:08:31 UTC
(In reply to comment #37)
> > --- Comment #36 from legolas558 <legolas558@email.it>  2010-03-26 04:51:08 PST ---
> > what if we "give up" on coherency of the first n flushes? something like the
> > screen test of arcade machines, except that we just don't check if test is OK,
> > pretending that coherency becomes consistent after that; I have often found
> > that the last graphics contents of the LVDS pipe are persistent between a
> > reboot (when using a liveCD, for example) and you can actually see the last
> > screenshot a while before screen gets properly cleared and initialized.
> 
> That's exactly what my patch currently does - it simply gives up after too
> many retries. Now if this corrupts a pixmap/texture, it just yields visual
> corruptions on the screen. But this can also corrupt the gpu command
> buffer (and some other vital things). And if the gpu reads crap from
> these, it usually just hangs itself.
> 
Yes I can understand this; there's always a state machine behind the scenes.

> You've increased max_retries to 2000, which equals to about 1ms of delay.
> And it hasn't helped at all, i.e. the chipset takes probably even longer
> to reach a coherent state again. And 1 ms is an eternity for computer hw,
> so this will crash your box - sooner or later.
> 
I have done some other tests and it seems that 1000 or 2000 is high enough to *never* cause a failure with mild usage, while if I make 2 glxgear windows have a clipping rectangle, the failure immediately happens. Might this help? Perhaps openGL is altering the GPU in some way that we cannot forecast?

> > Sorry but I am shooting in the dark here
> 
> We all are ;) But there are some more constants to tune, this time in
> include/drm/intel-gtt.h
> 
> #define I830_CC_GTT_WHACK_PAGES         16
> 
> Try to increase this (doubling it each step is sensible, the algo only
> uses as much as required, this is just an upper bound). But don't go above
> 128, that'd be crazy (and I would have to figure out a new trick).
> 
> btw, dmesg is usually enough - I'll ask if I need anything else.
> 
Ok, thank you Daniel, I will revert the retries to 1000 and make the next possible 3 tests with the whack pages constant.

Can we state that the pre-KMS driver was working good enough because bug was harder to trigger in those conditions? I can say I experienced lockups when watching videos or when shutting down even with the pre-KMS/Xorg1.6 combo, but it was rare.
Comment 39 Daniel Vetter 2010-03-26 07:19:15 UTC
> --- Comment #38 from legolas558 <legolas558@email.it>  2010-03-26 07:08:31 PST ---
> Can we state that the pre-KMS driver was working good enough because bug was
> harder to trigger in those conditions? I can say I experienced lockups when
> watching videos or when shutting down even with the pre-KMS/Xorg1.6 combo, but
> it was rare.

Yep, the cache coherency bug was most likely always there. kms code tends
to be faster in certain circumstances and therefore tends to hit cache
coherency problems with a higher probability. That's also why Eric's patch
from half a year ago managed to break a few i855 chipsets by slightly
changing the timings. But that's just coincidental because that piece of
code helps cache coherency on i865 chipsets.

btw, I've tried your two-glxgears-with-clipping test. No ill effects, here
...
Comment 40 legolas558 2010-03-26 07:49:31 UTC
Created attachment 34479 [details]
dmesg of early failure with 1000 retries, 128 whack pages

(In reply to comment #39)
> > --- Comment #38 from legolas558 <legolas558@email.it>  2010-03-26 07:08:31 PST ---
> btw, I've tried your two-glxgears-with-clipping test. No ill effects, here
> ...
> 

I have tested with 32, 64, 128 whack pages and I have found the following:

- with 32/64 whack pages, moving around one glxgears window was enough to trigger the flush failure
- with 128 whack pages, moving around didn't seem to work but a resize of the glxgears window triggered the bug
- bug becomes progressively harder to trigger when increasing whack pages (qualitative sensation)

In all cases (even 16 whack pages and/or 1000/2000 retries), no more than 2 failures are found in dmesg (because as you said it gives up after that).

I am worried about this fact that our hardware, apparently the same, is not showing same behaviour...my .config is here:

http://www.iragan.com/linux/i855GM/legolas558.config
Comment 41 Daniel Vetter 2010-03-26 13:28:21 UTC
> --- Comment #40 from legolas558 <legolas558@email.it>  2010-03-26 07:49:31 PST ---
> In all cases (even 16 whack pages and/or 1000/2000 retries), no more than 2
> failures are found in dmesg (because as you said it gives up after that).

I've overlooked this, but now that I've checked, this is _very_ curious.
With v6 you only ever see 2 chipset flush failures, no matter how hard you
abuse your machine?

With the three dmesgs you've posted, these two failures are always in the
same chipset flush, just opposite directions (gtt->cpu and cpu->gtt
transfers).  They'll also coincide with the chipset flush timed out
message. Can you please check that this is indeed the case (with the other
dmesgs you've got lying around) with the other test runs, too? Just
compare the "expected: xxx" value on each of the three backtraces.

This is strange because my code only gives up on the _current_ chipset
flush and doesn't bother to report any further timeouts. It still executes
all chipset flushes and still reports about failed ones. So if your hw
only ever reports one failure where everything fails (timeout+paranoia
check failures in both directions) and never fails again, this would be
_very_ strange indeed.

> I am worried about this fact that our hardware, apparently the same, is not
> showing same behaviour...my .config is here:

I've compared our configs and tried changing a few relevant ones to your
setting. Still can't reproduce your failures.
Comment 42 legolas558 2010-03-26 14:49:10 UTC
(In reply to comment #41)
> > --- Comment #40 from legolas558 <legolas558@email.it>  2010-03-26 07:49:31 PST ---
> > In all cases (even 16 whack pages and/or 1000/2000 retries), no more than 2
> > failures are found in dmesg (because as you said it gives up after that).
> 
> I've overlooked this, but now that I've checked, this is _very_ curious.
> With v6 you only ever see 2 chipset flush failures, no matter how hard you
> abuse your machine?
> 
Yes. Never seen more than 2 since when I started using v6 patch, but I might be wrong because I never did more than 300k flushes in a session with a v6-patched kernel.

> With the three dmesgs you've posted, these two failures are always in the
> same chipset flush, just opposite directions (gtt->cpu and cpu->gtt
> transfers).  They'll also coincide with the chipset flush timed out
> message. Can you please check that this is indeed the case (with the other
> dmesgs you've got lying around) with the other test runs, too? Just
> compare the "expected: xxx" value on each of the three backtraces.
> 
Yes, you can also see it with v5 patch dmesg in attachment 34233 [details]

From my dmesg logs:
~~ session1 - v6 patch
[   79.983513] i8xx chipset flush failed, expected: 5807, cpu_read: 5806
[   79.983771] i8xx chipset flush failed, expected: 5807, gtt_read: 5806
~~ session2 - v6 patch
[  101.807650] i8xx chipset flush failed, expected: 14194, cpu_read: 14193
[  101.807844] i8xx chipset flush failed, expected: 14194, gtt_read: 14193
~~ session3 - v5 patch
[ 2832.905107] i8xx chipset flush failed, expected: 113457, cpu_read: 113456
[ 2832.905315] i8xx chipset flush failed, expected: 113457, gtt_read: 113456
[ 2910.626579] i8xx chipset flush failed, expected: 215361, cpu_read: 215360
[ 2910.626872] i8xx chipset flush failed, expected: 215361, gtt_read: 215360
[ 2977.424469] i8xx chipset flush failed, expected: 308976, cpu_read: 308975
[ 2977.424746] i8xx chipset flush failed, expected: 308976, gtt_read: 308975

I am gonna make more intensive tests later.

> This is strange because my code only gives up on the _current_ chipset
> flush and doesn't bother to report any further timeouts. It still executes
> all chipset flushes and still reports about failed ones. So if your hw
> only ever reports one failure where everything fails (timeout+paranoia
> check failures in both directions) and never fails again, this would be
> _very_ strange indeed.
> 
Occam would say: perhaps it didn't fail at all and we are just not being informed correctly.

My raw guess is that some buddy between us and the GPU is touching something that shouldn't, and I am inclined to always blame the i8042 controller since I am already experiencing keyboard ports corruption when the battery ACPI is being used. But it is hard to link i8042 and the GPU (and the modules which cause the i8042 glitch for keyboard are never loaded), so I am still out of bullets.

> > I am worried about this fact that our hardware, apparently the same, is not
> > showing same behaviour...my .config is here:
> 
> I've compared our configs and tried changing a few relevant ones to your
> setting. Still can't reproduce your failures.
> 
As already stated, I am not using "clip solid fills" patch, if that might be relevant, but I doubt.

Just FYI, the crashes with hangcheck timer still happen (this time with a wine application, not a video) with the original v6 patch (no custom tuning of mine).
Comment 43 2points 2010-03-26 14:56:10 UTC
Created attachment 34497 [details]
dmesg with gtt flush v6 patch

Intermediate results for v6: No reported failures after 3.5M flushes, the number of maximum retries seems to have gone back slightly since v5 (now at 7, down from 9)

Meanwhile, I upgraded Xorg from 1.6.5 to 1.7.4 and noticed that the GPU hangs almost instantly once X is started up. Could attach relevant error state files, but it's probably not directly related to this bug, since no messages concerning failed flushes turn up in dmesg. Gone back to 1.6.5, and the problem disappeared (with the clip solid fills patch, that is).

> As already stated, I am not using "clip solid fills" patch, if that might be
relevant, but I doubt.
Maybe you should. It was already merged in git, too.
Comment 44 legolas558 2010-03-26 15:02:25 UTC
Created attachment 34498 [details]
dmesg of multiple flush failure with v6 patch

I was wrong. Playing with 4-5 glxgears windows finally triggered the bug, so situation is invariated vs v5 patch (except that failures happen less frequently).
Comment 45 legolas558 2010-03-26 15:06:03 UTC
(In reply to comment #43)
> Created an attachment (id=34497) [details]
> dmesg with gtt flush v6 patch
> 
> Intermediate results for v6: No reported failures after 3.5M flushes, the
> number of maximum retries seems to have gone back slightly since v5 (now at 7,
> down from 9)
> 
> Meanwhile, I upgraded Xorg from 1.6.5 to 1.7.4 and noticed that the GPU hangs
> almost instantly once X is started up. Could attach relevant error state files,
> but it's probably not directly related to this bug, since no messages
> concerning failed flushes turn up in dmesg. Gone back to 1.6.5, and the problem
> disappeared (with the clip solid fills patch, that is).
> 
Wait, are you using Xorg 1.6.5 with KMS and the most recent intel driver? Is that possible?

I have always had the X hangup issue with 1.7.x series of Xorg. Right now I am using Xorg 1.7.5.902 and it crashes only when playing videos or using intensive graphics applications (wine).

> > As already stated, I am not using "clip solid fills" patch, if that might be relevant, but I doubt.
> Maybe you should. It was already merged in git, too.
> 
Problem is that by freedesktop git stack doesn't work, so I don't know how to get a patched xf86-video-intel
Comment 46 Bruno 2010-03-26 15:15:46 UTC
On my side, nothing interesting to report, except maybe that (though I've not done aggressive tests, mostly just usual desktop use) for one day long sessions I've either hit the BUG_ON (not yet with v6 patch) or just had seldom increase of the number of retried flushes (for today, v6):
...
[  384.945981] chipset flush no. 16384, max retries 0
[  554.507090] chipset flush no. 32768, max retries 0
[  686.938325] chipset flush no. 49152, max retries 0
[  946.820082] chipset flush no. 65536, max retries 0
[ 1531.186977] chipset flush no. 81920, max retries 0
[ 2157.786320] chipset flush no. 98304, max retries 0
...
[21448.487379] chipset flush no. 1097728, max retries 0
[21910.290869] chipset flush no. 1114112, max retries 0
[22345.259438] chipset flush no. 1130496, max retries 0
[22858.071707] chipset flush no. 1146880, max retries 1
[23118.534467] chipset flush no. 1163264, max retries 1
[23299.123857] chipset flush no. 1179648, max retries 1
...

with 4 being the highest number seen.
Comment 47 legolas558 2010-03-26 15:28:37 UTC
If I don't open any glxgears and use the laptop for browsing and editing files, my max retries only reaches 6 at max.

When I open glxgears and resize it till flush failure, max retries count is updated to 1000 in next "chipset flush no." line.
Comment 48 Daniel Vetter 2010-03-26 16:13:26 UTC
> --- Comment #42 from legolas558 <legolas558@email.it>  2010-03-26 14:49:10 PST ---
> From my dmesg logs:
> ~~ session1 - v6 patch
> [   79.983513] i8xx chipset flush failed, expected: 5807, cpu_read: 5806
> [   79.983771] i8xx chipset flush failed, expected: 5807, gtt_read: 5806
> ~~ session2 - v6 patch
> [  101.807650] i8xx chipset flush failed, expected: 14194, cpu_read: 14193
> [  101.807844] i8xx chipset flush failed, expected: 14194, gtt_read: 14193
> ~~ session3 - v5 patch
> [ 2832.905107] i8xx chipset flush failed, expected: 113457, cpu_read: 113456
> [ 2832.905315] i8xx chipset flush failed, expected: 113457, gtt_read: 113456
> [ 2910.626579] i8xx chipset flush failed, expected: 215361, cpu_read: 215360
> [ 2910.626872] i8xx chipset flush failed, expected: 215361, gtt_read: 215360
> [ 2977.424469] i8xx chipset flush failed, expected: 308976, cpu_read: 308975
> [ 2977.424746] i8xx chipset flush failed, expected: 308976, gtt_read: 308975

Yet again I was totally blind. All your failed flushes report an actual
value that's only one off from the expected one. But since v2 I'm moving
around the check value on the check page, so each position is only used
every 1024th cache flush. Which means that if the flush doesn't work and
the old value is still there, it should be "expected_value - 1024".

Furthermore your system seems to be the only one where chipset flushes
fail in pairs (always both directions in the same chipset flush). I
haven't seen this on any other dmesg neither by me nor by any other
tester.

In other words I highly suspect that something is (very rarely) corrupting
the last two bits of a 4 byte block. This would also explain why the
correct value never shows up, even after extensive gtt whacking.

Please test your box with memtest86+. If that doesn't turn anything up
I'll write a testpatch (memtest86+ doesn't check the gtt, wherein the
problem might be, too).
Comment 49 legolas558 2010-03-27 05:48:46 UTC
(In reply to comment #48)
> > --- Comment #42 from legolas558 <legolas558@email.it>  2010-03-26 14:49:10 PST ---
> > From my dmesg logs:
> > ~~ session1 - v6 patch
> > [   79.983513] i8xx chipset flush failed, expected: 5807, cpu_read: 5806
> > [   79.983771] i8xx chipset flush failed, expected: 5807, gtt_read: 5806
> > ~~ session2 - v6 patch
> > [  101.807650] i8xx chipset flush failed, expected: 14194, cpu_read: 14193
> > [  101.807844] i8xx chipset flush failed, expected: 14194, gtt_read: 14193
> > ~~ session3 - v5 patch
> > [ 2832.905107] i8xx chipset flush failed, expected: 113457, cpu_read: 113456
> > [ 2832.905315] i8xx chipset flush failed, expected: 113457, gtt_read: 113456
> > [ 2910.626579] i8xx chipset flush failed, expected: 215361, cpu_read: 215360
> > [ 2910.626872] i8xx chipset flush failed, expected: 215361, gtt_read: 215360
> > [ 2977.424469] i8xx chipset flush failed, expected: 308976, cpu_read: 308975
> > [ 2977.424746] i8xx chipset flush failed, expected: 308976, gtt_read: 308975
> 
In the session3, v5 might be v2 actually.

> Yet again I was totally blind. All your failed flushes report an actual
> value that's only one off from the expected one. But since v2 I'm moving
> around the check value on the check page, so each position is only used
> every 1024th cache flush. Which means that if the flush doesn't work and
> the old value is still there, it should be "expected_value - 1024".
> 
Well I hope this will be useful to improve the patch.

> Furthermore your system seems to be the only one where chipset flushes
> fail in pairs (always both directions in the same chipset flush). I
> haven't seen this on any other dmesg neither by me nor by any other
> tester.
> 
Yes, I admit I feel lonely recently...it would be nice to find another guy with my exact hardware.

My only custom option for intel driver in xorg.conf is:
Option "XvMC" "true"

but I doubt this could be relevant.

> In other words I highly suspect that something is (very rarely) corrupting
> the last two bits of a 4 byte block. This would also explain why the
> correct value never shows up, even after extensive gtt whacking.
> 
I have tried booting with acpi=off, but seems that KMS depends on ACPI. I can only think that some ACPI or "gone wild" IRQ is causing the corruption, or that there is a broken GTT memory cell as you hypothesized.

> Please test your box with memtest86+. If that doesn't turn anything up
> I'll write a testpatch (memtest86+ doesn't check the gtt, wherein the
> problem might be, too).
> 
I completed 2 passes (ECC off) with memtest86+ v4.00 and no errors were found in my 503M (I suppose the missing memory is shadowed). So the corruption might lie in the GTT area, but I don't know how to test that...and if I have understood correctly i855GM is not very handy to make this kind of consistency checks; I am waiting your testpatch because unfortunately I am far from being able to write such GTT testpatch.
Comment 50 Daniel Vetter 2010-03-27 06:19:19 UTC
Created attachment 34505 [details] [review]
memory check patch for legolas

legolas, please apply this patch on top of v6. When a flush fails, this will read back the check values written to system mem (cpu) and gtt via the same path as they have been written to and print them out. If the readback value equals the expected value (look at the chipset fail message right before), everything is fine. If they equal the values as read on the other side (i.e. gtt readback = cpu read), something is corrupting memory when (ab)using the gtt.

To test just stress your system until you get a cache flush failure.
Comment 51 legolas558 2010-03-27 10:39:55 UTC
Created attachment 34506 [details]
dmesg of failure with v6 patch + gtt/cpu readback info

mumble mumble...
Comment 52 legolas558 2010-03-27 11:21:51 UTC
(In reply to comment #51)
> Created an attachment (id=34506) [details]
> dmesg of failure with v6 patch + gtt/cpu readback info
> 
An excerpt from the above file:

[   85.216591] i8xx chipset flush failed, expected: 5031, gtt_read: 5029
[   85.216771] gtt readback: 5029, cpu reaback: 5029
[   85.231559] chipset flush timed out, gtt_read: 5031, cpu_read: 5031, expected: 5032, gtt_pos :2421, cpu_pos: 0
[   85.231854] gtt readback: 5031, cpu reaback: 5031
[  113.151457] gtt readback: 20845, cpu reaback: 20845
[  116.806192] gtt readback: 23135, cpu reaback: 23135

I'd say it's the second case, and narrowing down the cause seems scary. Some notes regarding patch v6:

1) there are some lags (even mouse cursor hiccups), this might be normal
2) the ability to switch to VT when system fatally crashes (e.g. video failure or other program failure, with hangcheck error or Xorg I/O errors) is gone
3) if I don't open glxgears and just use thunderbird, firefox and XFCE normally, no flush failures are ever reported in dmesg. Otherwise if I open glxgears and play with it a bit, I get the 2 failures (getting more is more difficult but indeed possible as shown before)

Regarding (2): drm and i915 modules are loaded automatically by Xorg and not loaded during kernel boot-up. I might try compiling drm+i915 as built-in, and perhaps this will fix the weird GTT corruption I am experiencing.

So it happens that after quite a while of normal usage I look at dmesg and I find no failure; in this status I have to expect a sudden (unrecoverable) crash from time to time, which requires hard shutdown, and that has almost become my usual way of closing the session (50% crashes, 50% normal I'd say).
Comment 53 legolas558 2010-03-27 11:35:16 UTC
(In reply to comment #52)
> Regarding (2): drm and i915 modules are loaded automatically by Xorg and not
> loaded during kernel boot-up. I might try compiling drm+i915 as built-in, and
> perhaps this will fix the weird GTT corruption I am experiencing.
> 
Bah, it didn't work. I have just compiled drm and i915 as built-in and it creates almost the same output:

[   71.198038] chipset flush timed out, gtt_read: 0, cpu_read: 3932, expected: 3933, gtt_pos :1, cpu_pos: 1
[   71.198389] i8xx chipset flush failed, expected: 3933, gtt_read: 3932
[   71.198533] gtt readback: 3932, cpu reaback: 3932
[   71.521156] gtt readback: 4138, cpu reaback: 4138
[   74.132310] gtt readback: 5803, cpu reaback: 5803
[   75.552605] gtt readback: 6813, cpu reaback: 6813
[   89.246701] gtt readback: 15633, cpu reaback: 15633
[   95.692300] gtt readback: 20097, cpu reaback: 20097
Comment 54 Tony White 2010-03-29 14:12:51 UTC
Hi guys,
Really appreciate the work you guys have been doing to try to fix this issue. And I truly mean that. This is a horrible issue.

If you haven't read it, the intel data sheet for this graphics card is here :
http://www.intel.com/Assets/PDF/datasheet/252615.pdf

And it does make of a reasonably interesting read when you consider it from the aspect of trying to hunt down what's causing this problem, although needle - haystack, much?

The SMM space restrictions look like a place of interest to me and also the very liberal way in which they have allowed bios manufacturers choose certain things related to the address registers.
Because it's Centrino technology the 855 is like a three in one deal, bios, 855gm chip and processor, all linked into together to render the graphics to screen and it looks like bios manufacturers have done whatever they thought was the best way to make that combo work, so any number of 855gm cards can work any number of different weird and wonderful ways using each possibly unique bios implementation. I've certainly seen evidence of that on my 855GM.
Having used Linux on the machine using the machine's original bios and then updating the bios to the latest one from the Asus website. Different behaviour exhibited by both bios versions.
The first one only required the nolapic parameter to boot. While the latest one requires mem=1001M (But the memory is supposed to be 1024M and memtest says 1016M.)
Without specifying the memory, the kernel boots but very slowly without the mem param. Fine with.
My wild stab in the dark here is that there is an undetected memory hole and that's causing the problem. The actual memory modules are fine.

As far as what you guys have been testing, I have experienced the same thing in regards to the symptoms. It will work, I can use a browser, watch flash video and it all seems fine but after an hour, it will lock up and need a manual power down to restore a working system.

If it is the case that it is the memory and more specifically the memory buffer which is causing the problem because the buffer is filling up and it is not being flushed in time, does the card not use any sort of compression to compress any parts of the buffer which would require a different type of flush to empty? (Multiple overlay?)
Could it be that because bios manufacturers have had such liberal choices on their bios implementations with this card that the memory addresses to flush are being detected incorrectly or could it be that the flush is trying to flush a part of the memory which it is not allowed to (Maybe because it's detected the addresses incorrectly) And that in turn triggers some kind of stop in the hardware, which prevents any further flushes.

I am of course clutching at straws, my knowledge is limited and it would be a tall order for me to learn C, fork the code, go through the data sheet and write a proper driver for the Linux kernel for this card. Although I would dearly love to do that.

I wish you guys luck on fixing this problem. You have made some impressive improvements so far compared to the way it was and in it's current form, the driver using your patches is so very nearly close to being suitable as a fix.

Please don't give up!
 
Comment 55 Daniel Vetter 2010-03-30 01:55:33 UTC
> --- Comment #54 from Tony White <tonywhite100@googlemail.com>  2010-03-29 14:12:51 PST ---
> Hi guys,
> Really appreciate the work you guys have been doing to try to fix this issue.
> And I truly mean that. This is a horrible issue.
> 
> If you haven't read it, the intel data sheet for this graphics card is here :
> http://www.intel.com/Assets/PDF/datasheet/252615.pdf

I know about this specsheet and I've read through it already a few weeks
ago. It only contains detailed feature descriptions of the gmch plus a few
non-graphics-core related register definitions. There's nothing in there
that could give us a hint as to how gtt<->cpu caches work and how to fix
it.

If you want to help, please test my latest patch (v6) and report how it
fares on your machine.

To everyone else who has already tested this: Thanks alot. Small summary
on the state of i855 cache coherency (please correct me if I'm wrong):

- Latest version works on three boxes (from Bruno, 2points and mine).
- It also seems to work on legolas' box, but that machine has some other
  issue. Pardon for being the messenger, but looks like your machine is
  toast :(
- The most likely unrelated BUG in put_pages hasn't surfaced again.

I won't submit this as is anytime soon for two reasons:
- The code is still rather ugly atm. I need to clean it up.
- This patch is a way too horrible hack to submit it for inclusion with a
  straight face.

To fix the latter, I want some more test reports. Given how many bug
reports downstream gathered, it shouldn't be a problem to find more
testers (distro maintainers: hint, hint). Please only test on i855
chipsets, i845 still seems to have some problems. If it works (check the
demsg for chipset flush related backtraces), please add your tested-by
line with a small blurb about your machine to this bug report, like this:

Tested-by: Daniel Vetter <daniel@ffwll.ch> (IBM Thinkpad X40)

If it doesn't work, hit me with your dmesg and i915_error_state output ;)

Bruno, legolas, 2points, please add your tested-by, too.
Comment 56 Geir Ove Myhr 2010-03-30 03:59:57 UTC
(In reply to comment #55)
> - This patch is a way too horrible hack to submit it for inclusion with a
>   straight face.
> 
> To fix the latter, I want some more test reports. Given how many bug
> reports downstream gathered, it shouldn't be a problem to find more
> testers (distro maintainers: hint, hint). 

I have been meaning to build test kernels for Ubuntu users for a while, but I the Ubuntu wiki documentation for building kernel packages [1] that I have used before seems to not work that well anymore. Btw, which kernel version is it best to patch? A recent drm-intel-next or 2.6.34-rc2? On -rc2 I get problems with intel-agp.c:

gomyhr@storhaugen:~/src/linux-2.6.34-rc2$ patch -p1 --dry-run <../fix-i8xx-gtt-cache-coherency-v6.patch 
patching file drivers/char/agp/Makefile
patching file drivers/char/agp/agp.h
patching file drivers/char/agp/efficeon-agp.c
patching file drivers/char/agp/intel-agp-gart.c
patching file drivers/char/agp/intel-agp.c
Reversed (or previously applied) patch detected!  Assume -R? [n] 
Apply anyway? [n] 
Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file drivers/char/agp/intel-agp.c.rej
patching file drivers/char/agp/intel-agp.h
patching file drivers/char/agp/intel-gtt.c
patching file drivers/gpu/drm/i915/i915_dma.c
patching file drivers/gpu/drm/i915/i915_gem.c
patching file include/drm/intel-gtt.h

and I get the same with drm-intel-next. Should I just delete this file, since that is what the patch seems to do? (I tried this, and the build seemed to work, but then some of the Ubuntu-packaging failed).

[1]: https://wiki.ubuntu.com/KernelTeam/GitKernelBuild
Comment 57 Daniel Vetter 2010-03-30 04:13:35 UTC
> --- Comment #56 from Geir Ove Myhr <gomyhr@gmail.com>  2010-03-30 03:59:57 PST ---
> I have been meaning to build test kernels for Ubuntu users for a while, but I
> the Ubuntu wiki documentation for building kernel packages [1] that I have used
> before seems to not work that well anymore. Btw, which kernel version is it
> best to patch? A recent drm-intel-next or 2.6.34-rc2? On -rc2 I get problems
> with intel-agp.c:

Sorry for the confusion. Patch is actually based upon rc1, but that ones
missing some recent intel bugfixes. I'll rebase that thing on something
more current later today.
Comment 58 legolas558 2010-03-30 05:47:52 UTC
(In reply to comment #55)
> To everyone else who has already tested this: Thanks alot. Small summary
> on the state of i855 cache coherency (please correct me if I'm wrong):
> 
> - Latest version works on three boxes (from Bruno, 2points and mine).
> - It also seems to work on legolas' box, but that machine has some other
>   issue. Pardon for being the messenger, but looks like your machine is
>   toast :(

I'd be more than happy to mark it as "toasted" and go on, but then I wouldn't be able to use it with WindowsXP (never had a crash there) and neither with Xorg 1.6 (that works but because as you said the driver hits less frequently the cache).

Has Intel ever released the WindowsXP driver sources? Yes, I know..just dreaming..

Is there any other way that could explain the GTT<->CPU bug?

I'll add my Tested-By with next patch, since this one doesn't seem to be in-sync with latest drm-intel
Comment 59 Daniel Vetter 2010-03-30 07:08:54 UTC
> --- Comment #58 from legolas558 <legolas558@email.it>  2010-03-30 05:47:52 PST ---
> I'd be more than happy to mark it as "toasted" and go on, but then I wouldn't
> be able to use it with WindowsXP (never had a crash there) and neither with
> Xorg 1.6 (that works but because as you said the driver hits less frequently
> the cache).

The patch I intend to submit hopefully works better, too. By killing all
the gtt stress-testing hacks I've added you box is probably on par with
the other solutions.

> Has Intel ever released the WindowsXP driver sources? Yes, I know..just
> dreaming..

Likely won't help. The i8xx chipsets were designed without a kernel memory
manager in mind (Windows only gained that with Vista). So the XP driver
probably just implements a static gtt (that doesn't need any chipset
flushes) and copies textures back and forth. That works, but performance
will suck, especially with kernel-managed graphics memory allocation,
where spills happen rather often.

In other words, we're coxing these chipset into a framework they're not
designed for (but which is the only sane thing to do considering modern
graphics apis), trying to paper over any hw deficiencies with horrible
hacks like mine.

> Is there any other way that could explain the GTT<->CPU bug?

As long as there's no other report of the same problem, hw flakiness is
the only likely option. After all it only happens when hitting the gtt
really hard, something XP (and the old ums driver) are not likely to do.
Comment 60 legolas558 2010-03-31 02:01:45 UTC
(In reply to comment #59)
> > --- Comment #58 from legolas558 <legolas558@email.it>  2010-03-30 05:47:52 PST ---
> > I'd be more than happy to mark it as "toasted" and go on, but then I wouldn't
> > be able to use it with WindowsXP (never had a crash there) and neither with
> > Xorg 1.6 (that works but because as you said the driver hits less frequently
> > the cache).
> 
> The patch I intend to submit hopefully works better, too. By killing all
> the gtt stress-testing hacks I've added you box is probably on par with
> the other solutions.
> 
Yes because I am driven to think that the sudden crash happening later in time (also 1 hour of uptime) is not related to these rarely happening cache failures (easily verified with glxgears, but not otherwise during normal usage). Fact is that the Xorg total crash will happen even if no cache failure have yet happened...so perhaps that's another bug, but this patch is still necessary even as-is.

Are you planning to submit the patch here before sending it upstream?

Anyway my signature is:

Tested-by: Daniele Castellitto <legolas558@users.sourceforge.net> (Maxdata Pro 7000X)

> > Has Intel ever released the WindowsXP driver sources? Yes, I know..just
> > dreaming..
> 
> Likely won't help. The i8xx chipsets were designed without a kernel memory
> manager in mind (Windows only gained that with Vista). So the XP driver
> probably just implements a static gtt (that doesn't need any chipset
> flushes) and copies textures back and forth. That works, but performance
> will suck, especially with kernel-managed graphics memory allocation,
> where spills happen rather often.
> 
> In other words, we're coxing these chipset into a framework they're not
> designed for (but which is the only sane thing to do considering modern
> graphics apis), trying to paper over any hw deficiencies with horrible
> hacks like mine.
> 
Thank you Daniel for telling us so much - very appreciated. I now see the bigger picture.

> > Is there any other way that could explain the GTT<->CPU bug?
> 
> As long as there's no other report of the same problem, hw flakiness is
> the only likely option. After all it only happens when hitting the gtt
> really hard, something XP (and the old ums driver) are not likely to do.
> 
I see. I have also tried adding delays before reading back the values, but that does not help. It must be indeed some hardware glitch. I'd like to try KMS without ACPI, to see if it still happens, but that doesn't seem possible either.

So for now I'll keep this patch; and perhaps we can focus the sudden crash bug later.
Comment 61 2points 2010-03-31 04:35:57 UTC
Thanks for all your work on this. Even if inclusion in mainline is still pending, patches here finally fix problems I've had for two years now. Looks like I can finally upgrade from 2.6.27 without fear of random failures interrupting my work every now and then.

Tested-by: Moritz Brunner <2points@gmx.org> (Asus M2400N)
Comment 62 Thorsten Vollmer 2010-03-31 11:24:00 UTC
Daniel, I tested your patch (v6) on an 852GME. Being very similar to the 855, the 852 also suffers from this bug. The GPU hangs frequently with all recent kernels. With your patch applied, I have not seen any hangs, nor any messages about failed flushes. The graphics performance is noticeably reduced though.

Tested-by: Thorsten Vollmer <thorsten@thvo.de> (DFI-ACP G5M150-N w/ 852GME)

BTW: I was surprised to read that you are using an X40, because I would never have discovered this bug on my X40, my second machine. I have been using it for weeks with unpatched kernels and never saw any hangs. At first I hoped that my X40 was not affected and we could compare register settings. But with your patch applied, the kernel reports some flush retries. The frequency of failed flushes must be very low though.

I appreciate your work on this issue. Thanks.
Comment 63 Daniel Vetter 2010-03-31 14:54:57 UTC
> --- Comment #60 from legolas558 <legolas558@email.it>  2010-03-31 02:01:45 PST ---
> Are you planning to submit the patch here before sending it upstream?

As soon as I post the patch for inclusion, I'll add a patch with all the
debugging stuff removed to this bug report.

> --- Comment #62 from Thorsten Vollmer <thorsten@thvo.de>  2010-03-31 11:24:00 PST ---
> BTW: I was surprised to read that you are using an X40, because I would never
> have discovered this bug on my X40, my second machine. I have been using it for
> weeks with unpatched kernels and never saw any hangs. At first I hoped that my
> X40 was not affected and we could compare register settings. But with your
> patch applied, the kernel reports some flush retries. The frequency of failed
> flushes must be very low though.

Yep, my X40 is very stable with stock kernels. So I could never understand
all these bug reports about "intel drivers totally suck on i855GM"
because, hey, it works here! But a discussion with Chris Wilson about a
very strange bug report got me thinking. A few debug hacks later (to
stress the gtt) I've reduced the lifetime expectancy of my X40 to half a
minute :( With this, I've could then start hacking on solutions.

btw, these hacks are included in the patches posted here to really make
sure it works now. I'll drop them for the final rev.
Comment 64 Daniel Vetter 2010-04-01 03:54:55 UTC
Created attachment 34595 [details] [review]
gtt chipset flush v7

Rebased against 2.6.34-rc3 (_not_ drm-intel-next, that one doesn't have all the latest fixes). No other changes from v6.
Comment 65 René Gabriëls 2010-04-01 09:31:56 UTC
Created attachment 34597 [details]
NEC P520: dmesg output
Comment 66 René Gabriëls 2010-04-01 09:38:40 UTC
The v7 patch appears to be working on my NEC versa P520: no crashes or screen corruptions yet!   I've been running glxgears for 40 minutes to stress test it.  The dmesg output (see previous post) reports a number of inconsistancies though.  No idea if this is a problem, i haven't been following this discussion very closely.

PS: Performance is also way better than with previous patch (https://bugs.freedesktop.org/attachment.cgi?id=33593), which I have been running for a couple of weeks now without any problems (except performance).
Comment 67 Bruno 2010-04-01 11:40:04 UTC
(In reply to comment #55)
> Tested-by: Daniel Vetter <daniel@ffwll.ch> (IBM Thinkpad X40)
> 
> If it doesn't work, hit me with your dmesg and i915_error_state output ;)
> 
> Bruno, legolas, 2points, please add your tested-by, too.
> 

Here it is, also had v7 running this evening with 3x glxgears  and dmesg lists same kind of results as in comment #46 (reached about 5M-flushes with 1 retry until now)

Tested-by: Bruno Prémont <bonbons@linux-vserver.org> (Acer TM66x)
Comment 68 Rémi Cardona 2010-04-01 15:34:11 UTC
Created attachment 34602 [details]
dmesg output (HP Pavilion dv1000)

Here's my dmesg after running 2 glxgears at the same time... Doesn't look too good.

I'll see if the various corruptions I've seen so far reappear less often or not.

Thanks
Comment 69 legolas558 2010-04-03 05:09:03 UTC
I am now using patch v7, it is much more performant, max retries never seen above 5.

The 2 flush failures still happen (although it's harder to trigger them) when resizing glxgears window. Also if you insist a bit you will get a total system crash.

Always better than the vanilla kernel; please send upstream.

Tested-by: Daniele Castellitto <legolas558@users.sourceforge.net> (Maxdata Pro
7000X)
Comment 70 legolas558 2010-04-07 02:24:10 UTC
(In reply to comment #68)
> Created an attachment (id=34602) [details]
> dmesg output (HP Pavilion dv1000)
> 
@Daniel Vetter: maybe I and Rémi have the same issue? Would it be possible to store somewhere the instructions executed just before the GTT failure or much more importantly before the total system crash? Perhaps there are opcodes which do not fully reset the internal state machine, and adding some null operation in the flow will fix it (this would also explain why Xorg 1.6 crashes once in a year...)
Comment 71 Indan Zupancic 2010-04-07 16:37:09 UTC
(In reply to comment #70)
> If you want to help, please test my latest patch (v6) and report how it
> fares on your machine.

I've got a Thinkpad X40 with a rev 2 855GM chipset. With stock kernels my system is mostly stable, but X crashes now and then (depending on what I do, a couple times a week). Since KMS it's just the screen that freezes, except for the mouse pointer, and I can switch to console.

With your v7 patch applied on current git I get a full system hang fairly soon and easily. Because the system is stuck I can't get any error messages, but I didn't get any in dmesg before the hang (max retries was always 0, printed twice).

I do have the debugfs output, dmesg and Xorg.log after a hang with a plain current git kernel (2.6.34-rc3, HEAD at 0fdf86). Start of i915_error_state says:

Time: 1270672918 s 546381 us
PCI ID: 0x3582
EIR: 0x00000000
  PGTBL_ER: 0x00000000
  INSTPM: 0x00000000
  IPEIR: 0x00000000
  IPEHR: 0x41100000
  INSTDONE: 0x037fefc1
  ACTHD: 0x07c2a814
seqno: 0x000298f1

xorg-server 1.7.5.902, intel driver 2.10.0,

If there's anything I can do to help, please ask.

As I'm not interested in 3D, I'm starting to wonder why I shouldn't switch to the VGA driver. Surely that one is stable and not that much slower for 2D?
Comment 72 Daniel Vetter 2010-04-08 04:46:43 UTC
> --- Comment #71 from Indan Zupancic <indan@nul.nu> 2010-04-07 16:37:09 PDT ---
> If there's anything I can do to help, please ask.

Thanks for testing. Please update to xf86-video-intel 2.11 (just released
a few days ago). Also update to libdrm 2.4.20. These contain a few fixes
for gpu hangs on i8xx hw.  If your gpu still hangs, please attach the
output of i915_error_state, that's usually sufficient to get a clue about
what's going on.
Comment 73 Daniel Vetter 2010-04-08 04:52:22 UTC
> --- Comment #70 from legolas558 <legolas558@email.it> 2010-04-07 02:24:10 PDT ---
> @Daniel Vetter: maybe I and Rémi have the same issue?

Yep, it looks like Rémi, René and you all suffer from the same. In other
words, this can't be explained by broken hw anymore. I have a few ideas
about what's going on (that would also explain the put_pages BUG seen by
Bruno). But don't hold your breath waiting for a fix for this bug really
is a specialist in evasive maneuvers ;(
Comment 74 legolas558 2010-04-08 05:22:16 UTC
(In reply to comment #73)
> > --- Comment #70 from legolas558 <legolas558@email.it> 2010-04-07 02:24:10 PDT ---
> > @Daniel Vetter: maybe I and Rémi have the same issue?
> 
> Yep, it looks like Rémi, René and you all suffer from the same. In other
> words, this can't be explained by broken hw anymore. I have a few ideas
> about what's going on (that would also explain the put_pages BUG seen by
> Bruno). But don't hold your breath waiting for a fix for this bug really
> is a specialist in evasive maneuvers ;(

Oh sure it's broken, but its brokeness is consistent within the same model at least. How mad/stupid would it be to analyze the flow of GPU instructions to detect differences between a normal non-crashing flow (pre-KMS)  and a crashing flow (KMS)? Also it would be nice to be able to "pack" these GPU flows in runnable batchsets so that one can eventually find the glitching sequence.
I have done this kind of sorcerery with JTAG so I thought it might be a "tool" for us too.
Comment 75 Bruno 2010-04-08 11:50:54 UTC
Created attachment 34823 [details]
Kernel logs with 3 BUGs in i915_gem.c:1456 / i915_gem_object_put_pages()
Comment 76 Daniel Vetter 2010-04-08 12:29:18 UTC
Created attachment 34824 [details] [review]
fix locking around chipset flushing

legolas, Rémi, René, this patch should fixed the problems you've encountered with timed-out chipset flushes. It was a bug in my code. Please test extensively.
Comment 77 legolas558 2010-04-08 14:35:51 UTC
(In reply to comment #76)
> Created an attachment (id=34824) [details]
> fix locking around chipset flushing
> 
> legolas, Rémi, René, this patch should fixed the problems you've encountered
> with timed-out chipset flushes. It was a bug in my code. Please test
> extensively.

Everything nominal with the new bugfix, as expected :-)

failures / flushes:
0 /  475136
max retries:
38.512510	0
160.138951	6
232.302424	8
402.559573	11

No flush failures with 6 glxgears windows (my 1.6Ghz hardware was almost hung up by running them).

The 8,11 max retries happened when opening Firefox, not when running glxgears; now it really seems stable, my raw feeling is that the bug is totally fixed.

Thanks for finding it :)

Now I'll try if playing videos still triggers a hangup, or if it hangs after hours of uptime.
Comment 78 Indan Zupancic 2010-04-08 15:00:35 UTC
(In reply to comment #72)
> Thanks for testing. Please update to xf86-video-intel 2.11 (just released
> a few days ago). Also update to libdrm 2.4.20. These contain a few fixes
> for gpu hangs on i8xx hw.  If your gpu still hangs, please attach the
> output of i915_error_state, that's usually sufficient to get a clue about
> what's going on.

Okay, running 2.11 and 2.4.20 now. I'll report if I get any hangs, but that can take a while.

Thank you for all your work.

(As a side note, I still get slightly corrupted text in Firefox sometimes. If that has anything to do with this then the issue isn't totally solved yet.)
Comment 79 legolas558 2010-04-08 18:37:55 UTC
4h uptime, more than 2M flushes (2031616 exactly) without any failure, no crash after watching several videos, max retries is still 11. No graphics corruption anywhere. Possibly never experienced a more stable Xorg (neither with UMS).

All i855GM bugs are fixed for me; I am using xf86-video-intel v2.10.0 and an old libdrm compiled from git

The code which prints chipset flush stats can be taken out; please push patch upstream.

Many thanks for all the hard work!
Comment 80 Daniel Vetter 2010-04-08 23:51:54 UTC
> --- Comment #79 from legolas558 <legolas558@email.it> 2010-04-08 18:37:55 PDT ---
> 4h uptime, more than 2M flushes (2031616 exactly) without any failure, no crash
> after watching several videos, max retries is still 11. No graphics corruption
> anywhere. Possibly never experienced a more stable Xorg (neither with UMS).

Great! Rémi & Réne, can you please retest v7 with this fix applied and add
your tested-by line here?
Comment 81 legolas558 2010-04-09 00:53:49 UTC
Created attachment 34836 [details]
i915_gem_object_put_pages crash happening with libdrm 2.4.18

Now I am also affected by the i915_gem_object_put_pages bug; it obviously has a different cause.

Note: this total system crash happens with libdrm-2.4.18 and with libdrm-git (pulled today), however I could only retrieve the syslog message for the older libdrm (with the new one I only got nul bytes printed to syslog)

The bug triggers very quickly with libdrm-2.4.18 while it becomes much harder to trigger with the most recent libdrm, but it is indeed there.
Comment 82 Scott Hansen 2010-04-09 10:04:27 UTC
Created attachment 34849 [details]
Everything.log for i845 freezes

Ok, for this hardware: 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)

Running this kernel: 2.6.34-rc2-59809-g22f2d3a-dirty #1 SMP PREEMPT Thu Apr 8 21:41:20 PDT 2010 i686 Intel(R) Pentium(R) 4 CPU 1.80GHz GenuineIntel GNU/Linux

from drm-intel-next with Daniel's patch from yesterday, runnining xorg-server 1.7.6, libdrm-newest 2.4.19, and xf86-video-intel-git 20100408,

it appears that the freezing up is gone -- almost. I have been running for several hours with a combination of glxgears, playing .mp4 and .wmv movies from my hard drive, surfing with chromium and firefox, flash movie playback, and switching virtual terminals between dwm and xfce and have no freezes yet

EXCEPT -- running Tuxpaint under xfce, I got a freeze (see attached log for details -- just a single error message). Also, trying to play a dvd (using mplayer or gnome-mplayer) I got lots of graphics errors and then the same freeze (see log for details.)  Not sure if these freezes are from this bug or an unrelated bug with the 845 chips.

Otherwise, nice work! Thank you :)

Scott
Comment 83 Daniel Vetter 2010-04-09 10:15:55 UTC
> --- Comment #82 from Scott Hansen <scottandchrystie@comcast.net> 2010-04-09 10:04:27 PDT ---
> Otherwise, nice work! Thank you :)

Sorry to disappoint you, but that's just placebo. It looks like you've
only applied the small kernel patch from yesterday. But that's just an
incremental fix, i.e. you need the v7 patch _plus_ this small fix.

Can you please retest? If it still reports hangs, please add
i915_error_state in addition to the dmesg to this bug report.
Comment 84 Scott Hansen 2010-04-09 14:11:38 UTC
Created attachment 34857 [details]
dmesg, i915_error_state and intel_gpu_dump

Agh, sorry! Well, with both the v7 and lock patches, my machine locked up instantly on starting X. I've attached (hopefully) the logs you requested. Let me know if you need more.

Thanks,
Scott
Comment 85 René Gabriëls 2010-04-09 17:24:04 UTC
(In reply to comment #80)
> > --- Comment #79 from legolas558 <legolas558@email.it> 2010-04-08 18:37:55 PDT ---
> > 4h uptime, more than 2M flushes (2031616 exactly) without any failure, no crash
> > after watching several videos, max retries is still 11. No graphics corruption
> > anywhere. Possibly never experienced a more stable Xorg (neither with UMS).
> 
> Great! Rémi & Réne, can you please retest v7 with this fix applied and add
> your tested-by line here?

OK the following setup is still working after 45 min of stress-test (3x glxgears, 1x x11perf, 1x youtube):

- kernel 2.6.34-rc3 + v7 patch + mutex patch
- libdrm 2.4.20
- xorg-server 1.7.6
- xf86-video-intel 2.11.0
- mesa 7.8.1

No errors in dmesg whatsoever.  The last entry reads "chipset flush no. 3637248, max retries 3".

There's just 1 thing wrong right now: the GNOME panel refuses to draw text after the stress test, but that's probably a panel issue.
Comment 86 Rémi Cardona 2010-04-09 23:49:09 UTC
Like I told Daniel yesterday on IRC, this works brilliantly (v7+locking patch). No more corruption and no more messages/crashes in dmesg.

Thanks again Daniel!

Tested-by: Rémi Cardona <remi@gentoo.org> (HP Pavilion dv1000)

Cheers
Comment 87 arthapex 2010-04-10 01:56:02 UTC
(In reply to comment #86)
> Like I told Daniel yesterday on IRC, this works brilliantly (v7+locking patch).
> No more corruption and no more messages/crashes in dmesg.
> 
> Thanks again Daniel!
> 
> Tested-by: Rémi Cardona <remi@gentoo.org> (HP Pavilion dv1000)
> 
> Cheers
@Rémi: Could you please create an ebuild with the working patchset and put it in the x11-overlay or portage? I would love to test it here on my Acer Travelmate 663
Comment 88 Daniel Vetter 2010-04-11 14:57:12 UTC
> --- Comment #81 from legolas558 <legolas558@email.it> 2010-04-09 00:53:49 PDT ---
> Created an attachment (id=34836)
>  --> (https://bugs.freedesktop.org/attachment.cgi?id=34836)
> i915_gem_object_put_pages crash happening with libdrm 2.4.18
> 
> Now I am also affected by the i915_gem_object_put_pages bug; it obviously has a
> different cause.
> 
> Note: this total system crash happens with libdrm-2.4.18 and with libdrm-git
> (pulled today), however I could only retrieve the syslog message for the older
> libdrm (with the new one I only got nul bytes printed to syslog)
> 
> The bug triggers very quickly with libdrm-2.4.18 while it becomes much harder
> to trigger with the most recent libdrm, but it is indeed there.

I've tried to again reproduce this problem on my box by downgrading to
libdrm 2.4.18 (and a few other hacks). But that bug simply refuses to show
up again, here. Can you and Bruno please gather a few backtraces (as many
as you have lying around in your logs) and upload them to this bug?

Perhaps I can see a pattern and get a clue what's going on - at least that
way I've managed to fix the other problem with the stuck chipset flush.
Comment 89 Bruno 2010-04-12 09:46:47 UTC
Created attachment 34917 [details]
All BUGs I've seen in i915_gem.c since February
Comment 90 René Gabriëls 2010-04-12 09:56:23 UTC
After a number of days testing this patch, i haven't seen any crashes or (render) errors.  Thanks for the hard work Daniel!

Tested-by: René Gabriëls <renegabriels@gmail.com> (NEC Versa P520)
Comment 91 legolas558 2010-04-12 10:13:19 UTC
(In reply to comment #88)
> > --- Comment #81 from legolas558 <legolas558@email.it> 2010-04-09 00:53:49 PDT ---
> > Created an attachment (id=34836) [details]
> >  --> (https://bugs.freedesktop.org/attachment.cgi?id=34836)
> > i915_gem_object_put_pages crash happening with libdrm 2.4.18
> > 
> > Now I am also affected by the i915_gem_object_put_pages bug; it obviously has a
> > different cause.
> > 
> > Note: this total system crash happens with libdrm-2.4.18 and with libdrm-git
> > (pulled today), however I could only retrieve the syslog message for the older
> > libdrm (with the new one I only got nul bytes printed to syslog)
> > 
> > The bug triggers very quickly with libdrm-2.4.18 while it becomes much harder
> > to trigger with the most recent libdrm, but it is indeed there.
> 
> I've tried to again reproduce this problem on my box by downgrading to
> libdrm 2.4.18 (and a few other hacks). But that bug simply refuses to show
> up again, here. Can you and Bruno please gather a few backtraces (as many
> as you have lying around in your logs) and upload them to this bug?
> 
> Perhaps I can see a pattern and get a clue what's going on - at least that
> way I've managed to fix the other problem with the stuck chipset flush.
Yes I confirm that the chipset flushes are all OK because I never got again a GTT failure.

I can't be sure that the crash with libdrm-2.4.20 is also due to i915_gem_object_put_pages, because no log lines are stored on /var/log/messages (only nul bytes).

I can trigger this total system crash only with a windows application running inside wine, with all other normal linux usage (even 3D) there is no crash.

I'll try catching the i915 debugfs data right after the crash, but I am not sure that the init process is still alive after that.

Unfortunately I don't have other dump files lying around, however I am sure that this is a new bug on this hardware e.g. it is not the crash happening when watching videos.
Comment 92 legolas558 2010-04-12 11:03:58 UTC
Created attachment 34922 [details]
dri debugfs snapshots taken every second + generator script used
Comment 93 legolas558 2010-04-12 11:13:50 UTC
Created attachment 34923 [details]
excerpt from dmesg containing crash dump for i915_gem_object_put_pages+0x10b/0x110

A few updates. I have used a script running in background to gather DRI debugfs dumps.

1) the system is not totally hung up, because background scripts still run. Only keyboard/mouse die
2) it is the same crash happening with libdrm-2.4.18 and libdrm-git, so it's not libdrm-dependant
3) the nul bytes were due to write buffers not being flushed before hard shutdown, and with the 'sync' call in the daemon script (available in tbz archive) it correctly puts the crash dump in /var/log/messages
4) it does not only depend from wine but also from some other application, because I had to run wine and firefox to trigger it; anyway it looks deterministic and not totally random
5) the crash must have happened within the last 10 snapshots (e.g. seconds), sorry but I can't be more precise, I hope you can guess where it crashed from the DRI debugfs dumps

If I can be of some help I'd be glad to make other tests/reports; looks like I am able to reproduce this bug at will, so I can actually make manual tests.
Comment 94 Daniel Vetter 2010-04-13 03:42:23 UTC
> --- Comment #91 from legolas558 <legolas558@email.it> 2010-04-12 10:13:19 PDT ---
> Unfortunately I don't have other dump files lying around, however I am sure
> that this is a new bug on this hardware e.g. it is not the crash happening when
> watching videos.

Concerning your overlay problem: Can you open a new bug report for that
and put me on the cc: (I'm the overlay guy)? A have another report from a
i965G hanging when using the overlay, perhaps there's some pattern. Please
add the usual amount of information so that I (or anyone else) doesn't
have to hunt around in various bug reports. Thanks.
Comment 95 legolas558 2010-04-13 03:49:52 UTC
(In reply to comment #94)
> > --- Comment #91 from legolas558 <legolas558@email.it> 2010-04-12 10:13:19 PDT ---
> > Unfortunately I don't have other dump files lying around, however I am sure
> > that this is a new bug on this hardware e.g. it is not the crash happening when
> > watching videos.
> 
> Concerning your overlay problem: Can you open a new bug report for that
> and put me on the cc: (I'm the overlay guy)? A have another report from a
> i965G hanging when using the overlay, perhaps there's some pattern. Please
> add the usual amount of information so that I (or anyone else) doesn't
> have to hunt around in various bug reports. Thanks.

Sorry, I ought have used the verb in past tense. the crash *that was* happening when watching videos. I am no more experiencing the overlay bug when watching videos with v7+locking patch.

If you want I can go back to an older patch which still verifies the crash with videos and make the report as I did for the i915_gem_object_put_pages issue.

I am confident that it is fixed now because I am no more seeing a psychedelic fuchsia/rainbow overlay fill, which was interleaved in frames from time to time and preceeded by some minutes the final crash.

The only bug remaining for me is i915_gem_object_put_pages
Comment 96 Indan Zupancic 2010-04-13 14:14:05 UTC
(In reply to comment #72)
> > --- Comment #71 from Indan Zupancic <indan@nul.nu> 2010-04-07 16:37:09 PDT ---
> > If there's anything I can do to help, please ask.
> 
> Thanks for testing. Please update to xf86-video-intel 2.11 (just released
> a few days ago). Also update to libdrm 2.4.20. These contain a few fixes
> for gpu hangs on i8xx hw.  If your gpu still hangs, please attach the
> output of i915_error_state, that's usually sufficient to get a clue about
> what's going on.

Okay, I have been running this combination for five days now without any hangs, it's looking pretty stable, I think my problems are fixed now.

Unpatched kernel 2.6.34-rc3
xf86-video-intel 2.11.0
libdrm 2.4.20
xorg-server 1.7.5.902
intel-dri 7.7.1

Do you want me to test your v7 patch + locking fixes to make sure it causes no regressions? It appeared to make things worse for Scott, and v7 on its own didn't work for me before either. Or are the good bits already upstream?

I think Scott hit the same bug as I did, so it might be fixed for him too now with the new libdrm. (No idea what actually caused my problems, nor what fixed it. Was it the EINTR versus EAGAIN bugfix? Or all the intel driver fixes?)

Thanks,

Indan
Comment 97 Daniel Vetter 2010-04-13 14:47:51 UTC
> --- Comment #96 from Indan Zupancic <indan@nul.nu> 2010-04-13 14:14:05 PDT ---
> Do you want me to test your v7 patch + locking fixes to make sure it causes no
> regressions? It appeared to make things worse for Scott, and v7 on its own
> didn't work for me before either. Or are the good bits already upstream?

Nope, nothing upstream yet (but the first patch series should hit
drm-intel-next in a few days). It looks like X40s are not really affected
by this gtt inconsistencies in day-to-day use. But the problem exists
there, too. So yes, please test v7+locking fix and beat it up for a few
days. If it works and doesn't report any failed chipset flushes, please
add your tested-by line, too.
Comment 98 Christian Beier 2010-04-14 03:27:27 UTC
(In reply to comment #76)
> Created an attachment (id=34824) [details]
> fix locking around chipset flushing
> 
> legolas, Rémi, René, this patch should fixed the problems you've encountered
> with timed-out chipset flushes. It was a bug in my code. Please test
> extensively.

Hi,
The v7 + locking patches work fine here on a JVC MP-XP731 (with an Intel 855GM rev 02), more than 3mb flushes and no hangs. Before (normal debian squeeze stack) X would crash shortly after login. 

However, it feels slightly more sluggish than with the old intel 2.3 driver and Xorg 7.3, but that's probably because the patch does some extra debug checking?

My working setup right now is:

Linux 2.6.34-rc2 from drm-intel-next + v7 + locking patch
libdrm 2.4.18
intel driver 2.9.1
Xserver 1.7.6
Mesa 7.7.1-devel

So the only exchanged component is the kernel. I've made a package which is available for others to test here: http://www2.informatik.hu-berlin.de/~beier/tmp/linux-image-2.6.34-rc2_2.6.34-rc2-10.00.Custom_i386.deb


Thumbs up for the hard work!

Christian
Comment 99 Christian Beier 2010-04-14 03:37:32 UTC
Created attachment 34996 [details]
dmesg output after suspend to disk
Comment 100 Christian Beier 2010-04-14 03:39:37 UTC
Comment on attachment 34996 [details]
dmesg output after suspend to disk

Oops, just after thinking evrything's fine. X got stuck shortly after a wakeup from suspend to disk. dmesg output attached. Dunno if this is related at all...

Cheers,
   Christian
Comment 101 Daniel Vetter 2010-04-14 03:47:53 UTC
Thanks alot for testing (this goes to everyone, not just Christian)!

> --- Comment #98 from Christian Beier <beier@informatik.hu-berlin.de> 2010-04-14 03:27:27 PDT ---
> However, it feels slightly more sluggish than with the old intel 2.3 driver and
> Xorg 7.3, but that's probably because the patch does some extra debug checking?

Yep, that's to be expected. My patch currently completely trashes the gtt
(to really exercise the chipset flush - no way to get to a few mm flushes
within just a few hours of testing without this). But that also kills
performance. Final version should be about on par with older drivers.

btw, is the following tested-by line correct?

Tested-by: Christian Beier <beier@informatik.hu-berlin.de> (JVC MP-XP731)
Comment 102 Christian Beier 2010-04-14 03:50:29 UTC
> btw, is the following tested-by line correct?
> 
> Tested-by: Christian Beier <beier@informatik.hu-berlin.de> (JVC MP-XP731)

Oops, forgot that. Yeah, correct!
Comment 103 Daniel Vetter 2010-04-14 04:56:12 UTC
> --- Comment #100 from Christian Beier <beier@informatik.hu-berlin.de> 2010-04-14 03:39:37 PDT ---
> (From update of attachment 34996 [details])
> Oops, just after thinking evrything's fine. X got stuck shortly after a wakeup
> from suspend to disk. dmesg output attached. Dunno if this is related at all...

Great, everyone is stuck on the dev->struct_mutex lock. Sigh. One more
hint that the locking is fishy. Can you please enable lockdep (Kernel
hacking -> Lock debugging: prove locking correctness) in your kernel
config and retest? Lockdep should shed some light on what the heck is
going on here.
Comment 104 Christian Beier 2010-04-14 15:52:09 UTC
Created attachment 35042 [details]
dmesg v7 + locking patch, lockdep enabled

This one's different from my last dmesg, no more hung tasks, rather looks like the i915_gem_object_put_pages() bug the others experienced.

HTH anyway,
Christian
Comment 105 legolas558 2010-04-14 16:56:43 UTC
I have enabled lockdep debugging and I also have an early BUG dump "BUG: key dd9e5288 not in .data!" like Christian, so that can be ignored.

Fixing this last bug has become very important because with updates of last week the Xorg v1.6 (and related packages and old intel driver) is badly crashing, so it can no more be used.

I am now using Xorg 1.7.6 with the VESA driver, and that is rock-solid
Comment 106 Daniel Vetter 2010-04-15 03:19:00 UTC
Good news about the put_pages BUG_ON: There's another bug report
indicating that this is not a problem in my patch but also exists in the
mainline kernel:

https://bugzilla.kernel.org/show_bug.cgi?id=15664

My patch (especialyl the hack to stress test the gtt) just makes it more
likely.

Bad news: I still have no clue what's goin on.

To all those who are hitting this problem: What mesa release are you using
and are you using a compositioning window manager that uses OpenGL? I have
an idea ...
Comment 107 legolas558 2010-04-15 03:41:55 UTC
(In reply to comment #106)
> To all those who are hitting this problem: What mesa release are you using
> and are you using a compositioning window manager that uses OpenGL? I have
> an idea ...

Using XFCE with mesa/libgl 7.7.1, no compositing at all. If you want I can enable Option  "Composite"     "Disable" in xorg.conf
Comment 108 Daniel Vetter 2010-04-15 04:33:15 UTC
> --- Comment #107 from legolas558 <legolas558@email.it> 2010-04-15 03:41:55 PDT ---
> (In reply to comment #106)
> > To all those who are hitting this problem: What mesa release are you using
> > and are you using a compositioning window manager that uses OpenGL? I have
> > an idea ...
> 
> Using XFCE with mesa/libgl 7.7.1, no compositing at all. If you want I can
> enable Option  "Composite"     "Disable" in xorg.conf

Arrgh, whatever, my theory just went bust.
Comment 109 Christian Beier 2010-04-15 05:06:55 UTC
(In reply to comment #106)
> To all those who are hitting this problem: What mesa release are you using
> and are you using a compositioning window manager that uses OpenGL? I have
> an idea ...

Mesa 7.7.1 and running compiz 0.8.4...
Comment 110 Daniel Vetter 2010-04-15 08:07:20 UTC
Created attachment 35065 [details] [review]
only call put_pages when gtt_space != NULL

Ok, this might be the first real stab at that dreaded put_pages BUG. Everyone who's hitting this problem, please apply this patch on top of whatever kernel most easily reproduces the problem.
Comment 111 legolas558 2010-04-15 09:37:43 UTC
(In reply to comment #110)
> Created an attachment (id=35065) [details]
> only call put_pages when gtt_space != NULL
> 
> Ok, this might be the first real stab at that dreaded put_pages BUG. Everyone
> who's hitting this problem, please apply this patch on top of whatever kernel
> most easily reproduces the problem.

It seems much more stable now.

gtt failures:
0 /  114688
max retries:
66.696711	0
203.340132	5
284.831640	6
570.264962	7

I also tested it through Murphy's law by trying to do something important with the wine application: no hangups up to now.

Let's see what happens in the next couple of days. For now I'd say FIXED.
Comment 112 Christian Beier 2010-04-15 15:19:40 UTC
Created attachment 35073 [details]
dmesg output snippet with i915_gem_tiling.c warning

With all three patches (the v7, locking and gtt_space!=NULL) atop a drm-intel-next kernel it seems to run stable. Did not run into any crashes (by now...). However, while playing around with RandR rotation, i got the attached warning. X continues running. 

Again, I don't know if this is in any way related, but maybe it helps...
Comment 113 Daniel Vetter 2010-04-16 00:11:29 UTC
> --- Comment #112 from Christian Beier <beier@informatik.hu-berlin.de> 2010-04-15 15:19:40 PDT ---
> With all three patches (the v7, locking and gtt_space!=NULL) atop a
> drm-intel-next kernel it seems to run stable. Did not run into any crashes (by
> now...). However, while playing around with RandR rotation, i got the attached
> warning. X continues running. 

Looks like the ddx is not properly disabling bo reuse on the framebuffer.
Please retest with the latest version of xf86-video-intel and libdrm. If
the problem persists, please open a new bug report, this is definitely
something else.
Comment 114 Daniel Vetter 2010-04-16 05:23:07 UTC
Created attachment 35087 [details] [review]
new patch against current drm-intel-next

I've beaten the patch into shape and killed all the debug hacks. Patch is against current drm-intel-next. But portions of it are already submitted upstream, so it might no longer apply in a few days. I'll try to rebase asap when that happens.

Performance should be about the same as old ums code or unpatched kernel - but stable ;)

Everyone who's still testing these patches and has not yet reported their tested-by line, please do so now. I intend to submit this some when next week.

Please test this patch thoroughly.
Comment 115 legolas558 2010-04-16 10:33:05 UTC
(In reply to comment #114)
> Created an attachment (id=35087) [details]
> new patch against current drm-intel-next
> 
> Performance should be about the same as old ums code or unpatched kernel - but
> stable ;)
> 
General performance has indeed increased, however I can clearly notice "hickups" during major load on the GPU. Like when starting mozilla apps or wine apps; this was not noticeable with previous v7+locking patchset.

As you said it is like old UMS code, and actually glxgears FPS is comparable (if not better); so the only minor issue is the hickups that I am experiencing. These hickups also hang the mouse for some seconds, so I suppose it is some locking on the GPU pipe, but I can't read the changes in your v8 patch so just suppositions.

Anyway the patch is perfectly mature for upstream in my opinion, and is indeed the best one we have up to now. Please grab my Tested-By line from previous comments.
Comment 116 Christian Beier 2010-04-18 04:01:13 UTC
(In reply to comment #114)
> Performance should be about the same as old ums code or unpatched kernel - but
> stable ;)

Yeah, with the v8 patch everything is more or less on par with the old ums code performance-wise. Seems to be stable as well, running with compiz since a few days, no crashes or hickups.

Thumbs up! 
Christian
Comment 117 Stefan Glasenhardt 2010-04-18 15:32:41 UTC
(In reply to comment #114)
> Created an attachment (id=35087) [details]
> new patch against current drm-intel-next

Hi Daniel,

Is it possible to merge the changes into a clean patch which applies to kernel version 2.6.32 and 2.6.33 (.32 preferred).

Since two hours fiddling around to get your patch cleanly applied to the latest lucid kernel. It compiles, but the patch changes so much things (splitting of files, etc.), so i haven't the slightest idea what you have really changed and what the patch really does. 

Can you please post the code snippets which where introduced by you to solve the crashes?

Greetings Stefan
Comment 118 arthapex 2010-04-18 23:12:56 UTC
(In reply to comment #114)
> Created an attachment (id=35087) [details]
> new patch against current drm-intel-next
> ...
> Please test this patch thoroughly.
I've tested it on my Travelmate 660 since Friday evening, and it worked wonderful. Here are my specs: 
00:02.0 VGA compatible controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)
00:02.1 Display controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)

I've started xcompmgr on top of openbox. Every night four glxgears were running. And on daily use firefox (with a lot of flash videos) a high resolutioned trailer of IronMan2 with mplayer, eclipse and a transparent consolte were running, without a freeze.
I would say it works! Thanks!

Tested-By: Arthur Spitzer <arthapex@gmail.com> (Acer Travelmate 660)
Comment 119 Geir Ove Myhr 2010-04-19 00:14:39 UTC
(In reply to comment #117)
> Is it possible to merge the changes into a clean patch which applies to kernel
> version 2.6.32 and 2.6.33 (.32 preferred).
> Since two hours fiddling around to get your patch cleanly applied to the latest
> lucid kernel. 

Stefan, for the Lucid kernel we would want a patch against the 2.6.33 kernel, since Ubuntu (and some other distros) use 2.6.32 kernels with drm from 2.6.33.
Comment 120 legolas558 2010-04-19 02:11:12 UTC
(In reply to comment #115)
> (In reply to comment #114)
> > Created an attachment (id=35087) [details] [details]
> > new patch against current drm-intel-next
> > 
> > Performance should be about the same as old ums code or unpatched kernel - but
> > stable ;)
> > 
> General performance has indeed increased, however I can clearly notice
> "hickups" during major load on the GPU. Like when starting mozilla apps or wine
> apps; this was not noticeable with previous v7+locking patchset.
> 
> As you said it is like old UMS code, and actually glxgears FPS is comparable
> (if not better); so the only minor issue is the hickups that I am experiencing.
> These hickups also hang the mouse for some seconds, so I suppose it is some
> locking on the GPU pipe, but I can't read the changes in your v8 patch so just
> suppositions.
> 
Daniel the patch fixed every bug, I almost forgot that I was running a testing stack. Regarding the hickup: it most probably is perfectly normal and was concealed in previous versions of the patch because the overall performance was slower so the hickups could not be "felt"

Please add my Tested-by line, it's ready for me.
Comment 121 Daniel Vetter 2010-04-19 03:25:36 UTC
> --- Comment #117 from Stefan Glasenhardt <stefan@glasen-hardt.de> 2010-04-18 15:32:41 PDT ---
> Can you please post the code snippets which where introduced by you to solve
> the crashes?

As a preview, my local topic branch is available at

http://cgit.freedesktop.org/~danvet/drm/log/?h=stuff/i8xx_cache_coherency_for_oga

The relevant patches start after "Enable distcc". On my further merge
plans: As already said, I hope to send the last patch pile (containing the
real fix) for review in a few days, pending merging of the previous
submissions. If it survives review intact I'll backport just the fix for
.34 and earlier kernels.

So taking testing/relase delays on each stage (-next, .34, -stable) into
account, expect a few weeks before this hits a stable kernel near you,
best-case scenario.
Comment 122 Indan Zupancic 2010-04-19 05:36:35 UTC
Created attachment 35159 [details]
dmesg of a failed chipset flush warning

Since the libdrm and intel driver updates my system seems to be rock solid.
That said, I tried your patches to see if it made any change. v7 had horrible
performance, as expected, but v8 is considerably slower than unpatched too. E.g.
"time dmesg" in rxvt takes a lot longer (2x or more) and it all feels slightly
sluggish, not snappy as it is when unpatched. On the upside, it seems the text corruption is fixed by v8, though I'm not totally sure.

On the downside, just when I thought everything was fine, I got the following warning (for the first time):

WARNING: at /home/indan/src/linux-2.6/drivers/char/agp/intel-gtt.
c:1007 intel_i830_chipset_flush+0x2e3/0x32d()
Hardware name: 2371GHG
i8xx chipset flush failed, expected: 827, cpu_read: 315

So it seems we're not there yet.
Comment 123 Daniel Vetter 2010-04-19 06:08:58 UTC
> --- Comment #122 from Indan Zupancic <indan@nul.nu> 2010-04-19 05:36:35 PDT ---
> Since the libdrm and intel driver updates my system seems to be rock solid.
> That said, I tried your patches to see if it made any change. v7 had horrible
> performance, as expected, but v8 is considerably slower than unpatched too.
> E.g.
> "time dmesg" in rxvt takes a lot longer (2x or more) and it all feels slightly
> sluggish, not snappy as it is when unpatched. On the upside, it seems the text
> corruption is fixed by v8, though I'm not totally sure.

This is just with the kernel changed, right? Because 2.11 has taken a
rather severe hit against 2.10 for i8xx chipsets on some workloads (I'm
working on fixing it).

> On the downside, just when I thought everything was fine, I got the following
> warning (for the first time):
> 
> WARNING: at /home/indan/src/linux-2.6/drivers/char/agp/intel-gtt.
> c:1007 intel_i830_chipset_flush+0x2e3/0x32d()
> Hardware name: 2371GHG
> i8xx chipset flush failed, expected: 827, cpu_read: 315
> 
> So it seems we're not there yet.

Depends. It's definitely just a failed chipset flush (I've checked the
offset). But given enough time and testers, this is somewhat expected
because this patch just implements a probabilistic chipset flush. Tallying
all the chipset flushes of all testers easily gives on the order of 100mm
successful ones. Now if yours is the only one that failed, that's not a
problem. Please keep an eye on this and report any reoccurences - some
more tuning might be called for (perhaps even a module parameter).

Also please report if the glyph corruptions show up again.
Comment 124 Indan Zupancic 2010-04-19 11:26:15 UTC
(In reply to comment #123)
> This is just with the kernel changed, right? Because 2.11 has taken a
> rather severe hit against 2.10 for i8xx chipsets on some workloads (I'm
> working on fixing it).

Yes, all userspace is unchanged since I switched to 2.11 and newer libdrm.

I didn't notice any regressions with 2.11 compared to 2.10 though, but I'm
only using 2D with xcompmgr -a running.

> Depends. It's definitely just a failed chipset flush (I've checked the
> offset). But given enough time and testers, this is somewhat expected
> because this patch just implements a probabilistic chipset flush. Tallying
> all the chipset flushes of all testers easily gives on the order of 100mm
> successful ones. Now if yours is the only one that failed, that's not a
> problem. Please keep an eye on this and report any reoccurences - some
> more tuning might be called for (perhaps even a module parameter).

Well, it's curious I never got it with the v7 patch, while I ran that one for days.

A module parameter to dis/enable this canary stuff would be good, it just seems to slow things down for me without improving anything.

I wonder if it's in any way significant that the difference between the expected 827 and cpu_read 315 is precisely 512... Did anyone try to increase I830_MCH_WRITE_BUFFER_SIZE to something bigger?

Looking at the commit, especially the description, it seems like there's no way to do proper chipset flushes. Maybe hunt down and confront an Intel developer? Or avoid the need to do flushes, but that's probably unrealistic. On the other hand, if you can't really flush, you can't really depend on it either.

> Also please report if the glyph corruptions show up again.

I will.

Okay, while writing this I got a second warning:

i8xx chipset flush failed, expected: 4043, cpu_read: 3531

The difference is again exactly 512.

Maybe the chipset flushing is fine, but there's a different bug making it seem to fail?
Comment 125 Bruno 2010-04-19 11:34:57 UTC
Created attachment 35167 [details]
failed flush with

Yesterday I've had a failed flush as well, the only one since I applied patch in attachment #35087 [details] [review].
Currently running:
  x11-base/xorg-server-1.7.6
  x11-libs/libdrm-2.4.20
  media-libs/mesa-7.8.1
  xf86-video-intel at commit 80f52482c7cde000a76b91fe3d8b6c16baf2141f
                             XvMC: fix memory overflow
                             8 April 2010, by Daniel Vetter
Comment 126 Geir Ove Myhr 2010-04-19 11:49:25 UTC
(In reply to comment #125)
> Created an attachment (id=35167) [details]
[30125.064301] i8xx chipset flush failed, expected: 648447, cpu_read: 647935
The difference is again 512. Suggests that it is usually/always bit 8 (counting from 0) that comes out wrong(?)
Comment 127 Daniel Vetter 2010-04-19 12:04:41 UTC
> --- Comment #124 from Indan Zupancic <indan@nul.nu> 2010-04-19 11:26:15 PDT ---
> > Depends. It's definitely just a failed chipset flush (I've checked the
> > offset). But given enough time and testers, this is somewhat expected
> > because this patch just implements a probabilistic chipset flush. Tallying
> > all the chipset flushes of all testers easily gives on the order of 100mm
> > successful ones. Now if yours is the only one that failed, that's not a
> > problem. Please keep an eye on this and report any reoccurences - some
> > more tuning might be called for (perhaps even a module parameter).
> 
> Well, it's curious I never got it with the v7 patch, while I ran that one for
> days.
> 
> A module parameter to dis/enable this canary stuff would be good, it just seems
> to slow things down for me without improving anything.
> 
> I wonder if it's in any way significant that the difference between the
> expected 827 and cpu_read 315 is precisely 512... Did anyone try to increase
> I830_MCH_WRITE_BUFFER_SIZE to something bigger?

The fact that it's 512 shows that the problem is a failed cacheflush and
nothing else (this is actually what I've checked). The chipset flush
checker changes the place it writes the check value every chipset flush.
And it reuses the same place every 512th chipset flush. So when the
chipset flush failes, the old value is there, which should be exactly 512
less than what's expected.

> Looking at the commit, especially the description, it seems like there's no way
> to do proper chipset flushes. Maybe hunt down and confront an Intel developer?
> Or avoid the need to do flushes, but that's probably unrealistic. On the other
> hand, if you can't really flush, you can't really depend on it either.

Well, there is _no_ way to do a reliable flush. And the hw docs explicitly
says so.  But we need to move stuff in/out of the graphics mem (i.e. the
gtt). The other option would be to copy stuff in/out, which is even worse:
- Wastes memory (actually simply uses twice as much for everything).
- Would be even slower than what my hack currently does.

And to add insult to injury, some of the chipsets from the 2nd gen (i8xx)
suffer from other cache coherency problems in addition to this.

> > Also please report if the glyph corruptions show up again.
> 
> I will.
> 
> Okay, while writing this I got a second warning:
> 
> i8xx chipset flush failed, expected: 4043, cpu_read: 3531

Ok, that's bad. Can you change the following define in
include/drm/intel-gtt.h and see whether you still get failed chipset
flushes?

-#define I830_CC_CANARY_FLOCK_GTT_PAGES 8
+#define I830_CC_CANARY_FLOCK_GTT_PAGES 16

The whole stuff make somewhat more sense this way around, anyway.

Oh, and add some details about your box, please (brand&model + cpu,
mostly, the rest is all in the dmesg, anyway).
Comment 128 Tony White 2010-04-19 13:04:24 UTC
OK Guys, I've tried the :
fix-i8xx-gtt-cache-coherency-v7.patch
locking_for_chipset_flush.patch
&
gtt_space_null_means_no_pages_ref.patch
patches against 2.6.34-rc3 for about a week now.

I have :
Intel Corporation 82852/855GM Integrated Graphics Device (rev 02) (prog-if 00 [VGA controller])

libdrm 2.4.18
xorg-x11-drv-intel-2.9.1
xorg-x11-server-Xorg-1.7.6

The patch seems quite stable, although I just experienced a (Non fatal) Crash.
So nice job guys, it's much better.

The crash I just got looks like :

Apr 19 20:28:48 m3n kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Apr 19 20:28:48 m3n kernel: render error detected, EIR: 0x00000000
Apr 19 20:28:51 m3n kdm[1368]: X server for display :0 terminated unexpectedly

It is the first time it has crashed using the patches, the screen just went black, input was totally lost, I was able to safely shut down the machine by pressing the power button once and I am seeing lots of stuff like :

Apr 19 16:35:59 m3n kernel: chipset flush no. 2752512, max retries 7
Apr 19 16:42:43 m3n kernel: chipset flush no. 2768896, max retries 7
Apr 19 16:45:03 m3n kernel: chipset flush no. 2785280, max retries 7
Apr 19 16:45:58 m3n kernel: chipset flush no. 2801664, max retries 7
Apr 19 16:47:54 m3n kernel: chipset flush no. 2818048, max retries 7
Apr 19 16:49:08 m3n kernel: chipset flush no. 2834432, max retries 7
Apr 19 16:50:26 m3n kernel: chipset flush no. 2850816, max retries 7
Apr 19 16:52:57 m3n kernel: chipset flush no. 2867200, max retries 7
Apr 19 16:56:29 m3n kernel: chipset flush no. 2883584, max retries 7
Apr 19 16:57:46 m3n kernel: chipset flush no. 2899968, max retries 7
Apr 19 16:58:27 m3n kernel: chipset flush no. 2916352, max retries 7
Apr 19 17:02:14 m3n kernel: chipset flush no. 2932736, max retries 7
Apr 19 17:05:27 m3n kernel: chipset flush no. 2949120, max retries 7
Apr 19 17:10:07 m3n kernel: chipset flush no. 2965504, max retries 7
Apr 19 17:12:00 m3n kernel: chipset flush no. 2981888, max retries 7
Apr 19 17:15:20 m3n kernel: chipset flush no. 2998272, max retries 7
Apr 19 17:17:07 m3n kernel: chipset flush no. 3014656, max retries 7
Apr 19 17:17:54 m3n kernel: chipset flush no. 3031040, max retries 7
Apr 19 17:18:57 m3n kernel: chipset flush no. 3047424, max retries 7
Apr 19 17:26:08 m3n kernel: chipset flush no. 3063808, max retries 7
Apr 19 17:26:37 m3n kernel: chipset flush no. 3080192, max retries 7
Apr 19 17:28:20 m3n kernel: chipset flush no. 3096576, max retries 7
Apr 19 17:28:55 m3n kernel: chipset flush no. 3112960, max retries 7
Apr 19 17:29:58 m3n kernel: chipset flush no. 3129344, max retries 7
Apr 19 17:30:20 m3n kernel: chipset flush no. 3145728, max retries 7
Apr 19 17:30:44 m3n kernel: chipset flush no. 3162112, max retries 7
Apr 19 17:31:26 m3n kernel: chipset flush no. 3178496, max retries 7
Apr 19 17:31:54 m3n kernel: chipset flush no. 3194880, max retries 7
Apr 19 17:35:13 m3n kernel: chipset flush no. 3211264, max retries 7
Apr 19 17:35:34 m3n kernel: chipset flush no. 3227648, max retries 7
Apr 19 17:37:11 m3n kernel: chipset flush no. 3244032, max retries 7
Apr 19 17:38:20 m3n kernel: chipset flush no. 3260416, max retries 7
Apr 19 17:38:48 m3n kernel: chipset flush no. 3276800, max retries 7
Apr 19 17:40:20 m3n kernel: chipset flush no. 3293184, max retries 7

In the message log.
Flash video video in full screen shows a noticeable degradation in performance compared to the user mode driver also.

So, I've not had the kernel lock up, just the xserver die. The patches are an improvement to 2.6.34-rc3 but it's still a regression in comparison to the user mode i915 driver found in previous versions.
When they push xserver 1.8 and the newer xorg-intel driver into Fedora's rawhide, I'll grab it and test it out.

Thanks guys
Comment 129 Daniel Vetter 2010-04-19 13:21:26 UTC
> --- Comment #128 from Tony White <tonywhite100@googlemail.com> 2010-04-19 13:04:24 PDT ---
> OK Guys, I've tried the :
> fix-i8xx-gtt-cache-coherency-v7.patch
> locking_for_chipset_flush.patch
> &
> gtt_space_null_means_no_pages_ref.patch
> patches against 2.6.34-rc3 for about a week now.
> 
> I have :
> Intel Corporation 82852/855GM Integrated Graphics Device (rev 02) (prog-if 00
> [VGA controller])
> 
> libdrm 2.4.18
> xorg-x11-drv-intel-2.9.1
> xorg-x11-server-Xorg-1.7.6

Thanks alot for testing. Unfortunately the versions you're using are
known-broken. Please retest with the latest&greatest (currently libdrm
2.4.20 and xf86-video-intel 2.11). Also, when the gpu hangs (as indicated
by the hangcheck time elapsed error in the dmesg) always grab an
i915_error_state (from the dri directory of the debugfs filesystem). That
file contains the a dump of the gpu state when it died with all the
necessary info to debug such a hang (the dmesg only tells that the gpu
died, but misses all the other needed info).
Comment 130 Geir Ove Myhr 2010-04-19 13:53:18 UTC
(In reply to comment #129)
> > --- Comment #128 from Tony White 2010-04-19 
> > libdrm 2.4.18
> > xorg-x11-drv-intel-2.9.1
> Thanks alot for testing. Unfortunately the versions you're using are
> known-broken. Please retest with the latest&greatest (currently libdrm
> 2.4.20 and xf86-video-intel 2.11). 

Daniel, would you be able to give a list of commits that fix this kind of bugs? Ubuntu is currently frozen and has those versions, but it would be nice to have a list of candidate patches for updates. We already have 
[0c47195ca805881e3fbd5b9224be5c930feeeb8c]  i830: Clip solid fills to surface
Comment 131 Daniel Vetter 2010-04-19 14:12:30 UTC
> --- Comment #130 from Geir Ove Myhr <gomyhr@gmail.com> 2010-04-19 13:53:18 PDT ---
> Daniel, would you be able to give a list of commits that fix this kind of bugs?
> Ubuntu is currently frozen and has those versions, but it would be nice to have
> a list of candidate patches for updates. We already have 
> [0c47195ca805881e3fbd5b9224be5c930feeeb8c]  i830: Clip solid fills to surface

For a definite answer, please ask Chris, but a quick scan shows the
following commit since libdrm 2.4.18 as an important fix
 - a4041e096ce0faea3dd39b4d78014d45a8cacec0 (intel: Repeat execbuffer if
   interrupted by signal)
Comment 132 Indan Zupancic 2010-04-19 15:26:36 UTC
(In reply to comment #127)
> The fact that it's 512 shows that the problem is a failed cacheflush and
> nothing else (this is actually what I've checked). The chipset flush
> checker changes the place it writes the check value every chipset flush.
> And it reuses the same place every 512th chipset flush. So when the
> chipset flush failes, the old value is there, which should be exactly 512
> less than what's expected.

Yeah, I figured it would be that, reading through your old comments.

By the way, I think I got those failed flushes without xcompmgr running.
(I killed it to see if there was any difference.) That might explain why I 
didn't see failed flushes before, xcompmgr is more or less always running.

My case might be related to suspend, because both failures happened within 
a minute or so from resume.

I wish I knew a way to trigger it easily, now it takes days to test anything.

> Well, there is _no_ way to do a reliable flush. And the hw docs explicitly
> says so.  But we need to move stuff in/out of the graphics mem (i.e. the
> gtt). The other option would be to copy stuff in/out, which is even worse:
> - Wastes memory (actually simply uses twice as much for everything).
> - Would be even slower than what my hack currently does.
> 
> And to add insult to injury, some of the chipsets from the 2nd gen (i8xx)
> suffer from other cache coherency problems in addition to this.

What I don't understand is why your patch slows things down so much for me,
it seems to do only a few thousand flushes anyway.

I guess copying around is what the old drivers did?

Some random ideas:

- Increase I830_MCH_WRITE_BUFFER_SIZE?

- Instead of writing zeroes, actually change the content of the flush page.
  Flushing caches doesn't seem to do much if the new content is the same as
  the old one?

- The text you quoted in one of your commit messages said that the memory 
  content isn't coherent, but it didn't say anything about the mapping itself.
  Can't you update the gtt mapping to effectively flush it? I mean, if you 
  move pages out of the gtt and back in, shouldn't that flush the old content?
  Maybe move it to a different index, e.g. insert new mapping to the start
  instead of the end, in case the hw caches it by address+index. Similar to
  Chris Wilson's gtt disabling thing, but instead of disabling, altering it
  in a smart, flush causing way.

If the problem is that the flush is needed to avoid the hardware from writing
stale data to old gtt mapped physical memory:

- If an entry is added, there should be no need for a flush, because the all
  memory is still valid. If an entry is removed, the gpu can continue to write
  to those pages. What about copying the content to a new physical page and 
  keeping the original page for a while until the gpu is done with it?

(I don't know what I'm talking about, just trying to inspire you to come up
with some genius plan to solve all problems. :-)

> Ok, that's bad. Can you change the following define in
> include/drm/intel-gtt.h and see whether you still get failed chipset
> flushes?
> 
> -#define I830_CC_CANARY_FLOCK_GTT_PAGES 8
> +#define I830_CC_CANARY_FLOCK_GTT_PAGES 16
> 
> The whole stuff make somewhat more sense this way around, anyway.

I will try this later, first I'm going to try without your latest commit
("fix i85x gtt chipset flush") to see how it behaves without that stuff,
both performance and amount of failed flushes.

> Oh, and add some details about your box, please (brand&model + cpu,
> mostly, the rest is all in the dmesg, anyway).

See my first post: Thinkpad X40, 855GM (rev 02), Pentium M (family 6, model 13,
stepping 6: It has clflush).

But the hangs are gone, so I'm happy. I prefer slight glyph corruption that goes
away when I cause a refresh (e.g. increase text size) with snappy performance to
the sluggishness caused by the current patch.
Comment 133 legolas558 2010-04-19 15:48:18 UTC
(In reply to comment #132)
> (In reply to comment #127)
> > Oh, and add some details about your box, please (brand&model + cpu,
> > mostly, the rest is all in the dmesg, anyway).
> 
> See my first post: Thinkpad X40, 855GM (rev 02), Pentium M (family 6, model 13,
> stepping 6: It has clflush).
> 
> But the hangs are gone, so I'm happy. I prefer slight glyph corruption that
> goes
> away when I cause a refresh (e.g. increase text size) with snappy performance
> to
> the sluggishness caused by the current patch.

I also own an 855GM (rev 02), but I had no glyph corruption with patch v6; without the locking patch I experienced crashes, so the most recent patch is really necessary for me, although I'd also like to see it more performant. But first comes reliability, and right now it's not crashing anymore.
Comment 134 Daniel Vetter 2010-04-19 15:56:38 UTC
> --- Comment #132 from Indan Zupancic <indan@nul.nu> 2010-04-19 15:26:36 PDT ---
> What I don't understand is why your patch slows things down so much for me,
> it seems to do only a few thousand flushes anyway.

Well, worst-case a flush can take 1 ms.

> I guess copying around is what the old drivers did?

Nope. But for various reasons it changed mappings _much_ less. So much
less likely to crash.

> Some random ideas:
> 
> - Increase I830_MCH_WRITE_BUFFER_SIZE?

Tried. Given up at 64 kb.

> - Instead of writing zeroes, actually change the content of the flush page.
>   Flushing caches doesn't seem to do much if the new content is the same as
>   the old one?

Patch does that atm for all writes. Furthermore I've never seen hw that
clever (it's a total worthless optimization, usually).

> - The text you quoted in one of your commit messages said that the memory 
>   content isn't coherent, but it didn't say anything about the mapping itself.
>   Can't you update the gtt mapping to effectively flush it? I mean, if you 
>   move pages out of the gtt and back in, shouldn't that flush the old content?
>   Maybe move it to a different index, e.g. insert new mapping to the start
>   instead of the end, in case the hw caches it by address+index. Similar to
>   Chris Wilson's gtt disabling thing, but instead of disabling, altering it
>   in a smart, flush causing way.

Well, that's exactly where the shit usually hits the fan. Furthermore, at
least on i845 there are chipset errata that says (no joke) if you change a
mapping shortly before the gpu reads stuff from it, it may read adjacent
pages. Chris is trying to battle that one. Oh, and no, rewriting the gtt
entries doesn't flush data (only tlb, but not everywhere, see above).

> If the problem is that the flush is needed to avoid the hardware from writing
> stale data to old gtt mapped physical memory:
> 
> - If an entry is added, there should be no need for a flush, because the all
>   memory is still valid. If an entry is removed, the gpu can continue to write
>   to those pages. What about copying the content to a new physical page and 
>   keeping the original page for a while until the gpu is done with it?

Something similar is already done. Look for scratch_page in intel-gtt.c

> > Ok, that's bad. Can you change the following define in
> > include/drm/intel-gtt.h and see whether you still get failed chipset
> > flushes?
> > 
> > -#define I830_CC_CANARY_FLOCK_GTT_PAGES 8
> > +#define I830_CC_CANARY_FLOCK_GTT_PAGES 16
> > 
> > The whole stuff make somewhat more sense this way around, anyway.
> 
> I will try this later, first I'm going to try without your latest commit
> ("fix i85x gtt chipset flush") to see how it behaves without that stuff,
> both performance and amount of failed flushes.

If your X40 is anything like mine, you're in for a bad surprise :(

> > Oh, and add some details about your box, please (brand&model + cpu,
> > mostly, the rest is all in the dmesg, anyway).
> 
> See my first post: Thinkpad X40, 855GM (rev 02), Pentium M (family 6, model 13,
> stepping 6: It has clflush).

Thanks, I'm regularly losing my overview with all the different testers on
this bug ;)
Comment 135 Daniel Vetter 2010-04-19 16:00:52 UTC
> --- Comment #133 from legolas558 <legolas558@email.it> 2010-04-19 15:48:18 PDT ---
> I also own an 855GM (rev 02), but I had no glyph corruption with patch v6;
> without the locking patch I experienced crashes, so the most recent patch is
> really necessary for me, although I'd also like to see it more performant. But
> first comes reliability, and right now it's not crashing anymore.

Yep, I want to get this right first before performance tuning starts. But
I have already a few ideas how to improve the current situation.
- The current chipset flush always flushes both directions, but we usually
  only need one direction flushed. This is especially important because
  the slower flush is in gtt->cpu direction, which isn't performance
  critical at all.
- atm the driver executes enormous amounts of unnecessary flushes.
  Batching them up should fix this.
Comment 136 René Gabriëls 2010-04-19 18:23:43 UTC
(In reply to comment #110)
> Created an attachment (id=35065) [details]
> only call put_pages when gtt_space != NULL
> 
> Ok, this might be the first real stab at that dreaded put_pages BUG. Everyone
> who's hitting this problem, please apply this patch on top of whatever kernel
> most easily reproduces the problem.

I've been hitting this bug at least once the last couple of days (my system crashed a couple of times, and only once I was able to extract logs).  With this patch applied, I haven't seen this bug resurface so far.

However, my system crashes when starting doomsday or warzone2100 (both OpenGL games).  I hadn't noticed this before, and is probably unrelated to this coherency bug.  Where should I file this bug report? Mesa?  Dmesg says:

[drm:i915_gem_do_execbuffer] *ERROR* Invalid object handle 48 at index 0

X log says:

[  5132.404] (EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Bad file descriptor.
Comment 137 Brian Rogers 2010-04-19 20:00:07 UTC
OpenGL hangs could be bug 26557.

Try reverting commit b4a6169412819cc3a027c6a118f0537911145a30.
Comment 138 Brian Rogers 2010-04-19 20:03:08 UTC
That's a Mesa commit, BTW.
Comment 139 rainy6144 2010-04-19 20:32:56 UTC
(I don't own an 855 and my 845 machine is not available right now, so this is just wild speculation.)

Could some of the chipset buffers be indexed by the SDRAM bank number (and maybe even the row (side) number)?  I'm imagining a scenario where the CPU and the GTT sides have separate SDRAM write buffers that are not kept coherent (their access to the actual RAM can be arbitrated), and each write buffer has one or two cache lines for each bank; this might be a relatively easy way to make simultaneous access to different banks in parallel.  There seems to be 4 banks on the 845, and the bank number can be between bits 11-12 and 14-15, depending on the DRAM modules installed; perhaps the situation is similar on the 855 as well.  If this is the case, 16 physically contiguous pages should cover all banks, while non-contiguous ones might not be so if we are particularly unlucky in intel_i830_setup_flush(), which is called when resuming.

To test this theory, maybe we can print the physical addresses (those within the System RAM range in /proc/iomem) of the allocated i8xx_pages.  Then, when we see retried or even failed flushes, perhaps some patterns can be observed.
Comment 140 Daniel Vetter 2010-04-20 01:52:56 UTC
> --- Comment #139 from rainy6144@gmail.com 2010-04-19 20:32:56 PDT ---
> Could some of the chipset buffers be indexed by the SDRAM bank number (and
> maybe even the row (side) number)?  I'm imagining a scenario where the CPU and
> the GTT sides have separate SDRAM write buffers that are not kept coherent
> (their access to the actual RAM can be arbitrated), and each write buffer has
> one or two cache lines for each bank; this might be a relatively easy way to
> make simultaneous access to different banks in parallel.  There seems to be 4
> banks on the 845, and the bank number can be between bits 11-12 and 14-15,
> depending on the DRAM modules installed; perhaps the situation is similar on
> the 855 as well.  If this is the case, 16 physically contiguous pages should
> cover all banks, while non-contiguous ones might not be so if we are
> particularly unlucky in intel_i830_setup_flush(), which is called when
> resuming.

Neat idea. I'll look into allocating the pages as one big chunk (ie higher
order alloc). But that doesn't explain why the problem seems to happen
only after a resume - the pages don't get reallocated on resume (look for
"goto setup" in intel_i830_setup_flush.
Comment 141 legolas558 2010-04-20 16:00:31 UTC
18 chipset failures in 1h of uptime with a resume from hibernation (seems totally unrelated for me).

[  140.678789] i8xx chipset flush failed, expected: 4642, cpu_read: 4130
[  382.334636] i8xx chipset flush failed, expected: 32422, cpu_read: 31910
[  916.360151] i8xx chipset flush failed, expected: 85629, cpu_read: 85117
[ 1461.747517] i8xx chipset flush failed, expected: 142082, cpu_read: 141570
[ 2256.590632] i8xx chipset flush failed, expected: 196727, cpu_read: 196215
[ 4106.345442] i8xx chipset flush failed, expected: 267271, cpu_read: 266759
[ 5147.195196] i8xx chipset flush failed, expected: 309181, cpu_read: 308669
[ 6185.589716] i8xx chipset flush failed, expected: 354133, cpu_read: 353621
[ 8005.430094] i8xx chipset flush failed, expected: 437064, cpu_read: 436552
[ 8114.898367] i8xx chipset flush failed, expected: 444113, cpu_read: 443601

no "max retries" line.

Xorg 1.7.6
libdrm 2.4.19
mesa 7.7.1
Comment 142 Christian Beier 2010-04-21 10:08:31 UTC
Created attachment 35211 [details]
dmesg of deadlock with v8 patch

Hi,
with the v8 patch applied atop a drm-intel-next kernel (which is not too recent, around 2,5 weeks old), i got an apparent deadlock some time after resume. Lockdep is actually turned on (according to dmesg), so i really don't know why it says it's off.

Kernel 2.6.34-rc2 from drm-intel-next with v8 patch applied, lockdep enabled.
Xserver 1.7.6
libdrm 2.4.18
intel-drv 2.11

Dunno if it's related to the patch...

Cheers,
   Christian
Comment 143 Christian Beier 2010-04-21 10:10:01 UTC
Comment on attachment 35073 [details]
dmesg output snippet with i915_gem_tiling.c warning

resolved by updating to intel 2.11.
Comment 144 legolas558 2010-04-21 13:48:55 UTC
Created attachment 35214 [details]
DRI debugfs after overlay crash
Comment 145 legolas558 2010-04-21 13:53:11 UTC
Created attachment 35215 [details]
gtt flush failures (not fatal)

I have attached the flush failures found in dmesg (probably not related to the overlay crash) and the DRI debugfs after a crash happening when watching videos.

Looks like the overlay bug is not yet fixed. I had to watch the entire video collection of The Rockets, but finally I got an overlay filled with a nice blue, music still playing but Xorg inevitably dead. I could access a VT and take the dump, but any further attempt to restart Xorg was failing miserably.

Xorg was being filled indefinitively with these lines:

(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.

Nothing else was relevant there.

I think the flush failures are not tied to the overlay crash (I also have them when the system doesn't crash), but anyway I have added them here.
Comment 146 legolas558 2010-04-23 04:21:36 UTC
(In reply to comment #145)
> I think the flush failures are not tied to the overlay crash (I also have them
> when the system doesn't crash), but anyway I have added them here.

Some more information: while watching videos there are occasional glitching overlay frames (psychedelic colors, only 1 frame), sometimes interleaved with a blue fill like the one appearing when it gives up in a total Xorg crash. The blue frames seem to appear more frequently when Xorg is more prone to fatally giving up, but never seen them more than twice before a total Xorg crash.

Shall we split this bug into a new one?
Comment 147 Indan Zupancic 2010-04-23 05:49:43 UTC
(In reply to comment #146)
> (In reply to comment #145)
> > I think the flush failures are not tied to the overlay crash (I also have them
> > when the system doesn't crash), but anyway I have added them here.
> 
> Some more information: while watching videos there are occasional glitching
> overlay frames (psychedelic colors, only 1 frame), sometimes interleaved with a
> blue fill like the one appearing when it gives up in a total Xorg crash. The
> blue frames seem to appear more frequently when Xorg is more prone to fatally
> giving up, but never seen them more than twice before a total Xorg crash.
> 
> Shall we split this bug into a new one?

I think you're hitting the bug I had, which should be fixed by upgrading to
the 2.11 intel driver and libdrm 2.4.20.
Comment 148 Indan Zupancic 2010-04-29 02:37:57 UTC
Created attachment 35329 [details]
GPU hang with newest driver and libdrm. v8 without extra flushing patch.

(In reply to comment #147)
> I think you're hitting the bug I had, which should be fixed by upgrading to
> the 2.11 intel driver and libdrm 2.4.20.

Okay, after a week or two (?) of running v8 without the last extra flushing commit I finally got a hung GPU again. So it seems there is a corner case left for this particular bug somewhere.

Last time I counted I got around 2% failed flushes, but otherwise the system was rock solid. Text corruption was rare too, though I think it did happen the day the GPU hung.

Dump was taken with this script:

#!/bin/bash
PATH="/bin:/usr/bin"
mount /mnt/debug
cd /tmp/

while true; do
	if grep -q 0 /mnt/debug/dri/0/i915_wedged; then
		sleep 1;
	else
		mkdir dump
		dmesg > dump/dmesg
		cp /var/log/Xorg.0.log dump/
		cp -a /mnt/debug/dri/0/* dump/
		tar czf dump.tgz dump
		rm -rf dump
		mv dump.tgz /home/indan/
		sync;
		exit;
	fi
done
Comment 149 Indan Zupancic 2010-04-29 02:49:34 UTC
(In reply to comment #134)
> > --- Comment #132 from Indan Zupancic <indan@nul.nu> 2010-04-19 15:26:36 PDT ---
> > What I don't understand is why your patch slows things down so much for me,
> > it seems to do only a few thousand flushes anyway.
> 
> Well, worst-case a flush can take 1 ms.

That would explain it yes.

[cut]
> > If the problem is that the flush is needed to avoid the hardware from writing
> > stale data to old gtt mapped physical memory:
> > 
> > - If an entry is added, there should be no need for a flush, because the all
> >   memory is still valid. If an entry is removed, the gpu can continue to write
> >   to those pages. What about copying the content to a new physical page and 
> >   keeping the original page for a while until the gpu is done with it?
> 
> Something similar is already done. Look for scratch_page in intel-gtt.c

But if done properly the need for flushing would go away altogether. Considering it's quite stable here without those extra flushes, perhaps it's easier to fix the corner cases that still need flushing instead of getting flushing reliable?

> > > Ok, that's bad. Can you change the following define in
> > > include/drm/intel-gtt.h and see whether you still get failed chipset
> > > flushes?
> > > 
> > > -#define I830_CC_CANARY_FLOCK_GTT_PAGES 8
> > > +#define I830_CC_CANARY_FLOCK_GTT_PAGES 16
> > > 
> > > The whole stuff make somewhat more sense this way around, anyway.
> > 
> > I will try this later, first I'm going to try without your latest commit
> > ("fix i85x gtt chipset flush") to see how it behaves without that stuff,
> > both performance and amount of failed flushes.
> 
> If your X40 is anything like mine, you're in for a bad surprise :(

Dmesg is full with backtraces, but other than that it's quite stable.
Performance is good too again.

Next week I should have a bit more time to read the code and do more testing.

> > > Oh, and add some details about your box, please (brand&model + cpu,
> > > mostly, the rest is all in the dmesg, anyway).
> > 
> > See my first post: Thinkpad X40, 855GM (rev 02), Pentium M (family 6, model 13,
> > stepping 6: It has clflush).
> 
> Thanks, I'm regularly losing my overview with all the different testers on
> this bug ;)

No problem, you're doing great. :-)
Comment 150 Daniel Vetter 2010-05-05 12:37:49 UTC
> --- Comment #136 from René Gabriëls <renegabriels@gmail.com> 2010-04-19 18:23:43 PDT ---
> However, my system crashes when starting doomsday or warzone2100 (both OpenGL
> games).  I hadn't noticed this before, and is probably unrelated to this
> coherency bug.  Where should I file this bug report? Mesa?  Dmesg says:
> 
> [drm:i915_gem_do_execbuffer] *ERROR* Invalid object handle 48 at index 0
> 
> X log says:
> 
> [  5132.404] (EE) intel(0): Failed to submit batch buffer, expect rendering
> corruption or even a frozen display: Bad file descriptor.

Looks like an (unrelated) bug in xf86-video-intel - it's submitting a
batchbuffer with a corrupt object/reloc table.
Comment 151 legolas558 2010-05-05 23:46:28 UTC
Created attachment 35450 [details]
failure after starting xfce4-panel

I have just pulled drm-intel, recompiled it (without patch v8, which seems to be already there) and then I can no more use Xorg, I instantly get these errors when starting XFCE:

intel_bufmgr_gem.c:1052: Error setting domain 69: Input/output error
intel_bufmgr_gem.c:1052: Error setting domain 65: Input/output error
intel_bufmgr_gem.c:1052: Error setting domain 89: Input/output error

and then the usual waterfall of I/O errors. I am on vesa now.

Versions of my packages:

xorg-server 1.7.6
libdrm 2.4.19
xf86-video-intel 2.10.0

Looks like Arch Linux hasn't yet upgraded these, nor I am able to run a freedesktop git development stack
Comment 152 Indan Zupancic 2010-05-06 02:17:52 UTC
(In reply to comment #151)
> Created an attachment (id=35450) [details]
> failure after starting xfce4-panel
> 
> I have just pulled drm-intel, recompiled it (without patch v8, which seems to
> be already there) and then I can no more use Xorg, I instantly get these errors
> when starting XFCE:
> 
> intel_bufmgr_gem.c:1052: Error setting domain 69: Input/output error
> intel_bufmgr_gem.c:1052: Error setting domain 65: Input/output error
> intel_bufmgr_gem.c:1052: Error setting domain 89: Input/output error
> 
> and then the usual waterfall of I/O errors. I am on vesa now.
> 
> Versions of my packages:
> 
> xorg-server 1.7.6
> libdrm 2.4.19
> xf86-video-intel 2.10.0
> 
> Looks like Arch Linux hasn't yet upgraded these, nor I am able to run a
> freedesktop git development stack

All the new stuff is in the testing repository.
Comment 153 legolas558 2010-05-06 05:46:34 UTC
(In reply to comment #152)
> (In reply to comment #151)
> > Looks like Arch Linux hasn't yet upgraded these, nor I am able to run a
> > freedesktop git development stack
> 
> All the new stuff is in the testing repository.

I got the new testing packages:

xf86-video-intel 2.11.0-1
xorg-server 1.8.902-1
libdrm 2.4.20-2
mesa 7.8.1-2

And exactly the same crash at startup. Some regression here?
Comment 154 Daniel Vetter 2010-05-07 10:26:27 UTC
> --- Comment #151 from legolas558 <legolas558@email.it> 2010-05-05 23:46:28 PDT ---
> I have just pulled drm-intel, recompiled it (without patch v8, which seems to
> be already there) and then I can no more use Xorg, I instantly get these errors
> when starting XFCE:

Nope the patch is not yet there, at least not yet fully. So it's expected
that the kernel you've tested is rather crash-happy ;)

I've hoped that a few patches more would go in before I rebase, but atm
stuff is stalling. I'll post a rebased version of the patch asap.
Comment 155 Stefan Glasenhardt 2010-05-09 15:12:57 UTC
Hi Daniel,

I just wanted to ask if there are any news on fixing the performance problems when using your patch?

I'm using your patch on the latest Lucid-kernel (Backported by comparing and copy'n pasting every singe line) and my notebook now works perfectly stable, even with the older Lucid intel-drivers (In UMS and KMS-mode)

P.S. :

I've disabled “intel_wait_for_canary_flocks” after the flush in the function "intel_i830_chipset_flush". This heavily improves the performance and only had one single crash in the last days (Which might related to another bug, because i could switch to console and restart the systems without problems).
Comment 156 legolas558 2010-05-10 06:52:36 UTC
Created attachment 35546 [details]
v8 patch rebased against latest drm-intel (anholt repository)

(In reply to comment #154)
> > --- Comment #151 from legolas558 <legolas558@email.it> 2010-05-05 23:46:28 PDT ---
> > I have just pulled drm-intel, recompiled it (without patch v8, which seems to
> > be already there) and then I can no more use Xorg, I instantly get these errors
> > when starting XFCE:
> 
> Nope the patch is not yet there, at least not yet fully. So it's expected
> that the kernel you've tested is rather crash-happy ;)
> 
> I've hoped that a few patches more would go in before I rebase, but atm
> stuff is stalling. I'll post a rebased version of the patch asap.

I have made an attempt to rebase your patch vs latest drm-intel-next; I hope the result is good (some people on Arch Linux forums were asking me about the patch, so now I am pointing them to this bug tracker)
Comment 157 legolas558 2010-05-10 10:49:27 UTC
I can't say if it's due to my badly rebased patch or to some recent change to software, but Firefox persona's background is badly garbled, VLC shows a still image in place of the video overlay and videos played with mplayer have nice psychedelic glitches
Comment 158 Daniel Vetter 2010-05-10 11:32:22 UTC
Created attachment 35548 [details] [review]
v9 against latest drm-intel-next

Sorry for the delay, but I want to test new patches a little before posting (especially now that quite a few people are on this bugs cc list). Changes vs v8:

- rebased against latest drm-intel-next (patch shrunk quite decently, yeah!).
- increased the gtt flock size to 16 pages. Perhaps this helps.

Plans going forward:
- I haven't yet started on the performance work. I don't really like mucking around in a very delicate and hard to debug part of gem. So I still hope that this problem somehow magically fixes itself ;) More honestly: Correctnes first, performance later (and if I'm very lucky, other ongoing work by other people will make this much easier).
- Merging plans: Due to the (new) failures reported with v8 I'm reluctant to submit the patch as-is. I'm definitely pushing everything up to the cache coherency checker for inclusion into -next (already submitted). But the actual fix probably needs to wait some more.
- I haven't yet had time to research/implement the RAM bank idea by rainy6144.
Comment 159 legolas558 2010-05-11 05:57:23 UTC
(In reply to comment #158)
> Created an attachment (id=35548) [details]
> v9 against latest drm-intel-next
> 
> Sorry for the delay, but I want to test new patches a little before posting
> (especially now that quite a few people are on this bugs cc list). Changes vs
> v8:
> 
> - rebased against latest drm-intel-next (patch shrunk quite decently, yeah!).
> - increased the gtt flock size to 16 pages. Perhaps this helps.
> 
The Firefox persona's background glitch is still there, I strongly think that it can be a new bug in intel driver or libdrm.

> - Merging plans: Due to the (new) failures reported with v8 I'm reluctant to
> submit the patch as-is. I'm definitely pushing everything up to the cache
> coherency checker for inclusion into -next (already submitted). But the actual
> fix probably needs to wait some more.
I think the patch should get critical priority even as-is because the vanilla kernel (also drm-intel-next) crashes in a few seconds without it.
Comment 160 Indan Zupancic 2010-05-11 15:33:24 UTC
Didn't take long:

[ 2111.864905] WARNING: at /home/indan/src/linux-2.6/drivers/char/agp/intel-gtt.c:1007 intel_i830_chipset_flush+0x2e3/0x32d()
[ 2111.864912] Hardware name: 2371GHG
[ 2111.864917] i8xx chipset flush failed, expected: 118451, cpu_read: 117939
[ 2111.864922] Modules linked in: pl2303 usbserial usb_storage uhci_hcd ehci_hcd usbcore
[ 2111.864940] Pid: 788, comm: X Not tainted 2.6.34-rc6-v9 #52
[ 2111.864945] Call Trace:
[ 2111.864956]  [<c101ea80>] ? warn_slowpath_common+0x5d/0x70
[ 2111.864964]  [<c101eac6>] ? warn_slowpath_fmt+0x26/0x2a
[ 2111.864973]  [<c1141b1b>] ? intel_i830_chipset_flush+0x2e3/0x32d
[ 2111.864984]  [<c113da68>] ? agp_flush_chipset+0xc/0xd
[ 2111.864994]  [<c115bdae>] ? i915_gem_flush+0x1a/0xbb
[ 2111.865003]  [<c115fae5>] ? i915_gem_do_execbuffer+0x9bb/0xe3f
[ 2111.865023]  [<c115d187>] ? i915_gem_object_set_to_gtt_domain+0x33/0x5c
[ 2111.865032]  [<c116004d>] ? i915_gem_execbuffer2+0xe4/0x164
[ 2111.865041]  [<c114826f>] ? drm_ioctl+0x1cf/0x27a
[ 2111.865049]  [<c115ff69>] ? i915_gem_execbuffer2+0x0/0x164
[ 2111.865060]  [<c10695d9>] ? do_sync_read+0x9d/0xd2
[ 2111.865069]  [<c11480a0>] ? drm_ioctl+0x0/0x27a
[ 2111.865078]  [<c1073701>] ? vfs_ioctl+0x1c/0x7d
[ 2111.865086]  [<c1073c97>] ? do_vfs_ioctl+0x478/0x4bc
[ 2111.865096]  [<c1030e62>] ? hrtimer_try_to_cancel+0x43/0x60
[ 2111.865105]  [<c1021c12>] ? do_setitimer+0xa4/0x17f
[ 2111.865113]  [<c1021d35>] ? sys_setitimer+0x48/0x73
[ 2111.865121]  [<c1034f41>] ? ktime_get_ts+0xb3/0xbb
[ 2111.865129]  [<c1073d08>] ? sys_ioctl+0x2d/0x44
[ 2111.865138]  [<c10025d0>] ? sysenter_do_call+0x12/0x26
[ 2111.865144] ---[ end trace d90ca0d623dcc2a3 ]---
[ 2934.051532] ------------[ cut here ]------------
[ 2934.051547] WARNING: at /home/indan/src/linux-2.6/drivers/char/agp/intel-gtt.c:1007 intel_i830_chipset_flush+0x2e3/0x32d()
[ 2934.051551] Hardware name: 2371GHG
[ 2934.051554] i8xx chipset flush failed, expected: 156295, cpu_read: 155783
[ 2934.051557] Modules linked in: pl2303 usbserial usb_storage uhci_hcd ehci_hcd usbcore
[ 2934.051569] Pid: 788, comm: X Tainted: G        W  2.6.34-rc6-v9 #52
[ 2934.051572] Call Trace:
[ 2934.051580]  [<c101ea80>] ? warn_slowpath_common+0x5d/0x70
[ 2934.051584]  [<c101eac6>] ? warn_slowpath_fmt+0x26/0x2a
[ 2934.051589]  [<c1141b1b>] ? intel_i830_chipset_flush+0x2e3/0x32d
[ 2934.051596]  [<c113da68>] ? agp_flush_chipset+0xc/0xd
[ 2934.051602]  [<c115bdae>] ? i915_gem_flush+0x1a/0xbb
[ 2934.051607]  [<c115fae5>] ? i915_gem_do_execbuffer+0x9bb/0xe3f
[ 2934.051614]  [<c116620f>] ? intel_mark_busy+0x9b/0x177
[ 2934.051619]  [<c115d187>] ? i915_gem_object_set_to_gtt_domain+0x33/0x5c
[ 2934.051624]  [<c116004d>] ? i915_gem_execbuffer2+0xe4/0x164
[ 2934.051629]  [<c114826f>] ? drm_ioctl+0x1cf/0x27a
[ 2934.051634]  [<c115ff69>] ? i915_gem_execbuffer2+0x0/0x164
[ 2934.051641]  [<c1007a62>] ? restore_i387_fxsave+0x4c/0x5c
[ 2934.051647]  [<c1034fa4>] ? ktime_get+0x5b/0xcf
[ 2934.051652]  [<c11480a0>] ? drm_ioctl+0x0/0x27a
[ 2934.051658]  [<c1073701>] ? vfs_ioctl+0x1c/0x7d
[ 2934.051662]  [<c1073c97>] ? do_vfs_ioctl+0x478/0x4bc
[ 2934.051669]  [<c1031655>] ? hrtimer_start+0xd/0x11
[ 2934.051674]  [<c1021c91>] ? do_setitimer+0x123/0x17f
[ 2934.051678]  [<c1034f41>] ? ktime_get_ts+0xb3/0xbb
[ 2934.051683]  [<c1073d08>] ? sys_ioctl+0x2d/0x44
[ 2934.051687]  [<c10025d0>] ? sysenter_do_call+0x12/0x26
[ 2934.051691] ---[ end trace d90ca0d623dcc2a4 ]---

There's also a small copy&paste bug in your patch:

	for (i = 0; i < I830_CC_CANARY_FLOCK_PAGES; i++) {
		intel_private.i8xx_cpu_canary_pages[i]
			= kmap(intel_private.i8xx_pages[i+2]);
		if (!intel_private.i8xx_cpu_flush_page) {
			WARN_ON(1);
			intel_i830_fini_flush();
			return;
		}
	}
That should be if (!intel_private.i8xx_cpu_canary_pages[i]).

I don't understand this bit:

	/* Don't map the first page, we only write via its physical address
	 * into it. */
	for (i = 0; i < I830_CC_DANCE_PAGES; i++) {
		writel(agp_bridge->driver->mask_memory(agp_bridge,
				page_to_phys(intel_private.i8xx_pages[i+1]), 0),
		       intel_private.registers+I810_PTE_BASE+((num_entries+i)*4));
	}

The first page is i8xx_cpu_flush_page, but if it isn't mapped, the gmch doesn't know about it, and intel_flush_mch_write_buffer() has no effect, has it? Or is any write at any address sufficient to fill the write buffer?

We seem to have mysterious behaviour here, all the canary pages ended up coherent, but that one write somehow didn't?!

I guess the gmch has a local cache that hides writes. If you know that cache's design (associativity etc.) then you can probably flush it out by doing a read or write to the right address. The canary stuff seems to work most of the time, so the cache can't be too big.

Or maybe you can flush it out by putting the chip in D1-3 and back to D0 quickly, or something crazy like that.
Comment 161 Indan Zupancic 2010-05-12 00:06:39 UTC
Oh, forgot to mention: The above failed flush was without having done a suspend.
Comment 162 Thorsten Vollmer 2010-05-15 10:40:23 UTC
(In reply to comment #158)
> Due to the (new) failures reported with v8 I'm reluctant to submit the
> patch as-is.

Patch v8 performed well on my 852GME. In three weeks of regular usage and limited stress testing the kernel reported exactly one flush failure. I did not notice any slowdown compared to unpatched kernels, and there is no measurable difference as reported in comment #122.
Patch v9 is just as good of course.

(In reply to comment #140)
> I'll look into allocating the pages as one big chunk (ie higher order
> alloc)

There is a decent chance that the pages are already allocated in one chunk. At least on my machine the pages happen to be allocated consecutively. (verified with page_to_phys(intel_private.i8xx_pages[i]))
You could make your next patch print the physical addresses. If other machines behave like mine but still show more failures, then you do not need to bother with explicit higher order allocations.
Comment 163 René Gabriëls 2010-05-17 06:57:42 UTC
(In reply to comment #158)
> Created an attachment (id=35548) [details]
> v9 against latest drm-intel-next
> 
> Sorry for the delay, but I want to test new patches a little before posting
> (especially now that quite a few people are on this bugs cc list). Changes vs
> v8:
> 
> - rebased against latest drm-intel-next (patch shrunk quite decently, yeah!).
> - increased the gtt flock size to 16 pages. Perhaps this helps.

So far (5 days of testing) v9 works flawlessly here: no crashes or artefacts.
Comment 164 nepo 2010-05-18 02:35:35 UTC
Hello,
I do have a 855 chipset as well but unfortunately I am not an advanced user - could somebody explain in some short words how to install the Patch? Is there a git software necessary?
Thank you so much!! D.
Comment 165 Indan Zupancic 2010-05-18 04:25:58 UTC
(In reply to comment #164)
> Hello,
> I do have a 855 chipset as well but unfortunately I am not an advanced user -
> could somebody explain in some short words how to install the Patch? Is there a
> git software necessary?
> Thank you so much!! D.

The patch is against drm-next, so git is probably easiest.

# Get Linux git tree (takes a while):
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

cd linux-2.6

# Add the drm-intel-next branch from drm-intel:
git remote add -t drm-intel-next drm-intel-next git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel.git

# Change working dir to this new stuff:
git checkout drm-intel-next

# Apply patch:
patch --dry-run -p1 < ../fix-i855-cache-coherency-v9.patch

# If that succeeds redo without the --dry-run bit.

Good luck!
Comment 166 dnjl 2010-05-19 06:10:43 UTC
(In reply to comment #164)
> Hello,
> I do have a 855 chipset as well but unfortunately I am not an advanced user -
> could somebody explain in some short words how to install the Patch? Is there a
> git software necessary?
> Thank you so much!! D.

if you are looking for updated packages this depends on your distribution.

For Ubuntu Lucid you will find
- updated kernel module (the module only) here:
  https://launchpad.net/~glasen/+archive/855gm-fix/
  (you have to remove this package when the bug is fixed in ubuntu kernel)
- or a whole kernel update here:
  https://launchpad.net/~dnjl/+archive/kernel/
  (which will be superseded/updated in the case its fixed in ubuntu kernel)

Also, dont forget to install all other provided updates!
On my systems no updates for drm or xorg-intel are needed anymore.

For other distrobutions I don't now...
Comment 167 Christian Beier 2010-05-19 06:32:22 UTC
(In reply to comment #166)
> (In reply to comment #164)
> > Hello,
> > I do have a 855 chipset as well but unfortunately I am not an advanced user -
> > could somebody explain in some short words how to install the Patch? Is there a
> > git software necessary?
> > Thank you so much!! D.
> 
> if you are looking for updated packages this depends on your distribution.
> 
> For Ubuntu Lucid you will find
> - updated kernel module (the module only) here:
>   https://launchpad.net/~glasen/+archive/855gm-fix/
>   (you have to remove this package when the bug is fixed in ubuntu kernel)
> - or a whole kernel update here:
>   https://launchpad.net/~dnjl/+archive/kernel/
>   (which will be superseded/updated in the case its fixed in ubuntu kernel)
> 
> Also, dont forget to install all other provided updates!
> On my systems no updates for drm or xorg-intel are needed anymore.
> 
> For other distrobutions I don't now...

For Debian Squeeze on i686 a kernel package is here:
http://www2.informatik.hu-berlin.de/~beier/tmp/linux-image-2.6.34gtt-fix-v9_2.6.34gtt-fix-v9-10.00.Custom_i386.deb
Comment 168 Daniel Vetter 2010-05-19 13:21:59 UTC
> --- Comment #160 from Indan Zupancic <indan@nul.nu> 2010-05-11 15:33:24 PDT ---
> There's also a small copy&paste bug in your patch:
> 
>     for (i = 0; i < I830_CC_CANARY_FLOCK_PAGES; i++) {
>         intel_private.i8xx_cpu_canary_pages[i]
>             = kmap(intel_private.i8xx_pages[i+2]);
>         if (!intel_private.i8xx_cpu_flush_page) {
>             WARN_ON(1);
>             intel_i830_fini_flush();
>             return;
>         }
>     }
> That should be if (!intel_private.i8xx_cpu_canary_pages[i]).

Thanks for spotting this. Fixed in my local version.

> I don't understand this bit:
> 
>     /* Don't map the first page, we only write via its physical address
>      * into it. */
>     for (i = 0; i < I830_CC_DANCE_PAGES; i++) {
>         writel(agp_bridge->driver->mask_memory(agp_bridge,
>                 page_to_phys(intel_private.i8xx_pages[i+1]), 0),
>                intel_private.registers+I810_PTE_BASE+((num_entries+i)*4));
>     }
> 
> The first page is i8xx_cpu_flush_page, but if it isn't mapped, the gmch doesn't
> know about it, and intel_flush_mch_write_buffer() has no effect, has it? Or is
> any write at any address sufficient to fill the write buffer?

This just implements the gtt mapping. The direct mapping using physical
address is done a few lines before. And because
intel_flush_mch_write_buffer only needs a direct mapping, I've decided to
save on gtt page.

> We seem to have mysterious behaviour here, all the canary pages ended up
> coherent, but that one write somehow didn't?!
> 
> I guess the gmch has a local cache that hides writes. If you know that cache's
> design (associativity etc.) then you can probably flush it out by doing a read
> or write to the right address. The canary stuff seems to work most of the time,
> so the cache can't be too big.

Well, that's exactly the problem. No one knows how it works exactly ...

> Or maybe you can flush it out by putting the chip in D1-3 and back to D0
> quickly, or something crazy like that.

That one probably takes even longer than what I'm doing here ...
Comment 169 René Gabriëls 2010-05-20 14:44:21 UTC
(In reply to comment #163)
> So far (5 days of testing) v9 works flawlessly here: no crashes or artefacts.

Meh, font-render errors in Emacs with v9 patch.  This was not the case with v7 patch AFAIK.  Some fonts aren't drawn at all, some aren't cleared after deleting, and some fonts are rendered too fat (twice maybe, with slichtly different position?)
Comment 170 Branimir 2010-05-21 01:59:23 UTC
Created attachment 35778 [details]
dmesg + i915_error_state

dmesg + i915_error_state
Comment 171 Branimir 2010-05-21 02:02:38 UTC
Hey guys I would really like to thank you All and especially Daniel for your hard work for solving this problem. I was struggling with it since I upgraded from Slackware 12.2 to 13.0. I've been making long searches in internet and finding many people with the same problem and not a single solution. I'm really happy that finally there is a real chance to get this solved. 

Yesterday I build a kernel with patch v9 and the crashes stopped. Finally I could upgrade the intel driver from 2.3.2-legacy to 2.11. Unfortunately there were several problems:

1.The video performance slowed down and now I can't watch HD videos any more (I worked so hard last month to get HD 720p working on my old laptop :( )
If I boot without KMS and the legacy driver the video is faster, but no 3D accel:

(EE) AIGLX error: i915 does not export required DRI extension
(EE) AIGLX: reverting to software rendering
(EE) AIGLX error: dlopen of /usr/lib/xorg/modules/dri/swrast_dri.so failed (/usr/lib/xorg/modules/dri/swrast_dri.so: cannot open shared object file: No such file or directory)
(EE) GLX: could not load software renderer
(II) GLX: no usable GL providers found for screen 0

2. When I tried to open a game with wine I had again crash:

(WW) intel(0): i830_uxa_pixmap_swap_bo_with_image: bo map failed 
(WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error 
(WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error 
(WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error 
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.

I would like to help you if I can. I don't have knowledge of low level programming but at least with testing.

I have attached dmesg and i915_error_state (sorry that it's in the previous post)

And one more thing:
I have 
kernel-2.6.34-rc6 from drm-intel-next + v9 patch
libdrm-2.4.20
xf86-video-intel-2.11
mesa-7.8.1
xorg-server-1.6.3

just I would like to ask how are you upgrading xorg-server cause there are many packages related to it and I don't feel like spending days to build all X11.

Thanks!
Comment 172 Daniel Vetter 2010-05-21 09:59:30 UTC
On Fri, May 21, 2010 at 02:02:41AM -0700, bugzilla-daemon@freedesktop.org wrote:
> 1.The video performance slowed down and now I can't watch HD videos any more (I
> worked so hard last month to get HD 720p working on my old laptop :( )
> If I boot without KMS and the legacy driver the video is faster, but no 3D
> accel:

The mesa you have doesn't support non-kms anymore. So if you haven't
downgraded that, too, dead-slow opengl is expected ;)

> (EE) AIGLX error: i915 does not export required DRI extension
> (EE) AIGLX: reverting to software rendering
> (EE) AIGLX error: dlopen of /usr/lib/xorg/modules/dri/swrast_dri.so failed
> (/usr/lib/xorg/modules/dri/swrast_dri.so: cannot open shared object file: No
> such file or directory)
> (EE) GLX: could not load software renderer
> (II) GLX: no usable GL providers found for screen 0
> 
> 2. When I tried to open a game with wine I had again crash:
> 
> (WW) intel(0): i830_uxa_pixmap_swap_bo_with_image: bo map failed 
> (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error 
> (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error 
> (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error 
> (EE) intel(0): Failed to submit batch buffer, expect rendering corruption or
> even a frozen display: Input/output error.
> 
> I would like to help you if I can. I don't have knowledge of low level
> programming but at least with testing.
> 
> I have attached dmesg and i915_error_state (sorry that it's in the previous
> post)

I've taken a quick look. The gpu jumped to a location where there's no
batchbuffer. Likely some memory corruption, but can't say for sure.

> And one more thing:
> I have 
> kernel-2.6.34-rc6 from drm-intel-next + v9 patch
> libdrm-2.4.20
> xf86-video-intel-2.11
> mesa-7.8.1
> xorg-server-1.6.3
> 
> just I would like to ask how are you upgrading xorg-server cause there are many
> packages related to it and I don't feel like spending days to build all X11.

You don't have to. But it would be great if you could upgrade to latest
libdrm and xf86-video-intel from git master, there have been quite a few
bug-fixes since the last release. Get them from
http://cgit.freedesktop.org/mesa/drm
http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/
Compile&install libdrm first (otherwise you get a non-working system).

Thanks for testing, Daniel
Comment 173 Brian Rogers 2010-05-27 01:33:27 UTC
I've set up an Ubuntu PPA with linux 2.6.34 + drm-intel-next + fix-i855-cache-coherency-v9.patch at https://launchpad.net/~brian-rogers/+archive/graphics-fixes

I just posted this PPA to some downstream bug reports. Where should I direct the feedback? Should I tell people to report here directly, or should I try to collect and aggregate feedback and report totals here?

And what's the status of this patch? Waiting for more feedback, in need of revision, or what?
Comment 174 Daniel Vetter 2010-05-27 09:15:12 UTC
> --- Comment #173 from Brian Rogers <brian@xyzw.org> 2010-05-27 01:33:27 PDT ---
> I just posted this PPA to some downstream bug reports. Where should I direct
> the feedback? Should I tell people to report here directly, or should I try to
> collect and aggregate feedback and report totals here?

IMHO gathering the feedback and aggregating interesting/special stuff here
is the best option. This bug report is already rather crowded as-is.

> And what's the status of this patch? Waiting for more feedback, in need
> of revision, or what?
The preparatory stuff missed the .35 merge window, so currently nothing's
gonna happen. I'll intend to submit it for .36 - but I'm slightly uneasy
with the fact that some systems still show reports of failed gtt flushes.
Comment 175 Lee Matheson 2010-05-27 12:23:08 UTC
(In reply to comment #174)
> > --- Comment #173 from Brian Rogers <brian@xyzw.org> 2010-05-27 01:33:27 PDT ---
 
> IMHO gathering the feedback and aggregating interesting/special stuff here
> is the best option. This bug report is already rather crowded as-is.

Apologies - but I can not see where else to post the feedback, 

I downloaded and burned the Ubuntu liveCD referenced here: http://glasen-hardt.de/?p=568 which I believe has this patch. I then booted my Fujitsu-Siemens Amilo 7400M (w/1.25 GB RAM) and an Intel i855 GM graphics with that "unofficial" Ubuntu community liveCD that has the patch, and it booted and worked with Intel driver (I played for 2 hours with xrandr (driving external display), wireless, and special desktop efects). openSUSE-11.1 w/2.6.27 kernel was last successful Linux w/this laptop. Neither openSUSE-11.3 Milestone7 nor the released Fedora-13 work with the Intel driver on this laptop.  I am in favour of this patch being sent upstream.
Comment 176 legolas558 2010-05-29 11:26:32 UTC
Created attachment 35946 [details]
2 flush failures with latest version of patch

These are 2 failures happening during normal usage. I have latest version of patch and also with the copy/paste bug fixed manually.

These flush failures happen without anything noticed on the user side; I am also using xv_overlay_mode_fix.diff (the 1st flush failure in the attachment was without it, but does not seem related) to limit the crashes when watching videos and it seem to work (reduced from 1/hour to 0.5/day)
Comment 177 legolas558 2010-05-30 08:47:53 UTC
*** Bug 24789 has been marked as a duplicate of this bug. ***
Comment 178 nepo 2010-05-31 14:32:35 UTC
Hey hey, i've been trying the present glasen-Kubuntu-CD and thought it's working fine, until i tried to surf to:
http://www.ardmediathek.de/
The massive use of flash etc seems to be too much for the patch: First strange colour stripes on screen and then crash of Xserver. Sorry - I don't know to post which log file! D.
Comment 179 nepo 2010-05-31 14:37:37 UTC
Hey hey, i've been trying the present glasen-Kubuntu-CD and thought it's working fine, until i tried to surf to:
http://www.ardmediathek.de/
The massive use of flash etc seems to be too much for the patch: First strange colour stripes on screen and then crash of Xserver. Sorry - I don't know to post which log file! D.

Kubuntu 10.04
xserver-xorg-video-intel: 2:2.11.0+git20100531~glasen~ppa1
855gm-fix-dkms: 0.6.2~glasen~ppa1
Kernel 2.6.32-22.33
Comment 180 René Gabriëls 2010-06-03 15:50:05 UTC
(In reply to comment #179)
> Hey hey, i've been trying the present glasen-Kubuntu-CD and thought it's
> working fine, until i tried to surf to:
> http://www.ardmediathek.de/
> The massive use of flash etc seems to be too much for the patch: First strange
> colour stripes on screen and then crash of Xserver. Sorry - I don't know to
> post which log file! D.
> 
> Kubuntu 10.04
> xserver-xorg-video-intel: 2:2.11.0+git20100531~glasen~ppa1
> 855gm-fix-dkms: 0.6.2~glasen~ppa1
> Kernel 2.6.32-22.33

This website also freezes my Xserver.  Cursor is still moving, but that's about it.  Xorg.log states:

[ 10152.357] [mi] EQ overflowing. The server is probably stuck in an infinite loop.
[ 10152.357] 
Backtrace:
[ 10152.357] 0: /usr/bin/X (xorg_backtrace+0x3c) [0x80ebabc]
[ 10152.357] 1: /usr/bin/X (mieqEnqueue+0x1f5) [0x80eb3d5]
[ 10152.357] 2: /usr/bin/X (xf86PostMotionEventP+0xc8) [0x80c5f68]
[ 10152.357] 3: /usr/lib/xorg/modules/input/evdev_drv.so (0xb73de000+0x34ff) [0xb73e14ff]
[ 10152.357] 4: /usr/lib/xorg/modules/input/evdev_drv.so (0xb73de000+0x37e6) [0xb73e17e6]
[ 10152.357] 5: /usr/bin/X (0x8048000+0x6c8cf) [0x80b48cf]
[ 10152.357] 6: /usr/bin/X (0x8048000+0x127b2a) [0x816fb2a]
[ 10152.357] 7: (vdso) (__kernel_sigreturn+0x0) [0xb7849400]
[ 10152.357] 8: /usr/lib/libpixman-1.so.0 (0xb760b000+0x5de3a) [0xb7668e3a]
[ 10152.357] 9: /usr/lib/libpixman-1.so.0 (0xb760b000+0x17193) [0xb7622193]
[ 10152.357] 10: /usr/lib/libpixman-1.so.0 (pixman_blt+0x78) [0xb7648108]
[ 10152.357] 11: /usr/lib/xorg/modules/libfb.so (fbCopyNtoN+0x1ad) [0xb737777d]
[ 10152.357] 12: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb7384000+0x32691) [0xb73b6691]
[ 10152.357] 13: /usr/bin/X (miCopyRegion+0x1ba) [0x81a140a]
[ 10152.357] 14: /usr/bin/X (miDoCopy+0x475) [0x81a19b5]
[ 10152.357] 15: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb7384000+0x31ebf) [0xb73b5ebf]
[ 10152.357] 16: /usr/bin/X (0x8048000+0xdf21f) [0x812721f]
[ 10152.358] 17: /usr/bin/X (0x8048000+0x27dbc) [0x806fdbc]
[ 10152.358] 18: /usr/bin/X (0x8048000+0x294c7) [0x80714c7]
[ 10152.358] 19: /usr/bin/X (0x8048000+0x1da8b) [0x8065a8b]
[ 10152.358] 20: /lib/libc.so.6 (__libc_start_main+0xe2) [0xb7495bb2]
[ 10152.358] 21: /usr/bin/X (0x8048000+0x1d641) [0x8065641]

The problem with al these kind of bugs is that as long as Daniel's patches aren't upstream, it's hard to work out where the problem is.  My guess is that this particular bug has nothing to do with Daniel's patches, but is a bug in xf86-video-intel.  I can try to test without Daniel's patches, but chances are I won't even be able to start Firefox before X.org crashes.
Comment 181 legolas558 2010-06-03 17:13:43 UTC
(In reply to comment #180)
> (In reply to comment #179)
> > Hey hey, i've been trying the present glasen-Kubuntu-CD and thought it's
> > working fine, until i tried to surf to:
> > http://www.ardmediathek.de/
> > The massive use of flash etc seems to be too much for the patch: First strange
> > colour stripes on screen and then crash of Xserver. Sorry - I don't know to
> > post which log file! D.
[SNIP]
> 
> The problem with al these kind of bugs is that as long as Daniel's patches
> aren't upstream, it's hard to work out where the problem is.  My guess is that
> this particular bug has nothing to do with Daniel's patches, but is a bug in
> xf86-video-intel.  I can try to test without Daniel's patches, but chances are
> I won't even be able to start Firefox before X.org crashes.

No crash for my 855GM rev02; I am using mainstream git linux with v9 patch.
Comment 182 Daniel Vetter 2010-06-04 01:42:25 UTC
> --- Comment #180 from René Gabriëls <renegabriels@gmail.com> 2010-06-03 15:50:05 PDT ---
> The problem with al these kind of bugs is that as long as Daniel's patches
> aren't upstream, it's hard to work out where the problem is.  My guess is that
> this particular bug has nothing to do with Daniel's patches, but is a bug in
> xf86-video-intel.  I can try to test without Daniel's patches, but chances are
> I won't even be able to start Firefox before X.org crashes.

As long as there's nothing in dmesg about the gpu hanging it's rather
likely that this is a different bug. Not really suprising given that this
cache coherency problem seems to prevent tons of users from testing the
latest & greatest.
Comment 183 René Gabriëls 2010-06-07 08:42:43 UTC
(In reply to comment #182)
> As long as there's nothing in dmesg about the gpu hanging it's rather
> likely that this is a different bug. Not really suprising given that this
> cache coherency problem seems to prevent tons of users from testing the
> latest & greatest.

I have tested this site with a variety of kernels:

2.6.33 + Firefox: works
2.6.33 + Opera: works
2.6.34 + Firefox: hang
2.6.34 + Opera: works
2.6.34-rc6 + v9 + Firefox: hang (sometimes works)
2.6.34-rc6 + v9 + Opera: works

In other words, there seems to be no correlation between this bug and this bug or Daniel's patch for it.  It probably is a bug introduced in between kernel 2.6.33 and 2.6.34-rc6.
Comment 184 Chris Wilson 2010-06-29 01:02:19 UTC
*** Bug 28796 has been marked as a duplicate of this bug. ***
Comment 185 SMF 2010-06-30 01:20:42 UTC
(In reply to comment #184)
> *** Bug 28796 has been marked as a duplicate of this bug. ***

Chris Wilson directed me here diagnosing cache coherency problems with my laptop setup. I have looked at the V9 patch and I am not convinced that it is appropriate for use with my stock 2.6.34 kernel. I am looking for advice as to which kernel (and/or other components) I should use that will allow me to get fully up to speed with this issue. My objective is to contribute to a solution if possible or at least proved a test resource for others.

thanks

My System Spec:
IBM ThinkPad R51, model 2889SG1
CPU: Intel(R) Pentium(R) M processor 1.70GHz stepping 06
agpgart-intel 0000:00:00.0: Intel 855GM Chipset
agpgart-intel 0000:00:00.0: detected 8060K stolen memory
agpgart-intel 0000:00:00.0: AGP aperture is 128M @ 0xe0000000
Kernel 2.6.34
Running recent Development LFS system with the following X11 components:
XServer 1.8.1
Mesa 7.8.2
libdrm-2.4.21
xf86-video-intel-2.12
Comment 186 Brian Rogers 2010-07-01 07:14:50 UTC
The v9 patch will not apply to plain 2.6.34. It was based on the drm-intel-next branch which has since been merged. So the easiest thing to do is apply the patch to 2.6.35-rc3.
Comment 187 Stefan Glasenhardt 2010-07-01 10:56:23 UTC
Hi,

I've backported the v9-patch to several kernel versions. You can download them from my homepage :

http://glasen-hardt.de/?page_id=707
Comment 188 SMF 2010-07-02 01:53:21 UTC
(In reply to comment #187)
> Hi,
> 
> I've backported the v9-patch to several kernel versions. You can download them
> from my homepage :
> 
> http://glasen-hardt.de/?page_id=707

Hi,

Thanks but I have already set my system up with 2.6.35-rc3 + V9.
I have also updated my X server to 1.8.2.

My test senario is to run two 3d apps (glxgears & atlantis from xscreensaver) and an XV mplayer AVI loop with fvwm2 as the window manager. After about four hours my kernel has reported the flush failure that has seen by others earlier. CPU usage was low (< 10%) at all times (CPU throttled back to 600 Mhz by acpi_cpufreq) but the displays were not smooth and stalled on occassion.
Is my system showing the expected behaviour for the current state of development ?

thanks.

WARNING: at drivers/char/agp/intel-gtt.c:1007 intel_i830_chipset_flush+0x2f5/0x400()
Hardware name: 2889SG1
i8xx chipset flush failed, expected: 113920, cpu_read: 113408
Modules linked in: nfs nfsd lockd sunrpc exportfs microcode usbhid uhci_hcd pcmcia thinkpad_acpi ehci_hcd hwmon rfkill rtc_cmos led_class snd_intel8x0 8250_pnp yenta_socket floppy usbcore 8250_pci rtc_core nvram battery ac ide_cd_mod rtc_lib 8250 pcmcia_rsrc e100 snd_ac97_codec psmouse nls_base pcmcia_core cdrom ac97_bus serial_core i2c_i801 thermal rng_core snd_pcm_oss snd_pcm snd_timer snd_page_alloc snd_mixer_oss snd soundcore acpi_cpufreq processor mperf unix
Pid: 2552, comm: X Not tainted 2.6.35-rc3 #1
Call Trace:
 [<c1027748>] ? warn_slowpath_common+0x78/0xb0
 [<c119a2b5>] ? intel_i830_chipset_flush+0x2f5/0x400
 [<c119a2b5>] ? intel_i830_chipset_flush+0x2f5/0x400
 [<c1027813>] ? warn_slowpath_fmt+0x33/0x40
 [<c119a2b5>] ? intel_i830_chipset_flush+0x2f5/0x400
 [<c119502c>] ? agp_flush_chipset+0xc/0x10
 [<c11bdaac>] ? i915_gem_object_flush_cpu_write_domain+0x2c/0x40
 [<c11bfbba>] ? i915_gem_object_set_to_gtt_domain+0x3a/0x80
 [<c11d8d1f>] ? intel_overlay_do_put_image+0x7f/0x7c0
 [<c11d9c4c>] ? intel_overlay_put_image+0x58c/0x770
 [<c11cacb6>] ? intel_mark_busy+0x1d6/0x1e0
 [<c11a2c67>] ? drm_ioctl+0x157/0x330
 [<c11d96c0>] ? intel_overlay_put_image+0x0/0x770
 [<c1073288>] ? handle_mm_fault+0x208/0x7a0
 [<c109486f>] ? do_vfs_ioctl+0x8f/0x610
 [<c101deb7>] ? do_page_fault+0x197/0x3b0
 [<c11a2b10>] ? drm_ioctl+0x0/0x330
 [<c10424ca>] ? ktime_get_ts+0x10a/0x140
 [<c112b763>] ? copy_to_user+0x33/0x70
 [<c1094e2d>] ? sys_ioctl+0x3d/0x70
 [<c1002b10>] ? sysenter_do_call+0x12/0x26
Comment 189 SMF 2010-07-07 01:25:15 UTC
Created attachment 36797 [details]
dmesg debufs and X server output

Updated kernel to 2.6.35.rc4 + V9 and running a small benchmark program that I have that exercises X, the file system and the CPU results in this GPU hang every time (on first run - sofar).

Is this the same issue or have I drifted off the thread ?

thanks
Comment 190 Andrej Podzimek 2010-07-09 08:12:49 UTC
I appreciate the effort of all the people trying to fix this problem, but please let me ask a question: Would it be possible to make the "legacy" driver work with current X.org again? Wouldn't that be easier than trying to fix the bug? It is hard to believe the bug will ever get fixed. It only affects old hardware, so I don't think Intel (or any other employer) is willing to spend money on this...

I remember there was a "legacy" driver in early 2009 that *just* *worked*. (ArchLinux had a package called xf86-video-intel-legacy.) It was meant to be a temporary workaround before this bug is *fixed*. Unfortunately and despite the fact that the bug has not been fixed, all the support for the legacy driver has been dropped months ago.

There has been (de facto) no support for old Intel graphic chipsets on Linux since April 2009. :-( Using 1-year-old packages is not an option for most people. If the legacy driver could work with the latest X.org again, it would be just great...
Comment 191 Szabo, Akos 2010-07-09 12:30:18 UTC
(In reply to comment #190)
> please let me ask a question: Would it be possible to make the "legacy" driver
> work with current X.org again? Wouldn't that be easier than trying to fix the

I totally agree with this!
My intel video driver _was_ worked with my hardware /00:02.0 VGA compatible controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)/ with XAA. I can use compiz, xvideo, etc, anything, what I want. Now, I can see a crappy, but "graphical" boot screen /with kms, but I can do the same thing with intelfb, from many years ago/, but, if I want to use X, I must use fbdev drv in xorg.conf /with newest kernel (2.6.35-rcX-gitY), I can use intel-drv, but no luck with compiz for example/.

So a legacy driver would be nice :)
Comment 192 Tony White 2010-07-09 14:54:27 UTC
@Andrej Podzimek & Szabo, Akos :
Did you actually try the patch or are you both just being objective?
Comment 193 Andrej Podzimek 2010-07-09 17:26:37 UTC
(In reply to comment #192)
> @Andrej Podzimek & Szabo, Akos :
> Did you actually try the patch or are you both just being objective?

In my case, the v9 patch improved the situation. Instead of freezing the whole kernel (beyond magic SysRq) right after X.org startup (which is what the mainline kernel + current X.org does), I could "only" see the well-known X-server freeze immediately after login. (I saw it the second my mouse cursor touched an icon on the desktop. Highlighting the icon obviously triggered the freeze.) But there was nothing that could be called a usable desktop.

Furthermore, testing the patches is *extremely* difficult for me, since all the Intel laptops I care about use the Reiser4 file system. Reiser4 patches are available for mainline kernels only. So in fact I had to give up testing most versions of the patch. (The mainline kernel was so different that patching it manually was simply impossible for someone unfamiliar with the code.) That said, it is well possible that I applied the patch incorrectly and caused some other problems due to a typo...

To sum up, one of the following would help:
1) A patch against the mainline kernel (such as 2.6.34.1) to which Reiser4 could be applied as well.
2) A live distro that would use the patch, just to test it. (Does it exist?)
Comment 194 Szabo, Akos 2010-07-10 08:07:42 UTC
(In reply to comment #192)
> @Andrej Podzimek & Szabo, Akos :
> Did you actually try the patch or are you both just being objective?

Yep. I wrote: now, I can use X, with latest patches, I think, never Fedora kernel contains it. /I use rawhide kernel now/.
But, I can't use X, just like with fedora10: no compiz, no (minimal) 3d, any complex video output make an unusable system. For a 1.5 year ago, I can play with fallout2 with wine, now just freeze the display, sometime with a nice blue screen, sometime not. And every time when X freezing, I can login through ssh.
Comment 195 legolas558 2010-07-10 13:36:44 UTC
v9 patch works perfectly with latest kernel. However I have not tested compiz and any other complex 3D, so it is possible that it crashes in such cases
Comment 196 Andrej Podzimek 2010-07-10 14:10:42 UTC
(In reply to comment #195)
> v9 patch works perfectly with latest kernel.

Well, it might work fine on *your* hardware, but that does not imply it "works perfectly" in general. Believe it or not, the may still be (and in fact really *are*) many people observing hangs and crashes.

BTW, what do you mean by "latest kernel"? I can only patch against the mainline, due to Reiser4. If there was a patch agains the mainline 2.6.34.x kernel, I could easily test every single patch version with every single kernel release.

> However I have not tested compiz and any other complex 3D...

Neither have I, but it crashes anyway. Furthermore, compositting has become a standard feature. Obviously, switching all acceleration off is *not* a usable workaround. The legacy driver was *perfectly* stable and weeks of uptime with (accelerated) KDE 4, DVB-T and simple 3D games were not a problem.
Comment 197 legolas558 2010-07-11 02:59:00 UTC
(In reply to comment #196)
> (In reply to comment #195)
> > v9 patch works perfectly with latest kernel.
> 
> Well, it might work fine on *your* hardware, but that does not imply it "works
> perfectly" in general. Believe it or not, the may still be (and in fact really
> *are*) many people observing hangs and crashes.
> 
I thought it was implied that I had not changed my hardware since last test, it's still i855GM rev02.

I know that there are hangs and crashes, it's just that since v9 patch was created I am no more experiencing them; overlays are also working fine.

> BTW, what do you mean by "latest kernel"? I can only patch against the
> mainline, due to Reiser4. If there was a patch agains the mainline 2.6.34.x
> kernel, I could easily test every single patch version with every single kernel
> release.
> 

2.6.35-rc4

There are ports of the patch to other versions as well, you'd better check them from comment 187

> > However I have not tested compiz and any other complex 3D...
> 
> Neither have I, but it crashes anyway. Furthermore, compositting has become a
> standard feature. Obviously, switching all acceleration off is *not* a usable
> workaround. The legacy driver was *perfectly* stable and weeks of uptime with
> (accelerated) KDE 4, DVB-T and simple 3D games were not a problem.

The legacy driver is no more an option, this was explained in this bug and in bug 26345.

I am sure you can track down the cause of the crash and the relative bug; it could be some part of Xorg stack or also a new bug.
Comment 198 legolas558 2010-07-12 03:52:51 UTC
(In reply to comment #189)
> Created an attachment (id=36797) [details]
> dmesg debufs and X server output
> 
> Updated kernel to 2.6.35.rc4 + V9 and running a small benchmark program that I
> have that exercises X, the file system and the CPU results in this GPU hang
> every time (on first run - sofar).
> 
> Is this the same issue or have I drifted off the thread ?
> 
> thanks

can you provide sources of such program? If I can verify the same crash with my i855GM rev02 then it would be a testcase program
Comment 199 SMF 2010-07-15 03:02:10 UTC
(In reply to comment #198)
> (In reply to comment #189)
> > Created an attachment (id=36797) [details] [details]
> > dmesg debufs and X server output
> > 
> > Updated kernel to 2.6.35.rc4 + V9 and running a small benchmark program that I
> > have that exercises X, the file system and the CPU results in this GPU hang
> > every time (on first run - sofar).
> > 
> > Is this the same issue or have I drifted off the thread ?
> > 
> > thanks
> 
> can you provide sources of such program? If I can verify the same crash with my
> i855GM rev02 then it would be a testcase program

I have discovered that if I just run the X server (no window manager etc) and my bench mark program the GPU has not hung. So I think I will explore why the window manager (fvwm2) and the other desktop apps that I have contributes to the regular GPU hang I originally reported. If I can discover a solid repeatable senario I will be happy release my test program.

thanks
Comment 200 Chris Wilson 2010-07-17 07:35:46 UTC
*** Bug 25086 has been marked as a duplicate of this bug. ***
Comment 201 SMF 2010-07-17 11:41:57 UTC
Created attachment 37157 [details]
Test Program
Comment 202 SMF 2010-07-17 11:44:33 UTC
Comment on attachment 37157 [details]
Test Program

See README for context and usage.
Comment 203 SMF 2010-07-23 01:38:17 UTC
(In reply to comment #202)
> (From update of attachment 37157 [details])
> See README for context and usage.

Updated kernel to 2.6.35.rc6 + V9 gpu hang reported, as before, with test program.
Comment 204 George Pichurov 2010-07-30 15:07:44 UTC
1.Sometimes the header bar of windows completely dissapears, and when the windows are closed, they still remain on the task bar. It is impossible to switch to another desktop in such a case. 

2. Other times the system hangs. Nothing but hard reset.
Comment 205 legolas558 2010-07-31 08:15:57 UTC
Seems like we are back to square 1 with most recent 2.6.35 git update.

I don't know what has been touched, but the v9 patch is no more effective.

I can only get a black screen (display is ON but no output, only a plain black surface at a  possibly low resolution).
Comment 206 René Gabriëls 2010-07-31 12:59:16 UTC
(In reply to comment #205)
> Seems like we are back to square 1 with most recent 2.6.35 git update.

I'm running 2.6.35-rc6 (juli 22) + v9 patch, which works for me. There have been a number of Intel DRM updates since though.  I'll see if I encounter the same problem with latest git and then bisect.

PS: do any of you have problems with full screen video or OpenGL apps? My system crashes immediately after starten such apps (flash, doom, warzone, etc.).
Comment 207 Indan Zupancic 2010-07-31 16:00:37 UTC
(In reply to comment #206)
> (In reply to comment #205)
> > Seems like we are back to square 1 with most recent 2.6.35 git update.
> 
> I'm running 2.6.35-rc6 (juli 22) + v9 patch, which works for me. There have
> been a number of Intel DRM updates since though.  I'll see if I encounter the
> same problem with latest git and then bisect.
> 
> PS: do any of you have problems with full screen video or OpenGL apps? My
> system crashes immediately after starten such apps (flash, doom, warzone,
> etc.).

I've been running v9 patch with the same system as since 11-5-2010, 2.6.34-rc6, with userspace updated recently, and it has been very stable. No glitches, no crashes, no errors in dmesg. So it seems that something like the v9 patch should be pushed upstream, because it seems to work. It doesn't fix all bugs, but it makes things a lot more stable.

I can run video fine in full screen in VLC. My laptop isn't fast enough to play flash full screen in a smooth way, nor to do real 3D stuff.
Comment 208 Andrej Podzimek 2010-07-31 18:13:06 UTC
<off_topic>OpenSolaris is affected by this problem as well. Seems like last hope is gone. :-D</off_topic>
Comment 209 René Gabriëls 2010-07-31 18:53:58 UTC
(In reply to comment #206)
> I'm running 2.6.35-rc6 (juli 22) + v9 patch, which works for me. There have
> been a number of Intel DRM updates since though.  I'll see if I encounter the
> same problem with latest git and then bisect.

Latest git + v9 patch works for me. However, another bug (invisible cursor) was introduced, that I needed to track down and fix.

Legolas: you can do a git-bisect of the kernel source to find out which patch introduced the bug that affects your machine.

Indan: for me the v9 patch primarily gets rid of font rendering errors.  Without it, my system is quite stable now modulo opengl/flash. I don't care too much for them either, but it also means compositing (and thus compiz/GNOME 3) is out of the question.

Let's hope Intel will not decide to simply stop supporting the 855 chipset, which Phoronix is hinting at.
Comment 210 legolas558 2010-08-02 04:54:21 UTC
(In reply to comment #206)
> (In reply to comment #205)
> > Seems like we are back to square 1 with most recent 2.6.35 git update.
> 
> I'm running 2.6.35-rc6 (juli 22) + v9 patch, which works for me. There have
> been a number of Intel DRM updates since though.  I'll see if I encounter the
> same problem with latest git and then bisect.
> 
> PS: do any of you have problems with full screen video or OpenGL apps? My
> system crashes immediately after starten such apps (flash, doom, warzone,
> etc.).

Bisect just completed, the bad commit is:

592d32cc4156ee512e55c5bc052fdece215f52b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6

It modifies (strangely) the i915 driver although it's not an USB-thing. I am inspecting the diff right now but it seems like a blunder.

I'll post my findings upstream as soon as I have finished with it
Comment 211 legolas558 2010-08-02 05:08:19 UTC
(In reply to comment #210)
> (In reply to comment #206)
> > (In reply to comment #205)
> > > Seems like we are back to square 1 with most recent 2.6.35 git update.
> > 
> > I'm running 2.6.35-rc6 (juli 22) + v9 patch, which works for me. There have
> > been a number of Intel DRM updates since though.  I'll see if I encounter the
> > same problem with latest git and then bisect.
> > 
> > PS: do any of you have problems with full screen video or OpenGL apps? My
> > system crashes immediately after starten such apps (flash, doom, warzone,
> > etc.).
> 
> Bisect just completed, the bad commit is:
> 
> 592d32cc4156ee512e55c5bc052fdece215f52b2 Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6
> 
> It modifies (strangely) the i915 driver although it's not an USB-thing. I am
> inspecting the diff right now but it seems like a blunder.
> 
> I'll post my findings upstream as soon as I have finished with it

Most of the patch has already been reverted, except the change reverted by the following small patch (which I am now testing):

diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c
index 5dde80f..8608462 100644
--- a/drivers/gpu/drm/i915/intel_dp.c
+++ b/drivers/gpu/drm/i915/intel_dp.c
@@ -806,7 +806,6 @@ intel_dp_dpms(struct drm_encoder *encoder, int mode)
                        intel_dp_link_train(intel_encoder, dp_priv->DP, dp_priv->link_configuration);
                        if (IS_eDP(intel_encoder)) {
                                ironlake_edp_panel_on(dev);
-                               ironlake_edp_backlight_on(dev);
                        }
                }
        }

This is possibly the only (related) difference.
Comment 212 legolas558 2010-08-02 07:04:47 UTC
Sorry for the noise, I can't reproduce the problem anymore. It was verifiable before, but must have been a compilation glitch of some sort. FTR, this was my bisection log:

git bisect start
# good: [1afaab90e8c0317170a53967064a934a77a59c16] Input: w90p910_keypad - change platfrom driver name to 'nuc900-kpi'
git bisect good 1afaab90e8c0317170a53967064a934a77a59c16
# bad: [a63ecd835f075b21d7d5cef9580447f5fbb36263] Merge master.kernel.org:/home/rmk/linux-2.6-arm
git bisect bad a63ecd835f075b21d7d5cef9580447f5fbb36263
# good: [2aa72f612144a0a7d4b0b22ae7c122692ac6a013] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
git bisect good 2aa72f612144a0a7d4b0b22ae7c122692ac6a013
# good: [4609a179c97ae60fef173547a9bbb214359808ce] ARM: Fix csum_partial_copy_from_user()
git bisect good 4609a179c97ae60fef173547a9bbb214359808ce
# good: [4609a179c97ae60fef173547a9bbb214359808ce] ARM: Fix csum_partial_copy_from_user()
git bisect good 4609a179c97ae60fef173547a9bbb214359808ce
# bad: [592d32cc4156ee512e55c5bc052fdece215f52b2] Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6
git bisect bad 592d32cc4156ee512e55c5bc052fdece215f52b2
# good: [0e1cf38889110a7188999388614aef17a84d9d25] Merge branch 'bugzilla-16396' into release
git bisect good 0e1cf38889110a7188999388614aef17a84d9d25
# good: [809cd1cb80d7dffe75dc94bc94ef2aab3dadc86a] USB: Fix USB3.0 Port Speed Downgrade after port reset
git bisect good 809cd1cb80d7dffe75dc94bc94ef2aab3dadc86a
# good: [b690e96cf9e6a6cde6f0393de47bdd6317ddb5de] drm/i915: add pipe A force quirks to i915 driver
git bisect good b690e96cf9e6a6cde6f0393de47bdd6317ddb5de
# good: [4afb93b4211b3f65ebd8ea0d9018426dd9e8693e] Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
git bisect good 4afb93b4211b3f65ebd8ea0d9018426dd9e8693e
# good: [c30c791c946a14a03e87819eced562ed28711961] USB: xhci: Set Mult field in endpoint context correctly.
git bisect good c30c791c946a14a03e87819eced562ed28711961
# good: [63ab71deae67b031045bb28bf8cff45180089f8f] USB: add quirk for Broadcom BT dongle
git bisect good 63ab71deae67b031045bb28bf8cff45180089f8f
# good: [2b795ea00c2bbb077a1199a4d729c8ac03a6bded] USB: musb: tusb6010: fix compile error with n8x0_defconfig
git bisect good 2b795ea00c2bbb077a1199a4d729c8ac03a6bded
Comment 213 Craig73 2010-08-03 10:20:46 UTC
Tried to evaluate patch... found my system more unstable after installing (Ubuntu) recommended patches 
 - font colours changing fairly quickly to something unreable making browsing/text editor-log viewing/window titles unreadable
 - logging out/switching users goes to a black screen that flickers, like it's cycling through trying to change the video or it's trying to start the driver and keeps failing.
 - Didn't get to fully test GL apps (once I saw things going down hill I backed out)

Was (and now back to) running .34 kernel+X-Updates (Intel 2.11) and it was running OK (when doing nothing GL related)... stable 2D and video largely OK (not entirely smooth but pretty good)

Note:  the 2.12/libdrm installed was enough to cause me the grief above... I didn't notice it until I had the 855GM patch installed, but it still persisted after I uninstalled the 855GM patch.
Comment 214 SMF 2010-08-03 11:03:51 UTC
(In reply to comment #203)
> (In reply to comment #202)
> > (From update of attachment 37157 [details] [details])
> > See README for context and usage.
> 
> Updated kernel to 2.6.35.rc6 + V9 gpu hang reported, as before, with test
> program.

Further updated to latest 2.6.35 kernel, gpu hang reported as before (both with and without V9 patch) using test program.
Comment 215 2points 2010-08-04 13:35:09 UTC
René Gabriëls: Did you happen to come across something regarding the mouse cursor mysteriously being on strike? Upgraded drm-intel-next kernel/v9 patch and userspace drivers (xf86-video-intel/libdrm) to git head and ran into the same problem (I assume the problem is somewhere in kernel code, as going back to 2.6.34 brings back the cursor).

Also +1 to GPU hanging, but couldn't find any flush-related messages in dmesg, so I assume I'm simply running into some other problem once again. Much to my delight I also noticed that the intel driver now seems to fall back to software rendering when the GPU is hung, so at least I can still work without everything going down the drain. Very nice work.
Comment 216 René Gabriëls 2010-08-04 18:06:00 UTC
(In reply to comment #215)
> René Gabriëls: Did you happen to come across something regarding the mouse
> cursor mysteriously being on strike? Upgraded drm-intel-next kernel/v9 patch
> and userspace drivers (xf86-video-intel/libdrm) to git head and ran into the
> same problem (I assume the problem is somewhere in kernel code, as going back
> to 2.6.34 brings back the cursor).

Yes. I tracked down the problem: if you comment the following lines in drivers/gpu/drm/intel_display.c the problem will probably be gone (it worked for me).

/* 855 & before need to leave pipe A & dpll A up */
{ 0x3582, PCI_ANY_ID, PCI_ANY_ID, quirk_pipea_force },
{ 0x2562, PCI_ANY_ID, PCI_ANY_ID, quirk_pipea_force },

> Also +1 to GPU hanging, but couldn't find any flush-related messages in dmesg,
> so I assume I'm simply running into some other problem once again. Much to my
> delight I also noticed that the intel driver now seems to fall back to software
> rendering when the GPU is hung, so at least I can still work without everything
> going down the drain. Very nice work.

My system locks up hard when I fullscreen certain video/gl apps.  Also I have render errors with progressively rendered images in Firefox.  It's hard to stay optimistic about the state of the Linux desktop considering that graphics has never really worked as it should since the day I bought it (almost 7 years ago!).
Comment 217 arthapex 2010-08-05 01:19:27 UTC
(In reply to comment #215)
> René Gabriëls: Did you happen to come across something regarding the mouse
> cursor mysteriously being on strike? Upgraded drm-intel-next kernel/v9 patch
> and userspace drivers (xf86-video-intel/libdrm) to git head and ran into the
> same problem (I assume the problem is somewhere in kernel code, as going back
> to 2.6.34 brings back the cursor).
I'm experiencing this behavior, too. The problem does not occur with 2.6.35-rc4. I'll try your fix.
Comment 218 arthapex 2010-08-05 01:43:17 UTC
Fix from comment #216 worked here, too. Now "Initializing HW Cursor" is back in Xorg.0.log.
Comment 219 nomnex 2010-08-10 20:05:26 UTC
This is probably of little help in the context, but I have installed Fedora 13, kernel 2.6.33.6-147.2.4.fc13.i686 on my old let's note Panasonic CF-W2 (i855 int(h)el(l) GPU).

This distribution installed without a glitch  (live CD). It is perfectly stable. None of the installation and stability problem I had encountered with Ubuntu 10.04 has affected me (on this machine + this with type of chipset).
Comment 220 Andrej Podzimek 2010-08-11 12:03:36 UTC
I have just tested patch v9 with 2.6.34.3. The patch breaks all the support for Intel graphics. How is this "solution" supposed to work? (Sounds like a bad joke.)

This is what I can see in dmesg:

[drm:i915_init] *ERROR* drm/i915 can't work without intel_agp module!

Presumably, intel_agp is present and loaded. I tried the patch with intel_agp compiled into the kernel and as a loadable module. It failed the same way in both cases. Without the patch, there is no such problem. (But X.org hangs.)

With the patch applied, the i915 driver never works. When compiled as a module, it cannot be loaded manually and says "No such device".

This is an Asus M2400N laptop with an Intel 82852/855GM (rev 02) GPU.

I wanted to switch to another operating system, but it seems that the Intel driver bug is omnipresent:

Linux: unusable (X.org freezes after a while)
FreeBSD: unusable (kernel panic)
OpenSolaris: unusable (X.org freezes during initialization)

Avoiding Intel graphic chipsets is probably the only solution. That's the most important lesson I have learned from all this. A perfectly working driver has been discontinued more than one year ago (April 2009) and no replacement is available so far.

BTW, has anyone tested a recent version of OpenBSD or NetBSD? Is there at least one reasonable (UNIX-like) system that would support old Intel chipsets?
Comment 221 2points 2010-08-11 13:00:14 UTC
Andrej Podzimek: I have exactly the same hardware. Previously tested 2.6.34-rc3 and v6 of this patchset, now running 2.6.35+ (from drm-intel-next) and v9, and have had no such problems. I'd suggest checking up on kernel configuration and if the patch was actually applied correctly before going on a rampage here.
Comment 222 legolas558 2010-08-12 03:23:00 UTC
Created attachment 37807 [details] [review]
v9 against latest git linus tree (manually fixed)
Comment 223 Dennis Nail 2010-08-14 15:50:00 UTC
I installed the GTT Incoherency Patch as described at:
https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes#GTT%20Incoherency%20Patch
I have a HP ze4904us laptop with 00:02.0 VGA compatible controller [0300]: Intel Corporation 82852/855GM Integrated Graphics Device [8086:3582] (rev 02) graphics card.
I'm running Ubuntu 10.04 Lucid Lynx with kernel 2.6.32-24.
Everything works so far, I used to get a freeze-up during boot but now it works properly.
For a time I used the change in /etc/default/grub, adding "i915.modeset=1" after "quiet splash", during that time I could operate normally except if I tried to play a video on Totem movie player it would crash the computer.
Now everything seems to work now.

Thank you all for all your effort.
Comment 224 SMF 2010-08-16 02:24:38 UTC
(In reply to comment #214)
> (In reply to comment #203)
> > (In reply to comment #202)
> > > (From update of attachment 37157 [details] [details] [details])
> > > See README for context and usage.
> > 
> > Updated kernel to 2.6.35.rc6 + V9 gpu hang reported, as before, with test
> > program.
> 
> Further updated to latest 2.6.35 kernel, gpu hang reported as before (both with
> and without V9 patch) using test program.

Updated kernel to 2.6.35.2 + V9 patch GPU hang reported on first run of my stress test program:

Linux Mars 2.6.35.2 #5 Mon Aug 16 06:47:43 BST 2010 i686 i686 i386 GNU/Linux

[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 828897 at 828892)
Comment 225 SMF 2010-08-22 01:57:29 UTC
(In reply to comment #224)
> (In reply to comment #214)
> > (In reply to comment #203)
> > > (In reply to comment #202)
> > > > (From update of attachment 37157 [details] [details] [details] [details])
> > > > See README for context and usage.
> > > 
> > > Updated kernel to 2.6.35.rc6 + V9 gpu hang reported, as before, with test
> > > program.
> > 
> > Further updated to latest 2.6.35 kernel, gpu hang reported as before (both with
> > and without V9 patch) using test program.
> 
> Updated kernel to 2.6.35.2 + V9 patch GPU hang reported on first run of my
> stress test program:
> 
> Linux Mars 2.6.35.2 #5 Mon Aug 16 06:47:43 BST 2010 i686 i686 i386 GNU/Linux
> 
> [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
> [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting
> 828897 at 828892)

Updated xserver to 1.9.0 and kernel to 2.6.35.3 and ran my stress test over night with NO GPU hangs reported.
Some issues with Xv and mplayer but that is probably another story.
Comment 226 SMF 2010-08-22 01:59:30 UTC
(In reply to comment #225)
> (In reply to comment #224)
> > (In reply to comment #214)
> > > (In reply to comment #203)
> > > > (In reply to comment #202)
> > > > > (From update of attachment 37157 [details] [details] [details] [details] [details])
> > > > > See README for context and usage.
> > > > 
> > > > Updated kernel to 2.6.35.rc6 + V9 gpu hang reported, as before, with test
> > > > program.
> > > 
> > > Further updated to latest 2.6.35 kernel, gpu hang reported as before (both with
> > > and without V9 patch) using test program.
> > 
> > Updated kernel to 2.6.35.2 + V9 patch GPU hang reported on first run of my
> > stress test program:
> > 
> > Linux Mars 2.6.35.2 #5 Mon Aug 16 06:47:43 BST 2010 i686 i686 i386 GNU/Linux
> > 
> > [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
> > [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting
> > 828897 at 828892)
> 
> Updated xserver to 1.9.0 and kernel to 2.6.35.3 and ran my stress test over
> night with NO GPU hangs reported.
> Some issues with Xv and mplayer but that is probably another story.

Sorry kernel was 2.6.35.3 + V9 patch.
Comment 227 Alkis Georgopoulos 2010-08-23 02:14:58 UTC
I tested https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes#GTT%20Incoherency%20Patch on my laptop with:
00:02.0 VGA compatible controller [0300]: Intel Corporation 82852/855GM Integrated Graphics Device [8086:3582] (rev 02)

While it allowed the system to boot, and KMS / 3D acceleration were working fine, it had problems with XV. E.g. playing a video with totem/vlc or testing XV with gstreamer-properties resulted in a crash:
$ gstreamer-properties
gstreamer-properties-Message: Skipping unavailable plugin 'artsdsink'
gstreamer-properties-Message: Skipping unavailable plugin 'esdsink'
gstreamer-properties-Message: Skipping unavailable plugin 'glimagesink'
gstreamer-properties-Message: Skipping unavailable plugin 'sdlvideosink'
gstreamer-properties-Message: Skipping unavailable plugin 'v4lmjpegsrc'
gstreamer-properties-Message: Skipping unavailable plugin 'qcamsrc'
gstreamer-properties-Message: Skipping unavailable plugin 'esdmon'
The program 'gstreamer-properties' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadAlloc (insufficient resources for operation)'.
  (Details: serial 60 error_code 11 request_code 132 minor_code 19)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the --sync command line
   option to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)


So the best workaround for me so far is to use the https://launchpad.net/~brian-rogers/+archive/experimental kernel which gives me KMS, 3D and XV with no problems.
Comment 228 Andrej Podzimek 2010-08-24 23:10:16 UTC
Tested 2.6.35.3 and the V9 patch.

This time the Intel adapter *works* and the KDE desktop is usable when compositing is switched off.

With compositing switched on, freezes *do* occur as usual, but this time they do not block virtual console switching. This means that the frozen machine could be rebooted gracefully if there wasn't another (possibly related) bug (see below).

Unfortunately, something gets broken inside the kernel. An attempt to sync the file systems (and suspend/hibernate/reboot) gets stuck forever. There is a kernel process called 'flush-8:0' that consumes 100% of CPU time. Existing sessions remain usable, but no new sessions can be established. The Magic SysRq is the only solution here.
Comment 229 Andrej Podzimek 2010-08-24 23:19:08 UTC
Created attachment 38135 [details]
Failed chipset flush backtrace

A backtrace from dmesg. Occurs with the V9 patch and 2.6.35.3.
Comment 230 Andrej Podzimek 2010-08-25 09:47:02 UTC
(In reply to comment #229)
> Created an attachment (id=38135) [details]
> Failed chipset flush backtrace
> 
> A backtrace from dmesg. Occurs with the V9 patch and 2.6.35.3.

Some extra notes on the issues mentioned above:

The "flush hang" problem (the 'flush-8:0' kernel process taking up all the CPU time) occurs no matter if compositing is off, no matter if intensive disk operations take place and no matter if the desktop is actually used. (It can easily happen right after boot when only KDM is displayed.)

The "flush hang" problem does not occur immediately in the moment when the backtrace appears. Minutes to tens of minutes elapse between the warning and the flush-8:0 process going out of control.

I tried the Reiser4 patch alone (without the V9 patch (and without X.org, of course)) and had no issues. Everything worked even after hours of uptime. But this could have just happened by chance...

I see there's some VFS related stuff in the backtrace. If I understand it well, a scheduling clock tick is involved. Could this be a bug related to interrupt disabling and other synchronization? If the VFS data integrity is compromised, it could possibly explain some of these issues.

I use a fully preemptible kernel and a 300Hz scheduling clock. The machine is an Asus M2400N (uniprocessor Pentium M).

What should I try next? Any suggestions?
Comment 231 Jean-Michel Grimaldi 2010-08-27 14:05:28 UTC
(In reply to comment #187)
> http://glasen-hardt.de/?page_id=707

I used to have crashes in less than 2 minutes under Ubuntu karmic and lucid when I simply ran firefox or even a gnome-term.
I applied fix-i8xx-gtt-cache-coherency-v9-2.6.35.1.patch to Ubuntu maverick kernel 2.6.35-18.24 from http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-maverick.git (the patch needed adaptation, but 'patch' did the magic alone).

I have no freeze anymore. Thanks for the patch!
Please put it in the main trunk for maverick, from other test reports it might not be perfect, but it is certainly better than the current status.
I run 2xserver-xorg-video-intel :2.9.1-3ubuntu5 from lucid. And KMS is activated through the boot command i915.modeset=1
Comment 232 Michal Nowak 2010-08-30 02:38:12 UTC
Daniel, could you be so kind and give us an update on this bug? From what I can see you created a decent patch which solves or enhances the situation for most of 855gm users but it seems that this patch did not make it upstream, is it correct? I also read on other approaches elsewhere: 1) dumping 855gm as unsupported, 2) reverting to user-mode setting... Thank You.
Comment 233 Daniel Vetter 2010-08-30 09:15:41 UTC
> --- Comment #232 from Michal Nowak <mnowak@redhat.com> 2010-08-30 02:38:12 PDT ---
> Daniel, could you be so kind and give us an update on this bug? From what I can
> see you created a decent patch which solves or enhances the situation for most
> of 855gm users but it seems that this patch did not make it upstream, is it
> correct? I also read on other approaches elsewhere: 1) dumping 855gm as
> unsupported, 2) reverting to user-mode setting... Thank You.

Ok, the long overdue status report: I haven't upstreamed the patch for a
few reasons:
- It's an extremely ugly approach, involving way too much duct-tape. Now
  if it would actually reliably work, but that's not the case.
- It has (under certain circumstances) rather severe performance
  implications (mostly because the eviction code is not clever enough).
Hence why I'm not satisfied and of the opinion that upstreaming might
cause more harm than good. Different story for distros, though.

I have a few ideas as how to amend this, but that requires a complete
rewrite of the gtt code. I've finally found time to start hacking on this,
see

http://cgit.freedesktop.org/~danvet/drm/log/?h=intel_gtt_rework

Don't try this on an i8xx, no cache coherency stuff in there (yet). I'll
give updates as soon as there is stuff to try out.

On other approaches for the short/medium term, as I seem them (take this
with a grain of salt, I'm just doing this for fun and leisure and I'm not
an Intel employee):

- Keep the old ums stuff around. Not supported by intel (and I don't think
  this will ever happen). You're basically on your own.
- Keep this patch as a band-aid. Thanksfully all the nice people here have
  been awesome with forward-porting and helping each another out, so this
  basically maintains itself ;)
- Chris Wilson's shadowfb branch. See

http://cgit.freedesktop.org/~ickle/xf86-video-intel/log/?h=shadow

  This won't give opengl, tough. But that looks like a good approach
  until the i8xx cache coherency nightmare is fixed for real. And it has
  the change of being merged to master.
- Burn your i855 on a pyre ;)

-Daniel
Comment 234 René Gabriëls 2010-08-30 15:14:07 UTC
Created attachment 38324 [details] [review]
v9.1: v9 patch updated for 2.6.36-rc3

Good news Daniel, keep up the good work!

In the mean time, I've ported the v9 patch to Linus's most recent git-tree (2.6.36-rc3 atm).  Some Sandy Bridge patches interfered with the old patch, so I merged them.  I hope the result is correct (it works for me).
Comment 235 Michal Nowak 2010-08-30 23:55:51 UTC
Thanks Daniel, for both the work on this issue and the status report, both really appreciated.

Fedora 13 with 2.6.34 kernel works 100% in terms of stability for me. (But the performance is... ~50 fps in glxgears -- I can live with that for sure.)
Comment 236 Indan Zupancic 2010-09-03 08:00:30 UTC
(In reply to comment #233)
> I have a few ideas as how to amend this, but that requires a complete
> rewrite of the gtt code. I've finally found time to start hacking on this,
> see
> 
> http://cgit.freedesktop.org/~danvet/drm/log/?h=intel_gtt_rework

What I like about that is that it removes a lot of lines of code.
I hope you can get it small and simple enough that it becomes very
stable.

Tell us when you want some testing and we'll provide.

> Don't try this on an i8xx, no cache coherency stuff in there (yet). I'll
> give updates as soon as there is stuff to try out.

Any new ideas how to get cache coherency, or how to avoid the need for it?

> On other approaches for the short/medium term, as I seem them (take this
> with a grain of salt, I'm just doing this for fun and leisure and I'm not
> an Intel employee):
> 
> - Keep the old ums stuff around. Not supported by intel (and I don't think
>   this will ever happen). You're basically on your own.

Chris Wilson reintegrated the legacy UMS and put it in his "legacy" branch,
to make it easier for people who are on their own to work on it together.

> - Keep this patch as a band-aid. Thanksfully all the nice people here have
>   been awesome with forward-porting and helping each another out, so this
>   basically maintains itself ;)
> - Chris Wilson's shadowfb branch. See
> 
> http://cgit.freedesktop.org/~ickle/xf86-video-intel/log/?h=shadow
> 
>   This won't give opengl, tough. But that looks like a good approach
>   until the i8xx cache coherency nightmare is fixed for real. And it has
>   the change of being merged to master.

I don't care about opengl, this thing can't do anything opengl for real anyway.
All I want is stable and fast 2D rendering. :-( Full-screen video playback is 
nice to have, but not a must. I used to have DRI disabled and only the UMS 2D 
intel driver enabled.

I think I'll give Chris' shadowfb branch a try, it looks very promising.
Any idea where I can give feedback about it?

> - Burn your i855 on a pyre ;)

When unplugging my mother's Ipod nano it burnt the USB part of the chip, 
so almost got there. ;-) I've a Thinpad X40 and it's a wonderful machine,
but it also gives no choice as far as the i855 goes.

> -Daniel

Thank you for all your work Daniel.
Comment 237 legolas558 2010-09-03 13:49:33 UTC
I have an i855GM rev.02 and with linus git tree + v9 patch I get a lovely black screen.

What spell shall I use to get it working again?

Thanks
Comment 238 Ryan Lovett 2010-09-08 16:32:53 UTC
A ThinkPad R51 was freezing at boot after upgrading to Lucid. I used Stefan Glasenhardt's Ubuntu PPA as described here:

https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes#GTT%20Incoherency%20Patch

and it solved the problem.
Comment 239 legolas558 2010-09-08 23:37:38 UTC
(In reply to comment #238)
> A ThinkPad R51 was freezing at boot after upgrading to Lucid. I used Stefan
> Glasenhardt's Ubuntu PPA as described here:
> 
> https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes#GTT%20Incoherency%20Patch
> 
> and it solved the problem.

Stefan uses the same patch which is here, which is not effective for me. I am using git linus tree; I think the patch should be sent upstream as it only improves things for all use-cases I have seen so far.

@Chris Wilson: is there a cumulative patch for your UMS "legacy" work?
Comment 240 Kristijan Vrban 2010-09-12 12:57:37 UTC
Hello, i tried the latest patch with 2.6.36-rc3, and it basically is working, but every ~9-10s the mouse pointer hangs, and dmesg is flooded with a lot of this messages:
   
[  214.024020] [drm:intel_calculate_wm] *ERROR* Insufficient FIFO for plane, expect flickering: entries required = 43, available = 42.
[  214.024029] [drm:intel_calculate_wm] *ERROR* Insufficient FIFO for plane, expect flickering: entries required = 43, available = 42.

my GPU:
00:02.0 VGA compatible controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02) (prog-if 00 [VGA controller])
	Subsystem: Acer Incorporated [ALI] Device 0064
	Flags: bus master, fast devsel, latency 0, IRQ 6
	Memory at e8000000 (32-bit, prefetchable) [size=128M]
	Memory at e0000000 (32-bit, non-prefetchable) [size=512K]
	I/O ports at 1800 [size=8]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: <access denied>
	Kernel driver in use: i915
Comment 241 Daniel Vetter 2010-09-12 13:37:05 UTC
> --- Comment #240 from Kristijan Vrban <vrban.lkml@googlemail.com> 2010-09-12 12:57:37 PDT ---
> Hello, i tried the latest patch with 2.6.36-rc3, and it basically is working,
> but every ~9-10s the mouse pointer hangs, and dmesg is flooded with a lot of
> this messages:
> 
> [  214.024020] [drm:intel_calculate_wm] *ERROR* Insufficient FIFO for plane,
> expect flickering: entries required = 43, available = 42.
> [  214.024029] [drm:intel_calculate_wm] *ERROR* Insufficient FIFO for plane,
> expect flickering: entries required = 43, available = 42.

Both known problems: The hang every 10s is hotplug code wasting too much
time (and hence stalling mouse updates). The warning is harmless, the code
wasn't changed at all, it just started reporting possible causes for
flicker. Both problems have patches in drm-intel/drm-intel-fixes that
should land in -stable sooner or later.

-Daniel
Comment 242 Brian Rogers 2010-09-17 08:18:41 UTC
Daniel, are you interested in people testing your gtt rework branch on non-i8xx hardware right now, or is it in an early enough state that the feedback would just be noise?

I could test that branch on my i965 laptop if you'd find that helpful.
Comment 243 korniko 2010-09-22 05:44:05 UTC
https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes Workaround A alone worked for Dell Latitude D505 with (using $ lspci | grep Intel)
00:02.1 Display controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)
Workaround A is..
To turn KMS back on, run this command in a Terminal window and reboot:
echo options i915 modeset=1 | sudo tee /etc/modprobe.d/i915-kms.conf
sudo update-initramfs -u
Previously Lucid with kernel 2.6.31-22 generic and all previous releases (Karmic and earlier) worked fine but Lucid kernel 2.6.32-24 generic on startup showed some graphics initially then went to a blank screen.
The problem was first seen after upgrading from Karmic to Lucid Xubuntu. 22Sep2010
Comment 244 Anthony Ruhier 2010-10-02 12:18:46 UTC
Hello,
I have tested your fix and your driver for i855GM' chipset on Lucid.
I can turn on compiz, but if I want to watch a video on VLC, compiz crash. But if I don't activate the hardware acceleration (sorry I'm french so I don't know if it's the good word), it's works. But If I activate the hardware acceleration, VLC crash after 30mn.

So I test Ubuntu 10.10, because I have read that it works better on. But the hardware acceleration isn't activated so when I want to watch a video I have some lags... I have activated it manually but it isn't stable, so I have tested your driver and your fix, but I couldn't install the fix because of the kernel 2.6.35.

Can you adapt your fix to the kernel 2.6.35 for Maverick ?
Or do you think at another solution ?
Comment 245 glapo21 2010-10-11 09:49:23 UTC
Hey, I have lubuntu 10.04 and have been trying to get an external monitor working with it. I currently vesa drivers because they were the only one that worked. I tried using your patch at this link: 

http://bugs.freedesktop.org/createaccount.cgi?login=glapo21%40gmail.com

I could get to the loading screen and the resolution seemed fine. The external mointor was actually working, but I was unable to boot into X. It would stop at some process:258 GLIB warning which I would get before but it would still load up after a second. Sometimes It loads more and it stops at loading some vmware player configs. I am very close I feel, do you know anyway to get it fully working? I apologize if this is not clear.

Thanks a lot it for your awesome patch!

I am working with:

 $ lspci -nn | grep VGA
Quote:
00:02.0 VGA compatible controller [0300]: Intel Corporation 82852/855GM Integrated Graphics Device [8086:3582] (rev 02)
Comment 246 glapo21 2010-10-11 10:47:06 UTC
I posted the wrong link sorry. 

Okay so this solution may only work for me but this is what I did.

Everything is based off of https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes
1. Workaround A: Re-enable KMS by editing the .conf file
2. Reboot
3. Fail to load into X modify xorg.conf file and change vesa to intel 
4. At this point it automatically started X and I was in Lubuntu
5. Celebrate and plug in your external monitor. If it does not start look under Preferences -> Monitor Settings. You may have to turn it on.
6. Reboot
7. At this point my external monitor was freaking out so I than did GTT Incoherency Patch.
8. Reboot
9. External mointor works with intel drivers and no "freaking out".
 
Thanks for your awesome PATCH!!!!
Comment 247 Anthony Ruhier 2010-10-11 14:39:25 UTC
Hello,
With the last driver (2.13.0) I can't turn on compiz and I can't turn on the graphic acceleration for Flash too.
It's normal or I have to do something else to turn on that ?
Comment 248 nepo 2010-10-26 13:42:47 UTC
Unfortunately I am still experiencing blue screens especially when playing avi-files fullscreen with VLC. vob-files however, seem to be stable. I am using the latest glasen-hardt.de patch for kubuntu 10.04, kernel 2.6.32-25.45.

The following lines are probably too short to figure out the origine:

[drm	915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
	render error detected, EIR: 0x00000000
[drm	915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 331959 at 331958)

Which information is necessary to solve it??

Thanks, J.
Comment 249 la.ouvrard 2010-11-17 13:19:34 UTC
Hi,
I have a chipset i855. I have installed Ubuntu a few days ago, but I cannot read DVDs. I have tried the GTT Incoherency Patch, but this:
http://ppa.launchpad.net/glaen/855gm-fix/ubuntu/dists/lucid/main/binary-i386/Packages.gz
does not work. And I wonder if that could explain why I still cannot use DVDs on my laptop. Is there another solution?
Cheers
Comment 250 legolas558 2010-12-01 15:30:00 UTC
git linus master v2.6.37-rc4 here, and it seems FUBAR...I couldn't adapt the patch
Comment 251 Dave Lahr 2010-12-04 05:47:22 UTC
Patch worked for me - thank you!  I used the instructions here ("GTT Incoherency Patch")
https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes

I'm running on an IBM Thinkpad G41 2886-5TU

Ubuntu 10.04 LTS

uname -r 
2.6.32-26-generic

lspci
00:00.0 Host bridge: Intel Corporation 82852/82855 GM/GME/PM/GMV Processor to I/O Controller (rev 02)
00:00.1 System peripheral: Intel Corporation 82852/82855 GM/GME/PM/GMV Processor to I/O Controller (rev 02)
00:00.3 System peripheral: Intel Corporation 82852/82855 GM/GME/PM/GMV Processor to I/O Controller (rev 02)
00:02.0 VGA compatible controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)
00:02.1 Display controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #3 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 81)
00:1f.0 ISA bridge: Intel Corporation 82801DBM (ICH4-M) LPC Interface Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801DBM (ICH4-M) IDE Controller (rev 01)
00:1f.3 SMBus: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) SMBus Controller (rev 01)
00:1f.5 Multimedia audio controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller (rev 01)
00:1f.6 Modem: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Modem Controller (rev 01)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5901 100Base-TX (rev 01)
02:01.0 CardBus bridge: Texas Instruments PCI1520 PC card Cardbus Controller (rev 01)
02:01.1 CardBus bridge: Texas Instruments PCI1520 PC card Cardbus Controller (rev 01)
02:02.0 Ethernet controller: Atheros Communications Inc. Atheros AR5001X+ Wireless Network Adapter (rev 01)
Comment 252 2points 2010-12-04 06:19:18 UTC
Could you please avoid marking this bug as fixed while the patch is not integrated in mainline yet, or in the drm-intel dev sources at least?
Comment 253 legolas558 2010-12-06 05:32:50 UTC
I have tried v9.1 patch on git linus kernel 2.6.36-rc3, but screen goes all black. Keyboard is still responsive as I can reboot with Ctrl+Alt+Del but nothing else.
Comment 254 René Gabriëls 2010-12-11 06:13:39 UTC
> --- Comment #253 from legolas558 <legolas558@email.it> 2010-12-06 05:32:50 PST ---
> I have tried v9.1 patch on git linus kernel 2.6.36-rc3, but screen goes all
> black. Keyboard is still responsive as I can reboot with Ctrl+Alt+Del but
> nothing else.

I have ported the patch to v2.6.36 (see attachment).  I had to edit quite a bit
of code so I hope it works for you.  I have been running this patch for a couple
of weeks now and it appears to be working fine for me at least.
Comment 255 René Gabriëls 2010-12-11 06:16:09 UTC
Created attachment 41004 [details] [review]
v9.3: v9 patch ported to kernel 2.6.36
Comment 256 legolas558 2010-12-12 05:12:49 UTC
(In reply to comment #254)
> > --- Comment #253 from legolas558 <legolas558@email.it> 2010-12-06 05:32:50 PST ---
> > I have tried v9.1 patch on git linus kernel 2.6.36-rc3, but screen goes all
> > black. Keyboard is still responsive as I can reboot with Ctrl+Alt+Del but
> > nothing else.
> 
> I have ported the patch to v2.6.36 (see attachment).  I had to edit quite a bit
> of code so I hope it works for you.  I have been running this patch for a
> couple
> of weeks now and it appears to be working fine for me at least.

Thank you very much René, it is much appreciated. I have applied your patch to git linus v2.6.36-rc8, patch applies perfectly but I am still experiencing the black screen (same issue as in comment 237). I have now ran 'make clean' and rebuilding the kernel, and will report when I have more information about this black screen glitch.
Comment 257 legolas558 2010-12-12 08:03:34 UTC
Ok, looks like each time I re-apply this patch I need to run 'make clean' otherwise I get the black screen glitch.

I confirm that patch v9.3 works perfectly for i855GM rev.02
Comment 258 SMF 2010-12-13 01:24:40 UTC
(In reply to comment #255)
> Created an attachment (id=41004) [details]
> v9.3: v9 patch ported to kernel 2.6.36

Tried V9.3 patch with X server 1.9.2, xf86-video-intel-2.13.0 driver and my test program for this issue (on an IBM Thinkpad R51 with Linux Kernel 2.6.36.2) and the GPU hung after a few minutes:

dmesg:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 677107 at 677109)

Xorg.0.log:
[  1448.685] (EE) intel(0): Detected a hung GPU, disabling acceleration.
Comment 259 René Gabriëls 2010-12-13 14:05:09 UTC
> Tried V9.3 patch with X server 1.9.2, xf86-video-intel-2.13.0 driver and my
> test program for this issue (on an IBM Thinkpad R51 with Linux Kernel 2.6.36.2)
> and the GPU hung after a few minutes:
> 
> dmesg:
> [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
> [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting
> 677107 at 677109)
> 
> Xorg.0.log:
> [  1448.685] (EE) intel(0): Detected a hung GPU, disabling acceleration.

Does this also happen without the patch or with an older kernel + v9 patch?  I
haven't changed anything in the v9 patch, I've just ported it to newer kernels.
 Also: I'm not a kernel hacker, so in I have no idea what I'm doing anyway :-)
Comment 260 legolas558 2010-12-13 14:35:40 UTC
(In reply to comment #259)
> > Tried V9.3 patch with X server 1.9.2, xf86-video-intel-2.13.0 driver and my
> > test program for this issue (on an IBM Thinkpad R51 with Linux Kernel 2.6.36.2)
> > and the GPU hung after a few minutes:
> > 
> > dmesg:
> > [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
> > [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting
> > 677107 at 677109)
> > 
> > Xorg.0.log:
> > [  1448.685] (EE) intel(0): Detected a hung GPU, disabling acceleration.
> 
> Does this also happen without the patch or with an older kernel + v9 patch?  I
> haven't changed anything in the v9 patch, I've just ported it to newer kernels.
>  Also: I'm not a kernel hacker, so in I have no idea what I'm doing anyway :-)

I can confirm that patch v9 has a problem: everything is fine for a few minutes, but eventually Xorg will lock up the entire system and I can only reboot with Ctrl+Alt+Del.

This is the same problem that I have always experienced - less or more - since UMS was dropped; looks like v9.1, and v9.3 which is 100% equivalent, are less tuned for my rev.02 hardware so the crash pops out sooner.
Comment 261 SMF 2010-12-14 01:47:55 UTC
(In reply to comment #260)
> (In reply to comment #259)
> > > Tried V9.3 patch with X server 1.9.2, xf86-video-intel-2.13.0 driver and my
> > > test program for this issue (on an IBM Thinkpad R51 with Linux Kernel 2.6.36.2)
> > > and the GPU hung after a few minutes:
> > > 
> > > dmesg:
> > > [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
> > > [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting
> > > 677107 at 677109)
> > > 
> > > Xorg.0.log:
> > > [  1448.685] (EE) intel(0): Detected a hung GPU, disabling acceleration.
> > 
> > Does this also happen without the patch or with an older kernel + v9 patch?  I
> > haven't changed anything in the v9 patch, I've just ported it to newer kernels.
> >  Also: I'm not a kernel hacker, so in I have no idea what I'm doing anyway :-)
> 
> I can confirm that patch v9 has a problem: everything is fine for a few
> minutes, but eventually Xorg will lock up the entire system and I can only
> reboot with Ctrl+Alt+Del.
> 
> This is the same problem that I have always experienced - less or more - since
> UMS was dropped; looks like v9.1, and v9.3 which is 100% equivalent, are less
> tuned for my rev.02 hardware so the crash pops out sooner.

I agree I have tested the various releases of the patch with latest available kernel, Xorg server and intel driver the stability and quality has improved significantly over time but for me my test program will tigger this fault after a few minutes every time (see attachment 37157 [details] for a early version of my test).
Comment 262 Andrej Podzimek 2010-12-19 14:56:46 UTC
Applying the patch to 2.6.36.2 results in a black screen and a total system freeze. All these reports about "improving stability" of the Intel GPU driver just sound like April Fool's Day jokes to me...

The transition from UMS to KMS changed a reliable GPU that worked perfectly under Linux into a piece of crap.

BTW, it freezes even in text consoles, when no X server is running. I have never understood what all this KMS nonsense is good for. I used to hate the NVidia blob, but I must say it *just* *works*, unlike all the recent Intel stuff.
Comment 263 Brian Rogers 2010-12-19 17:46:08 UTC
KMS has nothing to do with the stability problems. The driver became unstable before KMS entered the picture. So the sentiment that the driver was destabilized to chase a frivolous pursuit is wrong.

Not that KMS is frivolous, though, as it is what got suspend and resume working stably on some systems, among other things.
Comment 264 Andrej Podzimek 2010-12-20 11:03:36 UTC
> Not that KMS is frivolous, though, as it is what got suspend and resume working
> stably on some systems, among other things.

On two newer Intel systems (965 and X3100) I often come across, the situation got much worse two years ago and it has not improved in any way so far. :-( With older drivers, suspend & resume worked just fine. With the current drivers and KMS, it does not work. There are random freezes. There is a black screen on boot from the moment the i915 stuff gets initialized to the moment when X.org takes control. Switching from X to a text console and back twice yields a reliable crash. 3D acceleration used to work just fine with older drivers in KDE 4. Basic compositting effects were really smooth. With the current driver, KDE 4 is a choppy and painful experience.

Needless to say that all the above only holds for chipsets much newer than my good old 855GM, which simply does *not* work at all and freezes even in the KMS text consoles, with no X.org at all... Starting X.org leads to a black screen and an undefined system state.
Comment 265 Andrej Podzimek 2010-12-20 11:12:30 UTC
Created attachment 41306 [details]
Output from dmesg after a X.org crash on ArchLinux stock kernel

Surprisingly, the ArchLinux stock kernel seems to work much much better than a vanilla kernel with just the v9 patch applied. (I don't know if they add this patch to the stock kernel or not...)

Unlike a v9-patched vanilla kernel, the ArchLinux stock kernel

* does not randomly crash early on boot
* is able to run X.org (ugly, choppy, full of crazy color artifacts, but it works at least for a while)

Unfortunately, X.org will eventually crash after a couple of minutes to hours of uptime. (This seems to be random.)

When the screen goes black (backlight off!), there is no response to keyboard or mouse. However, the sytem seems to survive, so I can ssh into the machine and get a backtrace from dmesg. And here it is.
Comment 266 legolas558 2010-12-20 12:00:23 UTC
For all those who think this is a good place where to rant: you are wrong. Here you can find the few which have worked *on their own* to fix the intel driver.

The pure, evil, truth is that there is no cash flowing to Intel regarding the Intel i8xx hardware, which is long far out of production and support; so yes when a company has no direct economic advantage in supporting a product, it will not support it and neither support its driver.

We can just hope that somebody with the expertize can finally flat out the remaining issues (since the hardware seems working for 1~15 minutes in the best cases); BUT, there is also the possibility that this hardware will never work with KMS because of "design glitches" which are hardly documented anywhere and cause the hardware to be fuzzy and all out of control during the precise and fast operations required by the new kernel architecture around video drivers.

Use VESA or the old kernel + driver, or else let's pay somebody to fix the driver - there are no other solutions and looks like praying altogether does not fix drivers :(
Comment 267 Indan Zupancic 2010-12-21 00:15:50 UTC
The latest driver is buggy as hell, for Archlinux you need package 
xf86-video-intel-2.12.0-3-i686.pkg.tar.xz, the newer one doesn't work.
Comment 268 sh29112911 2010-12-21 12:37:20 UTC
I have been running the latest stable version of the driver (2.13.0) on 855GM with the shadow framebuffer enabled and this seems to have fixed most of the stability issues for me. KMS is working (without the patch!) and I have not seen any artifacts or other weirdness.

I do not really experience any loss in performance although 2D acceleration is disabled for basic drawing operations. Xv overlay is working, so watching videos fullscreen can still be accelerated.

The only glitch I have found is that changing resolution sometimes crashes X, but this affects normal desktop usage rarely.

In the development version of the driver (2.13.902) it is even possible to enable both 3D acceleration and shadowfb, but 3D programs like compiz sometimes (not often though) hang the GPU. I have had no stability issues with 2D programs.

The other components of my setup are vanilla kernel 2.6.36.1, xorg-server 1.9.2 and mesa 7.9.
Comment 269 Chris Wilson 2010-12-22 03:29:47 UTC
Created attachment 41368 [details] [review]
Try a magic GWB bit in the HIC register

Some old documents turned up, that briefly mention bit31 in a 16bit register...
Comment 270 arthapex 2010-12-22 05:29:55 UTC
(In reply to comment #269)
> Created an attachment (id=41368) [details]
> Try a magic GWB bit in the HIC register
> 
> Some old documents turned up, that briefly mention bit31 in a 16bit register...

On which kernel version?
Comment 271 Chris Wilson 2010-12-22 05:34:53 UTC
drm-intel-next, should apply to -fixes as well (2.6.37).
Comment 272 arthapex 2010-12-23 07:44:59 UTC
(In reply to comment #271)
> drm-intel-next, should apply to -fixes as well (2.6.37).

I'm trying -fixes (2.6.37-rc5) with your patch since yesterday. Had no crash since then. The Shadow option works without crashes here, too.
I tried xcompmgr, zsnes, and extreme tux racer. though the latter only with about 5 fps.
Waiting for instructions for further informations or tests.
Comment 273 arthapex 2010-12-23 09:11:28 UTC
I'm using your patch without the Shadow option, of course.
Comment 274 Chris Wilson 2010-12-23 09:22:15 UTC
That's excellent news! So far:

i830: untested
i845: still fails the yes wtf > /tmp/wtf test
i855: it possibly works
i865: untested, might also work - anyone feeling brave?

The stress test that I'd like people to try as well is:

$ while :; do yes wtf > /tmp/wtf; done & # will fill up the entire disk
$ while :; do x11perf -range copywinpix10,comppixwin500 -time 1 -repeat 1; done
Comment 275 2points 2010-12-23 11:10:49 UTC
Created attachment 41404 [details]
Error state after x11perf test case

855GM (rev 2) report on drm-intel-fixes, 4d302442: Slight font corruption in comparison to 2.6.36 vanilla (occasional horizontal "lines" through letters, barely noticeable). No crash during normal desktop activity (mplayer, Opera, Thunderbird, etc.). The yes/x11perf test case ran for a while, but eventually gave up and crashed the GPU. Xorg log prints "(WW) intel(0): intel_uxa_prepare_access: bo map failed: Input/output error" and "(EE) intel(0): failed to set cursor: Input/output error" after GPU hung, but this is probably expected behavior. Attached error state.
Comment 276 Chris Wilson 2010-12-23 11:46:20 UTC
Well that error state is atypical of the "wtf" failure and the previous cache flushing errors. Progress. ;-)
Comment 277 Stefan Glasenhardt 2010-12-25 07:35:58 UTC
(In reply to comment #269)
> Some old documents turned up, that briefly mention bit31 in a 16bit register...

Did the folks at Intel smoke some weird things when designing this chipset?

P.S: 

I tried to backport your patch to kernel-version 2.6.35 but i'm not 100% sure if the following lines are necessary :

> +			intel_private.gmch_chip_id =
> +				intel_gtt_chipsets[i].gmch_chip_id;

Greetings and Merry Christmas

Stefan
Comment 278 2points 2010-12-25 09:22:40 UTC
> I tried to backport your patch to kernel-version 2.6.35 but i'm not 100% sure
> if the following lines are necessary :
> 
> > +			intel_private.gmch_chip_id =
> > +				intel_gtt_chipsets[i].gmch_chip_id;
That field is used in the IS_855GM macro defined earlier in the patch, and by extension in i830_chipset_flush.
Comment 279 Daniel Vetter 2010-12-28 12:53:11 UTC
Chris, I've tested your patch together with my resurrected cache-coherency
checker. Code available at

git://anongit.freedesktop.org/~danvet/drm i855-cache-coherency-checker

Unfortunately coherency failure rate is as high as ever :( and varies by
about 1-2 order of magnitude (depending upon what's running). Any
improvement is definitely down in the noise. So either my cache coherency
checker is bust (please check the code) or your patch doesn't flush (all)
the right caches.
-Daniel
Comment 280 Chris Wilson 2010-12-30 08:17:11 UTC
Created attachment 41530 [details] [review]
Poke the GWB bit.

If I invoke a wbinvd before poking the HIC, my system is stable and even survives Daniel's cache-coherency checker.

* Update: cache flush: 0 fails / 10240 flushes, then a GPU hang:

IPEHR: 0x00000029
...
0x01435000:      0x54300004: XY_COLOR_BLT (rgb enabled, alpha enabled, dst tile 0)
0x01435004: HEAD 0x03f01000:    format 8888, pitch 4096, clipping disabled

But this might be just the craptastic 845G...
Comment 281 Chris Wilson 2010-12-30 08:19:19 UTC
Created attachment 41531 [details] [review]
Poke the GWB bit.
Comment 282 2points 2010-12-30 10:51:52 UTC
Good going. Survived about 30 minutes of yes/x11perf so far without any traces of errors recorded. Stopped after that to do some regular work, which was also unproblematic. Upgrading Xorg from 1.7 to 1.9 also fixed font corruption, so that was unrelated to kernel bugs. 

For comparison's sake, I'm currently running:
855GM rev2
drm-intel-fixes, cc6455f8
Xorg 1.9.2
Mesa 7.9
xf86-video-intel from git, 7667ad84
libdrm also from git, bad5242a
Comment 283 Jim Rees 2011-01-01 06:55:43 UTC
Everything seems to work with this GWB patch, but it's too soon to tell whether it fixes my main problem, which is failure to resume from suspend. There is one unwanted side-effect. On resume, the LCD screen brightness gets set to some random value. This is on a x40 with 855GM, running the Ubuntu version of 2.6.37-rc5.
Comment 284 Jim Rees 2011-01-01 09:52:49 UTC
Still not working with GWB patch. I just had the backlight go off and nothing I could do would bring it back. I had to power cycle and go back to the vesa driver. But thanks for trying, keep up the good work.

My xorg is 1.9.0 with the intel driver from https://launchpad.net/~glasen/+archive/intel-driver .
Comment 285 Julien Cristau 2011-01-01 10:09:19 UTC
> --- Comment #283 from Jim Rees <rees@umich.edu> 2011-01-01 06:55:43 PST ---
> Everything seems to work with this GWB patch, but it's too soon to tell whether
> it fixes my main problem, which is failure to resume from suspend.

resume failure is not what this bug is about, though.
Comment 286 Indan Zupancic 2011-01-01 16:58:23 UTC
Jim, the brightness and backlight problems are new in 2.6.37. See:

https://bugzilla.kernel.org/show_bug.cgi?id=23472
https://bugzilla.kernel.org/show_bug.cgi?id=22672

For me 22672 is fixed, but it doesn't look like 23472 will be fixed before the 2.6.37 release. Looks like something fiddles with the brightness while it shouldn't touch it at all, but all patches just fiddle more with the brightness.

I haven't had the chance to try out all the new patches yet, will report back when I do (though everything is fairly stable now for me without any patches).
Comment 287 Jim Rees 2011-01-01 17:18:00 UTC
The brightness problem is new for me, although there was something similar in 2.6.32; see http://kubuntuforums.net/forums/index.php?topic=3111005.0 .

I've had the backlight problem, or one with identical symptoms, ever since KMS was introduced. I haven't had a working intel driver since 2.6.31.
Comment 288 Rémi Cardona 2011-01-02 01:58:57 UTC
@Jim and @Indan,

Please don't use this bug as a forum. The issue being handled here is _only_ about visual corruptions due to i8xx chipsets' odd cache-handling.

If you guys have other issues, please report them in _other_ bug reports. Don't spam the rest of us.

Thank you
Comment 289 Thorsten Vollmer 2011-01-03 06:03:39 UTC
Chris, I have tested your latest patch on my 852GME with kernel 2.6.37-rc8 and Daniel's cache-coherency checker. After accumulating more than 10 million cache flushes I am confident that your patch is a significant improvement or even the solution to this problem. No flush failures have been reported, nor did I notice any visual corruptions that appear with unpatched kernels.

Calling WBINVD is absolutely necessary. But if I remove the manipulation of the GWB bit, no failures occur. Do you have evidence that the GWB bit has any effect? I am not implying that WBINVD is enough on its own.
Comment 290 Chris Wilson 2011-01-03 06:09:56 UTC
On Mon,  3 Jan 2011 06:03:44 -0800 (PST), bugzilla-daemon@freedesktop.org wrote:
> https://bugs.freedesktop.org/show_bug.cgi?id=27187
> 
> --- Comment #289 from Thorsten Vollmer <thorsten@thvo.de> 2011-01-03 06:03:39 PST ---
> Calling WBINVD is absolutely necessary. But if I remove the manipulation of the
> GWB bit, no failures occur. Do you have evidence that the GWB bit has any
> effect? I am not implying that WBINVD is enough on its own.

Just a nudge and a wink from one of the win32 driver architects who
vaguely recalled that this is necessary, and the equivalent to the Global
Write Buffers flush in later generals. As always one is left baffled just
which writes are buffered... Might this be a GPU -> CPU coherency flush
rather than CPU -> GPU, which is what we need here...

But we're getting closer!
-Chris
Comment 291 Daniel Vetter 2011-01-04 14:37:35 UTC
Created attachment 41651 [details] [review]
Poke HIC bit + wbinv + cache coherency checker

Chris Wilson's latest patch with my cache coherency checker added. Spills the number of chipset flushes regurlarly into the dmesg and bails loudly if one fails.

Tested-by lines (like for the previous patch attempts by me) highly welcome.
Comment 292 Brian Rogers 2011-01-04 22:53:59 UTC
I'm setting up an Ubuntu PPA kernel build with that patch to get some more widespread testing. Do I also need to supply libdrm or xserver-xorg-video-intel updates to get all the relevant stability fixes, or will the kernel cover it all?

Stock Lucid has xserver-xorg-video-intel 2.9.1 and libdrm 2.4.18.
Stock Maverick has xserver-xorg-video-intel 2.12.0 and libdrm 2.4.21.
Comment 293 Daniel Vetter 2011-01-05 00:24:57 UTC
> --- Comment #292 from Brian Rogers <brian@xyzw.org> 2011-01-04 22:53:59
> I'm setting up an Ubuntu PPA kernel build with that patch to get some more
> widespread testing. Do I also need to supply libdrm or
> xserver-xorg-video-intel
> updates to get all the relevant stability fixes, or will the kernel cover
> it
> all?
>
> Stock Lucid has xserver-xorg-video-intel 2.9.1 and libdrm 2.4.18.
> Stock Maverick has xserver-xorg-video-intel 2.12.0 and libdrm 2.4.21.

There have been a bunch of fixes since 2.12, iirc. But the important thing
is cache-coherency, which is being checked with this kernel patch. If boxes
hang without any failed chipset flushes at all, it's probably just a bug
in the userspace driver. So upgrading userspace shouldn't hurt, but is
also not really required for testing. Like in the previous patches, the
important metric is failed chipset flushes per totally executed chipset
flushes.

-Daniel (who fears that due to the sorry state of i8xx regressions crept
into newer versions ...)
Comment 294 Thorsten Vollmer 2011-01-05 00:29:34 UTC
(In reply to comment #291)
> Chris Wilson's latest patch with my cache coherency checker added.

Tested-by: Thorsten Vollmer <thorsten@thvo.de> (DFI-ACP G5M150-N w/ 852GME)

As stated in comment 289, I have successfully tested this combination on kernel 2.6.37-rc8. Currently I have reduced the chipset flushing to a single wbinvd_on_all_cpus(), and I am still waiting for the first failure to occur.
Comment 295 Daniel Vetter 2011-01-05 00:36:52 UTC
> --- Comment #294 from Thorsten Vollmer <thorsten@thvo.de> 2011-01-05
> 00:29:34 PST ---
> (In reply to comment #291)
>> Chris Wilson's latest patch with my cache coherency checker added.
>
> Tested-by: Thorsten Vollmer <thorsten@thvo.de> (DFI-ACP G5M150-N w/
> 852GME)
>
> As stated in comment 289, I have successfully tested this combination on
> kernel
> 2.6.37-rc8. Currently I have reduced the chipset flushing to a single
> wbinvd_on_all_cpus(), and I am still waiting for the first failure to
> occur.

The wbinvd alone is known to be insufficient on i865G. If this patch works
on i855 we need a guinea-pig to fry his i865G ... Please pass the word ;)
Comment 296 2points 2011-01-05 07:27:07 UTC
Failed flushes/all flushes is 0/540416 as of now. Userspace tool versions are in comment #282.

Tested-by: Moritz Brunner <2points@gmx.org> (Asus M2400N/i855GM)
Comment 297 Alkis Georgopoulos 2011-01-07 14:50:55 UTC
Tested the kernel from Brian's PPA on 855GM/Ubuntu Lucid.
I also had some other fixes installed (libdrm, xserver-xorg-video-intel) from https://launchpad.net/~glasen/+archive/intel-driver.
I saw very good performance (308 FPS at glxgears from 150 previously) but also some screen corruption (http://imagebin.org/131510).

Then I removed those other fixes and the corruption was gone, at a loss of performance (250 FPS).
In both cases no cache flush errors were reported.
But in the second test (with just Brian's kernel and mesa), the gears in glxgears weren't doing full rounds, they were moving like a stick was blocking them (maybe a bug in glxgears, all other 3D apps were fine).

I noticed this error though every time I changed VTs:
Jan  8 00:12:33 sotiria kernel: [  285.819894] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id


Versions:
00:02.0 VGA compatible controller [0300]: Intel Corporation 82852/855GM Integrated Graphics Device [8086:3582] (rev 02)
linux-image-2.6.37-graphics2+12-generic  2.6.37-graphics2+12.26~lucid
xserver-xorg               1:7.5+5ubuntu1
xserver-xorg-video-intel   2:2.9.1-3ubuntu5
mesa-utils                 7.7.1-1ubuntu4~gpufix2
libdrm-intel1              2.4.18-1ubuntu3
libdrm2                    2.4.18-1ubuntu3
Comment 298 Stenten 2011-01-07 21:26:34 UTC
Created attachment 41766 [details]
kern.log from GPU freeze with latest patch

After a day and a half of successfully testing Chris Wilson's latest patch using Brian Rogers's PPA kernel in Ubuntu Maverick, I hit a GPU lockup: frozen monitor and cursor, no caps lock, no vt switch.

I was running Google Chrome (iGoogle and Gmail open), a terminal, Nautilus, and VLC open to a paused .wmv video file. Disk activity spiked for a while, I accepted to force quit Chrome, then the computer locked up; tried to switch VTs, Alt+SysRq+k'd (unsuccessfully: heavy disk activity for a bit but didn't change anything), and then REISUB'd (successfully).

Kern.log from the bugged boot is attached. Nothing was captured in i915_error_state and Xorg.0.log from that boot was normal; nothing was logged after the lockup happened though.

This is on a Dell Latitude D505 with a i855GM (Intel Corporation 82852/855GM Integrated Graphics Device [8086:3582] (rev 02))
Comment 299 Indan Zupancic 2011-01-08 04:33:35 UTC
I tried the "Poke the GWB bit." patch and it seems to work fine. No idea if 2.6.37 has some other improvements, or if it's this patch, but everything seems more snappy and quicker in the graphics department, even without xcompmgr -a, at least with xf86-video-intel 2.12.0.

2.13.0 used to give major screen corruption resulting in a unusable X, and also weird graphic update delays. The weird delays still seem to be present after this patch, but much less severe. They only make the system seem less snappy. A good test for this is dmesg in an xterm: With 2.13 the output sometimes halts for a moment somewhere in the middle, as if some flush is forgotten. 2.12 displays all output at once with no delay. However, this is unrelated to this bug, and the GWB poking definitely seems to get rid of all the corruption problems, so a big ack from me.

I didn't test Daniel's combined patch because it didn't apply to 2.6.37 nor to drm-intel/drm-intel-next or drm-intel/drm-intel-fixes. I also didn't try 2.6.37 without the GWB bit poking, but rc7 didn't work for me, so I assume it's because of this patch. I'll test without the patch tomorrow to be sure, but if you hear nothing it's this patch.

I think Stenten may have an unrelated problem, possibly not graphics related, and looking through the dmesg I'd say it's a memory leak in the wireless driver causing the OOM.

What Alkis sees may be related to the delay I see with the newer Intel X driver, but that is unrelated to this issue.

All in all I think this bug should be marked RESOLVED/FIXED when the fix is upstream. People shouldn't reopen this bug for unrelated problems.

Tested-by: Indan Zupancic <indan@nul.nu> (Thinkpad X40/855GM rev 02)
(In reply to comment #281)
> Created an attachment (id=41531) [details]
> Poke the GWB bit.
Comment 300 Daniel Vetter 2011-01-08 07:07:02 UTC
> --- Comment #298 from Stenten <stenten@gmail.com> 2011-01-07 21:26:34 PST ---
> I was running Google Chrome (iGoogle and Gmail open), a terminal, Nautilus, and
> VLC open to a paused .wmv video file. Disk activity spiked for a while, I
> accepted to force quit Chrome, then the computer locked up; tried to switch
> VTs, Alt+SysRq+k'd (unsuccessfully: heavy disk activity for a bit but didn't
> change anything), and then REISUB'd (successfully).

There's an oom killer report in the logs, i.e. you've simply run out of
memory. The Hang doesn't look gpu related. Please keep on testing and
report any failed flushes or graphics corruptions.
Comment 301 Rémi Cardona 2011-01-09 01:28:39 UTC
@Daniel, Chris,

What patch(es) should I be trying now? Is one of the many v9 versions still required? Should I be testing just the 'poke the GWB bit' patch?

Thanks :)
Comment 302 Chris Wilson 2011-01-09 03:26:11 UTC
http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=8xx-cache-coherency contains both the HIC poking and cache-coherency checker patches.
Comment 303 Bruno 2011-01-09 13:14:26 UTC
(In reply to comment #281)
> Created an attachment (id=41531) [details]
> Poke the GWB bit.

This seems to behave properly on top of 2.6.37 (and 2.6.37-rc8), running for some time under normal desktop (Enlightenment) use and with a bit of GTK and wine apps running on top.

I've had no GPU hang or (font) corruptions in Firefox as happened before (or even with the patch from attachment #41530 [details] [review])

HW is:
00:02.0 VGA compatible controller [0300]: Intel Corporation 82852/855GM
        Integrated Graphics Device [8086:3582] (rev 02)
00:02.1 Display controller [0380]: Intel Corporation 82852/855GM Integrated
        Graphics Device [8086:3582] (rev 02)

Software (Gentoo):
  linux-2.6.37
  x11-base/xorg-server-1.9.2
  media-libs/mesa-7.9
  x11-libs/libdrm-2.4.22
  x11-drivers/xf86-video-intel-2.13.0

Though I haven't tried with cache coherency checker patched to the kernel yet.
Comment 304 nepo 2011-01-11 08:03:04 UTC
Using Fujitsu Siemens M7400 with Intel 855
and
https://launchpad.net/~brian-rogers/+archive/graphics-fixes-testing kernel
and the Glasen-Patch
and the shadow-line in the xorg.conf, the system works perfectly apart from Google Earth, the globe of which is black partly. When I take out the shadow-line, there are the little corruptions in the windows as mentioned before by another user, but Google Earth works perfectly.
Comment 305 Rémi Cardona 2011-01-11 13:45:53 UTC
As for me, my 855 seems to work a lot better with Chris's kernel branch. The only downside is that rendering is somewhat slower than it used to be (I can get numbers on that if needed), especially with text in my terminal.

Though I do have a hunch that DDX 2.14.0 is actually worse than 2.13.0. I'm currently trying to bisect the issue but it's hard to be 100% certain that one revision is either good or bad...

Cheers
Comment 306 nepo 2011-01-11 15:18:34 UTC
With only Brian's Kernel installed, and no Glasen-patch nor shadow-line, there seems to be only very little artifacts in the windows, though my fan keeps on running :-/
Comment 307 legolas558 2011-01-11 21:36:09 UTC
vanilla kernel 2.6.37-rc8 with 0001-agp-intel-Experiment-with-a-855GM-GWB-bit.patch and everything nominal with my 855GM rev.02

No flush errors or glitches whatsoever, I cannot notice the difference with the old times video driver.

Don't wanna speak too early but maybe you got it, finally?
Comment 308 nepo 2011-01-15 16:17:24 UTC
Unfortunately just had a total freeze whilst playing an avifile with smplayer on my Amilo M7400 Intel 855 and Bryan's 2.6.37-graphics2+12 kernel. Bummer!
Added Xorg.log
Comment 309 nepo 2011-01-15 16:19:29 UTC
Created attachment 42089 [details]
Xorg.log after freeze
Comment 310 Indan Zupancic 2011-01-17 21:40:12 UTC
(In reply to comment #308)
> Unfortunately just had a total freeze whilst playing an avifile with smplayer
> on my Amilo M7400 Intel 855 and Bryan's 2.6.37-graphics2+12 kernel. Bummer!
> Added Xorg.log

Do you get any failed chipset flush messages in dmesg? (Either before or after a freeze.)
Comment 311 nepo 2011-01-18 04:25:41 UTC
Well, the dmesg logfile stays empty, but the syslog tells the following:

Jan 18 13:13:45 frank kernel: [  244.926758] cache flush num 3328
Jan 18 13:13:48 frank kernel: [  247.880907] cache flush num 3584
Jan 18 13:14:09 frank kernel: [  268.375338] cache flush num 3840
Jan 18 13:16:21 frank kernel: [  400.536308] cache flush num 4096
Jan 18 13:16:28 frank kernel: [  407.530350] cache flush num 4352
Jan 18 13:16:36 frank kernel: [  415.580043] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Jan 18 13:16:36 frank kernel: [  415.589006] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 63907 at 63906, next 63908)
Jan 18 13:16:36 frank kernel: [  415.592369] [drm:i915_reset] *ERROR* Failed to reset chip.

The video is just started 5 sec or so before the GPU hung. By the way, this error is reproduceable, whenever I start this video, I get the freeze (and blue screen within the video player) after 5 - 10 sec.
Comment 312 Daniel Vetter 2011-01-18 07:04:24 UTC
Am Di, 18.01.2011, 13:25 schrieb bugzilla-daemon@freedesktop.org:
> https://bugs.freedesktop.org/show_bug.cgi?id=27187
>
> --- Comment #311 from nepo <dwistal@yahoo.de> 2011-01-18 04:25:41 PST ---
> Well, the dmesg logfile stays empty, but the syslog tells the following:

[An aside:]
You're dmesg probably filters everything below the error level. Use

dmesg -n 9

to see more on the console (and in the dmesg output).

> The video is just started 5 sec or so before the GPU hung. By the way,
> this
> error is reproduceable, whenever I start this video, I get the freeze (and
> blue
> screen within the video player) after 5 - 10 sec.

Please rehang you're gpu and attach the i915_error_state. Might be a known
(but unfortunately undiagnosed) xv hang.
-Daniel
Comment 313 nepo 2011-01-18 07:20:48 UTC
dmesg -n 9
just tells me "klogctl: Die Operation ist nicht erlaubt" (operation invalid), as well as sudo. Since I'm only a half-proficient user, I dont know how to proceed....
just checked syslog, it definitely stopps after the crash.
Comment 314 Daniel Vetter 2011-01-18 07:37:57 UTC
> --- Comment #313 from nepo <dwistal@yahoo.de> 2011-01-18 07:20:48 PST ---
> dmesg -n 9
> just tells me "klogctl: Die Operation ist nicht erlaubt" (operation
> invalid),
> as well as sudo. Since I'm only a half-proficient user, I dont know how to
> proceed....
> just checked syslog, it definitely stopps after the crash.

http://intellinuxgraphics.org/i915_error_state.html for the error_state.

If you can't use the box locally anymore, that usually means remote log-in
via ssh from a 2nd box. Other possibility is to write a small script that
dumps the error state every 5s (into a different file each time around)
before hanging the box.
Comment 315 nepo 2011-01-18 10:38:06 UTC
Okay got it! Take a look at the attached file! Greezt
Comment 316 nepo 2011-01-18 10:40:11 UTC
Created attachment 42176 [details]
freeze log dmesg
Comment 317 legolas558 2011-01-18 15:30:44 UTC
Created attachment 42187 [details]
FF3.6 persona background thrashed (only top!)

The top part of the background was garbled (see attachment), moving/resizing window did not clear the artifacts

After some time the correct bitmap was eventually reloaded
Comment 318 Daniel Vetter 2011-01-19 01:04:22 UTC
@legolas:
Nice garbage ;) Doesn't look like the usual cache coherency problem, more
like pte lookup/tlb fail. Still on a i855gm, right?
Comment 319 Daniel Vetter 2011-01-19 01:06:02 UTC
@nepo:
I need the i915_error_state as detailed in the prev link. dmesg just tells that the gpu died, i915_error_state can tell (sometimes) why exactly.
btw, #intel-gfx on freenode irc is a nice place to get help in case you're stuck.
Comment 320 legolas558 2011-01-19 02:28:03 UTC
Yes, same hardware, i855GM rev.02

I am seeing it right now while typing, so it must happen on a regular basis. I am no more getting flush errors but instead a flood of:

[ 4675.815930] [drm:intel_prepare_page_flip] *ERROR* Prepared flip multiple times
Comment 321 2points 2011-01-19 02:52:32 UTC
(In reply to comment #320)
> [ 4675.815930] [drm:intel_prepare_page_flip] *ERROR* Prepared flip multiple
> times

I believe this isn't necessarily related to your problem - I used to see the same message (before I removed it from the kernel sources because it was flooding my syslog with gigabytes of data), but haven't experienced any the corruptions you describe.
Comment 322 Daniel Vetter 2011-01-19 04:39:01 UTC
> --- Comment #320 from legolas558 <legolas558@email.it> 2011-01-19 02:28:03
> PST ---
> Yes, same hardware, i855GM rev.02
>
> I am seeing it right now while typing, so it must happen on a regular
> basis. I
> am no more getting flush errors but instead a flood of:
>
> [ 4675.815930] [drm:intel_prepare_page_flip] *ERROR* Prepared flip
> multiple
> times

Possibly related: #30654 (also reporting the same error on a i855gm).
Likely caused by a hw oddity and/or some fancy race in our pageflip
interrupt programming. If you don't see any corruptions/hangs, it's a
rather harmless warning.

[btw: I think all current bugs discussed here are not coherency related
anymore. I just try to triadge these problems and point people to the
correct bugs.]
Comment 323 nepo 2011-01-19 04:53:21 UTC
okay, had to turn this report service on. please find the report in the extra file! thanks, n.
Comment 324 nepo 2011-01-19 04:55:01 UTC
Created attachment 42198 [details]
i915 gpu freeze report
Comment 325 legolas558 2011-01-19 15:07:17 UTC
(In reply to comment #321)
> (In reply to comment #320)
> > [ 4675.815930] [drm:intel_prepare_page_flip] *ERROR* Prepared flip multiple
> > times
> 
> I believe this isn't necessarily related to your problem - I used to see the
> same message (before I removed it from the kernel sources because it was
> flooding my syslog with gigabytes of data), but haven't experienced any the
> corruptions you describe.

Yeah, it might not be related - it's just the new fish in the net for now. And while reading your post I was actually looking for it in the kernel to do the same thing!
Comment 326 legolas558 2011-01-19 15:13:51 UTC
(In reply to comment #322)
> > --- Comment #320 from legolas558 <legolas558@email.it> 2011-01-19 02:28:03
> > PST ---
> > Yes, same hardware, i855GM rev.02
> >
> > I am seeing it right now while typing, so it must happen on a regular
> > basis. I
> > am no more getting flush errors but instead a flood of:
> >
> > [ 4675.815930] [drm:intel_prepare_page_flip] *ERROR* Prepared flip
> > multiple
> > times
> 
> Possibly related: #30654 (also reporting the same error on a i855gm).
> Likely caused by a hw oddity and/or some fancy race in our pageflip
> interrupt programming. If you don't see any corruptions/hangs, it's a
> rather harmless warning.
> 
> [btw: I think all current bugs discussed here are not coherency related
> anymore. I just try to triadge these problems and point people to the
> correct bugs.]

It's surely harmless, but I can't say if I am also affected by 30654; anyway you're right I'd say this bug is fixed for me because even if I get visual corruptions I have no more been able to crash it - and that's good.
Comment 327 Indan Zupancic 2011-01-19 17:28:36 UTC
There is not one report with failed chipset flushes after Chris' GWB bit poking patch. There are other, unrelated bugs reported here, but the coherency problem itself seems pretty much fixed. So I propose the GWB poking patch goes upstream (if it hasn't yet) and this bug be closed.

For people wanting to post problems here: Apply the GWB poking patch + chipset flush checker (or use "Bryan's" PPA thingy) and only report here if you see chipset flush failures. Otherwise it's an unrelated bug and you have to find a more fitting bugzilla bug. If you can't, open a new bug.

Only reopen this bug if you get failed chipset flushes after applying the patches!
Comment 328 René Gabriëls 2011-01-19 19:15:23 UTC
Created attachment 42216 [details]
Flush failures with Chris's patch

Sorry to spoil the party, but I just got my first cache flush failure with Chris's patch (see attachment for dmesg excerpt).  I've been running it for a number of days now, and until just yet I hadn't seen any cache flush failure.

My configuration is:
- kernel: drm-intel-next (23 dec) + Daniel Vetters patch
- libdrm-2.4.23
- xf86-video-intel-2.13.0
- xorg-server-1.9.3
- mesa-7.10 with gallium enabled
Comment 329 Chris Wilson 2011-01-20 02:29:58 UTC
(In reply to comment #328)
> Sorry to spoil the party, but I just got my first cache flush failure with
> Chris's patch (see attachment for dmesg excerpt).  I've been running it for a
> number of days now, and until just yet I hadn't seen any cache flush failure.

More importantly, which chipset?
Comment 330 René Gabriëls 2011-01-20 05:50:12 UTC
(In reply to comment #329)
> More importantly, which chipset?

855GM rev 02.
Comment 331 Daniel Vetter 2011-01-20 11:15:52 UTC
> --- Comment #323 from nepo <dwistal@yahoo.de> 2011-01-19 04:53:21 PST ---
> okay, had to turn this report service on. please find the report in the extra
> file! thanks, n.

Overlay hang. Please open a new bug report and add all the usual details
(http://intellinuxgraphics.org/how_to_report_bug.html) and add the
i915_error_state you've captured. Also add me to the cc: of the bug.
Comment 332 SMF 2011-01-21 03:01:37 UTC
(In reply to comment #302)
> http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=8xx-cache-coherency
> contains both the HIC poking and cache-coherency checker patches.
 
IBM Thinkpad R51 LFS+BLFS system ( Intel(R) Pentium(R) M processor 1.70GHz)
00:02.0 VGA compatible controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)

Running the kernel referred to above (2.6.37-rc7+patch)
X server 1.9.3
intel driver 2.14.0

Hammered with my test program (see 37157) plus fvwm desktop plus mplayer (X-VIDEO) + compiling kernel and all of latest XORG repeatedly for the last 24hours has resulted in; 

cache flush num 654592
No failures and no detectable visual issues and overall performance is good(for this class of system).

I will leave the system soaking and see if anything happens later.

thanks.
Comment 333 SMF 2011-01-22 02:59:34 UTC
(In reply to comment #332)
> (In reply to comment #302)
> > http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=8xx-cache-coherency
> > contains both the HIC poking and cache-coherency checker patches.
> 
> IBM Thinkpad R51 LFS+BLFS system ( Intel(R) Pentium(R) M processor 1.70GHz)
> 00:02.0 VGA compatible controller: Intel Corporation 82852/855GM Integrated
> Graphics Device (rev 02)
> 
> Running the kernel referred to above (2.6.37-rc7+patch)
> X server 1.9.3
> intel driver 2.14.0
> 
> Hammered with my test program (see 37157) plus fvwm desktop plus mplayer
> (X-VIDEO) + compiling kernel and all of latest XORG repeatedly for the last
> 24hours has resulted in; 
> 
> cache flush num 654592
> No failures and no detectable visual issues and overall performance is good(for
> this class of system).
> 
> I will leave the system soaking and see if anything happens later.
> 
> thanks.

Died after ~42 Hours with

Xorg.log:

[138586.394] [mi] EQ overflowing. The server is probably stuck in an infinite loop.
[138586.395] 
Backtrace:
[138586.481] 0: /usr/X11R6/bin/X (xorg_backtrace+0x37) [0x8104137]

dmesg:

cache flush num 954624
cache flush num 954880
cache flush num 955136
cache flush num 955392
cache flush num 955648
cache flush num 955904
cache flush num 956160
cache flush num 956416
cache flush num 956672
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 609339698 at 609339697, next 609339699)
[drm:i915_reset] *ERROR* Failed to reset chip.

Is this the same problem or something else ? how should I proceed ?

Thanks
Comment 334 Indan Zupancic 2011-02-11 03:36:38 UTC
(In reply to comment #333)
> Died after ~42 Hours with
> 
> Xorg.log:
> 
> [138586.394] [mi] EQ overflowing. The server is probably stuck in an infinite
> loop.
> [138586.395] 
> Backtrace:
> [138586.481] 0: /usr/X11R6/bin/X (xorg_backtrace+0x37) [0x8104137]
> 
> dmesg:
> 
> cache flush num 954624
> cache flush num 954880
> cache flush num 955136
> cache flush num 955392
> cache flush num 955648
> cache flush num 955904
> cache flush num 956160
> cache flush num 956416
> cache flush num 956672
> [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
> [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting
> 609339698 at 609339697, next 609339699)
> [drm:i915_reset] *ERROR* Failed to reset chip.
> 
> Is this the same problem or something else ? how should I proceed ?

This is probably something else, as failed flushes should be detected by the patch. Best way forward is to capture the error log from debugfs and open a new bugreport. (Perhaps I'm getting jaded, but I'm surprised it lasted 42 hours under any heavy video load.)

It's great that the coherence bug seems to be solved, but could you guys please push it upstream? It's still not in kernel 2.6.38-rc4. Then this bug can be closed for good. There are no downsides to the patch, and it improves things a lot. Pretty please?
Comment 335 Indan Zupancic 2011-02-11 03:43:24 UTC
(In reply to comment #328)
> Created an attachment (id=42216) [details]
> Flush failures with Chris's patch
> 
> Sorry to spoil the party, but I just got my first cache flush failure with
> Chris's patch (see attachment for dmesg excerpt).  I've been running it for a
> number of days now, and until just yet I hadn't seen any cache flush failure.
> 
> My configuration is:
> - kernel: drm-intel-next (23 dec) + Daniel Vetters patch
> - libdrm-2.4.23
> - xf86-video-intel-2.13.0
> - xorg-server-1.9.3
> - mesa-7.10 with gallium enabled

Are you sure that the HIC poking patch was/is in drm-intel-next?
It doesn't seem like it was. You can check your source to make 
sure you got I830_HIC defined in drivers/char/agp/intel-agp.h and
that it's poked in intel_i830_setup_flush() in intel-gtt.c.
Comment 336 René Gabriëls 2011-02-18 12:56:04 UTC
On 11-02-11 12:43, bugzilla-daemon@freedesktop.org wrote:
> Are you sure that the HIC poking patch was/is in drm-intel-next?
> It doesn't seem like it was. You can check your source to make 
> sure you got I830_HIC defined in drivers/char/agp/intel-agp.h and
> that it's poked in intel_i830_setup_flush() in intel-gtt.c.

It was in Daniel's latest patch.  The kernel I compiled from Chris's tree has no
cache coherency problems as far as I can tell.  The only difference I can find
is that in intel-gtt.c in Chris's tree *i8xx_flush_page and *i8xx_page are not a
member of static struct _intel_private.  The code that uses these members is
absent as well.  If you want, I can post the diff.

Anyway, so far, the kernel from Chris's tree appears to work fine, so there must
be a difference somewhere.
Comment 337 Indan Zupancic 2011-02-18 21:23:22 UTC
(In reply to comment #336)
> On 11-02-11 12:43, bugzilla-daemon@freedesktop.org wrote:
> > Are you sure that the HIC poking patch was/is in drm-intel-next?
> > It doesn't seem like it was. You can check your source to make 
> > sure you got I830_HIC defined in drivers/char/agp/intel-agp.h and
> > that it's poked in intel_i830_setup_flush() in intel-gtt.c.
> 
> It was in Daniel's latest patch.  The kernel I compiled from Chris's tree
> has no cache coherency problems as far as I can tell.  The only difference
> I can find is that in intel-gtt.c in Chris's tree *i8xx_flush_page and
> *i8xx_page are not a member of static struct _intel_private.  The code that
> uses these members is absent as well.  If you want, I can post the diff.
> 
> Anyway, so far, the kernel from Chris's tree appears to work fine, so there
> must be a difference somewhere.

Ah, indeed. Bugger. Not sure what's up with the GWB poking then. Should we close
this bug or not? It seems to be mostly solved one way or the other.

2.6.38-rc4 kernel without Chris' patch seems fine too for me, except I do have
a new kind of screen corruption, but it's not related to this coherency bug.
I'll try rc5 and post bisection results to lkml.
Comment 338 legolas558 2011-02-19 04:12:40 UTC
With latest vanilla kernel (without Chris' patch) I get a working Xorg for a while, and then text disappears (as white on white) with several other artifacts
Comment 339 Daniel Vetter 2011-02-25 04:01:40 UTC
> --- Comment #338 from legolas558 <legolas558@email.it> 2011-02-19 04:12:40 PST ---
> With latest vanilla kernel (without Chris' patch) I get a working Xorg for a
> while, and then text disappears (as white on white) with several other
> artifacts

2nd gen chips (including i855gm) had a rather nasty bug due to the new
relaxed tiling code. Fix is in drm-intel-fixes, libdrm and
xfree86-video-intel. The kernel simply rejects buggy tiling parameters and
userspace then falls back to untiled operations. So kernel is enough to
fix these corruptions, all three required to avoid a performance drop.

Can you test whether this is the problem your experiencing?
-Daniel
Comment 340 legolas558 2011-02-26 00:54:02 UTC
(In reply to comment #339)
> > --- Comment #338 from legolas558 <legolas558@email.it> 2011-02-19 04:12:40 PST ---
> > With latest vanilla kernel (without Chris' patch) I get a working Xorg for a
> > while, and then text disappears (as white on white) with several other
> > artifacts
> 
> 2nd gen chips (including i855gm) had a rather nasty bug due to the new
> relaxed tiling code. Fix is in drm-intel-fixes, libdrm and
> xfree86-video-intel. The kernel simply rejects buggy tiling parameters and
> userspace then falls back to untiled operations. So kernel is enough to
> fix these corruptions, all three required to avoid a performance drop.
> 
> Can you test whether this is the problem your experiencing?
> -Daniel

In yesterday's git pull I have noticed some interesting patches and I can say that the vanilla kernel (with xorg-server 1.9.4 and xf86-video-intel 2.14.0) is now very stable, no video corruptions and videos playing smoothly.

I will report back as soon as I experience some issue, but the early feeling is that it has been fixed. I confirm that there is a performance drop (glxgears @ 61FPS) so I think you have depicted a correct scenario for my case.
Comment 341 legolas558 2011-03-03 00:07:46 UTC
Created attachment 44058 [details]
Font thrashing on vanilla 2.6.38-rc6

Same configuration as my last comment, I get font artifacts as per screenshot. This happens after a while and repainting the screen will fix it. It's the same issue as reported previously regarding the FF3.6 persona background.

I am sure this makes perfect sense to involved developers; I think this issue might be fixed with more recent Xorg and DRM, but I prefer to wait for their release to Arch Linux stable; basically we have an usable intel video driver now, and I think it is very comparable to the one we got before KMS because with this driver I am also getting the good ol' blue screen when playing videos (which I also got with the pre-KMS driver). It happens at random after some time of use and will be fixed only by rebooting; video does not play (nice BSOD blue) while we can hear its music instead.

Tip: those 3 vertically aligned dots near the center axis are somewhat interesting to me, they remind to me about pitch offsets
Comment 342 Daniel Vetter 2011-03-03 04:45:51 UTC
> --- Comment #341 from legolas558 <legolas558@email.it> 2011-03-03 00:07:46
> PST ---
> Created an attachment (id=44058)
>  --> (https://bugs.freedesktop.org/attachment.cgi?id=44058)
> Font thrashing on vanilla 2.6.38-rc6

Please upgrade to -rc7, that one contains my patch to fix relaxed tiling
related corruptions on gen2. I can't tell right away from that screenshot
whether this is the problem with certainty, but it's likely.
Comment 343 legolas558 2011-03-05 05:12:18 UTC
Created attachment 44148 [details]
font garbled upon scrolling

Now using vanilla 2.6.38-rc7, I have found a reproducible way to trigger the garbled font issue (see attached screenshot). It's enough to scroll with the mousewheel and we suddenly have modern art; furthermore, the video-playing surfaces instantly turn into BSOD blue, while music keeps playing (never experienced a Xorg crash since these visualization issues).
Comment 344 Daniel Vetter 2011-03-05 06:58:25 UTC
> https://bugs.freedesktop.org/show_bug.cgi?id=27187
> Now using vanilla 2.6.38-rc7, I have found a reproducible way to trigger the
> garbled font issue (see attached screenshot). It's enough to scroll with the
> mousewheel and we suddenly have modern art; furthermore, the video-playing
> surfaces instantly turn into BSOD blue, while music keeps playing (never
> experienced a Xorg crash since these visualization issues).

Sounds alot like #34980 Please add a "me too" with this short description
+ screenshot + chipset/hw info.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.