Bug 26345 - [845G] CPU/GPU incoherency
Summary: [845G] CPU/GPU incoherency
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: medium critical
Assignee: Default DRI bug account
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 19068 21826 22771 23032 24137 24825 25091 25552 26200 26229 26580 26723 26746 26803 27245 27578 28187 29006 30064 34868 40181 40960 53065 54093 55934 56933 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-01-30 23:06 UTC by Brian Rogers
Modified: 2017-07-14 14:10 UTC (History)
51 users (show)

See Also:
i915 platform: ALL
i915 features:


Attachments
intel_gpu_dump output (112.34 KB, application/x-gzip)
2010-01-30 23:06 UTC, Brian Rogers
no flags Details
dmesg with newest stuff (30.12 KB, text/plain)
2010-01-31 06:06 UTC, Brian Rogers
no flags Details
dmesg on 2.6.33-rc8 with drm.debug=0x06 and tests srect10-srect500 (76.20 KB, text/plain)
2010-02-13 06:05 UTC, Brian Rogers
no flags Details
i915_error_state with srect tests (172 bytes, text/plain)
2010-02-13 06:08 UTC, Brian Rogers
no flags Details
i915_error_state with batchbuffer dump (756.31 KB, text/plain)
2010-02-16 06:03 UTC, Brian Rogers
no flags Details
i915_error_state with IPEHR = wtf (756.31 KB, text/plain)
2010-02-16 15:06 UTC, Brian Rogers
no flags Details
Wait after memory barriers for the system memory to update (1.52 KB, patch)
2010-02-26 09:25 UTC, Chris Wilson
no flags Details | Splinter Review
i915_error_state from my i855 (756.67 KB, text/plain)
2010-03-04 00:59 UTC, Daniel Vetter
no flags Details
Flush the GTT by disabling/enabling it. (3.08 KB, patch)
2010-03-04 05:10 UTC, Chris Wilson
no flags Details | Splinter Review
Xorg 1.7.5 log with patched kernel (26.68 KB, text/plain)
2010-03-04 05:58 UTC, legolas558
no flags Details
i915_error_state with GTT enable/disable patch (758.21 KB, text/plain)
2010-03-04 06:25 UTC, 2points
no flags Details
dri debugfs dumps for i855GM + Xorg.0.log (254.02 KB, application/octet-stream)
2010-03-04 07:58 UTC, legolas558
no flags Details
Debug logs from unpatched drm-intel-next kernel freeze (224.02 KB, application/x-gzip)
2010-03-07 10:15 UTC, Scott Hansen
no flags Details
Hack that prevents freezing (920 bytes, patch)
2010-03-13 02:13 UTC, Brian Rogers
no flags Details | Splinter Review
(hopefullyy) fix gtt cache coherency (176.44 KB, patch)
2010-03-18 05:32 UTC, Daniel Vetter
no flags Details | Splinter Review
dmesg with gtt cache coherency patch (35.99 KB, text/plain)
2010-03-18 14:58 UTC, 2points
no flags Details
section of dmesg with GTT flush failures (13.74 KB, text/plain)
2010-03-19 04:05 UTC, legolas558
no flags Details
chipset flushing quality script (221 bytes, text/plain)
2010-03-19 05:31 UTC, legolas558
no flags Details
chipset flushing quality script (revised) (296 bytes, text/plain)
2010-03-19 05:35 UTC, legolas558
no flags Details
Chipset flushing quality script for freedesktop bug 26345 (403 bytes, text/plain)
2010-03-25 04:43 UTC, legolas558
no flags Details
Script to check chipset flushing quality (387 bytes, text/plain)
2010-03-25 10:01 UTC, legolas558
no flags Details
/sys/kernel/debug/dri after the GTT flush failures (768 bytes, application/octet-stream)
2010-03-25 10:02 UTC, legolas558
no flags Details
dmesg after 7 GTT flush failures (48.05 KB, text/plain)
2010-03-25 10:12 UTC, legolas558
no flags Details
gttqual script to check GTT flushing quality (844 bytes, text/plain)
2010-03-26 15:32 UTC, legolas558
no flags Details
i915_error_state from drm-intel-next kernel (756.66 KB, text/plain)
2010-04-04 12:44 UTC, Geir Ove Myhr
no flags Details
Graphics-breaking workaround: skip i830_uxa_solid (920 bytes, patch)
2010-04-04 13:15 UTC, Brian Rogers
no flags Details | Splinter Review
output of intel gpu dump after a crash. (116.36 KB, application/x-gzip)
2010-04-05 08:50 UTC, theonewiththeevillook
no flags Details
New stab at working around the i845 tlb issues. (5.20 KB, patch)
2011-09-20 06:00 UTC, Daniel Vetter
no flags Details | Splinter Review
The relevant part of dmesg with Daniel's patch (798 bytes, application/octet-stream)
2011-11-01 15:04 UTC, Balló György
no flags Details

Description Brian Rogers 2010-01-30 23:06:14 UTC
Created attachment 32942 [details]
intel_gpu_dump output

The following command is pretty good at locking up the GPU on this i845 machine:
x11perf -range copywinpix10,comppixwin500 -time 1 -repeat 1

It doesn't always freeze on the same test, just somewhere random within this range. However, I've run this several times, and so far it hasn't made it to the end without freezing.

GPU is the following:
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)

I observed this on Ubuntu Karmic. Jaunty is unaffected.
Comment 1 Chris Wilson 2010-01-31 02:00:55 UTC
Hmm, that test is surviving in a continuous loop on my i845. I do note one substantial difference between our environments (besides my system using the current tip) is that I have a rev 03 compared to your rev 01. Could this be another notorious gen2 h/w bug? Given that the batchbuffer appears perfectly innocent, we don't have enough information to deduce how the h/w got itself into this state.

Thanks for reducing this to a small test case, though it is still baffling. The biggest change between Jaunty and Karmic, would be that the i8xx was blacklisted and acceleration disabled for Jaunty due to the severe number of bugs (some extremely nasty cache coherency issues in particular). It would be useful to still if the bug is still reproducible on the latest stack --in particular, there has been a couple of fence flushing issues spotted (though I don't think the tests you performed would have hit those paths) and we now upload images differently (which may stress the h/w differently, for better or for worse). But it will help to narrow the differences between our systems.
Comment 2 Brian Rogers 2010-01-31 06:06:48 UTC
Created attachment 32947 [details]
dmesg with newest stuff

Also freezes on a Lucid install with kernel 2.6.33-rc6 and the xorg-edgers PPA, which has the current xserver-xorg-video-intel and a very recent libdrm and mesa.

What do you make of this stuff in dmesg?
[    2.043712] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
[    2.142832] render error detected, EIR: 0x00000010
[    2.142838] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
[    2.142852] render error detected, EIR: 0x00000010
Comment 3 Chris Wilson 2010-01-31 06:12:46 UTC
(In reply to comment #2)
> What do you make of this stuff in dmesg?
> [    2.043712] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor
> 0
> [    2.142832] render error detected, EIR: 0x00000010

That's an unrelated and apparently harmless error. The most serious side-effect it has at the moment is it prevents i915_error_state from capturing the later hang.

I'm now close to 6 hours of runtime with 'x11perf -range copywinpix10,comppixwin500 -time 1 -repeat 1', still no hang on this i845. :(
Comment 4 Brian Rogers 2010-01-31 06:20:01 UTC
See also bug 26344, which is a GL-triggered freeze bug on the same hardware. Can you reproduce that one on your i845?
Comment 5 Brian Rogers 2010-01-31 07:47:53 UTC
I've been running a PPA to try to bisect random freezing here:
https://bugs.launchpad.net/bugs/456902

In fact, that's what led me to find this test case. Anyway, I just asked people to report results with x11perf, and someone just reported that it freezes with rev 03 hardware:

00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 03)
Comment 6 Brian Rogers 2010-02-13 06:05:40 UTC
Created attachment 33268 [details]
dmesg on 2.6.33-rc8 with drm.debug=0x06 and tests srect10-srect500

The stippled rectangle tests can also trigger a freeze. Here's a dmesg from 2.6.33-rc8 with xorg-edgers. It took eight cycles of this, on a freshly started Gnome desktop with nothing else going on:

while ! ( dmesg | grep 'render error' ); do x11perf -range srect10,srect500 -time 1 -repeat 1; done

I set DISPLAY=:0 and ran this test from an ssh login, because I think updating a terminal on the same screen interferes with the test.
Comment 7 Brian Rogers 2010-02-13 06:08:35 UTC
Created attachment 33269 [details]
i915_error_state with srect tests

With the console initialization bug fixed, I can now get an i915_error_state corresponding to this freeze.

Let me know if you need any more debug info.
Comment 8 Brian Rogers 2010-02-14 06:39:50 UTC
I'm finding ASCII characters in the IPEHR register sometimes, as if the wrong data is being sent to the graphics card. For example, one run looked like this:

EIR: 0x00000000
  PGTBL_ER: 0x00000000
  INSTPM: 0x00000000
  IPEIR: 0x00000000
  IPEHR: 0x45494c43
  INSTDONE: 0x00ffffc1
  ACTHD: 0x01a33008

With byte order swapped, the value of IPEHR corresponds to the string 'CLIE'.

I just applied Chris Wilson's latest batch buffer reporting patch. I now have the system running the x11perf test on bootup until it freezes, then saving /sys/kernel/debug/dri/0 and rebooting to do the test again.

So far I see suspicious values of IPEHR in 3 out of 13 runs. I don't see anything strange like that in the batchbuffer dumps so far, though. I'll leave this test running overnight.
Comment 9 Brian Rogers 2010-02-16 06:03:57 UTC
Created attachment 33331 [details]
i915_error_state with batchbuffer dump

Kernel is the current linux master branch (v2.6.33-rc8-26-g0813e22) with the batchbuffer dumping patch added.

$ apt-cache policy libdrm2 xserver-xorg-video-intel | grep Installed
  Installed: 2.4.17+git20100210.4f0f8717-0ubuntu0sarvatt
  Installed: 2:2.10.0+git20100211.00e7312d-0ubuntu0sarvatt

This is a dump that does not have recognizable string data in IPEHR. It was made with a couple runs of x11perf while running a program that repeatedly forked and waited on its children to stress the CPU, and no drm debug messages turned on. The x11perf command was this:

x11perf -range srect10,srect500 -time 1 -repeat 1
Comment 10 Brian Rogers 2010-02-16 15:06:36 UTC
Created attachment 33345 [details]
i915_error_state with IPEHR = wtf

I figured out exactly where the string data in IPEHR is coming from. In fact, I was able to plant my own data into that register. Here's a dump where

IPEHR = 0x0a667477 = "wtf" (followed by newline)

All it took was "yes wtf > wtf" during the x11perf run, which writes lines of 'wtf' continuously to a file.

I guess the graphics card is getting data being sent to the hard drive. That's why dmesg fragments wound up in the register: that's something that gets logged.
Comment 11 Chris Wilson 2010-02-19 04:15:41 UTC
"x11perf -range srect10,srect500 -time 1 -repeat 1" reliably triggers the freeze here.
Comment 12 Chris Wilson 2010-02-24 07:42:59 UTC
Finally found the missing module and tracked down the broken patch that was preventing my brnach booting on my i845... And now I cannot reproduce the freeze with "x11perf -range srect10,srect500 -time 1 -repeat 1".
Comment 13 Chris Wilson 2010-02-25 03:25:07 UTC
*cries*

IPEHR: 0x0a667477
...
ACTHD: 0x00b06008
seqno: 0x00000080
Buffers [1]:
  00b06000    16384 00000009 00000000 00000081
batchbuffer at 0x00b06000:
0x00b06000:      0x02000011: MI_FLUSH
0x00b06004:      0x05000000: MI_BATCH_BUFFER_END
0x00b06008: HEAD 0x00000000:    

And that is after applying the big hammer of wbinvd on every batch.
Comment 14 Chris Wilson 2010-02-26 08:48:22 UTC
First successful workaround:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index f1fcc97..0dcf761 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3530,6 +3530,9 @@ i915_dispatch_gem_execbuffer(struct drm_device *dev,
        if (exec->flags & I915_GEM_NO_DISPATCH)
                return 0;
 
+       msleep(10);

---

*screams*

So it appears that the memory barriers are having little effect and the uncached write to the ringbuffer by the CPU and subsequent read of the batch buffer by the GPU is occurring before main memory has been flushed.

Adding extra memory barriers or invalidations, or writes from the GPU to memory were insufficient. 
Comment 15 Chris Wilson 2010-02-26 09:25:04 UTC
Created attachment 33593 [details] [review]
Wait after memory barriers for the system memory to update

I'm not going to send this upstream until I have at least a Tested-by!
Comment 16 Brian Rogers 2010-02-28 05:31:29 UTC
That appears to stabilize it, however I've noticed at least one scenario where it causes severe slowdown. When wine is in charge of drawing its own windows because the "emulate a virtual desktop" option is turned on, the gradient on the window's title bar takes a good couple of seconds to draw. I can see it being filled in left to right.

Also, I did manage to cause a freeze in one scenario. In my normal test, I run Xorg, then xclock, then I run x11perf over and over. I can't get that test to freeze with this patch. But if I skip xclock, then Xorg resets the display after every run of x11perf because the last client has disconnected, and in this case I caused a GPU hang once. I'll do some more testing and gather data on this. Maybe it's just a different bug.

I put a kernel with this patch in an Ubuntu PPA, and I'm going to get some feedback from users experiencing random hangs. Let's see if this patch affects that issue...
Comment 17 Chris Wilson 2010-03-01 08:03:40 UTC
So I had the bright idea of using a GTT mapping to avoid the chipset flushing (and associated delay) which restores performance... and the original problem. Conclusion: not even GTT mappings are coherent.
Comment 18 Chris Wilson 2010-03-02 07:57:03 UTC
*** Bug 21826 has been marked as a duplicate of this bug. ***
Comment 19 Chris Wilson 2010-03-02 08:06:20 UTC
*** Bug 24789 has been marked as a duplicate of this bug. ***
Comment 20 Chris Wilson 2010-03-02 08:09:54 UTC
Retitling to better reflect the underlying issue.
Comment 21 Chris Wilson 2010-03-02 08:15:38 UTC
*** Bug 22771 has been marked as a duplicate of this bug. ***
Comment 22 Chris Wilson 2010-03-02 08:21:23 UTC
*** Bug 26200 has been marked as a duplicate of this bug. ***
Comment 23 Chris Wilson 2010-03-02 08:27:25 UTC
*** Bug 26580 has been marked as a duplicate of this bug. ***
Comment 24 Chris Wilson 2010-03-02 08:31:44 UTC
*** Bug 24137 has been marked as a duplicate of this bug. ***
Comment 25 Chris Wilson 2010-03-02 08:41:09 UTC
*** Bug 26746 has been marked as a duplicate of this bug. ***
Comment 26 legolas558 2010-03-02 10:02:54 UTC
this bug does not duplicate bug 24789 as I can't even boot to Xorg in enough time to run the testcase; furthermore, if I use the suggest msleep(10) patch I get crashes almost instantly.

So patches drm-intel-big-hammer.patch and 855nolid.patch are both necessary for me, and this cannot be a dupe of 24789; it might be the other way round instead.

Please test these patches and say if bug is (almost) fixed for you:

- http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by
jbarnes
- http://bugzilla.kernel.org/attachment.cgi?id=25084 -
drm-intel-big-hammer.patch from FC13 kernel patches

I said almost because bug can still be triggered under heavy CPU/GPU load, like when watching a video, but system is definitively usable (I used it to post these comments)
Comment 27 Chris Wilson 2010-03-02 10:28:37 UTC
(In reply to comment #26)
> this bug does not duplicate bug 24789 as I can't even boot to Xorg in enough
> time to run the testcase; furthermore, if I use the suggest msleep(10) patch I
> get crashes almost instantly.

Sounds like plymouth is doing something just as funky to write to the framebuffer. Looks like a GTT map [ http://cgit.freedesktop.org/plymouth/tree/src/plugins/renderers/drm/ply-renderer-i915-driver.c ], which as pointed out earlier also suffers from exactly the same coherency issues, but is not covered by the AGP chipset flush.

In short, you have the same bug and the wbinvd() just happens to cause sufficient delay on all batchbuffers that it happens to work most of the time.
 
> So patches drm-intel-big-hammer.patch and 855nolid.patch are both necessary for
> me, and this cannot be a dupe of 24789; it might be the other way round
> instead.
> 
> Please test these patches and say if bug is (almost) fixed for you:
> 
> - http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by
> jbarnes
Upstream, no effect (obviously).

> - http://bugzilla.kernel.org/attachment.cgi?id=25084 -
> drm-intel-big-hammer.patch from FC13 kernel patches

As mentioned much earlier, no effect.
Comment 28 Chris Wilson 2010-03-02 10:29:28 UTC
*** Bug 24789 has been marked as a duplicate of this bug. ***
Comment 29 legolas558 2010-03-02 13:14:57 UTC
thanks for testing those patches Chris, so the conclusion is that bug are different, not duplicate

I also believe that patch for this bug could fix also bug 24789, but that's a mere supposition we don't have evidence until such patch exists. and until that we have two different bugs triggered in two different ways.
Comment 30 legolas558 2010-03-02 13:20:47 UTC
(In reply to comment #27)
> (In reply to comment #26)
> Upstream, no effect (obviously).
> 
> > - http://bugzilla.kernel.org/attachment.cgi?id=25084 -
> > drm-intel-big-hammer.patch from FC13 kernel patches
> 
> As mentioned much earlier, no effect.
> 
Are you saying that this patch does not fix the bug for you?
So we have your hardware with patch in attachment 33593 [details] [review] which works, while my hardware (855GM rev02) which doesn't. Correct?

I really don't see why the bugs should be merged considering different triggering conditions and different workaround patches
Comment 31 Chris Wilson 2010-03-02 13:37:06 UTC
(In reply to comment #30)
> Are you saying that this patch does not fix the bug for you?
> So we have your hardware with patch in attachment 33593 [details] [review] which works, while my
> hardware (855GM rev02) which doesn't. Correct?
> 
> I really don't see why the bugs should be merged considering different
> triggering conditions and different workaround patches

The two bugs in question are both due to the GPU executing the command stream prior to GMCH completing its write, thus hanging on illegal instructions that do not match the batch buffer dumped.

The drm-intel-big-hammer.patch adds a wbinvd() [write-back invalidate to flush all levels of CPU cache] instruction to i915_gem_execbuffer(). For all intents and purposes, this simply adds a delay since the caches are flushed later anyway. However as is demonstrated by your own statements, and I confirm, this is insufficient to ensure that all writes are completed prior to the GPU performing its DMA to main memory. The reason why the msleep() hack does not solve everything is that it is limited to the AGP chipset flush which is only performed on invalidating the CPU domain. The truly astonishing thing about this bug is that the GTT domain appears to be similarly affected. Hence why the wbinvd() patch appears to be more successful in some scenarios than the msleep(), but is still fundamentally flawed.

Comment 32 legolas558 2010-03-02 14:08:39 UTC
(In reply to comment #31)
> (In reply to comment #30)
> > Are you saying that this patch does not fix the bug for you?
> > So we have your hardware with patch in attachment 33593 [details] [review] [details] which works, while my
> > hardware (855GM rev02) which doesn't. Correct?
> > 
> > I really don't see why the bugs should be merged considering different
> > triggering conditions and different workaround patches
> 
> The two bugs in question are both due to the GPU executing the command stream
> prior to GMCH completing its write, thus hanging on illegal instructions that
> do not match the batch buffer dumped.
> 
[offtopic]I know that this ends up with something broken being identified in the hardware/firmware, I am just waiting for somebody to clearly say it...but still believing that some magic (read hack) can eventually fix this up[/offtopic]

> The drm-intel-big-hammer.patch adds a wbinvd() [write-back invalidate to flush
> all levels of CPU cache] instruction to i915_gem_execbuffer(). For all intents
> and purposes, this simply adds a delay since the caches are flushed later
> anyway. However as is demonstrated by your own statements, and I confirm, this
> is insufficient to ensure that all writes are completed prior to the GPU
> performing its DMA to main memory. The reason why the msleep() hack does not
> solve everything is that it is limited to the AGP chipset flush which is only
> performed on invalidating the CPU domain. The truly astonishing thing about
> this bug is that the GTT domain appears to be similarly affected. Hence why the
> wbinvd() patch appears to be more successful in some scenarios than the
> msleep(), but is still fundamentally flawed.
> 
I fully agree with your statements, also taking those for which I have no knowledge as true; as per my testing I have found a qualitative difference in the two patches: the msleep() workaround works a lot worse for me, and it is rarely distinguishable from the vanilla kernel's situation, while the wbinvd() approach "makes it usable", although I should use it with the certainness that it will crash - sooner or later. So right now I have a quick testcase for invalidating the msleep() patch while the wbinvd() patch works for longer and is not tied to a magic number (10), possibly dependant on the load of my machine.

If you want I can calibrate that magic number for my box, but that would just be experimentation without an usable feedback.

Also there was an user (M.Nowak) saying that FC13 is totally free of this bug; if this fact was true, then FC13 has some other interesting patch to look at (which I couldn't identify up to now), otherwise it's just harder for the bug to be triggered with FC13's patched kernel (I believe this), bringing no news to us.
Comment 33 legolas558 2010-03-02 14:11:40 UTC
(In reply to comment #32)
> Also there was an user (M.Nowak) saying that FC13 is totally free of this bug;
> if this fact was true, then FC13 has some other interesting patch to look at
> (which I couldn't identify up to now), otherwise it's just harder for the bug
> to be triggered with FC13's patched kernel (I believe this), bringing no news
> to us.
> 

Forgot to add: for me FC13 is broken as any other kernel, so I asked M.Nowak to test vanilla kernel + drm-intel-big-hammer.patch, but he has not yet provided results.

Some other test results from people with this hardware would be very welcome.
Comment 34 Daniel Vetter 2010-03-03 06:13:27 UTC
Chris, I've thought a bit about what failure-mode could possibly explain all these different corruptions. Could it be that the GTT _table_ contains stale entries? Yes I know, this sounds crazy but I haven't yet found another failure mode that could nicely explain what's going on ...

For the gtt corruption case:
- map new bo into gtt
- start writing
- new gtt mappings become effective
- further writes

gtt cpu writes are wc, i.e. the cpu can send out only 4 byte sized writes, agp doesn't cache them. This would explain a single "wtf " in the command buffer (stale data that's been in the page that's just been assigned to this new bo).

For the non-gtt write
- write stuff to mem
- map bo into gtt
- gpu starts using them
- new gtt mappings become effective
- gpu reads crap

Of course, in the case of gtt writes, the write should end up somewhere else in system memory. But where mapping a dummy page for all empty gtt entries, right? So it's quite likely they end up in there. If this dummy page never gets corrupted, I'm obviously wrong.

<crazymode />

I can't test this theory thoug, because I can't reproduce the bug on my i855.

btw, the reason a came up with this: Just yesterday I've experienced a strangely corrupted pixmap on my i855GM: 8 pixels high (TILE_X height) and about half the screen wide (around 512 pixels, i.e. 4 pages of TILE_X tiles, as wide as the pixmap) with nice colorful garbage. This brought me into thinking that maybe we're dealing with corruptions in GTT_PAGE_SIZE quantities. This was after about an hour of hitting the box with x11perf and filling the disk with wtfs, also the first time I've ever seen something like this.
Comment 35 Chris Wilson 2010-03-03 06:44:52 UTC
Yes, I had thought it possible that this could be a missing flush after updating the PTEs. I added a few more flushes along those paths, just in case. (Though that does not conclusively rule that out.)
Comment 36 Bruno 2010-03-03 14:08:45 UTC
(In reply to comment #34)
> I can't test this theory thoug, because I can't reproduce the bug on my i855.

Daniel, do you have a patch that would enable testing your theory?

I could apply it and see if my system keeps alive GPU for more than a few hours (it wedges at least once a day, usually more often and half of the time X exists)

Here it wedges with normal desktop usage (Enlightenment + GTK applications)
Comment 37 Daniel Vetter 2010-03-04 00:39:48 UTC
Chris, I've tried to prove my theory one way or another by massively increasing the gtt map/unmap: I simply unmap every bo as soon as it hits the inactive list with the following patch:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 50244fc..197260b 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1822,6 +1822,8 @@ i915_gem_retire_request(struct drm_device *dev,
                        drm_gem_object_reference(obj);
                        i915_gem_object_move_to_inactive(obj);
                        spin_unlock(&dev_priv->mm.active_list_lock);
+                       i915_gem_object_unbind(obj);
                        drm_gem_object_unreference(obj);
                        spin_lock(&dev_priv->mm.active_list_lock);
                }

Whereas before I couldn't hang my i855GM with about half an hour of x11perf plus yes wtf > /tmp/wtf, it now hangs within one minute after login. Moving around windows, switching desktops is sufficient.

I'll add the i915_error_state dump shortly.
Comment 38 Daniel Vetter 2010-03-04 00:59:32 UTC
Created attachment 33750 [details]
i915_error_state from my i855

I'm not really fluent in reading these dumps, but the block of zeros right before ACTHD looks very fishy: 64 bytes in length and nicely size-aligned, i.e. a cache-line gone wrong.

The hang occured right after a fresh bootup, i.e. the memory has been cleared by the power reset. That might explain the zeros instead of some other random crap.
Comment 39 Chris Wilson 2010-03-04 01:03:40 UTC
(In reply to comment #37)
> Chris, I've tried to prove my theory one way or another by massively increasing
> the gtt map/unmap: I simply unmap every bo as soon as it hits the inactive list
> with the following patch:

Sadly that also forces the domain change to CPU, so stresses the CPU flushing paths as well; not quite clear cut.

The code to poke around in is drivers/char/agp/intel-agp.c. In particular, intel_i8xx_tlbflush(). But there does some to be some other odd discrepancies between gen2 and later.
Comment 40 legolas558 2010-03-04 01:58:29 UTC
(In reply to comment #27)
> (In reply to comment #26)
> > - http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by
> > jbarnes
> Upstream, no effect (obviously).
> 
Where is this upstream? I am using linus' tree and patch is not there; I need this patch for a different bug: without it the screen is OFF
Comment 41 Chris Wilson 2010-03-04 02:05:50 UTC
(In reply to comment #40)
> (In reply to comment #27)
> > (In reply to comment #26)
> > > - http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by
> > > jbarnes
> > Upstream, no effect (obviously).
> > 
> Where is this upstream? I am using linus' tree and patch is not there; I need
> this patch for a different bug: without it the screen is OFF

Still waiting for Linus to pull, apparently. It's been in drm-intel-next for a couple of weeks now - surprisingly since it is marked for stable.
Comment 42 legolas558 2010-03-04 02:54:22 UTC
(In reply to comment #41)
> (In reply to comment #40)
> > (In reply to comment #27)
> > > (In reply to comment #26)
> > > > - http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by
> > > > jbarnes
> > > Upstream, no effect (obviously).
> > > 
> > Where is this upstream? I am using linus' tree and patch is not there; I need
> > this patch for a different bug: without it the screen is OFF
> 
> Still waiting for Linus to pull, apparently. It's been in drm-intel-next for a
> couple of weeks now - surprisingly since it is marked for stable.
> 

Yes because on these laptops it's really necessary, otherwise the display is not even recognized.

Perhaps I should also use drm-intel-next kernel for these tests?
I am also available to try other patches and run other tests; apparently my i855GM (rev 02) is very quick at crashing; unfortunately I am not following you both very well in the reasonings, but there seem to be some light.

The worst thing is that bug is apparent in different ways and development efforts are seemingly scattered around (all distros' bugtrackers are filled up with reports about i8xx devices)
Comment 43 Daniel Vetter 2010-03-04 03:18:17 UTC
> --- Comment #39 from Chris Wilson <chris@chris-wilson.co.uk>  2010-03-04 01:03:40 PST ---
> (In reply to comment #37)
> > Chris, I've tried to prove my theory one way or another by massively increasing
> > the gtt map/unmap: I simply unmap every bo as soon as it hits the inactive list
> > with the following patch:
> 
> Sadly that also forces the domain change to CPU, so stresses the CPU flushing
> paths as well; not quite clear cut.

Of course, you're right, this test only strongly points at a problem in
either object_unbind and/or object_bind. But I think we can rule out cpu
flushing with a high probability:

- We only clflush on moving away from the cpu domain.
- Most access is via gtt, no intermediate cpu access. So the clflush
  should be a no-op (for the hw).

> The code to poke around in is drivers/char/agp/intel-agp.c. In particular,
> intel_i8xx_tlbflush(). But there does some to be some other odd discrepancies
> between gen2 and later.

My next step is to check whether this gtt writes end up someplace else
(i.e. most likely on the agp scratch page). If they do, we have mixed up
gtt entries somewhere, if they don't the problem is definitely somewhere
else.
Comment 44 Chris Wilson 2010-03-04 05:10:27 UTC
Created attachment 33751 [details] [review]
Flush the GTT by disabling/enabling it.

The Broadwater errata notes that PTE entries that have been prefetched are not correctly invalidated when the GTT is updated. It goes on to note that in these situations: (1) don't do that, (2) flush the GTT - but helpfully forgets to mention how to actually enact the flush.

Instead, zap the GTT by disabling it and re-enabling it after every update. Fortunately, this does not appear to impact on throughput too much.
Comment 45 legolas558 2010-03-04 05:58:49 UTC
Created attachment 33752 [details]
Xorg 1.7.5 log with patched kernel

(In reply to comment #44)
> Created an attachment (id=33751) [details]
> Flush the GTT by disabling/enabling it.
> 
Just tried this one on vanilla linus' tree, it crashes (see Xorg log) as always, however I noted no font glitches (apparently) and only a minor glitch (a couple of lines missing) on a desktop icon.

Xorg lasted less than 2 minutes, and starting firefox most probably nuked it
Comment 46 Chris Wilson 2010-03-04 06:09:23 UTC
(In reply to comment #45)
> Xorg lasted less than 2 minutes, and starting firefox most probably nuked it


*sigh* it was doing so well here, surviving the x11perf test using both CPU and GTT mappings. Did you manage to grab an i915_error_state, so that we can see what manner of corruption remains?
Comment 47 Geir Ove Myhr 2010-03-04 06:14:59 UTC
(In reply to comment #45)
> Created an attachment (id=33752) [details]
> Xorg 1.7.5 log with patched kernel
>
> Just tried this one on vanilla linus' tree, it crashes (see Xorg log) as
> always, however I noted no font glitches (apparently) and only a minor glitch
> (a couple of lines missing) on a desktop icon.

With the patch on top of the latest intel-drm-next kernel you can grab <debugfs>/dri/0/i915_error_state after the hang. That would probably be useful for Chris and Daniel. http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git
Comment 48 2points 2010-03-04 06:25:00 UTC
Created attachment 33756 [details]
i915_error_state with GTT enable/disable patch

Can confirm said crash appearing reliably a few seconds into X. My chipset is also 82852/855GM (rev 02), so this should probably be relevant to the bug at hand.
Comment 49 legolas558 2010-03-04 06:28:19 UTC
(In reply to comment #47)
> (In reply to comment #45)
> > Created an attachment (id=33752) [details] [details]
> > Xorg 1.7.5 log with patched kernel
> >
> > Just tried this one on vanilla linus' tree, it crashes (see Xorg log) as
> > always, however I noted no font glitches (apparently) and only a minor glitch
> > (a couple of lines missing) on a desktop icon.
> 
> With the patch on top of the latest intel-drm-next kernel you can grab
> <debugfs>/dri/0/i915_error_state after the hang. That would probably be useful
> for Chris and Daniel.
> http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git
> 

It will take about 15 minutes to git-clone it, I had deleted it some days ago out of frustration; next I'll grab i915_error_state after the crash and attach it here.

I have the same hardware as 2points so you can already look at his attachment 33756 [details]; however we'll later check that they talk about the same bug, as double-check.
Comment 50 Chris Wilson 2010-03-04 06:40:45 UTC
As usual, something is strange in that dump. The strange part is that it looks perfectly fine, even the IPEHR shouldn't have been a trigger for a hang. The odd part is that the last loaded instruction (IPEHR) corresponds to several instructions prior to ACTHD (ACTive HeaD, where the DMA engine is currently grabbing the next QWord from) - presuming what the CPU read back is consistent with what is being read by the GPU. Hmm.
Comment 51 legolas558 2010-03-04 07:10:55 UTC
(In reply to comment #50)
> As usual, something is strange in that dump. The strange part is that it looks
> perfectly fine, even the IPEHR shouldn't have been a trigger for a hang. The
> odd part is that the last loaded instruction (IPEHR) corresponds to several
> instructions prior to ACTHD (ACTive HeaD, where the DMA engine is currently
> grabbing the next QWord from) - presuming what the CPU read back is consistent
> with what is being read by the GPU. Hmm.
> 

Maybe this is another bug (evil twin bug 24789)? That needs another patch?

My laptop has one of the early intel centrino (single-core of course) CPU (1.6 Ghz), with the wicked i8042 controller, but I wouldn't infer that it is on the CPU side either...
Comment 52 legolas558 2010-03-04 07:58:03 UTC
Created attachment 33760 [details]
dri debugfs dumps for i855GM + Xorg.0.log

OK, I built drm-intel with Chris' patch and rebooted; first time I forgot "nomodeset" active and it *detonated* back to boot screen (it must be able to successfully trigger some kernel/CPU/BIOS failsafe reboot, some way), this must be an unique feature of most recent 2.6.33 kernels.

Now back on topic: I successfully started Xorg, and it looked great (like if everything was fixed, but more probably I didn't have time to find any glitch), however when I opened some directory windows (XFCE) it crashed. The mouse cursor was changing when I tried to drag a window, so the underlying system was still breathing (it always happens with bug 24789). I popped up the VT where Xorg was started and hit Ctrl+C it since crash was already lasting for 3-4 seconds, and the characterizing I/O error lines were already printed.

In the attachment you can find /sys/kernel/debugfs/dri/{1,64}; does anybody know if it is normal to have entries 1 and 64? With the old intel driver I have only dri/1
Comment 53 legolas558 2010-03-05 02:48:51 UTC
I have uploaded the compiled drm-intel (kernel, initramfs and modules) with Chris' patch:

http://www.iragan.com/linux/i855GM/

you can find it under the *kernel directory.

This was built with my .config so might not work properly on all laptops, but surely should boot
Comment 54 legolas558 2010-03-05 02:50:39 UTC
@scottandchrystie: please also follow this bug since they are tied (mutually exclusive but probably caused by same hardware glitch). I have compiled and uploaded the drm-intel-next kernel already
Comment 55 Scott Hansen 2010-03-06 13:43:37 UTC
@legolas558 - I booted the kernel you provided, and it booted, but no better than the one I compiled myself with the lid and big hammer patches. Still froze after several minutes of flipping between VT's and running graphics-intensive programs (inkscape, tuxpaint, etc).

My hardware is:
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)

I'm afraid I'm not at the level of understanding the level of the conversation here, but I'm willing to compile and test kernels on my hardware :) I'm looking up now how to collect the info you need with the intel-gpu-tools. Let me know what I can do to help.

Scott
Comment 56 legolas558 2010-03-07 03:34:46 UTC
(In reply to comment #55)
> @legolas558 - I booted the kernel you provided, and it booted, but no better
> than the one I compiled myself with the lid and big hammer patches. Still froze
> after several minutes of flipping between VT's and running graphics-intensive
> programs (inkscape, tuxpaint, etc).

That was drm-intel-next with Chris' patch; in my case it doesn't even last 1 minute, while the kernel patched with the big hammer lasts for longer.

> My hardware is:
> 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE
> Chipset Integrated Graphics Device (rev 01)
> 

Mine is:
00:02.0 VGA compatible controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)

So we can infer that Chris' patch works (a bit) with rev01 but not with rev02.

Thanks for taking time to test this.
Comment 57 Scott Hansen 2010-03-07 10:15:22 UTC
Created attachment 33835 [details]
Debug logs from unpatched drm-intel-next kernel freeze

I compiled the drm-intel-next kernel from git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel.git (I hope that's correct for drm-intel-next). I didn't use any patches initally for testing. Also, I would have been unable to apply Chris Wilson's msleep patch from comment #14, as the code has been changed and I couldn't see where it should go. The kernel booted and ran fine for about 15 minutes under load from running XFCE with movie trailers, switching VT's and using tuxpaint, all of which are typically what will crash it the quickest.

I've enclosed dmesg output as well as the contents of /sys/kernel/debug/dri/0 -- hopefully that's all you need for now. I didn't patch with the big-hammer patch because the last kernels I tried with that patch didn't seem to improve things much, but I can do that and provide the logs if it would help.

Let me know what I can test next.

Thanks!
Scott
Comment 58 Brian Rogers 2010-03-13 02:13:06 UTC
Created attachment 34016 [details] [review]
Hack that prevents freezing

At least some of the hangs seem to be related to XY_COLOR_BLT, so I made a patch to disable that in the DDX. And now my testcase doesn't hang the GPU. Is this command being sent out properly?

Obviously, this patch causes fairly garbled graphics.
Comment 59 Brian Rogers 2010-03-13 14:56:45 UTC
Still stable after 9 hours of "x11perf -range dot,comppixwin500 -time 1 -repeat 1".

It seems i830_uxa_solid plays a role in the freezing.
Comment 60 legolas558 2010-03-15 05:29:56 UTC
In reply to comment #58)
> At least some of the hangs seem to be related to XY_COLOR_BLT, so I made a
> patch to disable that in the DDX. And now my testcase doesn't hang the GPU. Is
> this command being sent out properly?
> 
> Obviously, this patch causes fairly garbled graphics.
> 

I am using a freedesktop git development stack, created with this script:

http://bit.ly/b2sJVO

I have applied your patch before building xf86-video-intel and recompiled my drm-intel kernel to be modular (CONFIG_DRM=m,CONFIG_DRM_i915=m and also CONFIG_FB=m) so that the freedesktop compiled modules can be used instead.

However, I can't boot with drm being modular! It simply shows a black screen (tuned OFF) and there's no way to put this into a working console... (I can login and send commands but I am blind without a screen)

Does anybody have any hint? I have never been able to boot with KMS enabled and drm modular; the mkinitcpio configuration facilities (for customization of the initramfs) of this Arch Linux box do not seem to do anything.
Comment 61 Daniel Vetter 2010-03-18 05:32:46 UTC
Created attachment 34194 [details] [review]
(hopefullyy) fix gtt cache coherency

This patch seems to fix any gtt related cache coherency problems, at least for my i855GM.

It's quite large, but that's just due to me having first needed to clean up intel-agp.c before seeing clear what's going on. This patch is also not yet polished, so don't look at it and expect beauty ;)

It contains a totally paranoid cache coherency checker with the absolutely minimal set of memory barriers and cache flush before/after the chipset flush. If it detects any inconsistency, it prints a warning + backtrace in the dmesg. Ratelimited to one warning per direction per minute. Don't use this cache coherency checker for unpatched kernels. On my i855GM, cpu->gtt transfers fail with > 1% chance, gtt->cpu transfers fail with > 50% chance on unpatched kernels. So it'll only spam your dmesg.

This patch also includes the unbind-inactive-objects patch to really trash on the gtt stuff. Also trashes performance, so expect a sluggish feel when testing this.

Patch also prints out the number of completed chipset flushes in regular intervals. If you test this, wait until at least 1 million chipset flushes have been done (or a chipset flush failed) before declaring that it works. Even better is 10 million. On my i855 that's about two hours of glxgears & openarena.

Suspend/resume only lightly tested. It might break the cache coherency checker (but should not).

In case this patch doesn't work and you get backtraces about failed flushes, please attach you full dmesg. If it works, please report on what hw (lspci -nn) and how many chipset flushes (more than 10M would be great) have been done.

Patch is against -rc1 but should apply to latest drm-intel, too. Don't use any other patches when testing this.

Thanks, Daniel
Comment 62 Chris Wilson 2010-03-18 13:41:24 UTC
*** Bug 23032 has been marked as a duplicate of this bug. ***
Comment 63 2points 2010-03-18 14:58:51 UTC
Created attachment 34220 [details]
dmesg with gtt cache coherency patch

Nearly got to one million flushes before X froze. Noticed a few warnings about failed flushes before, but apparently to no critical effect (yet).
Comment 64 Daniel Vetter 2010-03-18 15:20:40 UTC
> --- Comment #63 from 2points@gmx.org  2010-03-18 14:58:51 PST ---
> Created an attachment (id=34220)
>  --> (http://bugs.freedesktop.org/attachment.cgi?id=34220)
> dmesg with gtt cache coherency patch
> 
> Nearly got to one million flushes before X froze. Noticed a few warnings about
> failed flushes before, but apparently to no critical effect (yet).

Thanks for testing this. Looks like it doesn't work as advertised, given
that you have a i855GM, too. I need to get back to the drawing board.
Comment 65 Geir Ove Myhr 2010-03-18 15:38:34 UTC
> Thanks for testing this. Looks like it doesn't work as advertised, given
> that you have a i855GM, too. I need to get back to the drawing board.

Would you still like the patch to be tested on other hardware, or should we consider it obsolete now?
Comment 66 Daniel Vetter 2010-03-18 16:06:51 UTC
> --- Comment #65 from Geir Ove Myhr <gomyhr@gmail.com>  2010-03-18 15:38:34 PST ---
> > Thanks for testing this. Looks like it doesn't work as advertised, given
> > that you have a i855GM, too. I need to get back to the drawing board.
> 
> Would you still like the patch to be tested on other hardware, or should we
> consider it obsolete now?

Well, Chris tested it on his i845 and it had no effect there. If you have
something else around and some time to waste, testing wouldn't hurt.
Perhaps my cache coherency checker uncovers some other stuff. Anyway, I
have the feeling that the i845 and the i855GM bugs are two different
things, so I've created a new bug to keep track of my crusade to fix the
i855 here:

bug # 27187

So if you test, please report your findings there.
Comment 67 legolas558 2010-03-19 04:05:33 UTC
Created attachment 34233 [details]
section of dmesg with GTT flush failures

(In reply to comment #61)
> Created an attachment (id=34194) [details]
> (hopefullyy) fix gtt cache coherency
> 
> This patch seems to fix any gtt related cache coherency problems, at least for
> my i855GM.
> 
Hi Daniel, I finally took time to test this.

I also have your hardware and with this patch I finally can use Xorg 1.7.5 and the modern intel driver!

So this patch obsoletes the DRM big hammer patch that I was previously using and that gave me about 5 minutes of working Xorg.

With your patch Xorg can be used for long time (more than 1 hour now and not crashed yet), but I have seen GTT flush failures in dmesg, please see attachment.

Did you see that? Flush failures at flush number 16384 and 32768! Am I just lucky or is there a reason for such numbers being powers of two?

So I will be using your patch since even with some failures it doesn't crash Xorg as the linus/drm-intel trees do; I'd even propose it for submission, because it makes the hardware usable!

Thanks
Comment 68 legolas558 2010-03-19 04:38:05 UTC
It finally crashed when playing a video, but I think this is a totally separate bug, perhaps related to the hangcheck timer; by the way, can somebody check if bug 26723 is duplicate of bug 24789 or of bug 26345 (this one)?
I am asking because with this patch I get:

[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung

after the crash triggered by playing a video.

Also I can see a pattern in the failed flushes:

~$ dmesg|grep -F "flush no"
[   30.890838] chipset flush no. 0
[  245.758686] chipset flush no. 16384
[  403.685617] chipset flush no. 32768
[  594.005756] chipset flush no. 49152

Do we have a crazy carry bit somewhere?
Comment 69 Daniel Vetter 2010-03-19 05:14:54 UTC
> --- Comment #68 from legolas558 <legolas558@email.it>  2010-03-19 04:38:05 PST ---
> Also I can see a pattern in the failed flushes:
> 
> ~$ dmesg|grep -F "flush no"
> [   30.890838] chipset flush no. 0
> [  245.758686] chipset flush no. 16384
> [  403.685617] chipset flush no. 32768
> [  594.005756] chipset flush no. 49152
> 
> Do we have a crazy carry bit somewhere?

Nothing crazy is going on. This just prints out the number of chipset
flushes done every 16*1024 flushes. This is just to know how reliable the
thing works. The cache coherency problems caught by my checker print out
"chipset flushed failed". So you have to count these (plus add in the ones
supressed by the ratelimiting code, watch out for "xx callbacks supressed"
in your demsg). Then divided them by the number of flushes (as printed in
your dmesg snippet above) and you have a ballbark figure for how reliable
the chipset flushing is. Obviously anything bigger than zero is
unacceptable.
Comment 70 legolas558 2010-03-19 05:31:12 UTC
Created attachment 34239 [details]
chipset flushing quality script

(In reply to comment #69)
> > --- Comment #68 from legolas558 <legolas558@email.it>  2010-03-19 04:38:05 PST ---
> > Also I can see a pattern in the failed flushes:
> > 
> > ~$ dmesg|grep -F "flush no"
> > [   30.890838] chipset flush no. 0
> > [  245.758686] chipset flush no. 16384
> > [  403.685617] chipset flush no. 32768
> > [  594.005756] chipset flush no. 49152
> > 
> > Do we have a crazy carry bit somewhere?
> 
> Nothing crazy is going on. This just prints out the number of chipset
> flushes done every 16*1024 flushes. This is just to know how reliable the
> thing works. The cache coherency problems caught by my checker print out
> "chipset flushed failed". So you have to count these (plus add in the ones
> supressed by the ratelimiting code, watch out for "xx callbacks supressed"
> in your demsg). Then divided them by the number of flushes (as printed in
> your dmesg snippet above) and you have a ballbark figure for how reliable
> the chipset flushing is. Obviously anything bigger than zero is
> unacceptable.
> 

Thanks Daniel for explaining this; I was confused by the fact that "chipset flush no." was printed after the crash trace dumps.

By using your formula (attached script) my chipset flushing quality ratio is 38/212992, and seems linear growing and not related to specific software running (possibly dependant on CPU load only).
Comment 71 legolas558 2010-03-19 05:35:31 UTC
Created attachment 34240 [details]
chipset flushing quality script (revised)

Errata: my ratio is 98/229376
Comment 72 Chris Wilson 2010-03-24 12:23:57 UTC
(In reply to comment #58)
> Created an attachment (id=34016) [details]
> Hack that prevents freezing
> 
> At least some of the hangs seem to be related to XY_COLOR_BLT, so I made a
> patch to disable that in the DDX. And now my testcase doesn't hang the GPU. Is
> this command being sent out properly?

Gah, missed this. Sorry Brian. Yes it does seem that we could emit a solid fill that exceed the surface bounds. Not quite sure what is generating such nonsense, but it will at least be resolved by:

commit 0c47195ca805881e3fbd5b9224be5c930feeeb8c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Mar 24 17:37:39 2010 +0000

    i830: Clip solid fills to surface.
    
    There is a reasonable surfeit of evidence to support this error,
    for instance: http://bugs.freedesktop.org/attachment.cgi?id=34417
Comment 73 legolas558 2010-03-25 04:43:05 UTC
Created attachment 34429 [details]
Chipset flushing quality script for freedesktop bug 26345

fixed script to correctly show failures ratio

now using patch v5, no failures within first 32768 flushes - will report further when a big amount of flushes have been done
Comment 74 legolas558 2010-03-25 10:01:12 UTC
Created attachment 34435 [details]
Script to check chipset flushing quality
Comment 75 legolas558 2010-03-25 10:02:38 UTC
Created attachment 34436 [details]
/sys/kernel/debug/dri after the GTT flush failures
Comment 76 legolas558 2010-03-25 10:12:48 UTC
Created attachment 34437 [details]
dmesg after 7 GTT flush failures

I got 7 failures within the first 300k flushes; I have attached dmesg and debugfs dri dumps.

Failures seem harder to trigger now. I had to open 4 glxgears and one xeyes to trigger them.

I am using latest drm-intel kernel with patch in attachment 34377 [details] [review]
I am using the stock packages from Arch Linux (xorg-server 1.7.5-902, xf86-video-intel 2.10.0, intel-dri 7.7, libdrm-git).

It still crashes with the hangcheck bug when playing videos, but very rarely now.
Comment 77 legolas558 2010-03-26 15:32:38 UTC
Created attachment 34499 [details]
gttqual script to check GTT flushing quality
Comment 78 Scott Hansen 2010-03-28 08:53:40 UTC
Tried Daniel's patch from comment #61 with drm-intel-next. Didn't work on my VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01). In fact it was a little worse than the stock drm-intel-kernel as far as running time until freezing (using dwm -- time running xfce was about the same: mere seconds)

Scott
Comment 79 legolas558 2010-03-28 12:36:04 UTC
@Scott: in my case it lasts some more time, but I have recently experienced sudden crashes...so D.Vetter's patch might be the way to go but Intel has clearly not released a good open source driver for these devices since the beginning (Xorg 1.6 and the old driver (non-KMS) are definitively usable, at least).
Comment 80 Chris Wilson 2010-03-29 03:17:16 UTC
*** Bug 25091 has been marked as a duplicate of this bug. ***
Comment 81 Chris Wilson 2010-03-29 04:03:07 UTC
*** Bug 26229 has been marked as a duplicate of this bug. ***
Comment 82 Chris Wilson 2010-03-29 04:10:01 UTC
*** Bug 26723 has been marked as a duplicate of this bug. ***
Comment 83 stefan 2010-03-30 14:20:49 UTC
hi there,
I'm also bitten by occasional x freezes on a i855GM rev 02 and I am following the respective bug reports.

I think a lot has been tested already, but in my case, I get those freezes *only* in a dual-head setup using LVDS and VGA together. I *never* had a freeze using only the LVDS on my laptop, linux 2.6.33 + libdrm 2.4.18-3 + xserver-xorg-video-intel 2.10.903 (all on debian) seems reasonably stable in this case.

hth, ben
Comment 84 theonewiththeevillook 2010-04-01 03:28:13 UTC
Yesterday (and today again), I got a slightly different error message in 'dmesg' than what I'm used to - only the three first lines are the usual ones (see below). This "new" error does not appear everytime.

I'm using a rather new git version of xf86-video-intel and libdrm (I couldn't tell you which commit exactly is installed, but it's no more than 2 weeks old for libdrm, and I think no more than 4 days old for xf86-video-intel).

Maybe this error is just a side-effect of me using incompatible versions of everything (i.e. libdrm and xf86-video-intel are the latest available, while the rest of my X.org install is what is available on Gentoo/Portage), but I hope not.

Here is the error message in dmesg:
------------------------------
kernel: [ 2067.426012] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
kernel: [ 2067.426024] render error detected, EIR: 0x00000000
kernel: [ 2067.426598] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 97808 at 97807)
kernel: [ 2067.426608] [drm:i915_gem_do_execbuffer] *ERROR* Failed to pin buffer 1 of 2, total 4210688 bytes: -5
kernel: [ 2067.426614] [drm:i915_gem_do_execbuffer] *ERROR* 356 objects [4 pinned], 37957632 object bytes [7487488 pinned], 20197376/117571584 gtt bytes
kernel: [ 2067.426741] ------------[ cut here ]------------
kernel: [ 2067.426753] WARNING: at drivers/gpu/drm/i915/i915_gem_tiling.c:490 i915_gem_set_tiling+0x164/0x1b5()
kernel: [ 2067.426756] Hardware name: 830342G
kernel: [ 2067.426758] failed to reset object for tiling switch
kernel: [ 2067.426761] Modules linked in: nls_iso8859_15 nls_cp850 vfat fat nls_utf8 ntfs floppy
kernel: [ 2067.426774] Pid: 4153, comm: X Not tainted 2.6.33-gentoo #4
kernel: [ 2067.426777] Call Trace:
kernel: [ 2067.426787]  [<c101fb13>] warn_slowpath_common+0x60/0x90
kernel: [ 2067.426792]  [<c101fb77>] warn_slowpath_fmt+0x24/0x27
kernel: [ 2067.426796]  [<c118a6c6>] i915_gem_set_tiling+0x164/0x1b5
kernel: [ 2067.426803]  [<c1171e96>] ? drm_ioctl+0x0/0x2b8
kernel: [ 2067.426807]  [<c11720bb>] drm_ioctl+0x225/0x2b8
kernel: [ 2067.426811]  [<c118a562>] ? i915_gem_set_tiling+0x0/0x1b5
kernel: [ 2067.426817]  [<c1049104>] ? generic_file_aio_write+0x7e/0x93
kernel: [ 2067.426823]  [<c10567b3>] ? do_wp_page+0x5a3/0x63e
kernel: [ 2067.426827]  [<c1171e96>] ? drm_ioctl+0x0/0x2b8
kernel: [ 2067.426832]  [<c1071119>] vfs_ioctl+0x19/0x51
kernel: [ 2067.426836]  [<c1071620>] do_vfs_ioctl+0x43a/0x46c
kernel: [ 2067.426841]  [<c10578b1>] ? handle_mm_fault+0x59a/0x619
kernel: [ 2067.426849]  [<c10340d2>] ? ktime_get_ts+0xd0/0xda
kernel: [ 2067.426853]  [<c107167e>] sys_ioctl+0x2c/0x45
kernel: [ 2067.426858]  [<c1002790>] sysenter_do_call+0x12/0x26
kernel: [ 2067.426862] ---[ end trace 445f83ad84043481 ]---
------------------------------
Comment 85 Geir Ove Myhr 2010-04-04 12:44:44 UTC
Created attachment 34663 [details]
i915_error_state from drm-intel-next kernel

I'm not sure what kind of additional information is useful at this point (as opposed to for bug # 27187). Here is another i915_error_state from drm-intel-next. This time it does not hang in XY_COLOR_BLT. IPEHR (0x60) does not match the instruction header before HEAD, so I suppose this is a real CPU/GPU incoherency.

Relevant part of intel_error_decode output:
  IPEHR: 0x00000060
  ACTHD: 0x0567e034
seqno: 0x00094b09
Buffers [7]:
  0567e000    16384 00000009 00000000 00094b0a dirty purgeable
batchbuffer at 0x0567e000:
...
0x0567e020:      0x7d980000: 3DSTATE_DEFAULT_Z
0x0567e024:      0x00000000:    dword 1
0x0567e028:      0x7d890002: 3DSTATE_FOG_MODE
0x0567e02c:      0x89800000:    dword 1
0x0567e030:      0x00000000:    dword 2
0x0567e034: HEAD 0x00000000:    dword 3
0x0567e038:      0x7c281088: 3DSTATE_MAP_TEX_STREAM_I830


This one is from Ivailo Stoyanov at Ubuntu bug report https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/541492/comments/17
Comment 86 Brian Rogers 2010-04-04 13:15:38 UTC
Created attachment 34664 [details] [review]
Graphics-breaking workaround: skip i830_uxa_solid

At least with x11perf, I could not reproduce the hang with this patch. We should figure out if there are cases where the GPU can still hang if XY_COLOR_BLT is never called.

That most recent dump appears to point to a different operation, but in my testing, not all of my dumps implicated XY_COLOR_BLT and yet eliminating that prevented all hangs as far as I could tell.

If it never hangs without using XY_COLOR_BLT, perhaps we could find a substitute. If it can hang on other operations, we could eliminate them one-by-one to make a list of the problematic opcodes and see what they might have in common.
Comment 87 Geir Ove Myhr 2010-04-04 21:54:27 UTC
(In reply to comment #86)
> Created an attachment (id=34664) [details]
> Graphics-breaking workaround: skip i830_uxa_solid

Brian, are you saying that commenting out most of i830_uxa_solid still works better, even after the clip solids commit from comment # 72?
Comment 88 Brian Rogers 2010-04-04 23:33:13 UTC
I haven't had a chance to test on the affected machine lately, so I don't know if that patch fixed the x11perf hangs. However, I didn't see invalid bounds in any of the dumps, so I don't see how it could. I'll test x11perf again next chance I get, though.
Comment 89 theonewiththeevillook 2010-04-05 08:50:41 UTC
Created attachment 34676 [details]
output of intel gpu dump after a crash.

The attached file is the output of intel_gpu_dump after a crash obtained while using the patch in Comment #86. I'm sorry I have no idea what the relevant part can be, so I just upload it all. See below for a small excerpt, though (containing the lines which mention 'HEAD' and 'TAIL').

Also I must say that the patch has some side effects : some texts (mostly Gnome menus or texts in gnome applets) sometimes don't appear on the screen and eventually appear if I roll over them with the mouse, or highlight them in some way. I guess these were expected... if not, I can take screenshots.

Preview of the attached file~:
ACTHD: 0x0686c000
EIR: 0x00000000
EMR: 0xffffff69
ESR: 0x00000001
PGTBL_ER: 0x00000000
IPEHR: 0x18000001
IPEIR: 0x00000000
INSTDONE: 0x01ffffc1
(7840 lines not shown)
0x00007a5c:      0x0686c001: MI UNKNOWN
0x00007a60:      0x0686c01c: MI UNKNOWN
0x00007a64: HEAD 0x00000000: MI_NOOP
0x00007a68:      0x02000004: MI_FLUSH
0x00007a6c:      0x00000000: MI_NOOP
0x00007a70:      0x10800001: MI_STORE_DATA_INDEX
0x00007a74:      0x00000080:    dword 1
0x00007a78:      0x000ddabf:    dword 2
0x00007a7c:      0x01000000: MI_USER_INTERRUPT
0x00007a80:      0x02000000: MI_FLUSH
0x00007a84:      0x00000000: MI_NOOP
0x00007a88:      0x10800001: MI_STORE_DATA_INDEX
0x00007a8c:      0x00000080:    dword 1
0x00007a90:      0x000ddac0:    dword 2
0x00007a94:      0x01000000: MI_USER_INTERRUPT
0x00007a98: TAIL 0x02000004: MI_FLUSH
0x00007a9c:      0x00000000: MI_NOOP
0x00007aa0:      0x18000001: MI UNKNOWN
(24921 more lines not shown)
Comment 90 Brian Rogers 2010-04-05 10:50:45 UTC
Thanks for testing. I just wanted to make sure there were other graphics operations that could hang the GPU. Turns out there are. Yeah, that patch messes up graphics, because I'm ignoring all requests to fill a rectangular region with a solid color, which is needed to clear off a pixmap before drawing on it.
Comment 91 Scott Hansen 2010-04-10 20:55:21 UTC
I tested the patches from bug 27187 on my i845, with no success. Debug logs are at https://bugs.freedesktop.org/show_bug.cgi?id=27187#c84 (hardware and software versions in comment #82). Let me know if you need something else tested.

Scott
Comment 92 theonewiththeevillook 2010-04-11 08:50:28 UTC
I tried to save the output of intel_gpu_dump for the last few crashes that happened to me. I don't know how to interpret these, but I thought I would share this with you :

# grep IPEHR *
intelgpudump-2010-04-05_17:23:34:IPEHR: 0x18000001
intelgpudump-2010-04-06_12:25:37:IPEHR: 0x41500000
intelgpudump-2010-04-06_13:24:52:IPEHR: 0x18000001
intelgpudump-2010-04-06_14:16:10:IPEHR: 0x41600000
intelgpudump-2010-04-06_14:32:29:IPEHR: 0x05000000
intelgpudump-2010-04-06_16:54:39:IPEHR: 0x18000001
intelgpudump-2010-04-07_13:29:26:IPEHR: 0x54300004
intelgpudump-2010-04-07_13:48:38:IPEHR: 0x04caf6e4
intelgpudump-2010-04-09_11:07:47:IPEHR: 0x18000001
intelgpudump-2010-04-09_11:58:45:IPEHR: 0x18000001
intelgpudump-2010-04-09_14:46:27:IPEHR: 0x18000001
intelgpudump-2010-04-09_16:34:45:IPEHR: 0x05000000
intelgpudump-2010-04-10_14:16:59:IPEHR: 0x0a103078
intelgpudump-2010-04-10_15:34:04:IPEHR: 0x00000000
(the date/time is when the dump was taken, which obviously is a few seconds/minutes after the crash happens)


From what I could understand from comment #10, the value of IPEHR could be almost anything, depending on (disk) activity. Am I experiencing distinct bugs ? Is the value of IPEHR not linked to the crash ? I have no idea.

Full logs (intel_gpu_dump output, usually also the dmesg output) are at http://dl.free.fr/v9nxyAGHx as a tar.bz2 file, in case they are of any interest.
Comment 93 René Gabriëls 2010-04-29 19:54:32 UTC
I get similar errors to theonewiththeevillook@yahoo.fr when running a youtube video fullscreen.  This is deterministic: it happens always and right after pressing the full screen button.

1. Dmesg:

[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
render error detected, EIR: 0x00000000
[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 473836 at 473833)
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
render error detected, EIR: 0x00000000

2. Xorg.log:

[ 44472.430] (EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.
[ 44472.504] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error
[ 44472.721] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error
[ 44473.220] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error

3. GPU dump:

ACTHD: 0x0f815b74
EIR: 0x00000000
EMR: 0xffffffed
ESR: 0x00000000
PGTBL_ER: 0x00000000
IPEHR: 0x02000004
IPEIR: 0x00000000
INSTDONE: 0x03311081
  busy: GMBUS
  busy: MPEG
  busy: MECO
  busy: CC
  busy: DG
  busy: DCMP
  busy: IT
  busy: MG
  busy: MEC
  busy: QCC
  busy: TB
  busy: WM
  busy: EF
  busy: Map L2 cache
  busy: Secondary ring 3
  busy: Secondary ring 2
  busy: Secondary ring 1
  busy: Secondary ring 0
  busy: Primary ring 1
Ringbuffer: Reminder: head pointer is GPU read, tail pointer is CPU write
ringbuffer at 0x00000000:
0x00000000:      0x10800001: MI_STORE_DATA_INDEX
0x00000004:      0x00000080:    dword 1
0x00000008:      0x0007309c:    dword 2
0x0000000c:      0x01000000: MI_USER_INTERRUPT
0x00000010:      0x02000001: MI_FLUSH
0x00000014:      0x00000000: MI_NOOP
0x00000018:      0x18800080: MI_BATCH_BUFFER_START
0x0000001c:      0x023e7001:    dword 1
0x00000020:      0x02000004: MI_FLUSH
0x00000024:      0x00000000: MI_NOOP
0x00000028:      0x10800001: MI_STORE_DATA_INDEX
0x0000002c:      0x00000080:    dword 1
0x00000030:      0x0007309d:    dword 2
0x00000034:      0x01000000: MI_USER_INTERRUPT
0x00000038:      0x02000005: MI_FLUSH
0x0000003c:      0x00000000: MI_NOOP
...

4. Software:

kernel: 2.3.34-rc3 + Daniel's patches from bug 27187
xserver: 1.8.0
libdrm: 2.4.20
xf86-video-intel: 2.11.0
Comment 94 Michael Rickmann 2010-05-21 05:00:21 UTC
I have been testing different kernels under Lucid on a desktop computer equipped with a 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) with KMS and dri enabled. Analysing the underlying code is beyond my capabilities, nevertheless I would like to help.
1) The standard Lucid kernel leads to a crash usually within the first 10 min of usual usage (Openoffice, Firefox).
2) drm-intel-next kernels as provided by http://kernel.ubuntu.com/~kernel-ppa/mainline/ or https://launchpad.net/~brian-rogers/+archive/experimental, the latter with Daniel Vetter's V9 patch, behave differently. While the 2010-04-19-lucid kernel crashes usually after 1 to 3h the later ones do not survive the very first mode switch even before Xorg is started, regardless V9 applied or not. This kind of crash does not allow for ssh or gracefully rebooting.
3) The most interesting kernel seems to me Daniel Baumann's ( https://launchpad.net/~dnjl/+archive/kernel ), a standard Lucid one with V8 patch applied. Usually it crashes within the first 3h of usage. But now I have hit a >20h period and cannot kill it, by normal usage, running GL screensaver, switching VTs, watching HTML5 Youtube videos normal and fullscreen, reverting to the standard Lucid xserver-xorg-video-intel and libdrms at stopped Xorg, GL screensaver again ..., passing the x11perf test several times, .... /sys/kernel/debug/dri/0/i915_error_state, however shows
Time: 1274367670 s 906969 us
EIR: 0x00000010
  PGTBL_ER: 0x00000049
  INSTPM: 0x00000000
  IPEIR: 0x00000000
  IPEHR: 0x01000000
  INSTDONE: 0x00ffffc0
  ACTHD: 0x00000048
That one seems to happen always when the computer starts.

Given the randomness of the time after which the crashes/stucks happen and to which extent different i8xx based machines are affected (e.g. my i855 based laptop runs prefectly with a V8 patched kernel) it is difficult to interpret above findings on the 82845G. What is not random, I think, are the early deaths with later drm-intel-next kernels (2), and also the times to crash of (3) does not look Gaussian distributed. The comparison of (1) and (3) seems to indicate that the V8 patch fixes one kind of mishap also for 82845G based hardware.
Comment 95 legolas558 2010-05-21 05:58:41 UTC
(In reply to comment #94)
> 2) drm-intel-next kernels as provided by
> http://kernel.ubuntu.com/~kernel-ppa/mainline/ or
> https://launchpad.net/~brian-rogers/+archive/experimental, the latter with
> Daniel Vetter's V9 patch, behave differently. While the 2010-04-19-lucid kernel
> crashes usually after 1 to 3h the later ones do not survive the very first mode
> switch even before Xorg is started, regardless V9 applied or not. This kind of
> crash does not allow for ssh or gracefully rebooting.
Are you sure that V9 is correctly applied? That early crash has always been an indicator of missing Daniel Vetter's patch (at least up to now).

> 3) The most interesting kernel seems to me Daniel Baumann's (
> https://launchpad.net/~dnjl/+archive/kernel ), a standard Lucid one with V8
> patch applied. Usually it crashes within the first 3h of usage. But now I have
> hit a >20h period and cannot kill it, by normal usage, running GL screensaver,
> switching VTs, watching HTML5 Youtube videos normal and fullscreen, reverting
> to the standard Lucid xserver-xorg-video-intel and libdrms at stopped Xorg, GL
> screensaver again ..., passing the x11perf test several times, ....
> /sys/kernel/debug/dri/0/i915_error_state, however shows
> Time: 1274367670 s 906969 us
> EIR: 0x00000010
>   PGTBL_ER: 0x00000049
>   INSTPM: 0x00000000
>   IPEIR: 0x00000000
>   IPEHR: 0x01000000
>   INSTDONE: 0x00ffffc0
>   ACTHD: 0x00000048
> That one seems to happen always when the computer starts.
> 
You mean that no crashes are found in dmesg during these >20h sessions? Do you have the logs to check it out?

> Given the randomness of the time after which the crashes/stucks happen and to
> which extent different i8xx based machines are affected (e.g. my i855 based
> laptop runs prefectly with a V8 patched kernel) it is difficult to interpret
> above findings on the 82845G. What is not random, I think, are the early deaths
> with later drm-intel-next kernels (2), and also the times to crash of (3) does
> not look Gaussian distributed. The comparison of (1) and (3) seems to indicate
> that the V8 patch fixes one kind of mishap also for 82845G based hardware.
Have you considered variability of the rest of code (drm-intel-next, Xorg, drivers/libraries), if there is any in your tests?

On my 855GM (rev2) I have reached a very stable situation, except that overlays (created by VLC or mplayer) do always crash the machine in a fairly short amount of time. I have also some background corruption but that might be a new bug in libdrm or in something else.

Can you confirm that the crashes you experienced were someway connected to video playing?
Comment 96 Michael Rickmann 2010-05-21 07:48:12 UTC
(In reply to comment #95)
> (In reply to comment #94)
> > 2) drm-intel-next kernels as provided by
> > http://kernel.ubuntu.com/~kernel-ppa/mainline/ or
> > https://launchpad.net/~brian-rogers/+archive/experimental, the latter with
> > Daniel Vetter's V9 patch, behave differently. While the 2010-04-19-lucid kernel
> > crashes usually after 1 to 3h the later ones do not survive the very first mode
> > switch even before Xorg is started, regardless V9 applied or not. This kind of
> > crash does not allow for ssh or gracefully rebooting.
> Are you sure that V9 is correctly applied? That early crash has always been an
> indicator of missing Daniel Vetter's patch (at least up to now).
It is the kernel which Brian Rogers has build and announced in https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/541492?comments=all comment #105, I somehow trust him. I also tried to patch the provided sources with V9 to find that it had been applied already.

> 
> > 3) The most interesting kernel seems to me Daniel Baumann's (
> > https://launchpad.net/~dnjl/+archive/kernel ), a standard Lucid one with V8
> > patch applied. Usually it crashes within the first 3h of usage. But now I have
> > hit a >20h period and cannot kill it, by normal usage, running GL screensaver,
> > switching VTs, watching HTML5 Youtube videos normal and fullscreen, reverting
> > to the standard Lucid xserver-xorg-video-intel and libdrms at stopped Xorg, GL
> > screensaver again ..., passing the x11perf test several times, ....
> > /sys/kernel/debug/dri/0/i915_error_state, however shows
> > Time: 1274367670 s 906969 us
> > EIR: 0x00000010
> >   PGTBL_ER: 0x00000049
> >   INSTPM: 0x00000000
> >   IPEIR: 0x00000000
> >   IPEHR: 0x01000000
> >   INSTDONE: 0x00ffffc0
> >   ACTHD: 0x00000048
> > That one seems to happen always when the computer starts.
> > 
> You mean that no crashes are found in dmesg during these >20h sessions? Do you
> have the logs to check it out?
First of all I wish to emphsize that it is one lucky strike which I have hit. More frequently this kernel gets stuck as well. Therefore I left the machine running the screensaver and it is still doing well. In dmesg is a single drm related error during startup:
[   23.288050] render error detected, EIR: 0x00000010
[   23.288062] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
[   23.288085] render error detected, EIR: 0x00000010
now the last line is:
[54005.830290] svc: failed to register lockdv1 RPC service (errno 97).

> 
> > Given the randomness of the time after which the crashes/stucks happen and to
> > which extent different i8xx based machines are affected (e.g. my i855 based
> > laptop runs prefectly with a V8 patched kernel) it is difficult to interpret
> > above findings on the 82845G. What is not random, I think, are the early deaths
> > with later drm-intel-next kernels (2), and also the times to crash of (3) does
> > not look Gaussian distributed. The comparison of (1) and (3) seems to indicate
> > that the V8 patch fixes one kind of mishap also for 82845G based hardware.
> Have you considered variability of the rest of code (drm-intel-next, Xorg,
> drivers/libraries), if there is any in your tests?
> 
> On my 855GM (rev2) I have reached a very stable situation, except that overlays
> (created by VLC or mplayer) do always crash the machine in a fairly short
> amount of time. I have also some background corruption but that might be a new
> bug in libdrm or in something else.
> 
> Can you confirm that the crashes you experienced were someway connected to
> video playing?
On my i855 based laptop I have tested video only occasionally, not with mplayer or VLC and found no preferential crashing.
Comment 97 legolas558 2010-05-21 08:20:01 UTC
(In reply to comment #96)
> (In reply to comment #95)
> > (In reply to comment #94)
> > > 2) drm-intel-next kernels as provided by
> > > http://kernel.ubuntu.com/~kernel-ppa/mainline/ or
> > > https://launchpad.net/~brian-rogers/+archive/experimental, the latter with
> > > Daniel Vetter's V9 patch, behave differently. While the 2010-04-19-lucid kernel
> > > crashes usually after 1 to 3h the later ones do not survive the very first mode
> > > switch even before Xorg is started, regardless V9 applied or not. This kind of
> > > crash does not allow for ssh or gracefully rebooting.
> > Are you sure that V9 is correctly applied? That early crash has always been an
> > indicator of missing Daniel Vetter's patch (at least up to now).
> It is the kernel which Brian Rogers has build and announced in
> https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/541492?comments=all
> comment #105, I somehow trust him. I also tried to patch the provided sources
> with V9 to find that it had been applied already.
> 
I see - can you pick the dump generation script (a small daemon) in attachment 34922 [details] to hopefully get snapshots right before and after the total crash? It has worked for me also without keyboard or ssh, and perhaps one of those dumps might contain information about what's happening there.

> > 
> > > 3) The most interesting kernel seems to me Daniel Baumann's (
> > > https://launchpad.net/~dnjl/+archive/kernel ), a standard Lucid one with V8
> > > patch applied. Usually it crashes within the first 3h of usage. But now I have
> > > hit a >20h period and cannot kill it, by normal usage, running GL screensaver,
> > > switching VTs, watching HTML5 Youtube videos normal and fullscreen, reverting
> > > to the standard Lucid xserver-xorg-video-intel and libdrms at stopped Xorg, GL
> > > screensaver again ..., passing the x11perf test several times, ....
> > > /sys/kernel/debug/dri/0/i915_error_state, however shows
> > > Time: 1274367670 s 906969 us
> > > EIR: 0x00000010
> > >   PGTBL_ER: 0x00000049
> > >   INSTPM: 0x00000000
> > >   IPEIR: 0x00000000
> > >   IPEHR: 0x01000000
> > >   INSTDONE: 0x00ffffc0
> > >   ACTHD: 0x00000048
> > > That one seems to happen always when the computer starts.
> > > 
> > You mean that no crashes are found in dmesg during these >20h sessions? Do you
> > have the logs to check it out?
> First of all I wish to emphsize that it is one lucky strike which I have hit.
> More frequently this kernel gets stuck as well. Therefore I left the machine
> running the screensaver and it is still doing well. In dmesg is a single drm
> related error during startup:
> [   23.288050] render error detected, EIR: 0x00000010
> [   23.288062] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
> [   23.288085] render error detected, EIR: 0x00000010
> now the last line is:
> [54005.830290] svc: failed to register lockdv1 RPC service (errno 97).
> 
svc is not related.

In my case (drm-intel-next with patched v9) it never crashes unless I pick some video or some screen-intensive wine application. Perhaps your configuration is not as stable because you are using a compositing window manager?

> > 
> > > Given the randomness of the time after which the crashes/stucks happen and to
> > > which extent different i8xx based machines are affected (e.g. my i855 based
> > > laptop runs prefectly with a V8 patched kernel) it is difficult to interpret
> > > above findings on the 82845G. What is not random, I think, are the early deaths
> > > with later drm-intel-next kernels (2), and also the times to crash of (3) does
> > > not look Gaussian distributed. The comparison of (1) and (3) seems to indicate
> > > that the V8 patch fixes one kind of mishap also for 82845G based hardware.
> > Have you considered variability of the rest of code (drm-intel-next, Xorg,
> > drivers/libraries), if there is any in your tests?
> > 
> > On my 855GM (rev2) I have reached a very stable situation, except that overlays
> > (created by VLC or mplayer) do always crash the machine in a fairly short
> > amount of time. I have also some background corruption but that might be a new
> > bug in libdrm or in something else.
> > 
> > Can you confirm that the crashes you experienced were someway connected to
> > video playing?
> On my i855 based laptop I have tested video only occasionally, not with mplayer
> or VLC and found no preferential crashing.
I suspect that some video-intensive application or window manager could make it crash more frequently; in such case crashes like mine, happening only under specific stress, would be masked because the GPU would be under constant stress. Can this be your case e.g. video-intensive desktop environment or applications being used?
Comment 98 Michael Rickmann 2010-05-24 01:00:57 UTC
(In reply to comment #97)
> (In reply to comment #96)
> > (In reply to comment #95)
> > > (In reply to comment #94)

snip

> > 
> I see - can you pick the dump generation script (a small daemon) in attachment
> 34922 [details] to hopefully get snapshots right before and after the total crash? It has
> worked for me also without keyboard or ssh, and perhaps one of those dumps
> might contain information about what's happening there.
> 
Thanks a lot. I will try that soon, once I have access to the 82845G/GL based hardware again. Currently, I have only a 82852/855GM (rev 02) based laptop, I guess its similar to yours and rather well behaving with a V8 patched (see below) Lucid kernel.

> > > 
snip
> > related error during startup:
> > [   23.288050] render error detected, EIR: 0x00000010
> > [   23.288062] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
> > [   23.288085] render error detected, EIR: 0x00000010
> > now the last line is:
> > [54005.830290] svc: failed to register lockdv1 RPC service (errno 97).
> > 
> svc is not related.
> 
> In my case (drm-intel-next with patched v9) it never crashes unless I pick some
> video or some screen-intensive wine application. Perhaps your configuration is
> not as stable because you are using a compositing window manager?
> 
My 82845G/GL has much poorer performance then the 855GM and must be different also in other respects. I guess that i8482845G/GL5 hardware is affected by an additional shortcoming which is not covered by the V8/V9 patches.

> > > 
snip
> > > Have you considered variability of the rest of code (drm-intel-next, Xorg,
> > > drivers/libraries), if there is any in your tests?
> > > 
> > > On my 855GM (rev2) I have reached a very stable situation, except that overlays
> > > (created by VLC or mplayer) do always crash the machine in a fairly short
> > > amount of time. I have also some background corruption but that might be a new
> > > bug in libdrm or in something else.
> > > 
> > > Can you confirm that the crashes you experienced were someway connected to
> > > video playing?
> > On my i855 based laptop I have tested video only occasionally, not with mplayer
> > or VLC and found no preferential crashing.
> I suspect that some video-intensive application or window manager could make it
> crash more frequently; in such case crashes like mine, happening only under
> specific stress, would be masked because the GPU would be under constant
> stress. Can this be your case e.g. video-intensive desktop environment or
> applications being used?
This is for the 82852/855GM now: I found that Daniel Baumann's kernel ( https://launchpad.net/~dnjl/+archive/kernel ) crashes on my hardware when confronted with Xv-overlay in a very similar way as you describe it. Stefan Glasenhardt's 855GM - fixed modules ( http://glasen-hardt.de/ , a lot in German only, https://launchpad.net/~glasen/+archive/855gm-fix ) avoid this kind of crash. In his Changelog it says "* i915-kernel module includes patch to get Xv-overlay mode working again." . That must be something in addition to V8/V9 (I really have to study the sources). If I use the latter modules with Lucid's xserver-xorg-video-intel and libdrms video overlay in totem player only appears when I fiddle around resizing the window. If I use xserver-xorg-video-intel 2.11.0 and the libdrms 2.4.20 as provided by https://launchpad.net/~glasen/+archive/intel-driver video overlay works, for me better than ever before on the 855GM. But Compiz gives in when changing from fullscreen to normal view. I guess one will have to recompile Compiz and all its dependencies against the new driver.
So I think for the 855GM the solution is really close, for the 82845G/GL not yet, I am afraid.
Comment 99 legolas558 2010-05-24 14:36:06 UTC
(In reply to comment #98)
> This is for the 82852/855GM now: I found that Daniel Baumann's kernel (
> https://launchpad.net/~dnjl/+archive/kernel ) crashes on my hardware when
> confronted with Xv-overlay in a very similar way as you describe it. Stefan
> Glasenhardt's 855GM - fixed modules ( http://glasen-hardt.de/ , a lot in German
> only, https://launchpad.net/~glasen/+archive/855gm-fix ) avoid this kind of
> crash. In his Changelog it says "* i915-kernel module includes patch to get
> Xv-overlay mode working again." . That must be something in addition to V8/V9
> (I really have to study the sources). If I use the latter modules with Lucid's
> xserver-xorg-video-intel and libdrms video overlay in totem player only appears
> when I fiddle around resizing the window. If I use xserver-xorg-video-intel

Where is such patch? I am begging anybody to bring it under my nose, because my Xorg crashes after a few seconds of video watching and it's becoming very stressing...and furthermore the system is often not recoverable by using VTs.  

Flash videos are less likely to trigger the crash, while fullscreen or maximized windows do it best.

I can't find the changelog you are mentioning, where is it? Also looks like he is using only the patches from this bug tracker entry, since he only mentions it on his launchpad page.

> 2.11.0 and the libdrms 2.4.20 as provided by
> https://launchpad.net/~glasen/+archive/intel-driver video overlay works, for me
> better than ever before on the 855GM. But Compiz gives in when changing from
> fullscreen to normal view. I guess one will have to recompile Compiz and all
> its dependencies against the new driver.
> So I think for the 855GM the solution is really close, for the 82845G/GL not
> yet, I am afraid.
It probably has more quirks to be worked out - some hackwork is needed.
Comment 100 Darxus 2010-05-25 15:36:01 UTC
"../../intel/intel_bufmgr_gem.c:901: Error setting to CPU domain 3: Input/output error"

Just after "Failed to submit batchbuffer: Input/output error".  Only gets written to the console.  I'm getting this when I kill off X and try to start it with startx, consistently.  

Up to date Lucid (clean, recent reinstall).  

(II) intel(0): Integrated Graphics Chipset: Intel(R) 845G
Comment 101 Darxus 2010-05-25 15:41:05 UTC
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)
Comment 102 Michael Rickmann 2010-05-26 06:06:43 UTC
(In reply to comment #99)
> (In reply to comment #98)
> > This is for the 82852/855GM now: I found that Daniel Baumann's kernel (
> > https://launchpad.net/~dnjl/+archive/kernel ) crashes on my hardware when
> > confronted with Xv-overlay in a very similar way as you describe it. Stefan
> > Glasenhardt's 855GM - fixed modules ( http://glasen-hardt.de/ , a lot in German
> > only, https://launchpad.net/~glasen/+archive/855gm-fix ) avoid this kind of
> > crash. In his Changelog it says "* i915-kernel module includes patch to get
> > Xv-overlay mode working again." . That must be something in addition to V8/V9
> > (I really have to study the sources). If I use the latter modules with Lucid's
> > xserver-xorg-video-intel and libdrms video overlay in totem player only appears
> > when I fiddle around resizing the window. If I use xserver-xorg-video-intel
> 
> Where is such patch? I am begging anybody to bring it under my nose, because my
> Xorg crashes after a few seconds of video watching and it's becoming very
> stressing...and furthermore the system is often not recoverable by using VTs.  
> 
I do not know whether you are using an Ubuntu patched kernel. If so, my best bet would be that you need
http://launchpadlibrarian.net/44195111/xv_overlay_mode_fix.diff
A short account on the background is given in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/554432 comment #15. Upstream kernels do not need that patch, e.g. the one from Brian Rogers mentioned up in comment #97, for reverting a previous patch.

> Flash videos are less likely to trigger the crash, while fullscreen or
> maximized windows do it best.
> 
> I can't find the changelog you are mentioning, where is it? Also looks like he
> is using only the patches from this bug tracker entry, since he only mentions
> it on his launchpad page.
Rather hidden in one of his German texts he mentions the xv fix. The changelog is the one in Stefan Glasenhardt's package /usr/share/doc/855gm-fix-dkms/changelog.gz when installed.
Comment 103 legolas558 2010-05-27 04:14:32 UTC
(In reply to comment #102)
> (In reply to comment #99)
> > Where is such patch? I am begging anybody to bring it under my nose, because my
> > Xorg crashes after a few seconds of video watching and it's becoming very
> > stressing...and furthermore the system is often not recoverable by using VTs.  
> > 
> I do not know whether you are using an Ubuntu patched kernel. If so, my best
> bet would be that you need
> http://launchpadlibrarian.net/44195111/xv_overlay_mode_fix.diff
> A short account on the background is given in
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/554432 comment #15.
> Upstream kernels do not need that patch, e.g. the one from Brian Rogers
> mentioned up in comment #97, for reverting a previous patch.
> 
Your bet was correct! I can witness that watching even 5 videos altogether does not crash the system anymore!

The green bands glitches on vertical resync are still there (and pretty disturbing), but this is probably a separate bug.

> > Flash videos are less likely to trigger the crash, while fullscreen or
> > maximized windows do it best.
> > 
> > I can't find the changelog you are mentioning, where is it? Also looks like he
> > is using only the patches from this bug tracker entry, since he only mentions
> > it on his launchpad page.
> Rather hidden in one of his German texts he mentions the xv fix. The changelog
> is the one in Stefan Glasenhardt's package
> /usr/share/doc/855gm-fix-dkms/changelog.gz when installed.
Thanks, I was not looking there because I downloaded the non-DKMS version.
Comment 104 legolas558 2010-05-29 11:18:29 UTC
I got one crash while watching a long video; I can't say right now if it is exactly the same kind of crash experienced before, I'll collect debug data next time. Anyway the xv_overlay_mode_fix.diff seems to reduce drastically the crash occurrencies, or possibly totally (if I experienced a different bug instead).
Comment 105 Nick Betcher 2010-06-09 23:04:07 UTC
I can confirm this same bug on a Dell Optiplex GX260. I see there is a hack for this, but this bug (at least this report) alone has been open for 6 months now with no sign of fixing. Being that this is a show-stopper for this chipset I would imagine that someone would have fixed it by now.

Because of this I am offering a $20 reward (payable via Paypal) to the FIRST person that fixes this issue properly, without simply commenting out code, and successfully feeds it into upstream. Once this is done please email me. :)

Thanks,
--Nick Betcher
Comment 106 rainy6144 2010-06-09 23:33:10 UTC
I wonder how much work it would take if, on the broken chipsets, we just pre-allocate all GTT-mapped memory and make a copy in case a buffer is moved between CPU and GTT domains (that is, like the classic memory manager, only with a new API?).  If my understanding is correct, this would eliminate the need for chipset flushes except when the GTT-mapped memory is first allocated (since the memory may have been touched by the CPU), where any failure would likely be detected early.
Comment 107 theonewiththeevillook 2010-06-29 10:01:35 UTC
I'm currently a recent version of libdrm and xf86-video-intel (not the most up to date, though) and I have to say that, although there are still messages like:
[11876.299009] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[11876.299300] render error detected, EIR: 0x00000000
[11876.299925] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 543130 at 543128)

in dmesg, X is usable and does not crash anymore. This is enough for me (having to reboot three or four times a day, sometimes much more, really was annoying), so thank you for the work on this !
Comment 108 Chris Wilson 2010-07-01 06:02:05 UTC
*** Bug 28187 has been marked as a duplicate of this bug. ***
Comment 109 Chris Wilson 2010-07-02 12:17:53 UTC
*** Bug 26803 has been marked as a duplicate of this bug. ***
Comment 110 Darxus 2010-07-02 13:02:47 UTC
This bug has been open for half a year, and it looks like no progress in a couple months, and my computer is STILL CRASHING EVERY DAY.  

This is seriously messed up.  What is going on?
Comment 111 Chris Wilson 2010-07-04 08:27:50 UTC
*** Bug 25552 has been marked as a duplicate of this bug. ***
Comment 112 Chris Wilson 2010-07-11 07:26:15 UTC
*** Bug 29006 has been marked as a duplicate of this bug. ***
Comment 113 Chris Wilson 2010-08-26 09:53:16 UTC
As a workaround, I've pushed a shadow branch to http://cgit.freedesktop.org/~ickle/xf86-video-intel/log/?h=shadow
which disables GPU acceleration and uses a static shadow buffer and uncached memory accesses. This avoids the dynamic reallocation of the GTT and the i845 errata and the general i8xx incoherency problems.

To enable use of the shadow, add
Section "Driver"
  Option "Shadow" "True"
EndSection

It is surviving the wtf test on my i845.
Comment 114 theonewiththeevillook 2010-08-27 02:32:44 UTC
I have to say that I have not had a crash for quite a long time :
uptime is 13 days and I have had no "GPU hung" meanwhile. Thanks to those who worked on this, this is much appreciated.

more details:

$ git config --get remote.origin.url
git://anongit.freedesktop.org/xorg/driver/xf86-video-intel

(compiled on 2010-08-13, it was the latest version at that time)

$ uname -r
2.6.35-gentoo-r1

$ sudo lspci -vv | grep -i graphics -A 12
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) (prog-if 00 [VGA controller])
	Subsystem: IBM NetVista A30p
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at 88000000 (32-bit, prefetchable) [size=128M]
	Region 1: Memory at 80000000 (32-bit, non-prefetchable) [size=512K]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: [d0] Power Management version 1
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Kernel driver in use: i915

$ uptime
 11:05:55 up 13 days, 18:23,  2 users,  load average: 1.13, 1.28, 1.18

$ sudo intel_gpu_dump | head
ACTHD: 0xae005568
EIR: 0x00000000
EMR: 0xffffff79
ESR: 0x00000010
PGTBL_ER: 0x00000049
IPEHR: 0x01000000
IPEIR: 0x00000000
INSTDONE: 0x00ffffc0

Do you suggest me to try the "shadow" branch anyway ?

Btw, I did have two others errors from drm. I mention them here for
information.
---
One I can reproduce easily is the following~:
[1189870.304194] [drm:i915_gem_do_execbuffer] *ERROR* Failed to pin
buffer 2 of 3, total 83902464 bytes, 0 fences: -28 [1189870.304202]
[drm:i915_gem_do_execbuffer] *ERROR* 586 objects [5 pinned], 147406848
object bytes [49430528 pinned], 49430528/117571584 gtt bytes

Reproduced by opening
<http://d2.gamaniak.com/img/0810/gamaniak_facebook-jesus.jpg> in
firefox ; it partly renders ok, then the image becomes black ; but saving
the file on disk and opening it in eog produces no error and the image renders ok.

(is it worth a new bug report here ?)
---
The other error is:

/var/log/messages-20100819.gz:
Aug 18 19:06:55 myhostname kernel: [440653.457668] [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking

but I don't know how, I don't know why.
Comment 115 Brian Rogers 2010-08-27 08:08:10 UTC
Nice work, Chris!

But, I can't get it to build:

intel_uxa.c: In function ‘intel_shadow_create’:
intel_uxa.c:1039: error: ‘size’ undeclared (first use in this function)
intel_uxa.c:1039: error: (Each undeclared identifier is reported only once
intel_uxa.c:1039: error: for each function it appears in.)
intel_uxa.c: In function ‘intel_uxa_create_screen_resources’:
intel_uxa.c:1255: error: ‘intel_screen_private’ has no member named ‘front_stride’

The master branch builds fine, however.

When I get this compiling, I'll put it in an Ubuntu PPA for people to try.
Comment 116 Chris Wilson 2010-08-27 08:15:21 UTC
> --- Comment #115 from Brian Rogers <brian@xyzw.org> 2010-08-27 08:08:10 PDT ---
> But, I can't get it to build:

Oops, sorry. Compilation fix pushed.
Comment 117 Brian Rogers 2010-09-07 19:12:09 UTC
I've had three people try the shadow branch PPA I set up.

One person got a failure to start up correctly:
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/541492/comments/205
He didn't get back to me on whether the failure also happened with the shadow option turned off.

One positive report (after sorting out package management issues):
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/541492/comments/217
No follow-up since then so it must still be working well.

One person still gets GPU hangs:
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/628556
There's a dump.tar there with an i915_error_state.
Comment 118 Chris Wilson 2010-09-10 07:24:27 UTC
*** Bug 30064 has been marked as a duplicate of this bug. ***
Comment 119 Brian Rogers 2010-09-17 07:05:08 UTC
*** Bug 24825 has been marked as a duplicate of this bug. ***
Comment 120 Chris Wilson 2010-12-30 08:11:43 UTC
After applying

commit 15056d2c06862627ead868e035fcacc59dce1b1a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Dec 21 17:04:23 2010 +0000

    drm/i915: Flush pending writes on i830/i845 after updating GTT
    
    There is an erratum on these two chipsets that causes the wrong PTE
    entries to be invalidate after updating the GTT and when used from the
    BLT engine. The workaround is to flush any pending writes before those
    PTEs are used by the BLT.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

this reduces to the general i8xx incoherency, bug 27187. For which I have a patch which appears to work on my i845; passing both the wtf test and Daniel's cache-coherency checker!
Comment 121 Chris Wilson 2011-01-04 02:15:20 UTC
Spoke too soon, still hanging.
Comment 122 Robert A. Kelly III 2011-01-14 07:29:03 UTC
I have the notorious i845 rev01 chipset:
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)
I don't know anything about graphics internals, but if there is anything I can do to help test things, let me know. Thanks.
Comment 123 Lee Matheson 2011-02-02 22:58:34 UTC
I've been testing my Fujitsu-Siemens Amilo 7400M with newer kernel and Intel driver releases with openSUSE and thus far while progress has been made, this is not yet completely fixed. I did work fine in the 2.6.27 kernel, but not as well since.  There is a thread here on the subject: http://forums.opensuse.org/forums/english/get-technical-help-here/pre-release-beta/438965-intel-gpu-8xx-issues-will-11-3-have-them-too.html

I made a video as to how this worked properly under openSUSE-11.1 here with the 2.6.27 kernel: http://www.youtube.com/watch?v=lfnAPDt_bn0

Until openSUSE-11.4 Milestone-6, the behaviour in openSUSE was not very good at all, although with milestone-6 there have been 'some' improvements with milestone-6.

I made a video as to how this works now under 32-bit openSUSE-11.4 milestone-6 (KDE liveCD version) with the 2.6.37-20 kernel and the recent Intel 2.14.0 video driver: http://www.youtube.com/watch?v=QRRyQn_h03Y

I made a video as to how this works now under 32-bit openSUSE-11.4 milestone-6 (Gnome liveCD version) with the 2.6.37-20 kernel and the recent Intel 2.14.0 video driver:  http://www.youtube.com/watch?v=9-X3ZiYUbcc

The prevention of a total crash/freeze with the newer 2.6.37-20 kernel (w/Intel 2.14.0 driver) on openSUSE-11.4 milestone-6 is significantly superior to the 2.6.28 and later kernels, but still not as good as the 2.6.27 kernel on openSUSE-11.1.

I have not (yet) tried the older Intel 2.9.1 video driver with the 2.6.37-20 kernel.
Comment 124 Balló György 2011-03-09 17:47:29 UTC
There is a regression with kernel 2.6.37 (from Arch Linux bugtracker): "when GPU hungs, the display randomly "fracture" (crazy artifacts) and no longer work. I can "repair" the screen by switching to console and back, but the screen doesn't respond to anything except the cursor. Thankfully I can switch back to console and reboot."[1]

So the current state of this bug on Arch Linux:

kernel 2.6.36 without shadow buffer:
- OpenGL apps work fine.
- Random GPU hungs, but the display is still usable (but slower). The error message from everything.log:
kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
And from Xorg.0.log:
[   792.650] (EE) intel(0): Detected a hung GPU, disabling acceleration.

kernel 2.6.36 with shadow buffer enabled:
- Random GPU hungs disappear.
- OpenGL apps are broken. When I launch a GL app (eg. Quadrapassel), the window's content not displayed correctly, and when I try to move/resize them, Xorg-server crashes and restart.

kernel 2.6.37 without shadow buffer:
- OpenGL apps work fine.
- Random GPU hungs, the display randomly "fracture" and freeze (reboot requires). The error message from everything.log:
kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
kernel: render error detected, EIR: 0x00000010
kernel: [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking
kernel: render error detected, EIR: 0x00000010
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[drm:i915_reset] *ERROR* Failed to reset chip.
And from Xorg.0.log:
[   116.454] (EE) intel(0): Detected a hung GPU, disabling acceleration.
And a lots of:
[   131.709] (EE) intel(0): failed to set cursor: Input/output error

kernel 2.6.37 with shadow buffer enabled:
- Random GPU hungs disappear, but:
- GPU usually hung, when I log in and log out from GNOME session. So when GDM loads again, the GPU hung and the display messed up after a few seconds.
- OpenGL apps are broken, when I launch a GL app (eg. Quadrapassel), the window's content not displayed correctly, and when I try to move/resize them, Xorg-server crashes and restart.

The used package versions:
libdrm 2.4.23-2
xf86-video-intel 2.14.0-2
xorg-server 1.9.4-1

So currently I can't configure my system with Intel 845G to work properly. Is there any solution?

[1] https://bugs.archlinux.org/task/22781
Comment 125 Chris Wilson 2011-04-17 01:01:14 UTC
*** Bug 27245 has been marked as a duplicate of this bug. ***
Comment 126 upiter77 2011-05-22 15:20:40 UTC
(In reply to comment #125)
> *** Bug 27245 has been marked as a duplicate of this bug. ***

Has someone find a solution for this BUG?
Has this patch above solved the freezing problem?

I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and Debian Squeeze with 2.6.32 kernel.
After 15 min working with an Internet Browser, my Gnome Desktop is freezing completely  and I get this error:

May 20 00:35:58 squeeze kernel: [  252.728005] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
May 20 00:35:58 squeeze kernel: [  252.728016] render error detected, EIR: 0x00000000
May 20 00:35:58 squeeze: [  252.728024] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 2208 at 2207)
Comment 127 upiter77 2011-06-07 16:55:40 UTC
(In reply to comment #126)
> (In reply to comment #125)
> > *** Bug 27245 has been marked as a duplicate of this bug. ***
> 
> Has someone find a solution for this BUG?
> Has this patch above solved the freezing problem?
> 
> I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and
> Debian Squeeze with 2.6.32 kernel.
> After 15 min working with an Internet Browser, my Gnome Desktop is freezing
> completely  and I get this error:
> 
> May 20 00:35:58 squeeze kernel: [  252.728005] [drm:i915_hangcheck_elapsed]
> *ERROR* Hangcheck timer elapsed... GPU hung
> May 20 00:35:58 squeeze kernel: [  252.728016] render error detected, EIR:
> 0x00000000
> May 20 00:35:58 squeeze: [  252.728024] [drm:i915_do_wait_request] *ERROR*
> i915_do_wait_request returns -5 (awaiting 2208 at 2207)


Good news!
I've installed these debian wheezy packages on my squeeze:

xserver-xorg-video-intel 

libdrm-intel1 

and now this problem is really solved!
Comment 128 upiter77 2011-06-09 16:36:13 UTC
(In reply to comment #127)
> (In reply to comment #126)
> > (In reply to comment #125)
> > > *** Bug 27245 has been marked as a duplicate of this bug. ***
> > 
> > Has someone find a solution for this BUG?
> > Has this patch above solved the freezing problem?
> > 
> > I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and
> > Debian Squeeze with 2.6.32 kernel.
> > After 15 min working with an Internet Browser, my Gnome Desktop is freezing
> > completely  and I get this error:
> > 
> > May 20 00:35:58 squeeze kernel: [  252.728005] [drm:i915_hangcheck_elapsed]
> > *ERROR* Hangcheck timer elapsed... GPU hung
> > May 20 00:35:58 squeeze kernel: [  252.728016] render error detected, EIR:
> > 0x00000000
> > May 20 00:35:58 squeeze: [  252.728024] [drm:i915_do_wait_request] *ERROR*
> > i915_do_wait_request returns -5 (awaiting 2208 at 2207)
> 
> 
> Good news!
> I've installed these debian wheezy packages on my squeeze:
> 
> xserver-xorg-video-intel 
> 
> libdrm-intel1 
> 
> and now this problem is really solved!


BTW Another solution for the 82845G video driver problem(s):

put the following in /etc/X11/xorg.conf

Section "Device"
Identifier "Card0"
Driver "intel"
Option "Shadow" "false"
Option "DRI" "false"
BoardName "Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)"
BusID "PCI:0:2:0"
EndSection

The entry above restricts the video driver to do what can result for example in the -logout black screen- problem or -Hang check timer elapsed- error message.

Run "lspci | grep VGA" you will get the "BusID" and the "BoardName", for example:

# lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)

For additional information run the "lspci -v" command.
Comment 129 upiter77 2011-06-10 03:38:17 UTC
(In reply to comment #128)
> (In reply to comment #127)
> > (In reply to comment #126)
> > > (In reply to comment #125)
> > > > *** Bug 27245 has been marked as a duplicate of this bug. ***
> > > 
> > > Has someone find a solution for this BUG?
> > > Has this patch above solved the freezing problem?
> > > 
> > > I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and
> > > Debian Squeeze with 2.6.32 kernel.
> > > After 15 min working with an Internet Browser, my Gnome Desktop is freezing
> > > completely  and I get this error:
> > > 
> > > May 20 00:35:58 squeeze kernel: [  252.728005] [drm:i915_hangcheck_elapsed]
> > > *ERROR* Hangcheck timer elapsed... GPU hung
> > > May 20 00:35:58 squeeze kernel: [  252.728016] render error detected, EIR:
> > > 0x00000000
> > > May 20 00:35:58 squeeze: [  252.728024] [drm:i915_do_wait_request] *ERROR*
> > > i915_do_wait_request returns -5 (awaiting 2208 at 2207)
> > 
> > 
> > Good news!
> > I've installed these debian wheezy packages on my squeeze:
> > 
> > xserver-xorg-video-intel 
> > 
> > libdrm-intel1 
> > 
> > and now this problem is really solved!
> 
>
Comment 130 upiter77 2011-06-10 03:39:19 UTC
(In reply to comment #129)
> (In reply to comment #128)
> > (In reply to comment #127)
> > > (In reply to comment #126)
> > > > (In reply to comment #125)
> > > > > *** Bug 27245 has been marked as a duplicate of this bug. ***
> > > > 
> > > > Has someone find a solution for this BUG?
> > > > Has this patch above solved the freezing problem?
> > > > 
> > > > I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and
> > > > Debian Squeeze with 2.6.32 kernel.
> > > > After 15 min working with an Internet Browser, my Gnome Desktop is freezing
> > > > completely  and I get this error:
> > > > 
> > > > May 20 00:35:58 squeeze kernel: [  252.728005] [drm:i915_hangcheck_elapsed]
> > > > *ERROR* Hangcheck timer elapsed... GPU hung
> > > > May 20 00:35:58 squeeze kernel: [  252.728016] render error detected, EIR:
> > > > 0x00000000
> > > > May 20 00:35:58 squeeze: [  252.728024] [drm:i915_do_wait_request] *ERROR*
> > > > i915_do_wait_request returns -5 (awaiting 2208 at 2207)
> > > 
> > > 

Good news!
I've installed these debian wheezy packages on my squeeze:
 
xserver-xorg-video-intel 
 
libdrm-intel1 
 
and now this problem is really solved!
Comment 131 upiter77 2011-06-19 14:38:54 UTC
(In reply to comment #130)
> (In reply to comment #129)
> > (In reply to comment #128)
> > > (In reply to comment #127)
> > > > (In reply to comment #126)
> > > > > (In reply to comment #125)
> > > > > > *** Bug 27245 has been marked as a duplicate of this bug. ***
> > > > > 
> > > > > Has someone find a solution for this BUG?
> > > > > Has this patch above solved the freezing problem?
> > > > > 
> > > > > I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and
> > > > > Debian Squeeze with 2.6.32 kernel.
> > > > > After 15 min working with an Internet Browser, my Gnome Desktop is freezing
> > > > > completely  and I get this error:
> > > > > 
> > > > > May 20 00:35:58 squeeze kernel: [  252.728005] [drm:i915_hangcheck_elapsed]
> > > > > *ERROR* Hangcheck timer elapsed... GPU hung
> > > > > May 20 00:35:58 squeeze kernel: [  252.728016] render error detected, EIR:
> > > > > 0x00000000
> > > > > May 20 00:35:58 squeeze: [  252.728024] [drm:i915_do_wait_request] *ERROR*
> > > > > i915_do_wait_request returns -5 (awaiting 2208 at 2207)
> > > > 
> > > > 
> 

Good news!
I've installed these debian wheezy packages on my squeeze:
 
xserver-xorg-video-intel 
 
libdrm-intel1 
 
and now this problem is really solved!


It's quite interesting, I'm still getting these errors on my squeeze:

Jun 15 17:28:48 squeeze kernel: [   49.124005] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Jun 15 17:28:48 squeeze kernel: [   49.124017] render error detected, EIR: 0x00000000
Jun 15 17:28:48 squeeze kernel: [   49.124838] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 727 at 649)

but the computer doesn't freeze anymore.
Comment 132 Bert Haverkamp 2011-06-27 03:37:05 UTC
Can someone explain what is the status of the solution for this bug?

Chris Wilson's last comment is that he still has hangs.
So I take there is still no upstream solution.

Upiter77 reports success with the latest debian packages. But what patches have caused this?

What patches should I apply and does anyone know the status of these patches wrt Ubuntu?

Regards,

Bert
Comment 133 upiter77 2011-07-14 01:36:35 UTC
(In reply to comment #128)
> (In reply to comment #127)
> > (In reply to comment #126)
> > > (In reply to comment #125)
> > > > *** Bug 27245 has been marked as a duplicate of this bug. ***
> > > 
> > > Has someone find a solution for this BUG?
> > > Has this patch above solved the freezing problem?
> > > 
> > > I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and
> > > Debian Squeeze with 2.6.32 kernel.
> > > After 15 min working with an Internet Browser, my Gnome Desktop is freezing
> > > completely  and I get this error:
> > > 
> > > May 20 00:35:58 squeeze kernel: [  252.728005] [drm:i915_hangcheck_elapsed]
> > > *ERROR* Hangcheck timer elapsed... GPU hung
> > > May 20 00:35:58 squeeze kernel: [  252.728016] render error detected, EIR:
> > > 0x00000000
> > > May 20 00:35:58 squeeze: [  252.728024] [drm:i915_do_wait_request] *ERROR*
> > > i915_do_wait_request returns -5 (awaiting 2208 at 2207)
> > 
> > 
> > Good news!
> > I've installed these debian wheezy packages on my squeeze:
> > 
> > xserver-xorg-video-intel 
> > 
> > libdrm-intel1 
> > 
> > and now this problem is really solved!
> 
> 
> BTW Another solution for the 82845G video driver problem(s):
> 
> put the following in /etc/X11/xorg.conf
> 
> Section "Device"
> Identifier "Card0"
> Driver "intel"
> Option "Shadow" "true"
> Option "DRI" "false"
> BoardName "Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated
> Graphics Device (rev 01)"
> BusID "PCI:0:2:0"
> EndSection
> 
> The entry above restricts the video driver to do what can result for example in
> the -logout black screen- problem or -Hang check timer elapsed- error message.
> 
> Run "lspci | grep VGA" you will get the "BusID" and the "BoardName", for
> example:
> 
> # lspci | grep VGA
> 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE
> Chipset Integrated Graphics Device (rev 01)
> 
> For additional information run the "lspci -v" command.
Comment 134 upiter77 2011-07-14 01:40:27 UTC
On opensuse 11.4 solved this problem using in /etc/X11/xorg.conf:

Section "Device"
        ### Available Driver options are:-
        ### Values: <i>: integer, <f>: float, <bool>: "True"/"False",
        ### <string>: "String", <freq>: "<f> Hz/kHz/MHz",
        ### <percent>: "<f>%"
        ### [arg]: arg optional
        #Option     "AccelMethod"        	# [<str>]
        Option     "DRI"   "false"             	# [<bool>]
        #Option     "ColorKey"           	# <i>
        #Option     "VideoKey"           	# <i>
        #Option     "FallbackDebug"      	# [<bool>]
        #Option     "Tiling"             	# [<bool>]
        Option     "Shadow"   "true"          	# [<bool>]
        #Option     "SwapbuffersWait"    	# [<bool>]
        #Option     "XvMC"               	# [<bool>]
        #Option     "XvPreferOverlay"    	# [<bool>]
        #Option     "DebugFlushBatches"  	# [<bool>]
        #Option     "DebugFlushCaches"   	# [<bool>]
        #Option     "DebugWait"          	# [<bool>]
        #Option     "HotPlug"            	# [<bool>]
	Identifier  "Card0"
	Driver      "intel"
	BusID       "PCI:0:2:0"
        BoardName   "Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device [8086:2562] (rev 01)"
EndSection
Comment 135 Felix Miata 2011-07-16 12:56:26 UTC
The following /etc/X11/xorg.conf.d/50-device.conf worked for me in Fedora 15:

Section "Device"
  Identifier "Default Device"
  Option	"DRI"	"false"
  Option	"Shadow"	"true"
EndSection

00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)
Comment 136 Eugeni Dodonov 2011-09-08 15:55:37 UTC
This issue is affecting a hardware component which is not being actively worked on anymore.

Moving the assignee to the dri-devel list as contact, to give this issue a better coverage.
Comment 137 Chris Wilson 2011-09-09 02:11:19 UTC
*** Bug 34868 has been marked as a duplicate of this bug. ***
Comment 138 Daniel Vetter 2011-09-20 06:00:31 UTC
Created attachment 51401 [details] [review]
New stab at working around the i845 tlb issues.

I'd be great if anyone with a still-booting i845 could test this. Obviously you need to disable Shadow. Also, expect some slowdown, but hopefully not that bad.
Comment 139 Balló György 2011-11-01 15:04:40 UTC
Created attachment 53024 [details]
The relevant part of dmesg with Daniel's patch

Daniel's patch not works for me. I tested with 3.1 kernel, and I saw only some dotted line and a cursor when Xorg loadaded. (Attached the relevant part of dmesg.)

Current state of the driver with the 3.1 kernel:
- Still get random GPU hangs without ShadowFB. It works stable with ShadowFB.
- XVideo: contrast and saturation are misconfigured (Bug 42488)
- OpenGL without ShadowFB: it's fast and stable until a GPU hang. Once GPU hang occurred, OpenGL apps are no longer works.
- OpenGL with enabled ShadowFB: it works, but very slow, slower than llvmpipe.
- Trying to run GNOME Shell always cause an immediate GPU hang, even if ShadowFB enabled.
Comment 140 Abdul Nazar P 2011-11-06 08:44:26 UTC
pls work for i845
Comment 141 Mike Ranweiler 2011-12-08 10:35:22 UTC
Hi, I've hit this too with:
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 03)

I tried the patch in comment #138 against 3.2.0-rc4, and I had similar results as comment #139 - the display wasn't usable.

Is there anything else I can try?  I have the system readily available.
Comment 142 Chris Wilson 2012-04-14 06:37:05 UTC
*** Bug 40960 has been marked as a duplicate of this bug. ***
Comment 143 Chris Wilson 2012-04-14 06:41:39 UTC
*** Bug 40181 has been marked as a duplicate of this bug. ***
Comment 144 Chris Wilson 2012-04-14 07:45:12 UTC
*** Bug 19068 has been marked as a duplicate of this bug. ***
Comment 145 Chris Wilson 2012-04-16 04:27:15 UTC
I've put some recent patches into http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=845g which makes my 845g much more stable, though I'm still getting spurious GPU hangs under memory pressure. In that regard SNA is performing better (not only to being a more complete acceleration architecture) as it is thrashing the GTT far less.
Comment 146 Chris Wilson 2012-07-21 18:09:09 UTC
*** Bug 27578 has been marked as a duplicate of this bug. ***
Comment 147 Chris Wilson 2012-08-03 12:18:44 UTC
*** Bug 53065 has been marked as a duplicate of this bug. ***
Comment 148 Chris Wilson 2012-08-26 19:24:34 UTC
*** Bug 54093 has been marked as a duplicate of this bug. ***
Comment 149 slenkar 2012-09-03 02:14:44 UTC
thanks chris how do I apply the patch?
Comment 150 Chris Wilson 2012-10-12 17:25:49 UTC
*** Bug 55934 has been marked as a duplicate of this bug. ***
Comment 151 Chris Wilson 2012-11-09 20:23:38 UTC
*** Bug 56933 has been marked as a duplicate of this bug. ***
Comment 152 Chris Wilson 2012-12-12 20:54:59 UTC
Woohoo!

commit c7f7dd61fd07dbf938fc6ba711de07986d35ce1f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Dec 12 19:43:19 2012 +0000

    sna: Pin some batches to avoid CS incoherence on 830/845
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=26345
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>


I can't promise the incoherence won't show up elsewhere as render corruption, but my 845g is finally surviving wtf stress tests.
Comment 153 Chris Wilson 2012-12-17 15:38:29 UTC
Now in kernel form as well:

commit b75e53bac7f4164e1c53a636352faa3d177b4beb
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Sun Dec 16 18:08:07 2012 +0100

    drm/i915: Implement workaround for broken CS tlb on i830/845
    
    Now that Chris Wilson demonstrated that the key for stability on early
    gen 2 is to simple _never_ exchange the physical backing storage of
    batch buffers I've tried a stab at a kernel solution. Doesn't look too
    nefarious imho, now that I don't try to be too clever for my own good
    any more.
    
    v2: After discussing the various techniques, we've decided to always blit
    batches on the suspect devices, but allow userspace to opt out of the
    kernel workaround assume full responsibility for providing coherent
    batches. The principal reason is that avoiding the blit does improve
    performance in a few key microbenchmarks and also in cairo-trace
    replays.
    
    Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 154 Trey Ramsay 2013-02-26 22:51:36 UTC
These fixes are in the latest 3.8 kernel.   Have the GPU hangs been fixed in earlier versions of the kernel?
Comment 155 Chris Wilson 2013-02-27 00:12:39 UTC
(In reply to comment #154)
> These fixes are in the latest 3.8 kernel.   Have the GPU hangs been fixed in
> earlier versions of the kernel?

The sna fixes work with any KMS/GEM (i.e. 2.6.29+) kernel. The kernel w/a is being backported by Julien Cristau for the debian kernel.
Comment 156 Trey Ramsay 2013-02-27 15:46:47 UTC
Thanks... We are using 2.6.32 kernel.  We had problems with the 845G hanging but I don't think we were using SNA acceleration at the time.  What patches do you think we need?
Comment 157 Trey Ramsay 2013-03-05 22:03:01 UTC
In regards to comment 145,  is it recommended to use SNA?  We are not using SNA and have seen GPU hangs on the 845G.   Is it better to use SNA and apply the SNA patch to xorg-x11-drv-intel?
Comment 158 Chris Wilson 2013-03-06 09:08:35 UTC
(In reply to comment #157)
> In regards to comment 145,  is it recommended to use SNA?  We are not using
> SNA and have seen GPU hangs on the 845G.   Is it better to use SNA and apply
> the SNA patch to xorg-x11-drv-intel?

comment 145 is stale, superseded by the genuine fixes in comments 152 and 153. I would recommend using SNA on gen2 as UXA pales in comparison.
Comment 159 Balló György 2013-04-30 01:36:25 UTC
Something is broken in the 3.8 kernel. When I'm using it, the colour depth is low, and my system freezes when I try to suspend the computer. I don't know if it caused by the applied workaround or not, but the problem gone if I downgrade to kernel version 3.7.
Comment 160 Jani Nikula 2013-04-30 05:46:10 UTC
(In reply to comment #159)
> Something is broken in the 3.8 kernel. When I'm using it, the colour depth
> is low, and my system freezes when I try to suspend the computer. I don't
> know if it caused by the applied workaround or not, but the problem gone if
> I downgrade to kernel version 3.7.

Please file a new bug with a detailed description of your symptoms instead of cluttering this one.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.