Hmm, that test is surviving in a continuous loop on my i845. I do note one substantial difference between our environments (besides my system using the current tip) is that I have a rev 03 compared to your rev 01. Could this be another notorious gen2 h/w bug? Given that the batchbuffer appears perfectly innocent, we don't have enough information to deduce how the h/w got itself into this state. Thanks for reducing this to a small test case, though it is still baffling. The biggest change between Jaunty and Karmic, would be that the i8xx was blacklisted and acceleration disabled for Jaunty due to the severe number of bugs (some extremely nasty cache coherency issues in particular). It would be useful to still if the bug is still reproducible on the latest stack --in particular, there has been a couple of fence flushing issues spotted (though I don't think the tests you performed would have hit those paths) and we now upload images differently (which may stress the h/w differently, for better or for worse). But it will help to narrow the differences between our systems. Created attachment 32947 [details]
dmesg with newest stuff
Also freezes on a Lucid install with kernel 2.6.33-rc6 and the xorg-edgers PPA, which has the current xserver-xorg-video-intel and a very recent libdrm and mesa.
What do you make of this stuff in dmesg?
[ 2.043712] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
[ 2.142832] render error detected, EIR: 0x00000010
[ 2.142838] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
[ 2.142852] render error detected, EIR: 0x00000010
(In reply to comment #2) > What do you make of this stuff in dmesg? > [ 2.043712] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor > 0 > [ 2.142832] render error detected, EIR: 0x00000010 That's an unrelated and apparently harmless error. The most serious side-effect it has at the moment is it prevents i915_error_state from capturing the later hang. I'm now close to 6 hours of runtime with 'x11perf -range copywinpix10,comppixwin500 -time 1 -repeat 1', still no hang on this i845. :( See also bug 26344, which is a GL-triggered freeze bug on the same hardware. Can you reproduce that one on your i845? I've been running a PPA to try to bisect random freezing here: https://bugs.launchpad.net/bugs/456902 In fact, that's what led me to find this test case. Anyway, I just asked people to report results with x11perf, and someone just reported that it freezes with rev 03 hardware: 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 03) Created attachment 33268 [details]
dmesg on 2.6.33-rc8 with drm.debug=0x06 and tests srect10-srect500
The stippled rectangle tests can also trigger a freeze. Here's a dmesg from 2.6.33-rc8 with xorg-edgers. It took eight cycles of this, on a freshly started Gnome desktop with nothing else going on:
while ! ( dmesg | grep 'render error' ); do x11perf -range srect10,srect500 -time 1 -repeat 1; done
I set DISPLAY=:0 and ran this test from an ssh login, because I think updating a terminal on the same screen interferes with the test.
Created attachment 33269 [details]
i915_error_state with srect tests
With the console initialization bug fixed, I can now get an i915_error_state corresponding to this freeze.
Let me know if you need any more debug info.
I'm finding ASCII characters in the IPEHR register sometimes, as if the wrong data is being sent to the graphics card. For example, one run looked like this: EIR: 0x00000000 PGTBL_ER: 0x00000000 INSTPM: 0x00000000 IPEIR: 0x00000000 IPEHR: 0x45494c43 INSTDONE: 0x00ffffc1 ACTHD: 0x01a33008 With byte order swapped, the value of IPEHR corresponds to the string 'CLIE'. I just applied Chris Wilson's latest batch buffer reporting patch. I now have the system running the x11perf test on bootup until it freezes, then saving /sys/kernel/debug/dri/0 and rebooting to do the test again. So far I see suspicious values of IPEHR in 3 out of 13 runs. I don't see anything strange like that in the batchbuffer dumps so far, though. I'll leave this test running overnight. Created attachment 33331 [details]
i915_error_state with batchbuffer dump
Kernel is the current linux master branch (v2.6.33-rc8-26-g0813e22) with the batchbuffer dumping patch added.
$ apt-cache policy libdrm2 xserver-xorg-video-intel | grep Installed
Installed: 2.4.17+git20100210.4f0f8717-0ubuntu0sarvatt
Installed: 2:2.10.0+git20100211.00e7312d-0ubuntu0sarvatt
This is a dump that does not have recognizable string data in IPEHR. It was made with a couple runs of x11perf while running a program that repeatedly forked and waited on its children to stress the CPU, and no drm debug messages turned on. The x11perf command was this:
x11perf -range srect10,srect500 -time 1 -repeat 1
Created attachment 33345 [details]
i915_error_state with IPEHR = wtf
I figured out exactly where the string data in IPEHR is coming from. In fact, I was able to plant my own data into that register. Here's a dump where
IPEHR = 0x0a667477 = "wtf" (followed by newline)
All it took was "yes wtf > wtf" during the x11perf run, which writes lines of 'wtf' continuously to a file.
I guess the graphics card is getting data being sent to the hard drive. That's why dmesg fragments wound up in the register: that's something that gets logged.
"x11perf -range srect10,srect500 -time 1 -repeat 1" reliably triggers the freeze here. Finally found the missing module and tracked down the broken patch that was preventing my brnach booting on my i845... And now I cannot reproduce the freeze with "x11perf -range srect10,srect500 -time 1 -repeat 1". *cries* IPEHR: 0x0a667477 ... ACTHD: 0x00b06008 seqno: 0x00000080 Buffers [1]: 00b06000 16384 00000009 00000000 00000081 batchbuffer at 0x00b06000: 0x00b06000: 0x02000011: MI_FLUSH 0x00b06004: 0x05000000: MI_BATCH_BUFFER_END 0x00b06008: HEAD 0x00000000: And that is after applying the big hammer of wbinvd on every batch. First successful workaround: diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index f1fcc97..0dcf761 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -3530,6 +3530,9 @@ i915_dispatch_gem_execbuffer(struct drm_device *dev, if (exec->flags & I915_GEM_NO_DISPATCH) return 0; + msleep(10); --- *screams* So it appears that the memory barriers are having little effect and the uncached write to the ringbuffer by the CPU and subsequent read of the batch buffer by the GPU is occurring before main memory has been flushed. Adding extra memory barriers or invalidations, or writes from the GPU to memory were insufficient. Created attachment 33593 [details] [review] Wait after memory barriers for the system memory to update I'm not going to send this upstream until I have at least a Tested-by! That appears to stabilize it, however I've noticed at least one scenario where it causes severe slowdown. When wine is in charge of drawing its own windows because the "emulate a virtual desktop" option is turned on, the gradient on the window's title bar takes a good couple of seconds to draw. I can see it being filled in left to right. Also, I did manage to cause a freeze in one scenario. In my normal test, I run Xorg, then xclock, then I run x11perf over and over. I can't get that test to freeze with this patch. But if I skip xclock, then Xorg resets the display after every run of x11perf because the last client has disconnected, and in this case I caused a GPU hang once. I'll do some more testing and gather data on this. Maybe it's just a different bug. I put a kernel with this patch in an Ubuntu PPA, and I'm going to get some feedback from users experiencing random hangs. Let's see if this patch affects that issue... So I had the bright idea of using a GTT mapping to avoid the chipset flushing (and associated delay) which restores performance... and the original problem. Conclusion: not even GTT mappings are coherent. *** Bug 21826 has been marked as a duplicate of this bug. *** *** Bug 24789 has been marked as a duplicate of this bug. *** Retitling to better reflect the underlying issue. *** Bug 22771 has been marked as a duplicate of this bug. *** *** Bug 26200 has been marked as a duplicate of this bug. *** *** Bug 26580 has been marked as a duplicate of this bug. *** *** Bug 24137 has been marked as a duplicate of this bug. *** *** Bug 26746 has been marked as a duplicate of this bug. *** this bug does not duplicate bug 24789 as I can't even boot to Xorg in enough time to run the testcase; furthermore, if I use the suggest msleep(10) patch I get crashes almost instantly. So patches drm-intel-big-hammer.patch and 855nolid.patch are both necessary for me, and this cannot be a dupe of 24789; it might be the other way round instead. Please test these patches and say if bug is (almost) fixed for you: - http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by jbarnes - http://bugzilla.kernel.org/attachment.cgi?id=25084 - drm-intel-big-hammer.patch from FC13 kernel patches I said almost because bug can still be triggered under heavy CPU/GPU load, like when watching a video, but system is definitively usable (I used it to post these comments) (In reply to comment #26) > this bug does not duplicate bug 24789 as I can't even boot to Xorg in enough > time to run the testcase; furthermore, if I use the suggest msleep(10) patch I > get crashes almost instantly. Sounds like plymouth is doing something just as funky to write to the framebuffer. Looks like a GTT map [ http://cgit.freedesktop.org/plymouth/tree/src/plugins/renderers/drm/ply-renderer-i915-driver.c ], which as pointed out earlier also suffers from exactly the same coherency issues, but is not covered by the AGP chipset flush. In short, you have the same bug and the wbinvd() just happens to cause sufficient delay on all batchbuffers that it happens to work most of the time. > So patches drm-intel-big-hammer.patch and 855nolid.patch are both necessary for > me, and this cannot be a dupe of 24789; it might be the other way round > instead. > > Please test these patches and say if bug is (almost) fixed for you: > > - http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by > jbarnes Upstream, no effect (obviously). > - http://bugzilla.kernel.org/attachment.cgi?id=25084 - > drm-intel-big-hammer.patch from FC13 kernel patches As mentioned much earlier, no effect. *** Bug 24789 has been marked as a duplicate of this bug. *** thanks for testing those patches Chris, so the conclusion is that bug are different, not duplicate I also believe that patch for this bug could fix also bug 24789, but that's a mere supposition we don't have evidence until such patch exists. and until that we have two different bugs triggered in two different ways. (In reply to comment #27) > (In reply to comment #26) > Upstream, no effect (obviously). > > > - http://bugzilla.kernel.org/attachment.cgi?id=25084 - > > drm-intel-big-hammer.patch from FC13 kernel patches > > As mentioned much earlier, no effect. > Are you saying that this patch does not fix the bug for you? So we have your hardware with patch in attachment 33593 [details] [review] which works, while my hardware (855GM rev02) which doesn't. Correct? I really don't see why the bugs should be merged considering different triggering conditions and different workaround patches (In reply to comment #30) > Are you saying that this patch does not fix the bug for you? > So we have your hardware with patch in attachment 33593 [details] [review] which works, while my > hardware (855GM rev02) which doesn't. Correct? > > I really don't see why the bugs should be merged considering different > triggering conditions and different workaround patches The two bugs in question are both due to the GPU executing the command stream prior to GMCH completing its write, thus hanging on illegal instructions that do not match the batch buffer dumped. The drm-intel-big-hammer.patch adds a wbinvd() [write-back invalidate to flush all levels of CPU cache] instruction to i915_gem_execbuffer(). For all intents and purposes, this simply adds a delay since the caches are flushed later anyway. However as is demonstrated by your own statements, and I confirm, this is insufficient to ensure that all writes are completed prior to the GPU performing its DMA to main memory. The reason why the msleep() hack does not solve everything is that it is limited to the AGP chipset flush which is only performed on invalidating the CPU domain. The truly astonishing thing about this bug is that the GTT domain appears to be similarly affected. Hence why the wbinvd() patch appears to be more successful in some scenarios than the msleep(), but is still fundamentally flawed. (In reply to comment #31) > (In reply to comment #30) > > Are you saying that this patch does not fix the bug for you? > > So we have your hardware with patch in attachment 33593 [details] [review] [details] which works, while my > > hardware (855GM rev02) which doesn't. Correct? > > > > I really don't see why the bugs should be merged considering different > > triggering conditions and different workaround patches > > The two bugs in question are both due to the GPU executing the command stream > prior to GMCH completing its write, thus hanging on illegal instructions that > do not match the batch buffer dumped. > [offtopic]I know that this ends up with something broken being identified in the hardware/firmware, I am just waiting for somebody to clearly say it...but still believing that some magic (read hack) can eventually fix this up[/offtopic] > The drm-intel-big-hammer.patch adds a wbinvd() [write-back invalidate to flush > all levels of CPU cache] instruction to i915_gem_execbuffer(). For all intents > and purposes, this simply adds a delay since the caches are flushed later > anyway. However as is demonstrated by your own statements, and I confirm, this > is insufficient to ensure that all writes are completed prior to the GPU > performing its DMA to main memory. The reason why the msleep() hack does not > solve everything is that it is limited to the AGP chipset flush which is only > performed on invalidating the CPU domain. The truly astonishing thing about > this bug is that the GTT domain appears to be similarly affected. Hence why the > wbinvd() patch appears to be more successful in some scenarios than the > msleep(), but is still fundamentally flawed. > I fully agree with your statements, also taking those for which I have no knowledge as true; as per my testing I have found a qualitative difference in the two patches: the msleep() workaround works a lot worse for me, and it is rarely distinguishable from the vanilla kernel's situation, while the wbinvd() approach "makes it usable", although I should use it with the certainness that it will crash - sooner or later. So right now I have a quick testcase for invalidating the msleep() patch while the wbinvd() patch works for longer and is not tied to a magic number (10), possibly dependant on the load of my machine. If you want I can calibrate that magic number for my box, but that would just be experimentation without an usable feedback. Also there was an user (M.Nowak) saying that FC13 is totally free of this bug; if this fact was true, then FC13 has some other interesting patch to look at (which I couldn't identify up to now), otherwise it's just harder for the bug to be triggered with FC13's patched kernel (I believe this), bringing no news to us. (In reply to comment #32) > Also there was an user (M.Nowak) saying that FC13 is totally free of this bug; > if this fact was true, then FC13 has some other interesting patch to look at > (which I couldn't identify up to now), otherwise it's just harder for the bug > to be triggered with FC13's patched kernel (I believe this), bringing no news > to us. > Forgot to add: for me FC13 is broken as any other kernel, so I asked M.Nowak to test vanilla kernel + drm-intel-big-hammer.patch, but he has not yet provided results. Some other test results from people with this hardware would be very welcome. Chris, I've thought a bit about what failure-mode could possibly explain all these different corruptions. Could it be that the GTT _table_ contains stale entries? Yes I know, this sounds crazy but I haven't yet found another failure mode that could nicely explain what's going on ... For the gtt corruption case: - map new bo into gtt - start writing - new gtt mappings become effective - further writes gtt cpu writes are wc, i.e. the cpu can send out only 4 byte sized writes, agp doesn't cache them. This would explain a single "wtf " in the command buffer (stale data that's been in the page that's just been assigned to this new bo). For the non-gtt write - write stuff to mem - map bo into gtt - gpu starts using them - new gtt mappings become effective - gpu reads crap Of course, in the case of gtt writes, the write should end up somewhere else in system memory. But where mapping a dummy page for all empty gtt entries, right? So it's quite likely they end up in there. If this dummy page never gets corrupted, I'm obviously wrong. <crazymode /> I can't test this theory thoug, because I can't reproduce the bug on my i855. btw, the reason a came up with this: Just yesterday I've experienced a strangely corrupted pixmap on my i855GM: 8 pixels high (TILE_X height) and about half the screen wide (around 512 pixels, i.e. 4 pages of TILE_X tiles, as wide as the pixmap) with nice colorful garbage. This brought me into thinking that maybe we're dealing with corruptions in GTT_PAGE_SIZE quantities. This was after about an hour of hitting the box with x11perf and filling the disk with wtfs, also the first time I've ever seen something like this. Yes, I had thought it possible that this could be a missing flush after updating the PTEs. I added a few more flushes along those paths, just in case. (Though that does not conclusively rule that out.) (In reply to comment #34) > I can't test this theory thoug, because I can't reproduce the bug on my i855. Daniel, do you have a patch that would enable testing your theory? I could apply it and see if my system keeps alive GPU for more than a few hours (it wedges at least once a day, usually more often and half of the time X exists) Here it wedges with normal desktop usage (Enlightenment + GTK applications) Chris, I've tried to prove my theory one way or another by massively increasing the gtt map/unmap: I simply unmap every bo as soon as it hits the inactive list with the following patch: diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 50244fc..197260b 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1822,6 +1822,8 @@ i915_gem_retire_request(struct drm_device *dev, drm_gem_object_reference(obj); i915_gem_object_move_to_inactive(obj); spin_unlock(&dev_priv->mm.active_list_lock); + i915_gem_object_unbind(obj); drm_gem_object_unreference(obj); spin_lock(&dev_priv->mm.active_list_lock); } Whereas before I couldn't hang my i855GM with about half an hour of x11perf plus yes wtf > /tmp/wtf, it now hangs within one minute after login. Moving around windows, switching desktops is sufficient. I'll add the i915_error_state dump shortly. Created attachment 33750 [details]
i915_error_state from my i855
I'm not really fluent in reading these dumps, but the block of zeros right before ACTHD looks very fishy: 64 bytes in length and nicely size-aligned, i.e. a cache-line gone wrong.
The hang occured right after a fresh bootup, i.e. the memory has been cleared by the power reset. That might explain the zeros instead of some other random crap.
(In reply to comment #37) > Chris, I've tried to prove my theory one way or another by massively increasing > the gtt map/unmap: I simply unmap every bo as soon as it hits the inactive list > with the following patch: Sadly that also forces the domain change to CPU, so stresses the CPU flushing paths as well; not quite clear cut. The code to poke around in is drivers/char/agp/intel-agp.c. In particular, intel_i8xx_tlbflush(). But there does some to be some other odd discrepancies between gen2 and later. (In reply to comment #27) > (In reply to comment #26) > > - http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by > > jbarnes > Upstream, no effect (obviously). > Where is this upstream? I am using linus' tree and patch is not there; I need this patch for a different bug: without it the screen is OFF (In reply to comment #40) > (In reply to comment #27) > > (In reply to comment #26) > > > - http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by > > > jbarnes > > Upstream, no effect (obviously). > > > Where is this upstream? I am using linus' tree and patch is not there; I need > this patch for a different bug: without it the screen is OFF Still waiting for Linus to pull, apparently. It's been in drm-intel-next for a couple of weeks now - surprisingly since it is marked for stable. (In reply to comment #41) > (In reply to comment #40) > > (In reply to comment #27) > > > (In reply to comment #26) > > > > - http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by > > > > jbarnes > > > Upstream, no effect (obviously). > > > > > Where is this upstream? I am using linus' tree and patch is not there; I need > > this patch for a different bug: without it the screen is OFF > > Still waiting for Linus to pull, apparently. It's been in drm-intel-next for a > couple of weeks now - surprisingly since it is marked for stable. > Yes because on these laptops it's really necessary, otherwise the display is not even recognized. Perhaps I should also use drm-intel-next kernel for these tests? I am also available to try other patches and run other tests; apparently my i855GM (rev 02) is very quick at crashing; unfortunately I am not following you both very well in the reasonings, but there seem to be some light. The worst thing is that bug is apparent in different ways and development efforts are seemingly scattered around (all distros' bugtrackers are filled up with reports about i8xx devices) > --- Comment #39 from Chris Wilson <chris@chris-wilson.co.uk> 2010-03-04 01:03:40 PST --- > (In reply to comment #37) > > Chris, I've tried to prove my theory one way or another by massively increasing > > the gtt map/unmap: I simply unmap every bo as soon as it hits the inactive list > > with the following patch: > > Sadly that also forces the domain change to CPU, so stresses the CPU flushing > paths as well; not quite clear cut. Of course, you're right, this test only strongly points at a problem in either object_unbind and/or object_bind. But I think we can rule out cpu flushing with a high probability: - We only clflush on moving away from the cpu domain. - Most access is via gtt, no intermediate cpu access. So the clflush should be a no-op (for the hw). > The code to poke around in is drivers/char/agp/intel-agp.c. In particular, > intel_i8xx_tlbflush(). But there does some to be some other odd discrepancies > between gen2 and later. My next step is to check whether this gtt writes end up someplace else (i.e. most likely on the agp scratch page). If they do, we have mixed up gtt entries somewhere, if they don't the problem is definitely somewhere else. Created attachment 33751 [details] [review] Flush the GTT by disabling/enabling it. The Broadwater errata notes that PTE entries that have been prefetched are not correctly invalidated when the GTT is updated. It goes on to note that in these situations: (1) don't do that, (2) flush the GTT - but helpfully forgets to mention how to actually enact the flush. Instead, zap the GTT by disabling it and re-enabling it after every update. Fortunately, this does not appear to impact on throughput too much. Created attachment 33752 [details] Xorg 1.7.5 log with patched kernel (In reply to comment #44) > Created an attachment (id=33751) [details] > Flush the GTT by disabling/enabling it. > Just tried this one on vanilla linus' tree, it crashes (see Xorg log) as always, however I noted no font glitches (apparently) and only a minor glitch (a couple of lines missing) on a desktop icon. Xorg lasted less than 2 minutes, and starting firefox most probably nuked it (In reply to comment #45) > Xorg lasted less than 2 minutes, and starting firefox most probably nuked it *sigh* it was doing so well here, surviving the x11perf test using both CPU and GTT mappings. Did you manage to grab an i915_error_state, so that we can see what manner of corruption remains? (In reply to comment #45) > Created an attachment (id=33752) [details] > Xorg 1.7.5 log with patched kernel > > Just tried this one on vanilla linus' tree, it crashes (see Xorg log) as > always, however I noted no font glitches (apparently) and only a minor glitch > (a couple of lines missing) on a desktop icon. With the patch on top of the latest intel-drm-next kernel you can grab <debugfs>/dri/0/i915_error_state after the hang. That would probably be useful for Chris and Daniel. http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git Created attachment 33756 [details]
i915_error_state with GTT enable/disable patch
Can confirm said crash appearing reliably a few seconds into X. My chipset is also 82852/855GM (rev 02), so this should probably be relevant to the bug at hand.
(In reply to comment #47) > (In reply to comment #45) > > Created an attachment (id=33752) [details] [details] > > Xorg 1.7.5 log with patched kernel > > > > Just tried this one on vanilla linus' tree, it crashes (see Xorg log) as > > always, however I noted no font glitches (apparently) and only a minor glitch > > (a couple of lines missing) on a desktop icon. > > With the patch on top of the latest intel-drm-next kernel you can grab > <debugfs>/dri/0/i915_error_state after the hang. That would probably be useful > for Chris and Daniel. > http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git > It will take about 15 minutes to git-clone it, I had deleted it some days ago out of frustration; next I'll grab i915_error_state after the crash and attach it here. I have the same hardware as 2points so you can already look at his attachment 33756 [details]; however we'll later check that they talk about the same bug, as double-check. As usual, something is strange in that dump. The strange part is that it looks perfectly fine, even the IPEHR shouldn't have been a trigger for a hang. The odd part is that the last loaded instruction (IPEHR) corresponds to several instructions prior to ACTHD (ACTive HeaD, where the DMA engine is currently grabbing the next QWord from) - presuming what the CPU read back is consistent with what is being read by the GPU. Hmm. (In reply to comment #50) > As usual, something is strange in that dump. The strange part is that it looks > perfectly fine, even the IPEHR shouldn't have been a trigger for a hang. The > odd part is that the last loaded instruction (IPEHR) corresponds to several > instructions prior to ACTHD (ACTive HeaD, where the DMA engine is currently > grabbing the next QWord from) - presuming what the CPU read back is consistent > with what is being read by the GPU. Hmm. > Maybe this is another bug (evil twin bug 24789)? That needs another patch? My laptop has one of the early intel centrino (single-core of course) CPU (1.6 Ghz), with the wicked i8042 controller, but I wouldn't infer that it is on the CPU side either... Created attachment 33760 [details] dri debugfs dumps for i855GM + Xorg.0.log OK, I built drm-intel with Chris' patch and rebooted; first time I forgot "nomodeset" active and it *detonated* back to boot screen (it must be able to successfully trigger some kernel/CPU/BIOS failsafe reboot, some way), this must be an unique feature of most recent 2.6.33 kernels. Now back on topic: I successfully started Xorg, and it looked great (like if everything was fixed, but more probably I didn't have time to find any glitch), however when I opened some directory windows (XFCE) it crashed. The mouse cursor was changing when I tried to drag a window, so the underlying system was still breathing (it always happens with bug 24789). I popped up the VT where Xorg was started and hit Ctrl+C it since crash was already lasting for 3-4 seconds, and the characterizing I/O error lines were already printed. In the attachment you can find /sys/kernel/debugfs/dri/{1,64}; does anybody know if it is normal to have entries 1 and 64? With the old intel driver I have only dri/1 I have uploaded the compiled drm-intel (kernel, initramfs and modules) with Chris' patch: http://www.iragan.com/linux/i855GM/ you can find it under the *kernel directory. This was built with my .config so might not work properly on all laptops, but surely should boot @scottandchrystie: please also follow this bug since they are tied (mutually exclusive but probably caused by same hardware glitch). I have compiled and uploaded the drm-intel-next kernel already @legolas558 - I booted the kernel you provided, and it booted, but no better than the one I compiled myself with the lid and big hammer patches. Still froze after several minutes of flipping between VT's and running graphics-intensive programs (inkscape, tuxpaint, etc). My hardware is: 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) I'm afraid I'm not at the level of understanding the level of the conversation here, but I'm willing to compile and test kernels on my hardware :) I'm looking up now how to collect the info you need with the intel-gpu-tools. Let me know what I can do to help. Scott (In reply to comment #55) > @legolas558 - I booted the kernel you provided, and it booted, but no better > than the one I compiled myself with the lid and big hammer patches. Still froze > after several minutes of flipping between VT's and running graphics-intensive > programs (inkscape, tuxpaint, etc). That was drm-intel-next with Chris' patch; in my case it doesn't even last 1 minute, while the kernel patched with the big hammer lasts for longer. > My hardware is: > 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE > Chipset Integrated Graphics Device (rev 01) > Mine is: 00:02.0 VGA compatible controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02) So we can infer that Chris' patch works (a bit) with rev01 but not with rev02. Thanks for taking time to test this. Created attachment 33835 [details] Debug logs from unpatched drm-intel-next kernel freeze I compiled the drm-intel-next kernel from git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel.git (I hope that's correct for drm-intel-next). I didn't use any patches initally for testing. Also, I would have been unable to apply Chris Wilson's msleep patch from comment #14, as the code has been changed and I couldn't see where it should go. The kernel booted and ran fine for about 15 minutes under load from running XFCE with movie trailers, switching VT's and using tuxpaint, all of which are typically what will crash it the quickest. I've enclosed dmesg output as well as the contents of /sys/kernel/debug/dri/0 -- hopefully that's all you need for now. I didn't patch with the big-hammer patch because the last kernels I tried with that patch didn't seem to improve things much, but I can do that and provide the logs if it would help. Let me know what I can test next. Thanks! Scott Created attachment 34016 [details] [review] Hack that prevents freezing At least some of the hangs seem to be related to XY_COLOR_BLT, so I made a patch to disable that in the DDX. And now my testcase doesn't hang the GPU. Is this command being sent out properly? Obviously, this patch causes fairly garbled graphics. Still stable after 9 hours of "x11perf -range dot,comppixwin500 -time 1 -repeat 1". It seems i830_uxa_solid plays a role in the freezing. In reply to comment #58) > At least some of the hangs seem to be related to XY_COLOR_BLT, so I made a > patch to disable that in the DDX. And now my testcase doesn't hang the GPU. Is > this command being sent out properly? > > Obviously, this patch causes fairly garbled graphics. > I am using a freedesktop git development stack, created with this script: http://bit.ly/b2sJVO I have applied your patch before building xf86-video-intel and recompiled my drm-intel kernel to be modular (CONFIG_DRM=m,CONFIG_DRM_i915=m and also CONFIG_FB=m) so that the freedesktop compiled modules can be used instead. However, I can't boot with drm being modular! It simply shows a black screen (tuned OFF) and there's no way to put this into a working console... (I can login and send commands but I am blind without a screen) Does anybody have any hint? I have never been able to boot with KMS enabled and drm modular; the mkinitcpio configuration facilities (for customization of the initramfs) of this Arch Linux box do not seem to do anything. Created attachment 34194 [details] [review] (hopefullyy) fix gtt cache coherency This patch seems to fix any gtt related cache coherency problems, at least for my i855GM. It's quite large, but that's just due to me having first needed to clean up intel-agp.c before seeing clear what's going on. This patch is also not yet polished, so don't look at it and expect beauty ;) It contains a totally paranoid cache coherency checker with the absolutely minimal set of memory barriers and cache flush before/after the chipset flush. If it detects any inconsistency, it prints a warning + backtrace in the dmesg. Ratelimited to one warning per direction per minute. Don't use this cache coherency checker for unpatched kernels. On my i855GM, cpu->gtt transfers fail with > 1% chance, gtt->cpu transfers fail with > 50% chance on unpatched kernels. So it'll only spam your dmesg. This patch also includes the unbind-inactive-objects patch to really trash on the gtt stuff. Also trashes performance, so expect a sluggish feel when testing this. Patch also prints out the number of completed chipset flushes in regular intervals. If you test this, wait until at least 1 million chipset flushes have been done (or a chipset flush failed) before declaring that it works. Even better is 10 million. On my i855 that's about two hours of glxgears & openarena. Suspend/resume only lightly tested. It might break the cache coherency checker (but should not). In case this patch doesn't work and you get backtraces about failed flushes, please attach you full dmesg. If it works, please report on what hw (lspci -nn) and how many chipset flushes (more than 10M would be great) have been done. Patch is against -rc1 but should apply to latest drm-intel, too. Don't use any other patches when testing this. Thanks, Daniel *** Bug 23032 has been marked as a duplicate of this bug. *** Created attachment 34220 [details]
dmesg with gtt cache coherency patch
Nearly got to one million flushes before X froze. Noticed a few warnings about failed flushes before, but apparently to no critical effect (yet).
> --- Comment #63 from 2points@gmx.org 2010-03-18 14:58:51 PST ---
> Created an attachment (id=34220)
> --> (http://bugs.freedesktop.org/attachment.cgi?id=34220)
> dmesg with gtt cache coherency patch
>
> Nearly got to one million flushes before X froze. Noticed a few warnings about
> failed flushes before, but apparently to no critical effect (yet).
Thanks for testing this. Looks like it doesn't work as advertised, given
that you have a i855GM, too. I need to get back to the drawing board.
> Thanks for testing this. Looks like it doesn't work as advertised, given
> that you have a i855GM, too. I need to get back to the drawing board.
Would you still like the patch to be tested on other hardware, or should we consider it obsolete now?
> --- Comment #65 from Geir Ove Myhr <gomyhr@gmail.com> 2010-03-18 15:38:34 PST --- > > Thanks for testing this. Looks like it doesn't work as advertised, given > > that you have a i855GM, too. I need to get back to the drawing board. > > Would you still like the patch to be tested on other hardware, or should we > consider it obsolete now? Well, Chris tested it on his i845 and it had no effect there. If you have something else around and some time to waste, testing wouldn't hurt. Perhaps my cache coherency checker uncovers some other stuff. Anyway, I have the feeling that the i845 and the i855GM bugs are two different things, so I've created a new bug to keep track of my crusade to fix the i855 here: bug # 27187 So if you test, please report your findings there. Created attachment 34233 [details] section of dmesg with GTT flush failures (In reply to comment #61) > Created an attachment (id=34194) [details] > (hopefullyy) fix gtt cache coherency > > This patch seems to fix any gtt related cache coherency problems, at least for > my i855GM. > Hi Daniel, I finally took time to test this. I also have your hardware and with this patch I finally can use Xorg 1.7.5 and the modern intel driver! So this patch obsoletes the DRM big hammer patch that I was previously using and that gave me about 5 minutes of working Xorg. With your patch Xorg can be used for long time (more than 1 hour now and not crashed yet), but I have seen GTT flush failures in dmesg, please see attachment. Did you see that? Flush failures at flush number 16384 and 32768! Am I just lucky or is there a reason for such numbers being powers of two? So I will be using your patch since even with some failures it doesn't crash Xorg as the linus/drm-intel trees do; I'd even propose it for submission, because it makes the hardware usable! Thanks It finally crashed when playing a video, but I think this is a totally separate bug, perhaps related to the hangcheck timer; by the way, can somebody check if bug 26723 is duplicate of bug 24789 or of bug 26345 (this one)? I am asking because with this patch I get: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung after the crash triggered by playing a video. Also I can see a pattern in the failed flushes: ~$ dmesg|grep -F "flush no" [ 30.890838] chipset flush no. 0 [ 245.758686] chipset flush no. 16384 [ 403.685617] chipset flush no. 32768 [ 594.005756] chipset flush no. 49152 Do we have a crazy carry bit somewhere? > --- Comment #68 from legolas558 <legolas558@email.it> 2010-03-19 04:38:05 PST ---
> Also I can see a pattern in the failed flushes:
>
> ~$ dmesg|grep -F "flush no"
> [ 30.890838] chipset flush no. 0
> [ 245.758686] chipset flush no. 16384
> [ 403.685617] chipset flush no. 32768
> [ 594.005756] chipset flush no. 49152
>
> Do we have a crazy carry bit somewhere?
Nothing crazy is going on. This just prints out the number of chipset
flushes done every 16*1024 flushes. This is just to know how reliable the
thing works. The cache coherency problems caught by my checker print out
"chipset flushed failed". So you have to count these (plus add in the ones
supressed by the ratelimiting code, watch out for "xx callbacks supressed"
in your demsg). Then divided them by the number of flushes (as printed in
your dmesg snippet above) and you have a ballbark figure for how reliable
the chipset flushing is. Obviously anything bigger than zero is
unacceptable.
Created attachment 34239 [details] chipset flushing quality script (In reply to comment #69) > > --- Comment #68 from legolas558 <legolas558@email.it> 2010-03-19 04:38:05 PST --- > > Also I can see a pattern in the failed flushes: > > > > ~$ dmesg|grep -F "flush no" > > [ 30.890838] chipset flush no. 0 > > [ 245.758686] chipset flush no. 16384 > > [ 403.685617] chipset flush no. 32768 > > [ 594.005756] chipset flush no. 49152 > > > > Do we have a crazy carry bit somewhere? > > Nothing crazy is going on. This just prints out the number of chipset > flushes done every 16*1024 flushes. This is just to know how reliable the > thing works. The cache coherency problems caught by my checker print out > "chipset flushed failed". So you have to count these (plus add in the ones > supressed by the ratelimiting code, watch out for "xx callbacks supressed" > in your demsg). Then divided them by the number of flushes (as printed in > your dmesg snippet above) and you have a ballbark figure for how reliable > the chipset flushing is. Obviously anything bigger than zero is > unacceptable. > Thanks Daniel for explaining this; I was confused by the fact that "chipset flush no." was printed after the crash trace dumps. By using your formula (attached script) my chipset flushing quality ratio is 38/212992, and seems linear growing and not related to specific software running (possibly dependant on CPU load only). Created attachment 34240 [details]
chipset flushing quality script (revised)
Errata: my ratio is 98/229376
(In reply to comment #58) > Created an attachment (id=34016) [details] > Hack that prevents freezing > > At least some of the hangs seem to be related to XY_COLOR_BLT, so I made a > patch to disable that in the DDX. And now my testcase doesn't hang the GPU. Is > this command being sent out properly? Gah, missed this. Sorry Brian. Yes it does seem that we could emit a solid fill that exceed the surface bounds. Not quite sure what is generating such nonsense, but it will at least be resolved by: commit 0c47195ca805881e3fbd5b9224be5c930feeeb8c Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Mar 24 17:37:39 2010 +0000 i830: Clip solid fills to surface. There is a reasonable surfeit of evidence to support this error, for instance: http://bugs.freedesktop.org/attachment.cgi?id=34417 Created attachment 34429 [details] Chipset flushing quality script for freedesktop bug 26345 fixed script to correctly show failures ratio now using patch v5, no failures within first 32768 flushes - will report further when a big amount of flushes have been done Created attachment 34435 [details]
Script to check chipset flushing quality
Created attachment 34436 [details]
/sys/kernel/debug/dri after the GTT flush failures
Created attachment 34437 [details] dmesg after 7 GTT flush failures I got 7 failures within the first 300k flushes; I have attached dmesg and debugfs dri dumps. Failures seem harder to trigger now. I had to open 4 glxgears and one xeyes to trigger them. I am using latest drm-intel kernel with patch in attachment 34377 [details] [review] I am using the stock packages from Arch Linux (xorg-server 1.7.5-902, xf86-video-intel 2.10.0, intel-dri 7.7, libdrm-git). It still crashes with the hangcheck bug when playing videos, but very rarely now. Created attachment 34499 [details]
gttqual script to check GTT flushing quality
Tried Daniel's patch from comment #61 with drm-intel-next. Didn't work on my VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01). In fact it was a little worse than the stock drm-intel-kernel as far as running time until freezing (using dwm -- time running xfce was about the same: mere seconds) Scott @Scott: in my case it lasts some more time, but I have recently experienced sudden crashes...so D.Vetter's patch might be the way to go but Intel has clearly not released a good open source driver for these devices since the beginning (Xorg 1.6 and the old driver (non-KMS) are definitively usable, at least). *** Bug 25091 has been marked as a duplicate of this bug. *** *** Bug 26229 has been marked as a duplicate of this bug. *** *** Bug 26723 has been marked as a duplicate of this bug. *** hi there, I'm also bitten by occasional x freezes on a i855GM rev 02 and I am following the respective bug reports. I think a lot has been tested already, but in my case, I get those freezes *only* in a dual-head setup using LVDS and VGA together. I *never* had a freeze using only the LVDS on my laptop, linux 2.6.33 + libdrm 2.4.18-3 + xserver-xorg-video-intel 2.10.903 (all on debian) seems reasonably stable in this case. hth, ben Yesterday (and today again), I got a slightly different error message in 'dmesg' than what I'm used to - only the three first lines are the usual ones (see below). This "new" error does not appear everytime. I'm using a rather new git version of xf86-video-intel and libdrm (I couldn't tell you which commit exactly is installed, but it's no more than 2 weeks old for libdrm, and I think no more than 4 days old for xf86-video-intel). Maybe this error is just a side-effect of me using incompatible versions of everything (i.e. libdrm and xf86-video-intel are the latest available, while the rest of my X.org install is what is available on Gentoo/Portage), but I hope not. Here is the error message in dmesg: ------------------------------ kernel: [ 2067.426012] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung kernel: [ 2067.426024] render error detected, EIR: 0x00000000 kernel: [ 2067.426598] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 97808 at 97807) kernel: [ 2067.426608] [drm:i915_gem_do_execbuffer] *ERROR* Failed to pin buffer 1 of 2, total 4210688 bytes: -5 kernel: [ 2067.426614] [drm:i915_gem_do_execbuffer] *ERROR* 356 objects [4 pinned], 37957632 object bytes [7487488 pinned], 20197376/117571584 gtt bytes kernel: [ 2067.426741] ------------[ cut here ]------------ kernel: [ 2067.426753] WARNING: at drivers/gpu/drm/i915/i915_gem_tiling.c:490 i915_gem_set_tiling+0x164/0x1b5() kernel: [ 2067.426756] Hardware name: 830342G kernel: [ 2067.426758] failed to reset object for tiling switch kernel: [ 2067.426761] Modules linked in: nls_iso8859_15 nls_cp850 vfat fat nls_utf8 ntfs floppy kernel: [ 2067.426774] Pid: 4153, comm: X Not tainted 2.6.33-gentoo #4 kernel: [ 2067.426777] Call Trace: kernel: [ 2067.426787] [<c101fb13>] warn_slowpath_common+0x60/0x90 kernel: [ 2067.426792] [<c101fb77>] warn_slowpath_fmt+0x24/0x27 kernel: [ 2067.426796] [<c118a6c6>] i915_gem_set_tiling+0x164/0x1b5 kernel: [ 2067.426803] [<c1171e96>] ? drm_ioctl+0x0/0x2b8 kernel: [ 2067.426807] [<c11720bb>] drm_ioctl+0x225/0x2b8 kernel: [ 2067.426811] [<c118a562>] ? i915_gem_set_tiling+0x0/0x1b5 kernel: [ 2067.426817] [<c1049104>] ? generic_file_aio_write+0x7e/0x93 kernel: [ 2067.426823] [<c10567b3>] ? do_wp_page+0x5a3/0x63e kernel: [ 2067.426827] [<c1171e96>] ? drm_ioctl+0x0/0x2b8 kernel: [ 2067.426832] [<c1071119>] vfs_ioctl+0x19/0x51 kernel: [ 2067.426836] [<c1071620>] do_vfs_ioctl+0x43a/0x46c kernel: [ 2067.426841] [<c10578b1>] ? handle_mm_fault+0x59a/0x619 kernel: [ 2067.426849] [<c10340d2>] ? ktime_get_ts+0xd0/0xda kernel: [ 2067.426853] [<c107167e>] sys_ioctl+0x2c/0x45 kernel: [ 2067.426858] [<c1002790>] sysenter_do_call+0x12/0x26 kernel: [ 2067.426862] ---[ end trace 445f83ad84043481 ]--- ------------------------------ Created attachment 34663 [details] i915_error_state from drm-intel-next kernel I'm not sure what kind of additional information is useful at this point (as opposed to for bug # 27187). Here is another i915_error_state from drm-intel-next. This time it does not hang in XY_COLOR_BLT. IPEHR (0x60) does not match the instruction header before HEAD, so I suppose this is a real CPU/GPU incoherency. Relevant part of intel_error_decode output: IPEHR: 0x00000060 ACTHD: 0x0567e034 seqno: 0x00094b09 Buffers [7]: 0567e000 16384 00000009 00000000 00094b0a dirty purgeable batchbuffer at 0x0567e000: ... 0x0567e020: 0x7d980000: 3DSTATE_DEFAULT_Z 0x0567e024: 0x00000000: dword 1 0x0567e028: 0x7d890002: 3DSTATE_FOG_MODE 0x0567e02c: 0x89800000: dword 1 0x0567e030: 0x00000000: dword 2 0x0567e034: HEAD 0x00000000: dword 3 0x0567e038: 0x7c281088: 3DSTATE_MAP_TEX_STREAM_I830 This one is from Ivailo Stoyanov at Ubuntu bug report https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/541492/comments/17 Created attachment 34664 [details] [review] Graphics-breaking workaround: skip i830_uxa_solid At least with x11perf, I could not reproduce the hang with this patch. We should figure out if there are cases where the GPU can still hang if XY_COLOR_BLT is never called. That most recent dump appears to point to a different operation, but in my testing, not all of my dumps implicated XY_COLOR_BLT and yet eliminating that prevented all hangs as far as I could tell. If it never hangs without using XY_COLOR_BLT, perhaps we could find a substitute. If it can hang on other operations, we could eliminate them one-by-one to make a list of the problematic opcodes and see what they might have in common. (In reply to comment #86) > Created an attachment (id=34664) [details] > Graphics-breaking workaround: skip i830_uxa_solid Brian, are you saying that commenting out most of i830_uxa_solid still works better, even after the clip solids commit from comment # 72? I haven't had a chance to test on the affected machine lately, so I don't know if that patch fixed the x11perf hangs. However, I didn't see invalid bounds in any of the dumps, so I don't see how it could. I'll test x11perf again next chance I get, though. Created attachment 34676 [details] output of intel gpu dump after a crash. The attached file is the output of intel_gpu_dump after a crash obtained while using the patch in Comment #86. I'm sorry I have no idea what the relevant part can be, so I just upload it all. See below for a small excerpt, though (containing the lines which mention 'HEAD' and 'TAIL'). Also I must say that the patch has some side effects : some texts (mostly Gnome menus or texts in gnome applets) sometimes don't appear on the screen and eventually appear if I roll over them with the mouse, or highlight them in some way. I guess these were expected... if not, I can take screenshots. Preview of the attached file~: ACTHD: 0x0686c000 EIR: 0x00000000 EMR: 0xffffff69 ESR: 0x00000001 PGTBL_ER: 0x00000000 IPEHR: 0x18000001 IPEIR: 0x00000000 INSTDONE: 0x01ffffc1 (7840 lines not shown) 0x00007a5c: 0x0686c001: MI UNKNOWN 0x00007a60: 0x0686c01c: MI UNKNOWN 0x00007a64: HEAD 0x00000000: MI_NOOP 0x00007a68: 0x02000004: MI_FLUSH 0x00007a6c: 0x00000000: MI_NOOP 0x00007a70: 0x10800001: MI_STORE_DATA_INDEX 0x00007a74: 0x00000080: dword 1 0x00007a78: 0x000ddabf: dword 2 0x00007a7c: 0x01000000: MI_USER_INTERRUPT 0x00007a80: 0x02000000: MI_FLUSH 0x00007a84: 0x00000000: MI_NOOP 0x00007a88: 0x10800001: MI_STORE_DATA_INDEX 0x00007a8c: 0x00000080: dword 1 0x00007a90: 0x000ddac0: dword 2 0x00007a94: 0x01000000: MI_USER_INTERRUPT 0x00007a98: TAIL 0x02000004: MI_FLUSH 0x00007a9c: 0x00000000: MI_NOOP 0x00007aa0: 0x18000001: MI UNKNOWN (24921 more lines not shown) Thanks for testing. I just wanted to make sure there were other graphics operations that could hang the GPU. Turns out there are. Yeah, that patch messes up graphics, because I'm ignoring all requests to fill a rectangular region with a solid color, which is needed to clear off a pixmap before drawing on it. I tested the patches from bug 27187 on my i845, with no success. Debug logs are at https://bugs.freedesktop.org/show_bug.cgi?id=27187#c84 (hardware and software versions in comment #82). Let me know if you need something else tested. Scott I tried to save the output of intel_gpu_dump for the last few crashes that happened to me. I don't know how to interpret these, but I thought I would share this with you : # grep IPEHR * intelgpudump-2010-04-05_17:23:34:IPEHR: 0x18000001 intelgpudump-2010-04-06_12:25:37:IPEHR: 0x41500000 intelgpudump-2010-04-06_13:24:52:IPEHR: 0x18000001 intelgpudump-2010-04-06_14:16:10:IPEHR: 0x41600000 intelgpudump-2010-04-06_14:32:29:IPEHR: 0x05000000 intelgpudump-2010-04-06_16:54:39:IPEHR: 0x18000001 intelgpudump-2010-04-07_13:29:26:IPEHR: 0x54300004 intelgpudump-2010-04-07_13:48:38:IPEHR: 0x04caf6e4 intelgpudump-2010-04-09_11:07:47:IPEHR: 0x18000001 intelgpudump-2010-04-09_11:58:45:IPEHR: 0x18000001 intelgpudump-2010-04-09_14:46:27:IPEHR: 0x18000001 intelgpudump-2010-04-09_16:34:45:IPEHR: 0x05000000 intelgpudump-2010-04-10_14:16:59:IPEHR: 0x0a103078 intelgpudump-2010-04-10_15:34:04:IPEHR: 0x00000000 (the date/time is when the dump was taken, which obviously is a few seconds/minutes after the crash happens) From what I could understand from comment #10, the value of IPEHR could be almost anything, depending on (disk) activity. Am I experiencing distinct bugs ? Is the value of IPEHR not linked to the crash ? I have no idea. Full logs (intel_gpu_dump output, usually also the dmesg output) are at http://dl.free.fr/v9nxyAGHx as a tar.bz2 file, in case they are of any interest. I get similar errors to theonewiththeevillook@yahoo.fr when running a youtube video fullscreen. This is deterministic: it happens always and right after pressing the full screen button. 1. Dmesg: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung render error detected, EIR: 0x00000000 [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 473836 at 473833) [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung render error detected, EIR: 0x00000000 2. Xorg.log: [ 44472.430] (EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error. [ 44472.504] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error [ 44472.721] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error [ 44473.220] (WW) intel(0): i830_uxa_prepare_access: gtt bo map failed: Input/output error 3. GPU dump: ACTHD: 0x0f815b74 EIR: 0x00000000 EMR: 0xffffffed ESR: 0x00000000 PGTBL_ER: 0x00000000 IPEHR: 0x02000004 IPEIR: 0x00000000 INSTDONE: 0x03311081 busy: GMBUS busy: MPEG busy: MECO busy: CC busy: DG busy: DCMP busy: IT busy: MG busy: MEC busy: QCC busy: TB busy: WM busy: EF busy: Map L2 cache busy: Secondary ring 3 busy: Secondary ring 2 busy: Secondary ring 1 busy: Secondary ring 0 busy: Primary ring 1 Ringbuffer: Reminder: head pointer is GPU read, tail pointer is CPU write ringbuffer at 0x00000000: 0x00000000: 0x10800001: MI_STORE_DATA_INDEX 0x00000004: 0x00000080: dword 1 0x00000008: 0x0007309c: dword 2 0x0000000c: 0x01000000: MI_USER_INTERRUPT 0x00000010: 0x02000001: MI_FLUSH 0x00000014: 0x00000000: MI_NOOP 0x00000018: 0x18800080: MI_BATCH_BUFFER_START 0x0000001c: 0x023e7001: dword 1 0x00000020: 0x02000004: MI_FLUSH 0x00000024: 0x00000000: MI_NOOP 0x00000028: 0x10800001: MI_STORE_DATA_INDEX 0x0000002c: 0x00000080: dword 1 0x00000030: 0x0007309d: dword 2 0x00000034: 0x01000000: MI_USER_INTERRUPT 0x00000038: 0x02000005: MI_FLUSH 0x0000003c: 0x00000000: MI_NOOP ... 4. Software: kernel: 2.3.34-rc3 + Daniel's patches from bug 27187 xserver: 1.8.0 libdrm: 2.4.20 xf86-video-intel: 2.11.0 I have been testing different kernels under Lucid on a desktop computer equipped with a 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) with KMS and dri enabled. Analysing the underlying code is beyond my capabilities, nevertheless I would like to help. 1) The standard Lucid kernel leads to a crash usually within the first 10 min of usual usage (Openoffice, Firefox). 2) drm-intel-next kernels as provided by http://kernel.ubuntu.com/~kernel-ppa/mainline/ or https://launchpad.net/~brian-rogers/+archive/experimental, the latter with Daniel Vetter's V9 patch, behave differently. While the 2010-04-19-lucid kernel crashes usually after 1 to 3h the later ones do not survive the very first mode switch even before Xorg is started, regardless V9 applied or not. This kind of crash does not allow for ssh or gracefully rebooting. 3) The most interesting kernel seems to me Daniel Baumann's ( https://launchpad.net/~dnjl/+archive/kernel ), a standard Lucid one with V8 patch applied. Usually it crashes within the first 3h of usage. But now I have hit a >20h period and cannot kill it, by normal usage, running GL screensaver, switching VTs, watching HTML5 Youtube videos normal and fullscreen, reverting to the standard Lucid xserver-xorg-video-intel and libdrms at stopped Xorg, GL screensaver again ..., passing the x11perf test several times, .... /sys/kernel/debug/dri/0/i915_error_state, however shows Time: 1274367670 s 906969 us EIR: 0x00000010 PGTBL_ER: 0x00000049 INSTPM: 0x00000000 IPEIR: 0x00000000 IPEHR: 0x01000000 INSTDONE: 0x00ffffc0 ACTHD: 0x00000048 That one seems to happen always when the computer starts. Given the randomness of the time after which the crashes/stucks happen and to which extent different i8xx based machines are affected (e.g. my i855 based laptop runs prefectly with a V8 patched kernel) it is difficult to interpret above findings on the 82845G. What is not random, I think, are the early deaths with later drm-intel-next kernels (2), and also the times to crash of (3) does not look Gaussian distributed. The comparison of (1) and (3) seems to indicate that the V8 patch fixes one kind of mishap also for 82845G based hardware. (In reply to comment #94) > 2) drm-intel-next kernels as provided by > http://kernel.ubuntu.com/~kernel-ppa/mainline/ or > https://launchpad.net/~brian-rogers/+archive/experimental, the latter with > Daniel Vetter's V9 patch, behave differently. While the 2010-04-19-lucid kernel > crashes usually after 1 to 3h the later ones do not survive the very first mode > switch even before Xorg is started, regardless V9 applied or not. This kind of > crash does not allow for ssh or gracefully rebooting. Are you sure that V9 is correctly applied? That early crash has always been an indicator of missing Daniel Vetter's patch (at least up to now). > 3) The most interesting kernel seems to me Daniel Baumann's ( > https://launchpad.net/~dnjl/+archive/kernel ), a standard Lucid one with V8 > patch applied. Usually it crashes within the first 3h of usage. But now I have > hit a >20h period and cannot kill it, by normal usage, running GL screensaver, > switching VTs, watching HTML5 Youtube videos normal and fullscreen, reverting > to the standard Lucid xserver-xorg-video-intel and libdrms at stopped Xorg, GL > screensaver again ..., passing the x11perf test several times, .... > /sys/kernel/debug/dri/0/i915_error_state, however shows > Time: 1274367670 s 906969 us > EIR: 0x00000010 > PGTBL_ER: 0x00000049 > INSTPM: 0x00000000 > IPEIR: 0x00000000 > IPEHR: 0x01000000 > INSTDONE: 0x00ffffc0 > ACTHD: 0x00000048 > That one seems to happen always when the computer starts. > You mean that no crashes are found in dmesg during these >20h sessions? Do you have the logs to check it out? > Given the randomness of the time after which the crashes/stucks happen and to > which extent different i8xx based machines are affected (e.g. my i855 based > laptop runs prefectly with a V8 patched kernel) it is difficult to interpret > above findings on the 82845G. What is not random, I think, are the early deaths > with later drm-intel-next kernels (2), and also the times to crash of (3) does > not look Gaussian distributed. The comparison of (1) and (3) seems to indicate > that the V8 patch fixes one kind of mishap also for 82845G based hardware. Have you considered variability of the rest of code (drm-intel-next, Xorg, drivers/libraries), if there is any in your tests? On my 855GM (rev2) I have reached a very stable situation, except that overlays (created by VLC or mplayer) do always crash the machine in a fairly short amount of time. I have also some background corruption but that might be a new bug in libdrm or in something else. Can you confirm that the crashes you experienced were someway connected to video playing? (In reply to comment #95) > (In reply to comment #94) > > 2) drm-intel-next kernels as provided by > > http://kernel.ubuntu.com/~kernel-ppa/mainline/ or > > https://launchpad.net/~brian-rogers/+archive/experimental, the latter with > > Daniel Vetter's V9 patch, behave differently. While the 2010-04-19-lucid kernel > > crashes usually after 1 to 3h the later ones do not survive the very first mode > > switch even before Xorg is started, regardless V9 applied or not. This kind of > > crash does not allow for ssh or gracefully rebooting. > Are you sure that V9 is correctly applied? That early crash has always been an > indicator of missing Daniel Vetter's patch (at least up to now). It is the kernel which Brian Rogers has build and announced in https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/541492?comments=all comment #105, I somehow trust him. I also tried to patch the provided sources with V9 to find that it had been applied already. > > > 3) The most interesting kernel seems to me Daniel Baumann's ( > > https://launchpad.net/~dnjl/+archive/kernel ), a standard Lucid one with V8 > > patch applied. Usually it crashes within the first 3h of usage. But now I have > > hit a >20h period and cannot kill it, by normal usage, running GL screensaver, > > switching VTs, watching HTML5 Youtube videos normal and fullscreen, reverting > > to the standard Lucid xserver-xorg-video-intel and libdrms at stopped Xorg, GL > > screensaver again ..., passing the x11perf test several times, .... > > /sys/kernel/debug/dri/0/i915_error_state, however shows > > Time: 1274367670 s 906969 us > > EIR: 0x00000010 > > PGTBL_ER: 0x00000049 > > INSTPM: 0x00000000 > > IPEIR: 0x00000000 > > IPEHR: 0x01000000 > > INSTDONE: 0x00ffffc0 > > ACTHD: 0x00000048 > > That one seems to happen always when the computer starts. > > > You mean that no crashes are found in dmesg during these >20h sessions? Do you > have the logs to check it out? First of all I wish to emphsize that it is one lucky strike which I have hit. More frequently this kernel gets stuck as well. Therefore I left the machine running the screensaver and it is still doing well. In dmesg is a single drm related error during startup: [ 23.288050] render error detected, EIR: 0x00000010 [ 23.288062] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking [ 23.288085] render error detected, EIR: 0x00000010 now the last line is: [54005.830290] svc: failed to register lockdv1 RPC service (errno 97). > > > Given the randomness of the time after which the crashes/stucks happen and to > > which extent different i8xx based machines are affected (e.g. my i855 based > > laptop runs prefectly with a V8 patched kernel) it is difficult to interpret > > above findings on the 82845G. What is not random, I think, are the early deaths > > with later drm-intel-next kernels (2), and also the times to crash of (3) does > > not look Gaussian distributed. The comparison of (1) and (3) seems to indicate > > that the V8 patch fixes one kind of mishap also for 82845G based hardware. > Have you considered variability of the rest of code (drm-intel-next, Xorg, > drivers/libraries), if there is any in your tests? > > On my 855GM (rev2) I have reached a very stable situation, except that overlays > (created by VLC or mplayer) do always crash the machine in a fairly short > amount of time. I have also some background corruption but that might be a new > bug in libdrm or in something else. > > Can you confirm that the crashes you experienced were someway connected to > video playing? On my i855 based laptop I have tested video only occasionally, not with mplayer or VLC and found no preferential crashing. (In reply to comment #96) > (In reply to comment #95) > > (In reply to comment #94) > > > 2) drm-intel-next kernels as provided by > > > http://kernel.ubuntu.com/~kernel-ppa/mainline/ or > > > https://launchpad.net/~brian-rogers/+archive/experimental, the latter with > > > Daniel Vetter's V9 patch, behave differently. While the 2010-04-19-lucid kernel > > > crashes usually after 1 to 3h the later ones do not survive the very first mode > > > switch even before Xorg is started, regardless V9 applied or not. This kind of > > > crash does not allow for ssh or gracefully rebooting. > > Are you sure that V9 is correctly applied? That early crash has always been an > > indicator of missing Daniel Vetter's patch (at least up to now). > It is the kernel which Brian Rogers has build and announced in > https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/541492?comments=all > comment #105, I somehow trust him. I also tried to patch the provided sources > with V9 to find that it had been applied already. > I see - can you pick the dump generation script (a small daemon) in attachment 34922 [details] to hopefully get snapshots right before and after the total crash? It has worked for me also without keyboard or ssh, and perhaps one of those dumps might contain information about what's happening there. > > > > > 3) The most interesting kernel seems to me Daniel Baumann's ( > > > https://launchpad.net/~dnjl/+archive/kernel ), a standard Lucid one with V8 > > > patch applied. Usually it crashes within the first 3h of usage. But now I have > > > hit a >20h period and cannot kill it, by normal usage, running GL screensaver, > > > switching VTs, watching HTML5 Youtube videos normal and fullscreen, reverting > > > to the standard Lucid xserver-xorg-video-intel and libdrms at stopped Xorg, GL > > > screensaver again ..., passing the x11perf test several times, .... > > > /sys/kernel/debug/dri/0/i915_error_state, however shows > > > Time: 1274367670 s 906969 us > > > EIR: 0x00000010 > > > PGTBL_ER: 0x00000049 > > > INSTPM: 0x00000000 > > > IPEIR: 0x00000000 > > > IPEHR: 0x01000000 > > > INSTDONE: 0x00ffffc0 > > > ACTHD: 0x00000048 > > > That one seems to happen always when the computer starts. > > > > > You mean that no crashes are found in dmesg during these >20h sessions? Do you > > have the logs to check it out? > First of all I wish to emphsize that it is one lucky strike which I have hit. > More frequently this kernel gets stuck as well. Therefore I left the machine > running the screensaver and it is still doing well. In dmesg is a single drm > related error during startup: > [ 23.288050] render error detected, EIR: 0x00000010 > [ 23.288062] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking > [ 23.288085] render error detected, EIR: 0x00000010 > now the last line is: > [54005.830290] svc: failed to register lockdv1 RPC service (errno 97). > svc is not related. In my case (drm-intel-next with patched v9) it never crashes unless I pick some video or some screen-intensive wine application. Perhaps your configuration is not as stable because you are using a compositing window manager? > > > > > Given the randomness of the time after which the crashes/stucks happen and to > > > which extent different i8xx based machines are affected (e.g. my i855 based > > > laptop runs prefectly with a V8 patched kernel) it is difficult to interpret > > > above findings on the 82845G. What is not random, I think, are the early deaths > > > with later drm-intel-next kernels (2), and also the times to crash of (3) does > > > not look Gaussian distributed. The comparison of (1) and (3) seems to indicate > > > that the V8 patch fixes one kind of mishap also for 82845G based hardware. > > Have you considered variability of the rest of code (drm-intel-next, Xorg, > > drivers/libraries), if there is any in your tests? > > > > On my 855GM (rev2) I have reached a very stable situation, except that overlays > > (created by VLC or mplayer) do always crash the machine in a fairly short > > amount of time. I have also some background corruption but that might be a new > > bug in libdrm or in something else. > > > > Can you confirm that the crashes you experienced were someway connected to > > video playing? > On my i855 based laptop I have tested video only occasionally, not with mplayer > or VLC and found no preferential crashing. I suspect that some video-intensive application or window manager could make it crash more frequently; in such case crashes like mine, happening only under specific stress, would be masked because the GPU would be under constant stress. Can this be your case e.g. video-intensive desktop environment or applications being used? (In reply to comment #97) > (In reply to comment #96) > > (In reply to comment #95) > > > (In reply to comment #94) snip > > > I see - can you pick the dump generation script (a small daemon) in attachment > 34922 [details] to hopefully get snapshots right before and after the total crash? It has > worked for me also without keyboard or ssh, and perhaps one of those dumps > might contain information about what's happening there. > Thanks a lot. I will try that soon, once I have access to the 82845G/GL based hardware again. Currently, I have only a 82852/855GM (rev 02) based laptop, I guess its similar to yours and rather well behaving with a V8 patched (see below) Lucid kernel. > > > snip > > related error during startup: > > [ 23.288050] render error detected, EIR: 0x00000010 > > [ 23.288062] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking > > [ 23.288085] render error detected, EIR: 0x00000010 > > now the last line is: > > [54005.830290] svc: failed to register lockdv1 RPC service (errno 97). > > > svc is not related. > > In my case (drm-intel-next with patched v9) it never crashes unless I pick some > video or some screen-intensive wine application. Perhaps your configuration is > not as stable because you are using a compositing window manager? > My 82845G/GL has much poorer performance then the 855GM and must be different also in other respects. I guess that i8482845G/GL5 hardware is affected by an additional shortcoming which is not covered by the V8/V9 patches. > > > snip > > > Have you considered variability of the rest of code (drm-intel-next, Xorg, > > > drivers/libraries), if there is any in your tests? > > > > > > On my 855GM (rev2) I have reached a very stable situation, except that overlays > > > (created by VLC or mplayer) do always crash the machine in a fairly short > > > amount of time. I have also some background corruption but that might be a new > > > bug in libdrm or in something else. > > > > > > Can you confirm that the crashes you experienced were someway connected to > > > video playing? > > On my i855 based laptop I have tested video only occasionally, not with mplayer > > or VLC and found no preferential crashing. > I suspect that some video-intensive application or window manager could make it > crash more frequently; in such case crashes like mine, happening only under > specific stress, would be masked because the GPU would be under constant > stress. Can this be your case e.g. video-intensive desktop environment or > applications being used? This is for the 82852/855GM now: I found that Daniel Baumann's kernel ( https://launchpad.net/~dnjl/+archive/kernel ) crashes on my hardware when confronted with Xv-overlay in a very similar way as you describe it. Stefan Glasenhardt's 855GM - fixed modules ( http://glasen-hardt.de/ , a lot in German only, https://launchpad.net/~glasen/+archive/855gm-fix ) avoid this kind of crash. In his Changelog it says "* i915-kernel module includes patch to get Xv-overlay mode working again." . That must be something in addition to V8/V9 (I really have to study the sources). If I use the latter modules with Lucid's xserver-xorg-video-intel and libdrms video overlay in totem player only appears when I fiddle around resizing the window. If I use xserver-xorg-video-intel 2.11.0 and the libdrms 2.4.20 as provided by https://launchpad.net/~glasen/+archive/intel-driver video overlay works, for me better than ever before on the 855GM. But Compiz gives in when changing from fullscreen to normal view. I guess one will have to recompile Compiz and all its dependencies against the new driver. So I think for the 855GM the solution is really close, for the 82845G/GL not yet, I am afraid. (In reply to comment #98) > This is for the 82852/855GM now: I found that Daniel Baumann's kernel ( > https://launchpad.net/~dnjl/+archive/kernel ) crashes on my hardware when > confronted with Xv-overlay in a very similar way as you describe it. Stefan > Glasenhardt's 855GM - fixed modules ( http://glasen-hardt.de/ , a lot in German > only, https://launchpad.net/~glasen/+archive/855gm-fix ) avoid this kind of > crash. In his Changelog it says "* i915-kernel module includes patch to get > Xv-overlay mode working again." . That must be something in addition to V8/V9 > (I really have to study the sources). If I use the latter modules with Lucid's > xserver-xorg-video-intel and libdrms video overlay in totem player only appears > when I fiddle around resizing the window. If I use xserver-xorg-video-intel Where is such patch? I am begging anybody to bring it under my nose, because my Xorg crashes after a few seconds of video watching and it's becoming very stressing...and furthermore the system is often not recoverable by using VTs. Flash videos are less likely to trigger the crash, while fullscreen or maximized windows do it best. I can't find the changelog you are mentioning, where is it? Also looks like he is using only the patches from this bug tracker entry, since he only mentions it on his launchpad page. > 2.11.0 and the libdrms 2.4.20 as provided by > https://launchpad.net/~glasen/+archive/intel-driver video overlay works, for me > better than ever before on the 855GM. But Compiz gives in when changing from > fullscreen to normal view. I guess one will have to recompile Compiz and all > its dependencies against the new driver. > So I think for the 855GM the solution is really close, for the 82845G/GL not > yet, I am afraid. It probably has more quirks to be worked out - some hackwork is needed. "../../intel/intel_bufmgr_gem.c:901: Error setting to CPU domain 3: Input/output error" Just after "Failed to submit batchbuffer: Input/output error". Only gets written to the console. I'm getting this when I kill off X and try to start it with startx, consistently. Up to date Lucid (clean, recent reinstall). (II) intel(0): Integrated Graphics Chipset: Intel(R) 845G 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) (In reply to comment #99) > (In reply to comment #98) > > This is for the 82852/855GM now: I found that Daniel Baumann's kernel ( > > https://launchpad.net/~dnjl/+archive/kernel ) crashes on my hardware when > > confronted with Xv-overlay in a very similar way as you describe it. Stefan > > Glasenhardt's 855GM - fixed modules ( http://glasen-hardt.de/ , a lot in German > > only, https://launchpad.net/~glasen/+archive/855gm-fix ) avoid this kind of > > crash. In his Changelog it says "* i915-kernel module includes patch to get > > Xv-overlay mode working again." . That must be something in addition to V8/V9 > > (I really have to study the sources). If I use the latter modules with Lucid's > > xserver-xorg-video-intel and libdrms video overlay in totem player only appears > > when I fiddle around resizing the window. If I use xserver-xorg-video-intel > > Where is such patch? I am begging anybody to bring it under my nose, because my > Xorg crashes after a few seconds of video watching and it's becoming very > stressing...and furthermore the system is often not recoverable by using VTs. > I do not know whether you are using an Ubuntu patched kernel. If so, my best bet would be that you need http://launchpadlibrarian.net/44195111/xv_overlay_mode_fix.diff A short account on the background is given in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/554432 comment #15. Upstream kernels do not need that patch, e.g. the one from Brian Rogers mentioned up in comment #97, for reverting a previous patch. > Flash videos are less likely to trigger the crash, while fullscreen or > maximized windows do it best. > > I can't find the changelog you are mentioning, where is it? Also looks like he > is using only the patches from this bug tracker entry, since he only mentions > it on his launchpad page. Rather hidden in one of his German texts he mentions the xv fix. The changelog is the one in Stefan Glasenhardt's package /usr/share/doc/855gm-fix-dkms/changelog.gz when installed. (In reply to comment #102) > (In reply to comment #99) > > Where is such patch? I am begging anybody to bring it under my nose, because my > > Xorg crashes after a few seconds of video watching and it's becoming very > > stressing...and furthermore the system is often not recoverable by using VTs. > > > I do not know whether you are using an Ubuntu patched kernel. If so, my best > bet would be that you need > http://launchpadlibrarian.net/44195111/xv_overlay_mode_fix.diff > A short account on the background is given in > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/554432 comment #15. > Upstream kernels do not need that patch, e.g. the one from Brian Rogers > mentioned up in comment #97, for reverting a previous patch. > Your bet was correct! I can witness that watching even 5 videos altogether does not crash the system anymore! The green bands glitches on vertical resync are still there (and pretty disturbing), but this is probably a separate bug. > > Flash videos are less likely to trigger the crash, while fullscreen or > > maximized windows do it best. > > > > I can't find the changelog you are mentioning, where is it? Also looks like he > > is using only the patches from this bug tracker entry, since he only mentions > > it on his launchpad page. > Rather hidden in one of his German texts he mentions the xv fix. The changelog > is the one in Stefan Glasenhardt's package > /usr/share/doc/855gm-fix-dkms/changelog.gz when installed. Thanks, I was not looking there because I downloaded the non-DKMS version. I got one crash while watching a long video; I can't say right now if it is exactly the same kind of crash experienced before, I'll collect debug data next time. Anyway the xv_overlay_mode_fix.diff seems to reduce drastically the crash occurrencies, or possibly totally (if I experienced a different bug instead). I can confirm this same bug on a Dell Optiplex GX260. I see there is a hack for this, but this bug (at least this report) alone has been open for 6 months now with no sign of fixing. Being that this is a show-stopper for this chipset I would imagine that someone would have fixed it by now. Because of this I am offering a $20 reward (payable via Paypal) to the FIRST person that fixes this issue properly, without simply commenting out code, and successfully feeds it into upstream. Once this is done please email me. :) Thanks, --Nick Betcher I wonder how much work it would take if, on the broken chipsets, we just pre-allocate all GTT-mapped memory and make a copy in case a buffer is moved between CPU and GTT domains (that is, like the classic memory manager, only with a new API?). If my understanding is correct, this would eliminate the need for chipset flushes except when the GTT-mapped memory is first allocated (since the memory may have been touched by the CPU), where any failure would likely be detected early. I'm currently a recent version of libdrm and xf86-video-intel (not the most up to date, though) and I have to say that, although there are still messages like: [11876.299009] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [11876.299300] render error detected, EIR: 0x00000000 [11876.299925] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 543130 at 543128) in dmesg, X is usable and does not crash anymore. This is enough for me (having to reboot three or four times a day, sometimes much more, really was annoying), so thank you for the work on this ! *** Bug 28187 has been marked as a duplicate of this bug. *** *** Bug 26803 has been marked as a duplicate of this bug. *** This bug has been open for half a year, and it looks like no progress in a couple months, and my computer is STILL CRASHING EVERY DAY. This is seriously messed up. What is going on? *** Bug 25552 has been marked as a duplicate of this bug. *** *** Bug 29006 has been marked as a duplicate of this bug. *** As a workaround, I've pushed a shadow branch to http://cgit.freedesktop.org/~ickle/xf86-video-intel/log/?h=shadow which disables GPU acceleration and uses a static shadow buffer and uncached memory accesses. This avoids the dynamic reallocation of the GTT and the i845 errata and the general i8xx incoherency problems. To enable use of the shadow, add Section "Driver" Option "Shadow" "True" EndSection It is surviving the wtf test on my i845. I have to say that I have not had a crash for quite a long time : uptime is 13 days and I have had no "GPU hung" meanwhile. Thanks to those who worked on this, this is much appreciated. more details: $ git config --get remote.origin.url git://anongit.freedesktop.org/xorg/driver/xf86-video-intel (compiled on 2010-08-13, it was the latest version at that time) $ uname -r 2.6.35-gentoo-r1 $ sudo lspci -vv | grep -i graphics -A 12 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) (prog-if 00 [VGA controller]) Subsystem: IBM NetVista A30p Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 16 Region 0: Memory at 88000000 (32-bit, prefetchable) [size=128M] Region 1: Memory at 80000000 (32-bit, non-prefetchable) [size=512K] Expansion ROM at <unassigned> [disabled] Capabilities: [d0] Power Management version 1 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Kernel driver in use: i915 $ uptime 11:05:55 up 13 days, 18:23, 2 users, load average: 1.13, 1.28, 1.18 $ sudo intel_gpu_dump | head ACTHD: 0xae005568 EIR: 0x00000000 EMR: 0xffffff79 ESR: 0x00000010 PGTBL_ER: 0x00000049 IPEHR: 0x01000000 IPEIR: 0x00000000 INSTDONE: 0x00ffffc0 Do you suggest me to try the "shadow" branch anyway ? Btw, I did have two others errors from drm. I mention them here for information. --- One I can reproduce easily is the following~: [1189870.304194] [drm:i915_gem_do_execbuffer] *ERROR* Failed to pin buffer 2 of 3, total 83902464 bytes, 0 fences: -28 [1189870.304202] [drm:i915_gem_do_execbuffer] *ERROR* 586 objects [5 pinned], 147406848 object bytes [49430528 pinned], 49430528/117571584 gtt bytes Reproduced by opening <http://d2.gamaniak.com/img/0810/gamaniak_facebook-jesus.jpg> in firefox ; it partly renders ok, then the image becomes black ; but saving the file on disk and opening it in eog produces no error and the image renders ok. (is it worth a new bug report here ?) --- The other error is: /var/log/messages-20100819.gz: Aug 18 19:06:55 myhostname kernel: [440653.457668] [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking but I don't know how, I don't know why. Nice work, Chris! But, I can't get it to build: intel_uxa.c: In function ‘intel_shadow_create’: intel_uxa.c:1039: error: ‘size’ undeclared (first use in this function) intel_uxa.c:1039: error: (Each undeclared identifier is reported only once intel_uxa.c:1039: error: for each function it appears in.) intel_uxa.c: In function ‘intel_uxa_create_screen_resources’: intel_uxa.c:1255: error: ‘intel_screen_private’ has no member named ‘front_stride’ The master branch builds fine, however. When I get this compiling, I'll put it in an Ubuntu PPA for people to try. > --- Comment #115 from Brian Rogers <brian@xyzw.org> 2010-08-27 08:08:10 PDT ---
> But, I can't get it to build:
Oops, sorry. Compilation fix pushed.
I've had three people try the shadow branch PPA I set up. One person got a failure to start up correctly: https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/541492/comments/205 He didn't get back to me on whether the failure also happened with the shadow option turned off. One positive report (after sorting out package management issues): https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/541492/comments/217 No follow-up since then so it must still be working well. One person still gets GPU hangs: https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/628556 There's a dump.tar there with an i915_error_state. *** Bug 30064 has been marked as a duplicate of this bug. *** *** Bug 24825 has been marked as a duplicate of this bug. *** After applying commit 15056d2c06862627ead868e035fcacc59dce1b1a Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Dec 21 17:04:23 2010 +0000 drm/i915: Flush pending writes on i830/i845 after updating GTT There is an erratum on these two chipsets that causes the wrong PTE entries to be invalidate after updating the GTT and when used from the BLT engine. The workaround is to flush any pending writes before those PTEs are used by the BLT. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> this reduces to the general i8xx incoherency, bug 27187. For which I have a patch which appears to work on my i845; passing both the wtf test and Daniel's cache-coherency checker! Spoke too soon, still hanging. I have the notorious i845 rev01 chipset: 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) I don't know anything about graphics internals, but if there is anything I can do to help test things, let me know. Thanks. I've been testing my Fujitsu-Siemens Amilo 7400M with newer kernel and Intel driver releases with openSUSE and thus far while progress has been made, this is not yet completely fixed. I did work fine in the 2.6.27 kernel, but not as well since. There is a thread here on the subject: http://forums.opensuse.org/forums/english/get-technical-help-here/pre-release-beta/438965-intel-gpu-8xx-issues-will-11-3-have-them-too.html I made a video as to how this worked properly under openSUSE-11.1 here with the 2.6.27 kernel: http://www.youtube.com/watch?v=lfnAPDt_bn0 Until openSUSE-11.4 Milestone-6, the behaviour in openSUSE was not very good at all, although with milestone-6 there have been 'some' improvements with milestone-6. I made a video as to how this works now under 32-bit openSUSE-11.4 milestone-6 (KDE liveCD version) with the 2.6.37-20 kernel and the recent Intel 2.14.0 video driver: http://www.youtube.com/watch?v=QRRyQn_h03Y I made a video as to how this works now under 32-bit openSUSE-11.4 milestone-6 (Gnome liveCD version) with the 2.6.37-20 kernel and the recent Intel 2.14.0 video driver: http://www.youtube.com/watch?v=9-X3ZiYUbcc The prevention of a total crash/freeze with the newer 2.6.37-20 kernel (w/Intel 2.14.0 driver) on openSUSE-11.4 milestone-6 is significantly superior to the 2.6.28 and later kernels, but still not as good as the 2.6.27 kernel on openSUSE-11.1. I have not (yet) tried the older Intel 2.9.1 video driver with the 2.6.37-20 kernel. There is a regression with kernel 2.6.37 (from Arch Linux bugtracker): "when GPU hungs, the display randomly "fracture" (crazy artifacts) and no longer work. I can "repair" the screen by switching to console and back, but the screen doesn't respond to anything except the cursor. Thankfully I can switch back to console and reboot."[1] So the current state of this bug on Arch Linux: kernel 2.6.36 without shadow buffer: - OpenGL apps work fine. - Random GPU hungs, but the display is still usable (but slower). The error message from everything.log: kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung And from Xorg.0.log: [ 792.650] (EE) intel(0): Detected a hung GPU, disabling acceleration. kernel 2.6.36 with shadow buffer enabled: - Random GPU hungs disappear. - OpenGL apps are broken. When I launch a GL app (eg. Quadrapassel), the window's content not displayed correctly, and when I try to move/resize them, Xorg-server crashes and restart. kernel 2.6.37 without shadow buffer: - OpenGL apps work fine. - Random GPU hungs, the display randomly "fracture" and freeze (reboot requires). The error message from everything.log: kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung kernel: render error detected, EIR: 0x00000010 kernel: [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking kernel: render error detected, EIR: 0x00000010 [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged! [drm:i915_reset] *ERROR* Failed to reset chip. And from Xorg.0.log: [ 116.454] (EE) intel(0): Detected a hung GPU, disabling acceleration. And a lots of: [ 131.709] (EE) intel(0): failed to set cursor: Input/output error kernel 2.6.37 with shadow buffer enabled: - Random GPU hungs disappear, but: - GPU usually hung, when I log in and log out from GNOME session. So when GDM loads again, the GPU hung and the display messed up after a few seconds. - OpenGL apps are broken, when I launch a GL app (eg. Quadrapassel), the window's content not displayed correctly, and when I try to move/resize them, Xorg-server crashes and restart. The used package versions: libdrm 2.4.23-2 xf86-video-intel 2.14.0-2 xorg-server 1.9.4-1 So currently I can't configure my system with Intel 845G to work properly. Is there any solution? [1] https://bugs.archlinux.org/task/22781 *** Bug 27245 has been marked as a duplicate of this bug. *** (In reply to comment #125) > *** Bug 27245 has been marked as a duplicate of this bug. *** Has someone find a solution for this BUG? Has this patch above solved the freezing problem? I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and Debian Squeeze with 2.6.32 kernel. After 15 min working with an Internet Browser, my Gnome Desktop is freezing completely and I get this error: May 20 00:35:58 squeeze kernel: [ 252.728005] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung May 20 00:35:58 squeeze kernel: [ 252.728016] render error detected, EIR: 0x00000000 May 20 00:35:58 squeeze: [ 252.728024] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 2208 at 2207) (In reply to comment #126) > (In reply to comment #125) > > *** Bug 27245 has been marked as a duplicate of this bug. *** > > Has someone find a solution for this BUG? > Has this patch above solved the freezing problem? > > I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and > Debian Squeeze with 2.6.32 kernel. > After 15 min working with an Internet Browser, my Gnome Desktop is freezing > completely and I get this error: > > May 20 00:35:58 squeeze kernel: [ 252.728005] [drm:i915_hangcheck_elapsed] > *ERROR* Hangcheck timer elapsed... GPU hung > May 20 00:35:58 squeeze kernel: [ 252.728016] render error detected, EIR: > 0x00000000 > May 20 00:35:58 squeeze: [ 252.728024] [drm:i915_do_wait_request] *ERROR* > i915_do_wait_request returns -5 (awaiting 2208 at 2207) Good news! I've installed these debian wheezy packages on my squeeze: xserver-xorg-video-intel libdrm-intel1 and now this problem is really solved! (In reply to comment #127) > (In reply to comment #126) > > (In reply to comment #125) > > > *** Bug 27245 has been marked as a duplicate of this bug. *** > > > > Has someone find a solution for this BUG? > > Has this patch above solved the freezing problem? > > > > I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and > > Debian Squeeze with 2.6.32 kernel. > > After 15 min working with an Internet Browser, my Gnome Desktop is freezing > > completely and I get this error: > > > > May 20 00:35:58 squeeze kernel: [ 252.728005] [drm:i915_hangcheck_elapsed] > > *ERROR* Hangcheck timer elapsed... GPU hung > > May 20 00:35:58 squeeze kernel: [ 252.728016] render error detected, EIR: > > 0x00000000 > > May 20 00:35:58 squeeze: [ 252.728024] [drm:i915_do_wait_request] *ERROR* > > i915_do_wait_request returns -5 (awaiting 2208 at 2207) > > > Good news! > I've installed these debian wheezy packages on my squeeze: > > xserver-xorg-video-intel > > libdrm-intel1 > > and now this problem is really solved! BTW Another solution for the 82845G video driver problem(s): put the following in /etc/X11/xorg.conf Section "Device" Identifier "Card0" Driver "intel" Option "Shadow" "false" Option "DRI" "false" BoardName "Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)" BusID "PCI:0:2:0" EndSection The entry above restricts the video driver to do what can result for example in the -logout black screen- problem or -Hang check timer elapsed- error message. Run "lspci | grep VGA" you will get the "BusID" and the "BoardName", for example: # lspci | grep VGA 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) For additional information run the "lspci -v" command. (In reply to comment #128) > (In reply to comment #127) > > (In reply to comment #126) > > > (In reply to comment #125) > > > > *** Bug 27245 has been marked as a duplicate of this bug. *** > > > > > > Has someone find a solution for this BUG? > > > Has this patch above solved the freezing problem? > > > > > > I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and > > > Debian Squeeze with 2.6.32 kernel. > > > After 15 min working with an Internet Browser, my Gnome Desktop is freezing > > > completely and I get this error: > > > > > > May 20 00:35:58 squeeze kernel: [ 252.728005] [drm:i915_hangcheck_elapsed] > > > *ERROR* Hangcheck timer elapsed... GPU hung > > > May 20 00:35:58 squeeze kernel: [ 252.728016] render error detected, EIR: > > > 0x00000000 > > > May 20 00:35:58 squeeze: [ 252.728024] [drm:i915_do_wait_request] *ERROR* > > > i915_do_wait_request returns -5 (awaiting 2208 at 2207) > > > > > > Good news! > > I've installed these debian wheezy packages on my squeeze: > > > > xserver-xorg-video-intel > > > > libdrm-intel1 > > > > and now this problem is really solved! > > (In reply to comment #129) > (In reply to comment #128) > > (In reply to comment #127) > > > (In reply to comment #126) > > > > (In reply to comment #125) > > > > > *** Bug 27245 has been marked as a duplicate of this bug. *** > > > > > > > > Has someone find a solution for this BUG? > > > > Has this patch above solved the freezing problem? > > > > > > > > I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and > > > > Debian Squeeze with 2.6.32 kernel. > > > > After 15 min working with an Internet Browser, my Gnome Desktop is freezing > > > > completely and I get this error: > > > > > > > > May 20 00:35:58 squeeze kernel: [ 252.728005] [drm:i915_hangcheck_elapsed] > > > > *ERROR* Hangcheck timer elapsed... GPU hung > > > > May 20 00:35:58 squeeze kernel: [ 252.728016] render error detected, EIR: > > > > 0x00000000 > > > > May 20 00:35:58 squeeze: [ 252.728024] [drm:i915_do_wait_request] *ERROR* > > > > i915_do_wait_request returns -5 (awaiting 2208 at 2207) > > > > > > Good news! I've installed these debian wheezy packages on my squeeze: xserver-xorg-video-intel libdrm-intel1 and now this problem is really solved! (In reply to comment #130) > (In reply to comment #129) > > (In reply to comment #128) > > > (In reply to comment #127) > > > > (In reply to comment #126) > > > > > (In reply to comment #125) > > > > > > *** Bug 27245 has been marked as a duplicate of this bug. *** > > > > > > > > > > Has someone find a solution for this BUG? > > > > > Has this patch above solved the freezing problem? > > > > > > > > > > I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and > > > > > Debian Squeeze with 2.6.32 kernel. > > > > > After 15 min working with an Internet Browser, my Gnome Desktop is freezing > > > > > completely and I get this error: > > > > > > > > > > May 20 00:35:58 squeeze kernel: [ 252.728005] [drm:i915_hangcheck_elapsed] > > > > > *ERROR* Hangcheck timer elapsed... GPU hung > > > > > May 20 00:35:58 squeeze kernel: [ 252.728016] render error detected, EIR: > > > > > 0x00000000 > > > > > May 20 00:35:58 squeeze: [ 252.728024] [drm:i915_do_wait_request] *ERROR* > > > > > i915_do_wait_request returns -5 (awaiting 2208 at 2207) > > > > > > > > > Good news! I've installed these debian wheezy packages on my squeeze: xserver-xorg-video-intel libdrm-intel1 and now this problem is really solved! It's quite interesting, I'm still getting these errors on my squeeze: Jun 15 17:28:48 squeeze kernel: [ 49.124005] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Jun 15 17:28:48 squeeze kernel: [ 49.124017] render error detected, EIR: 0x00000000 Jun 15 17:28:48 squeeze kernel: [ 49.124838] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 727 at 649) but the computer doesn't freeze anymore. Can someone explain what is the status of the solution for this bug? Chris Wilson's last comment is that he still has hangs. So I take there is still no upstream solution. Upiter77 reports success with the latest debian packages. But what patches have caused this? What patches should I apply and does anyone know the status of these patches wrt Ubuntu? Regards, Bert (In reply to comment #128) > (In reply to comment #127) > > (In reply to comment #126) > > > (In reply to comment #125) > > > > *** Bug 27245 has been marked as a duplicate of this bug. *** > > > > > > Has someone find a solution for this BUG? > > > Has this patch above solved the freezing problem? > > > > > > I have Intel 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device and > > > Debian Squeeze with 2.6.32 kernel. > > > After 15 min working with an Internet Browser, my Gnome Desktop is freezing > > > completely and I get this error: > > > > > > May 20 00:35:58 squeeze kernel: [ 252.728005] [drm:i915_hangcheck_elapsed] > > > *ERROR* Hangcheck timer elapsed... GPU hung > > > May 20 00:35:58 squeeze kernel: [ 252.728016] render error detected, EIR: > > > 0x00000000 > > > May 20 00:35:58 squeeze: [ 252.728024] [drm:i915_do_wait_request] *ERROR* > > > i915_do_wait_request returns -5 (awaiting 2208 at 2207) > > > > > > Good news! > > I've installed these debian wheezy packages on my squeeze: > > > > xserver-xorg-video-intel > > > > libdrm-intel1 > > > > and now this problem is really solved! > > > BTW Another solution for the 82845G video driver problem(s): > > put the following in /etc/X11/xorg.conf > > Section "Device" > Identifier "Card0" > Driver "intel" > Option "Shadow" "true" > Option "DRI" "false" > BoardName "Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated > Graphics Device (rev 01)" > BusID "PCI:0:2:0" > EndSection > > The entry above restricts the video driver to do what can result for example in > the -logout black screen- problem or -Hang check timer elapsed- error message. > > Run "lspci | grep VGA" you will get the "BusID" and the "BoardName", for > example: > > # lspci | grep VGA > 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE > Chipset Integrated Graphics Device (rev 01) > > For additional information run the "lspci -v" command. On opensuse 11.4 solved this problem using in /etc/X11/xorg.conf: Section "Device" ### Available Driver options are:- ### Values: <i>: integer, <f>: float, <bool>: "True"/"False", ### <string>: "String", <freq>: "<f> Hz/kHz/MHz", ### <percent>: "<f>%" ### [arg]: arg optional #Option "AccelMethod" # [<str>] Option "DRI" "false" # [<bool>] #Option "ColorKey" # <i> #Option "VideoKey" # <i> #Option "FallbackDebug" # [<bool>] #Option "Tiling" # [<bool>] Option "Shadow" "true" # [<bool>] #Option "SwapbuffersWait" # [<bool>] #Option "XvMC" # [<bool>] #Option "XvPreferOverlay" # [<bool>] #Option "DebugFlushBatches" # [<bool>] #Option "DebugFlushCaches" # [<bool>] #Option "DebugWait" # [<bool>] #Option "HotPlug" # [<bool>] Identifier "Card0" Driver "intel" BusID "PCI:0:2:0" BoardName "Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device [8086:2562] (rev 01)" EndSection The following /etc/X11/xorg.conf.d/50-device.conf worked for me in Fedora 15: Section "Device" Identifier "Default Device" Option "DRI" "false" Option "Shadow" "true" EndSection 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) This issue is affecting a hardware component which is not being actively worked on anymore. Moving the assignee to the dri-devel list as contact, to give this issue a better coverage. *** Bug 34868 has been marked as a duplicate of this bug. *** Created attachment 51401 [details] [review] New stab at working around the i845 tlb issues. I'd be great if anyone with a still-booting i845 could test this. Obviously you need to disable Shadow. Also, expect some slowdown, but hopefully not that bad. Created attachment 53024 [details] The relevant part of dmesg with Daniel's patch Daniel's patch not works for me. I tested with 3.1 kernel, and I saw only some dotted line and a cursor when Xorg loadaded. (Attached the relevant part of dmesg.) Current state of the driver with the 3.1 kernel: - Still get random GPU hangs without ShadowFB. It works stable with ShadowFB. - XVideo: contrast and saturation are misconfigured (Bug 42488) - OpenGL without ShadowFB: it's fast and stable until a GPU hang. Once GPU hang occurred, OpenGL apps are no longer works. - OpenGL with enabled ShadowFB: it works, but very slow, slower than llvmpipe. - Trying to run GNOME Shell always cause an immediate GPU hang, even if ShadowFB enabled. pls work for i845 Hi, I've hit this too with: 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 03) I tried the patch in comment #138 against 3.2.0-rc4, and I had similar results as comment #139 - the display wasn't usable. Is there anything else I can try? I have the system readily available. *** Bug 40960 has been marked as a duplicate of this bug. *** *** Bug 40181 has been marked as a duplicate of this bug. *** *** Bug 19068 has been marked as a duplicate of this bug. *** I've put some recent patches into http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=845g which makes my 845g much more stable, though I'm still getting spurious GPU hangs under memory pressure. In that regard SNA is performing better (not only to being a more complete acceleration architecture) as it is thrashing the GTT far less. *** Bug 27578 has been marked as a duplicate of this bug. *** *** Bug 53065 has been marked as a duplicate of this bug. *** *** Bug 54093 has been marked as a duplicate of this bug. *** thanks chris how do I apply the patch? *** Bug 55934 has been marked as a duplicate of this bug. *** *** Bug 56933 has been marked as a duplicate of this bug. *** Woohoo! commit c7f7dd61fd07dbf938fc6ba711de07986d35ce1f Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Dec 12 19:43:19 2012 +0000 sna: Pin some batches to avoid CS incoherence on 830/845 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=26345 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> I can't promise the incoherence won't show up elsewhere as render corruption, but my 845g is finally surviving wtf stress tests. Now in kernel form as well: commit b75e53bac7f4164e1c53a636352faa3d177b4beb Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Sun Dec 16 18:08:07 2012 +0100 drm/i915: Implement workaround for broken CS tlb on i830/845 Now that Chris Wilson demonstrated that the key for stability on early gen 2 is to simple _never_ exchange the physical backing storage of batch buffers I've tried a stab at a kernel solution. Doesn't look too nefarious imho, now that I don't try to be too clever for my own good any more. v2: After discussing the various techniques, we've decided to always blit batches on the suspect devices, but allow userspace to opt out of the kernel workaround assume full responsibility for providing coherent batches. The principal reason is that avoiding the blit does improve performance in a few key microbenchmarks and also in cairo-trace replays. Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> These fixes are in the latest 3.8 kernel. Have the GPU hangs been fixed in earlier versions of the kernel? (In reply to comment #154) > These fixes are in the latest 3.8 kernel. Have the GPU hangs been fixed in > earlier versions of the kernel? The sna fixes work with any KMS/GEM (i.e. 2.6.29+) kernel. The kernel w/a is being backported by Julien Cristau for the debian kernel. Thanks... We are using 2.6.32 kernel. We had problems with the 845G hanging but I don't think we were using SNA acceleration at the time. What patches do you think we need? In regards to comment 145, is it recommended to use SNA? We are not using SNA and have seen GPU hangs on the 845G. Is it better to use SNA and apply the SNA patch to xorg-x11-drv-intel? (In reply to comment #157) > In regards to comment 145, is it recommended to use SNA? We are not using > SNA and have seen GPU hangs on the 845G. Is it better to use SNA and apply > the SNA patch to xorg-x11-drv-intel? comment 145 is stale, superseded by the genuine fixes in comments 152 and 153. I would recommend using SNA on gen2 as UXA pales in comparison. Something is broken in the 3.8 kernel. When I'm using it, the colour depth is low, and my system freezes when I try to suspend the computer. I don't know if it caused by the applied workaround or not, but the problem gone if I downgrade to kernel version 3.7. (In reply to comment #159) > Something is broken in the 3.8 kernel. When I'm using it, the colour depth > is low, and my system freezes when I try to suspend the computer. I don't > know if it caused by the applied workaround or not, but the problem gone if > I downgrade to kernel version 3.7. Please file a new bug with a detailed description of your symptoms instead of cluttering this one. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 32942 [details] intel_gpu_dump output The following command is pretty good at locking up the GPU on this i845 machine: x11perf -range copywinpix10,comppixwin500 -time 1 -repeat 1 It doesn't always freeze on the same test, just somewhere random within this range. However, I've run this several times, and so far it hasn't made it to the end without freezing. GPU is the following: 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) I observed this on Ubuntu Karmic. Jaunty is unaffected.