Using the xf86-video-intel driver in SNA mode (using version 2.20.9, but I've seen it back to 2.20.7 as well - I have not tested SNA in earlier versions), chromium tabs exhibit a strange "blotchy" flickering of text/graphics as the mouse is moved over them. Note that this may vary depending on the text (or more probably the length of the text) in the tab. I wonder if the transparent gradient/fade of the text in these tabs plays a part here. I will attach a video screen capture of this happening. One URL that causes it is the URL of a previous bug I reported: https://bugs.freedesktop.org/show_bug.cgi?id=55484 (the tab in the video uses this one). [From my lspci: VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)]
Created attachment 67928 [details] Video screen capture showing the chromium tab flicker
Can you please check that attached video is the correct one? mplayer doesn't complain, but doesn't show anything either. Also can you please attach your Xorg.0.log so that I know what hardware you have, and which WM you are using (and any compositing options)?
Created attachment 67929 [details] Video screen capture showing the chromium tab flicker Trying this attachment again - this time specifying binary manually... the previous upload seemed to alter the file.
Created attachment 67930 [details] Xorg.0.log file from session showing problem
I am using openbox as a WM. I have not selected any special (non-default) compositing options that I know of.
What it actually looks like is that the gradient is misapplied whilst rendering the glyphs. Can you please test with either downgrading pixman to 0.26 or compile -intel with -UHAS_PIXMAN_GLYPHS?
And for an extra level of paranoia, can you also check if the error persists if you do a debug build (unoptimized, -O0) of pixman, xserver, and -intel? Having been burnt by bugs uncovered by aggressive compiler optimisations before, it helps to keep me calm to have a sanity check. ;-)
Ok, I downgraded to pixman-0.26 (which then causes -intel to be build without HAS_PIXMAN_GLYPHS). I also compiled xorg-server and -intel with -O0. However, I was unable to compile pixman with -O0, because configure failed (checking for MMX support). The same problem persists even with the above done. Let me know if I should try anything else, or if this is enough to verify that it's not pixman...
Yes, that is enough to rule out the new pixman_glyph_t routines and enough that is not some random miscompilation. Which means we^W I need to look harder.
First of all lets disable acceleration of glyphs: diff --git a/src/sna/sna_glyphs.c b/src/sna/sna_glyphs.c index 53494e3..4e510a4 100644 --- a/src/sna/sna_glyphs.c +++ b/src/sna/sna_glyphs.c @@ -69,7 +69,7 @@ #include <mipict.h> -#define FALLBACK 0 +#define FALLBACK 1 #define NO_GLYPH_CACHE 0 #define NO_GLYPHS_TO_DST 0 #define NO_GLYPHS_VIA_MASK 0 That will tell us whether the corruption occurs as we render the glyphs using the GPU or as we upload. Similarly working through each of the NO_* options thereafter would be very helpful to identify which path in particular is affected. Secondly, diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c index ceef528..f901008 100644 --- a/src/sna/gen4_render.c +++ b/src/sna/gen4_render.c @@ -1863,7 +1863,7 @@ gen4_composite_picture(struct sna *sna, if (picture->pDrawable == NULL) { int ret; - if (picture->pSourcePict->type == SourcePictTypeLinear) + if (picture->pSourcePict->type == SourcePictTypeLinear && 0) return gen4_composite_linear_init(sna, picture, channel, x, y, w, h, @@ -2046,7 +2046,6 @@ check_gradient(PicturePtr picture) { switch (picture->pSourcePict->type) { case SourcePictTypeSolidFill: - case SourcePictTypeLinear: return false; default: return true; will confirm whether is it the gradient that is implicated in this bug.
OK, tried your suggestions - I set the following, one at a time, to 1. Here are the results: #define FALLBACK 0 clean #define NO_GLYPH_CACHE 0 clean #define NO_GLYPHS_TO_DST 0 clean #define NO_GLYPHS_VIA_MASK 0 bad #define NO_SMALL_MASK 0 bad #define NO_GLYPHS_SLOW 0 bad #define NO_DISCARD_MASK 0 bad Where it says "clean", changing only this define to 1 (and leaving the rest at 0) fixed the problem. Where it says "bad", the bug still existed with only this define set to 1. Applying the gen4_render.c patches (with all of the above unchanged at set to 0) had no effect - the bug still existed. Let me know if this helps.
Ok, we are now into the realms of a missing GPU flush, or rather a missing workaround. Can you please try: diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c index ceef528..9d298dd 100644 --- a/src/sna/gen4_render.c +++ b/src/sna/gen4_render.c @@ -1265,6 +1265,7 @@ gen4_emit_pipelined_pointers(struct sna *sna, if (key == sna->render_state.gen4.last_pipelined_pointers) return; + OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH); OUT_BATCH(GEN4_3DSTATE_PIPELINED_POINTERS | 5); OUT_BATCH(sna->render_state.gen4.vs); OUT_BATCH(GEN4_GS_DISABLE); /* passthrough */ with everything else back to normal.
(In reply to comment #12) > + OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH); Nope, bug still occurs.
Hmm, ok. A more drastic patch to confirm that this is the flushing bug I think it is... diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c index ceef528..5e35ff1 100644 --- a/src/sna/gen4_render.c +++ b/src/sna/gen4_render.c @@ -1376,7 +1376,7 @@ gen4_emit_state(struct sna *sna, const struct sna_composite_op *op, uint16_t wm_binding_table) { - if (FLUSH_EVERY_VERTEX) + if (1||FLUSH_EVERY_VERTEX) OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH); gen4_emit_drawing_rectangle(sna, op);
(In reply to comment #14) > Hmm, ok. A more drastic patch to confirm that this is the flushing bug I > think it is... > ... > + if (1||FLUSH_EVERY_VERTEX) > OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH); No, sorry to say, this did not fix it.
Created attachment 68731 [details] [review] Flush state changes Since this sounds like a flush issue and gen5+ require flushes between certain pipelined ops, presume gen4 also needs them. Can you please test the attached patch as I've yet to reproduce this issue?
Created attachment 68756 [details] [review] Flush state changes
I've uploaded commit 257abfdabe39629fb458ed65fab11283f7518dc4 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Oct 17 23:34:22 2012 +0100 sna/gen4: Presume we need a flush upon state change similar to gen5+ References: https://bugs.freedesktop.org/show_bug.cgi?id=55627 References: https://bugs.freedesktop.org/show_bug.cgi?id=55500 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> in the hope that gen4 has the same brokenness as the later generations. Please reopen if this doesn't fix the corruption, thanks.
I tested just now with the latest git, and problem persists. Just to make sure I did this correctly, I replaced the following three files on my system with freshly-built ones: /usr/lib/xorg/modules/drivers/intel_drv.so /usr/lib/libI810XvMC.so.1.0.0 /usr/lib/libIntelXvMC.so.1.0.0 The other lib symlinks are, of course, there, for the last two.
This bug affects me too, on both the machines in which I have tried to enable the SNA option. I've tried to enable it both on GNOME and Xfce, with the same results. I attach the Xorg.0.log of one of them. I have tried to install the git version of xf86-video-intel, but the bug persists. I have also noticed that the bug seems to affect only chromium tabs in which the html title can not fit, as in the video posted by Joe Peterson.
Created attachment 69237 [details] Xorg.0.log of the session affected by the bug
Chris, the bug has disappeared today after upgrading cairo on my Archlinux machine. The upgrade includes this set of patches (half of them are from yourself): https://projects.archlinux.org/svntogit/packages.git/tree/trunk/git_fixes.diff?h=packages/cairo They were submitted on September 17, but they were included into the Archlinux cairo package only yesterday.
Hmm, more relevant to this case is that they dropped the 'cairo-1.10.0-buggy_gradients.patch' which will impact the rendering in the tabs (and notably be about an order of magnitude faster). So it feels like the bug is still lurking.
(In reply to comment #23) > Hmm, more relevant to this case is that they dropped the > 'cairo-1.10.0-buggy_gradients.patch' which will impact the rendering in the > tabs (and notably be about an order of magnitude faster). So it feels like > the bug is still lurking. Also, I have that cairo update from Arch as well, and although the problem looks (perhaps) less obvious, it's definitely still there.
You are both right. At a first sight, the flickering seemed to be gone, but even if it's much less noticeable (at least for me) it's still there.
Found a hint with commit b2245838c15b54d72557de8facb7cc15d59624ae Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Nov 6 16:32:32 2012 +0000 sna/gen4: opacity spans requires the per-rectangle flush w/a Note that this is worsened, but not caused, by: commit e1a63de8991a6586b83c06bcb3369208871cf43d Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Nov 2 09:10:32 2012 +0000 sna/gen4+: Prefer GPU spans if the destination is active References: https://bugs.freedesktop.org/show_bug.cgi?id=55500 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Does this help?
I've just tested the latest git, but unfortunately the problem persists.
Ok, looking closely on this gm45 I can see a similar subtle flicker in the chromium tabs. Is the effect still as severe as shown in the first video? Any improvements from the flushing changing?
Created attachment 69666 [details] Screen capture for chromium tabs I have attached a video screen capture that shows the flickering after installing the latest git version of xf86-video-intel, lib-drm and cairo. As I wrote in a previous comment, the problem is almost not noticeable. This is the same behavior that I have encountered just before the latest patch.
I think the effect is now subtle enough that I'm not going to worry too much - it is undoubtably a missing flush or incorrect hw state. Since I can reproduce using the chromium tabs, I'll fix it one day (hopefully!). Please do ping occasionally to remind me, or if you have a found a particular nasty example.
(In reply to comment #30) > I think the effect is now subtle enough that I'm not going to worry too much > - it is undoubtably a missing flush or incorrect hw state. Since I can > reproduce using the chromium tabs, I'll fix it one day (hopefully!). Please > do ping occasionally to remind me, or if you have a found a particular nasty > example. Hey Chris, do you think this could be mainly a HW bug only on the older graphics chips? In other words, is the code obeying the spec, but it would take an odd (non-spec) workaround to make the HW behave? If you have put in any extra flushes to try to fix this, perhaps those should now be removed so as not to affect performance adversely, especially if the extra ones affect newer HW as well (or did you only do it in the code for the old HW?). I suspect I'll be upgrading from this old laptop soon for other reasons, so I agree that worrying too much about a hard-to-fix problem that only affects really old HW is probably not worth it (people could just turn off SNA on older HW, as long as UXA is supported for the foreseeable future).
The code paths are specific for this chipset, and this flicker only seems to appear on gm45 for now. (Though I need to look at the other gen4 closely.) And for the order of magnitude performance improvement switching from to UXA to SNA I think such a minor bug is a small price to pay... And it will be fixed as soon as I find a workaround.
*** Bug 58139 has been marked as a duplicate of this bug. ***
I'm not yet convinced that my Bug 58139 is duplicate of this BZ. There are noticeable differences - when I use some ffmpeg capture tool, to grab picture from screen - it seems to be not catch those 'visual' errors I'm observing on laptops LCD (and captured with mobile phone). (On the other hand I could have just wrong options passed in to the grab tool) Also in my case, it's rather something 'new' in this 'scale' - since I'm not aware I'd have been seeing this with several months older SNA driver. (Eventually I could possibly try bisect if I found some 'boring' movie to watch :))
(In reply to comment #34) > I'm not yet convinced that my Bug 58139 is duplicate of this BZ. > > There are noticeable differences - when I use some ffmpeg capture tool, to > grab picture from screen - it seems to be not catch those 'visual' errors > I'm observing on laptops LCD (and captured with mobile phone). > (On the other hand I could have just wrong options passed in to the grab > tool) > > Also in my case, it's rather something 'new' in this 'scale' - since I'm not > aware I'd have been seeing this with several months older SNA driver. > > (Eventually I could possibly try bisect if I found some 'boring' movie to > watch :)) I'm using this bug as a catch-all for the issues I'm uncovering with enabling gen4.
Zdenek, can you see if this reduces the majority of your flicker: commit 2dbe7d91a7f15a3a9ddad696c5088ca98898fca2 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Dec 12 09:50:34 2012 +0000 sna/gen4: Use the single-threaded SF w/a for spans as well Fixes the flickering seen in the fishtank demo, for example.
(In reply to comment #36) > Zdenek, can you see if this reduces the majority of your flicker: > > commit 2dbe7d91a7f15a3a9ddad696c5088ca98898fca2 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Wed Dec 12 09:50:34 2012 +0000 > Hmm, nope, no difference. Went through some some Czech magazines and their headlines and here are links: http://goo.gl/xvPt1 http://goo.gl/3kBAV which seems to be the most obvious on my laptop. UXA seems to be without problems.
(In reply to comment #37) > UXA seems to be without problems. Because for UXA/gen4, I have a massive hammer to prevent the GPU from trying to execute operations in parallel. The battle is to understand precisely what doesn't work and find alternatives.
It's probably interesting to note - that with those 2 links above it's not always causing problem - in some case the firefox tab seems to be rendered without any visible problem. If I open 2 firefox windows - then it's much easier to get into rendering problems, even in a way, that I've seen pictures being completely replaced with some brown colorish blur picture - even for seconds. (I'm using firefox-17.0.1-1.fc19.x86_64)
*** Bug 54357 has been marked as a duplicate of this bug. ***
Created attachment 71490 [details] How the picture should look like Here is firefox grab of original picture from this page: http://goo.gl/KAAgP This is the proper look - at the top are visible 'tabs'
Created attachment 71491 [details] Incorrect image 1 This picture is grabbed with some delay - to not move mouse outside of the window (since in that case it would be properly refreshed). However if if just scroll page up/down - I'm able to get for a short moment picture like this visible there. Looks like coordinates of the rendered picture were squeezed and mirrored ?
Created attachment 71492 [details] Incorrect picture 2 And here is another one I've managed to take - again nicely visible distortion. When I scroll further up and down - image again gets correct size and look, and then again after a while I could seen something like this.
*** Bug 59685 has been marked as a duplicate of this bug. ***
*** Bug 59351 has been marked as a duplicate of this bug. ***
*** Bug 60284 has been marked as a duplicate of this bug. ***
Created attachment 74207 [details] Sample libreoffice document that almost always shows issues on gen4 I am attaching a test file that does almost never look right when opened with libreoffice on gen4
So my testcase was "fixed" by: commit 1565917f10d9fb3c7e2e7e273173c38c364b9861 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Feb 5 10:11:14 2013 +0000 sna/gen4: Disable non-rectilinear GPU span compositing This seems to be the primary victim of the render corruption, so disable until the root cause is fixed. Can you please check your worst-case / typical behaviours and see if any still remain?
(In reply to comment #47) > Created attachment 74207 [details] > Sample libreoffice document that almost always shows issues on gen4 > > I am attaching a test file that does almost never look right when opened > with libreoffice on gen4 I get different rendering of that test.odg with different versions of lodraw - and on the machine that renders it different, it switches back to the old output at a certain scale factor. That is quite atypical behaviour for this gen4 bug.
It does not fix #59351 https://bugs.freedesktop.org/show_bug.cgi?id=59351 The symptoms remain unchanged. test.odg works fine here on both screens, though. [lodraw 3.6.5.2]
Hmm, right, that would not be along the spans path in the first place. Oh well, I can try one of the other workarounds I had earlier...
How about: diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c index cc1778a..0a59681 100644 --- a/src/sna/gen4_render.c +++ b/src/sna/gen4_render.c @@ -1895,6 +1895,9 @@ gen4_render_composite(struct sna *sna, tmp->has_component_alpha = false; tmp->need_magic_ca_pass = false; + if (!mask) + mask = sna->render.white_picture; + if (mask) { if (mask->componentAlpha && PICT_FORMAT_RGB(mask->format)) { tmp->has_component_alpha = true;
On 05/02/2013 12:34, bugzilla-daemon@freedesktop.org wrote: > > *Comment # 50 <https://bugs.freedesktop.org/show_bug.cgi?id=55500#c50> on bug > 55500 <https://bugs.freedesktop.org/show_bug.cgi?id=55500> from Till > Matthiesen <mailto:high.entropy@web.de> * > It does not fix #59351 > https://bugs.freedesktop.org/show_bug.cgi?id=59351 <show_bug.cgi?id=59351> > > The symptoms remain unchanged. > > test.odg works fine here on both screens, though. [lodraw 3.6.5.2] > -------------------------------------------------------------------------------- > Please test test.odg zooming in/out and selecting all and moving it around by steps (e.g. with the arrows). Sometimes when opened it initially renders well, but when zooming and moving by little amounts the rendering breaks, bot on 3.6.5.2 and on 4.0.0.3 (RC).
(In reply to comment #52) > How about: > > diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c > index cc1778a..0a59681 100644 > --- a/src/sna/gen4_render.c > +++ b/src/sna/gen4_render.c > @@ -1895,6 +1895,9 @@ gen4_render_composite(struct sna *sna, > tmp->has_component_alpha = false; > tmp->need_magic_ca_pass = false; > > + if (!mask) > + mask = sna->render.white_picture; > + > if (mask) { > if (mask->componentAlpha && PICT_FORMAT_RGB(mask->format)) { > tmp->has_component_alpha = true; Doesn't fix it, unfortunately.
(In reply to comment #53) > On 05/02/2013 12:34, bugzilla-daemon@freedesktop.org wrote: > > Please test test.odg zooming in/out and selecting all and moving it around > by > steps (e.g. with the arrows). > Sometimes when opened it initially renders well, but when zooming and moving > by > little amounts the rendering breaks, bot on 3.6.5.2 and on 4.0.0.3 (RC). I tried hard to reproduce the issue, but wasn't able to do so. Either it's that hard to trigger or it simply doesn't exist for *my* configuration. Chris, how much influence could the kernel drm and libdrm version have on those issues? I use Linux 3.7.5 and the recent libdrm from git.
The bug that I'm presuming underlies all of these gen4 corruption issues is a misprogramming of GPU state - I have seen the flicker persist for a few kernels now and would not expect it to be a factor. (Except that there may be an eventual w/a required in the kernel, we have not applied one recently.)
(In reply to comment #56) > The bug that I'm presuming underlies all of these gen4 corruption issues is > a misprogramming of GPU state - I have seen the flicker persist for a few > kernels now and would not expect it to be a factor. (Except that there may > be an eventual w/a required in the kernel, we have not applied one recently.) I see. Is there anything we can do to help you further? Enable debug mode, etc.?
On 05/02/2013 13:13, bugzilla-daemon@freedesktop.org wrote: > > *Comment # 55 <https://bugs.freedesktop.org/show_bug.cgi?id=55500#c55> on bug > 55500 <https://bugs.freedesktop.org/show_bug.cgi?id=55500> from Till > Matthiesen <mailto:high.entropy@web.de> * > (In reply tocomment #53 <show_bug.cgi?id=55500#c53>) > > On 05/02/2013 12:34,bugzilla-daemon@freedesktop.org <mailto:bugzilla-daemon@freedesktop.org> wrote: > > > > Please test test.odg zooming in/out and selecting all and moving it around > > by > > steps (e.g. with the arrows). > > Sometimes when opened it initially renders well, but when zooming and moving > > by > > little amounts the rendering breaks, bot on 3.6.5.2 and on 4.0.0.3 (RC). > > I tried hard to reproduce the issue, but wasn't able to do so. > Either it's that hard to trigger or it simply doesn't exist for *my* > configuration. > > Weird enough, I have discovered that the rendering bug with my test file (test.odg) goes away if antialiasing is disabled in libreoffice.
This morning I've received the kde 4.10 update and newer debs for git version of the intel driver (13/02/07 - git 974b6a). I cannot decouple if the rendering issues have worsened with the newer driver (probably not) or if the newer kde framework and kwin is exposing bugs in a more aggressive way, but things have become a bit problematic with this newer setup. 1) Border of the toolbar disappears 2) Shades/light effects of kwin do not render properly and remain in place (often with distortion) after use 3) Some characters appear underlined in the konsole Moving back to uxa seems to fix all these issues, as well as the libreoffice rendering.
I should mention that Option "AccelMethod" "blt" should eliminate the artifacts and still outperform uxa. Can you please confirm that supposition?
Trying blt right now. No rendering issues. SNA was perceivably faster.
I can't test it, obviously. The rotated zaphod screen, at which the artefacts appear in my case, is only available with "sna" enabled. On a side note: I tried it nevertheless as I forgot about that. The result was a segfault of the xserver. [ 406.336] Requested Entity already in use! [ 406.336] (EE) Screen 1 deleted because of no matching config section. [ 406.336] (EE) [ 406.336] (EE) Backtrace: [ 406.336] (EE) 0: /usr/bin/X (xorg_backtrace+0x36) [0x58a416] [ 406.336] (EE) 1: /usr/bin/X (0x400000+0x18e269) [0x58e269] [ 406.336] (EE) 2: /usr/lib/libpthread.so.0 (0x7f3cfacee000+0xf1e0) [0x7f3cfacfd1e0] [ 406.336] (EE) 3: /usr/lib/xorg/modules/drivers/intel_drv.so (0x7f3cf84d2000+0x16150) [0x7f3cf84e8150] [ 406.336] (EE) 4: /usr/lib/xorg/modules/drivers/intel_drv.so (0x7f3cf84d2000+0x17435) [0x7f3cf84e9435] [ 406.336] (EE) 5: /usr/bin/X (xf86DeleteScreen+0x84) [0x480c64] [ 406.336] (EE) 6: /usr/bin/X (xf86BusConfig+0x216) [0x46c086] [ 406.336] (EE) 7: /usr/bin/X (InitOutput+0x956) [0x479df6] [ 406.336] (EE) 8: /usr/bin/X (0x400000+0x26776) [0x426776] [ 406.336] (EE) 9: /usr/lib/libc.so.6 (__libc_start_main+0xf5) [0x7f3cf9b7aa15] [ 406.336] (EE) 10: /usr/bin/X (0x400000+0x26c9d) [0x426c9d] [ 406.336] (EE) [ 406.336] (EE) Segmentation fault at address 0x0 [ 406.336] Fatal server error: [ 406.336] Caught signal 11 (Segmentation fault). Server aborting [ 406.336] [ 406.336] (EE) Wouldn't it be possible to gracefully exit and remind users that rotated setups are only available with the "sna" option?
Ah, sorry I mislead. I forgot everybody doesn't use SNA as their default accelmethod. In order to use the BLT trick, you need to --disable-uxa or --with-default-accel=sna.
Hi Chris, thanks for clarification. I compiled the driver with --with-default-accel=sna and set both, the rotated and non-rotated screens, to "AccelMethod" "blt". Now, the rotated screen is _all black_ but the cursor. So I, somehow, managed to start acroread but do not see anything but the cursor. The other screen works fine, though.
Created attachment 75189 [details] A video showing the issue during a zoom sequence Attaching a video showing the issue while zooming with libreoffice. Hope that the shape of the patterns that appear may be a clue to identify the issue.
Still present (exactly as in the video) with kernel 3.8.0 libdrm and libkms 2.4.42 + git shapshot 13/02/25 commit 41fc2c... mesa 9.2 + git snapshot 13/02/25 commit 533dc3... xserver intel driver 2.21.3 + git snapshot 13/02/25 commit 421910... I have discovered that I encounter a similarly looking rendering issue, with the image getting decomposed in multiple pieces that are not correctly aligned, on the whole screen whenever I attach an external displayport monitor and call xrandr --output DP1 --auto --primary --output LVDS1 --off curiosly, this does not happen if I first switch on the monitor and then switch off the laptop screen with 2 independent xrandr calls.
(In reply to comment #66) > Still present (exactly as in the video) with kernel 3.8.0 > libdrm and libkms 2.4.42 + git shapshot 13/02/25 commit 41fc2c... > mesa 9.2 + git snapshot 13/02/25 commit 533dc3... > xserver intel driver 2.21.3 + git snapshot 13/02/25 commit 421910... > > I have discovered that I encounter a similarly looking rendering issue, with > the image getting decomposed in multiple pieces that are not correctly > aligned, on the whole screen whenever I attach an external displayport > monitor and call Like https://bugs.freedesktop.org/attachment.cgi?id=70128 (bug 57160)?
Looks like the 'decomposition' in small elements that are not aligned correctly has an even finer granularity. See https://bugs.freedesktop.org/attachment.cgi?id=75673
The issue with the libreoffice test file is fixed for me. Thanks!!!
(In reply to comment #69) > The issue with the libreoffice test file is fixed for me. Thanks!!! Okay, now that was an accident! Can you try diff --git a/src/sna/sna_accel.c b/src/sna/sna_accel.c index ae6d3c1..5edad51 100644 --- a/src/sna/sna_accel.c +++ b/src/sna/sna_accel.c @@ -57,7 +57,7 @@ #define FORCE_INPLACE 0 #define FORCE_FALLBACK 0 #define FORCE_FLUSH 0 -#define FORCE_FULL_SYNC 1 /* https://bugs.freedesktop.org/show_bug.cgi?id=61628 +#define FORCE_FULL_SYNC 0 #define DEFAULT_TILING I915_TILING_X and see if the corruption returns?
*** Bug 62302 has been marked as a duplicate of this bug. ***
"and see if the corruption returns?" after this diff, openoffice test file in my config looks the same as in comment 65, and ok without it. It is after recompiling cairo without server_side_gradients.patch, what makes chromium tabs much better.
(In reply to comment #72) > It is after recompiling cairo without > server_side_gradients.patch, what makes chromium tabs much better. Don't bother with that, the bug is manifest inside the GPU. Your chromium tabs is just one instance where the GPU stutters, but it is not the only one and they are all not related to gradients (which is just a different texture after all).
I think I've a sort of positive message here at least on my T61 - now using upstream git commit 8f340f90f4b2f269d6308d0bd31fbc2a5f579608 I'm no longer observing corruptions while scrolling Firefox pages. So some recent commit is probably behind this change. Before I've used 14 days older commit and I've been able to easily see those corruptions. Now it looks like they are gone. Of course I'll make a longer observation here - but so far it looks promising.
(In reply to comment #74) > I think I've a sort of positive message here > at least on my T61 - now using upstream git commit > > 8f340f90f4b2f269d6308d0bd31fbc2a5f579608 > > I'm no longer observing corruptions while scrolling Firefox pages. > So some recent commit is probably behind this change. > Before I've used 14 days older commit and I've been able to easily > see those corruptions. Now it looks like they are gone. > > Of course I'll make a longer observation here - but so far it looks > promising. Ok - it seems that on the longterm run the corruptions starts to appear again. So it seems to be related how the memory is being used over the time. But with my 4 days uptime - now I see easily images corrupted during the scroll in firefox.
One very effective way to accelerate the appearence of the issues (which may help debugging) seems to be using libreoffice draw/impress. Drawing large shapes (or even better importing large bitmaps) and then selecting them causes a sort of a grid to be drawn over the shapes and then to be erased after the object is not anymore selected. Doing this a few times is often sufficient to cause incorrect drawing or erasing of the grid or incorrect redrawing of the image under the grid.
Also note that the 'sample libreoffice document that almost always shows issues on gen4' is again quite often showing issues on gen4. This file seems to be a nice regression test.
No hope to solve this and many, many other bugs on gen4 - here and in mesa. I suggest to buy new laptop because it doesn't have any sense. Some time ago I also thought that it has. Now I have Intel Ivybridge and Nvidia through Bumblebee and both runs almost perfect :-) Regards
(In reply to comment #78) > No hope to solve this and many, many other bugs on gen4 - here and in mesa. > I suggest to buy new laptop because it doesn't have any sense. Some time ago > I also thought that it has. Now I have Intel Ivybridge and Nvidia through > Bumblebee and both runs almost perfect :-) > > Regards Sure I could buy and run Windows, and forget about buggy Linux drivers ;) But we are not all the same - I still hope it can be fixed :).... Enjoy your non free proprietary Nvidia drivers....
I think that all Intel developers are extremely willing to help, and I appreciate a lot that they try to do so even if I have hardware that is 3 years old. Let's try to provide good info and replicable demo cases with the bug reports. And let's stick just to them, since these bug reports are probably already long, difficult to read and to interpret as they are.
Actually I didn't say that Nvidia has better drivers. Just Ivy- and Sandybridge based grahics cards have MUCH better support. I reported some bugs in mesa driver. Bisected, logs from sysprof, wine etc. For example: https://bugs.freedesktop.org/show_bug.cgi?id=51471 Bug since mesa 8.0-rc1, easy to repair. Fixed fortunately in february 2013. And many other not fixed. Bugs in X.org drivers were usually fixed quickly. Sometimes in the same day ;) Sorry for offtopic. Regards, Deve
Sergio, I believe the corruption you are seeing in the presentation is from the coherency bug. My apologies for confusing that with the general gen4 rendering issues.
Created attachment 82118 [details] Video capture showing flicker on firefox scroll I've captured at 25FPS some example how the pictures are flickering when i.e. Firefox windows is being scrolled up/down. In most cases, the pictures is visible normally when scrolling stops, but in rare case the picture stays non viewable. In this short video if you use i.e. mplayer '.'/step-by-step-frame you could find 2 places where picture is broken.
Yes, that is characteristic of this bug. The internal vertex/texture coordinates that are being passed along the GPU pipeline become corrupt (it looks like we overflow a small ring buffer). The only effective approach I've found so far has been to keep the number of rectangles inside the GPU pipeline below a magic value - but that is quite tricky here, and the simplest seems to be the big hammer that I stumbled upon for UXA. (Stumbled as it was an artifact of the original implementation and fixes an entirely different GPU bug - an immediate hang.)
(In reply to comment #84) > Yes, that is characteristic of this bug. The internal vertex/texture > coordinates that are being passed along the GPU pipeline become corrupt (it > looks like we overflow a small ring buffer). The only effective approach If it would be plain 'overflow' - than I'd be seeing this kind of corruption always right after X starts. But it seems like i.e. Firefox must be used for certain period of time, to make these corruptions visible. Surely I'm not not an expert on GPU programming - but maybe when the physical memory gets fragmented enough after some usage there are some 'cached bo' objects - maybe their usage is not fully synchronized? Also the effect could also disappear (I've not yet noticed any particular way for that) - so then Firefox scroll even the very same pages and there is no problem (i.e. now I run still the same session - and I do not get any visible problem) > I've found so far has been to keep the number of rectangles inside the GPU > pipeline below a magic value - but that is quite tricky here, and the Well why the number would sometimes work for ours without a single visible problem, and suddenly start to show them again? My impression here is, when this problem is visible, it looks like operations is working with 'wrong' parameters (i.e. sometimes I see the image stretched, inverted, zoomed) Maybe parameters have only some bits mangled - not fully synchronized memory from CPU for GPU??
Whilst more than likely you are using a kernel with a known incoherency bug (as there is yet to be a kernel release with it fixed), the effects are quite different to the ones you captured. The issue which I believe to be behind the distorted flickering can be quite easily triggered by simply changing the number of rectangles submitted in a single draw call. In practice, this then both depends upon the exact details of the rendering and its timing.
Created attachment 82127 [details] Scroll with intel_gpu_top I've captured video when problem appears and when doesn't (Attaching only frame) Using some past kernel 3.10 b2c311075db578f1433d9b303698491bfa21279a (as of now - current vanilla) Using current xf86 tree 5aaab9ea0310d48bb1a1ca20308d1c9721a9de3f (as of now) Running Firefox scroll when I've managed to trigger problem (seem opening a lot of pages with small icons on them helps to speedup this process) As could be seen - intel_gpu_top shows quite high usage there. Machine has been otherwise unloaded. During scrolling those percentages were quite the same.
Created attachment 82128 [details] Scroll with intel_gpu_top and no problems Quite the same system - except the problem was not visible (restarted X session). As could be seen - now during scrolling the GPU has been basically unloaded. Nothing else the Firefox scroll has been basically done. I should also mention I'm running Firefox nightly - but the same could be triggered with stable version.
Hmm - after doing some more experiments - the high load on GPU seems to be pretty much the thing which make the problem visible. Another interesting thing is - it's usually enough to just 'reload' the very same page in Firefox and the load goes down to ~1% and everything is ok. Also the kernel perf top doesn't seem to be showing anything interesting. And a side note - when intel_gpu_top is running - the laptop is actually giving some whistling noise not really pleasant for longterm usage ;)....
I have pushed a revised workaround to the best of my understanding to commit 368c909b29758f996dbbdbec4d471df23f60bc04 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat Jul 6 22:27:44 2013 +0100 sna/gen4: Restore the flush-every-vertex w/a This is an abhorrent workaround for some internal GPU brokenness. A slight refinement since earlier times is the recognition that 16 is a magic number limiting the maximum number of inflight rectangles through the GPU. References: https://bugs.freedesktop.org/show_bug.cgi?id=55500 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
From what I'm experiencing after driver rebuild - I'd have said - now it's actually much simpler to trigger that flickering/high GPU usage. So unless the patch is missing something - than limitation to 16 doesn't help here on T61.
Removed an older hack and had to reduce the max further. Bah, humbugs.
(In reply to comment #92) > Removed an older hack and had to reduce the max further. Bah, humbugs. OK, this time I'm definitely not able to reproduce the issue so far. Maybe it will some extra time - hard to say now - but before it has took me just minutes to get the issue visible. intel_gpu_top now shows only few lines busy to 1% during scroll. The bad part is - while before I've been getting 3.7Mchar/s in x11perf -aa10text - now it's like 1.3Mchar/s so significantly slower. So the question here would be - isn't the corruption based on triangle surface size ? So i.e. GPU is able to process a lot of small ones - but has bug with bigger ones ? Maybe a cheap test would be to flush when some longer triangle edge is pushed in ?
I'm doing some experiments with MAX_FLUSH_VERTICES. When set to 64 - even gnome-terminal starts so show weird pixels in some cases. Now I'm just playing with value 12 (gives ~2.1Mchars/s) and I'm hunting for visual problems.
And with 12 - flickering starts to appear as well.
(In reply to comment #93) > The bad part is - while before I've been getting 3.7Mchar/s in x11perf > -aa10text - now it's like 1.3Mchar/s so significantly slower. That's the sacrifice, we have to stop sending commands to the GPU and wait for it complete those in flight (quite frequently). Or else new rectangles overwrite vertex entries still being used by later entries > So the question here would be - isn't the corruption based on triangle > surface size ? So i.e. GPU is able to process a lot of small ones - but has > bug with bigger ones ? > > Maybe a cheap test would be to flush when some longer triangle edge is > pushed in ? Not really, you have to predict when a VUE being used by the end of the pipeline will be overwritten by a new rectangle at the start of the pipeline. This is completely internal state - the primitive command we want to feed to the GPU can contains thousands of rectangles. Instead of counting rectangles, you want to start counting fragments (actually texel reads since that will be the ratelimiting factor) and flush if we queue up too much work for the GPU. If you also model how fast the gpu is retiring fragments so that you can predict how much work is in flight, you could further reduce flushes... We still need to stop the gpu and wait for it to complete. No matter how finely you do it, it will sacrifice throughput and latency. It's a workaround. I still live in hope that after all these years we missed some configuration detail required for gen4.
(In reply to comment #96) > (In reply to comment #93) > > The bad part is - while before I've been getting 3.7Mchar/s in x11perf > > -aa10text - now it's like 1.3Mchar/s so significantly slower. > > That's the sacrifice, we have to stop sending commands to the GPU and wait > for it complete those in flight (quite frequently). Or else new rectangles > overwrite vertex entries still being used by later entries > > > So the question here would be - isn't the corruption based on triangle > > surface size ? So i.e. GPU is able to process a lot of small ones - but has > > bug with bigger ones ? > > But as I said before - if that would be plain hw defect - IMHO it would simply always appear - but it seems like it's working for a while - then 'something' happens - and flickering starts to appear - with (assumingly) same amount of texels/triangle/vertices - and than something again may happen, and the problem is gone for a while. > Not really, you have to predict when a VUE being used by the end of the > pipeline will be overwritten by a new rectangle at the start of the > pipeline. This is completely internal state - the primitive command we want > to feed to the GPU can contains thousands of rectangles. Instead of counting Well I've tried even 8 max triangles - and the error appeared after a while, so far '6' is magic. > rectangles, you want to start counting fragments (actually texel reads since > that will be the ratelimiting factor) and flush if we queue up too much work > for the GPU. If you also model how fast the gpu is retiring fragments so But in case the same page is rendered with problems as well as without problems, then it doesn't look like texel read is problem, it rather looks like some kind of memory mapping/ordering. Also is there some explanation why intel_gpu_top is showing so much higher GPU usage when the flickering is visible ?
Well I should wait a while before posting a comment about magic value 6. I'm now observing flickering with value 6 as well. So yeah - it's more or less time related - and it takes more or less time until the problem becomes visible. Also is there explanation with the max value 64 starts to make problems with text rendering in gnome terminal ? i.e. I'd have expected if there would be a large press on GPU - but in this case it just appear random pixel start to be drawn instead of some letter - maybe some font cache corruption ?
(In reply to comment #97) > (In reply to comment #96) > > (In reply to comment #93) > > > The bad part is - while before I've been getting 3.7Mchar/s in x11perf > > > -aa10text - now it's like 1.3Mchar/s so significantly slower. > > > > That's the sacrifice, we have to stop sending commands to the GPU and wait > > for it complete those in flight (quite frequently). Or else new rectangles > > overwrite vertex entries still being used by later entries > > > > > So the question here would be - isn't the corruption based on triangle > > > surface size ? So i.e. GPU is able to process a lot of small ones - but has > > > bug with bigger ones ? > > > > > But as I said before - if that would be plain hw defect - IMHO it would > simply always appear - but it seems like it's working for a while - then > 'something' happens - and flickering starts to appear - with (assumingly) > same amount > of texels/triangle/vertices - and than something again may happen, > and the problem is gone for a while. It does. You do not have quite as much control over your tests as you presume. > > Not really, you have to predict when a VUE being used by the end of the > > pipeline will be overwritten by a new rectangle at the start of the > > pipeline. This is completely internal state - the primitive command we want > > to feed to the GPU can contains thousands of rectangles. Instead of counting > > Well I've tried even 8 max triangles - and the error appeared after a while, > so far '6' is magic. > > > rectangles, you want to start counting fragments (actually texel reads since > > that will be the ratelimiting factor) and flush if we queue up too much work > > for the GPU. If you also model how fast the gpu is retiring fragments so > > But in case the same page is rendered with problems as well as without > problems, > then it doesn't look like texel read is problem, it rather looks like some > kind of memory mapping/ordering. No. I did not say the texel reads where the problem, just an indicator as to how long the EU would execute any particular shader for a fragment. Also there is only a single sampler and many EU running many more threads, so contention will also play a factor into how long each fragment takes to process, and so how long buffers will be active for. Look more closely at what it is going on, it is clearly that the hardware is not tracking lifetimes of its URB correctly. > Also is there some explanation why intel_gpu_top is showing so much higher > GPU usage when the flickering is visible ? Other than the flickering correlates with GPU activity? (In reply to comment #98) > Well I should wait a while before posting a comment about magic value 6. > > I'm now observing flickering with value 6 as well. > > So yeah - it's more or less time related - and it takes more or less time > until the problem becomes visible. > > Also is there explanation with the max value 64 starts to make problems > with text rendering in gnome terminal ? > > i.e. I'd have expected if there would be a large press on GPU - but in this > case it just appear random pixel start to be drawn instead of some letter - > maybe some font cache corruption ? It's still the same bug.
(In reply to comment #99) > (In reply to comment #97) > > (In reply to comment #96) > > > (In reply to comment #93) > > > > The bad part is - while before I've been getting 3.7Mchar/s in x11perf > > > > -aa10text - now it's like 1.3Mchar/s so significantly slower. > > But as I said before - if that would be plain hw defect - IMHO it would > > simply always appear - but it seems like it's working for a while - then > > 'something' happens - and flickering starts to appear - with (assumingly) > > same amount > > of texels/triangle/vertices - and than something again may happen, > > and the problem is gone for a while. > > It does. You do not have quite as much control over your tests as you > presume. Well I understand there are some weird things underneath - but when I've a page where the scrolling up/down is showing heavy intel_gpu_top usage, and flickering of pictures is visible - then I make nothing else then page reload - I've just naive assumption, that this redrawn page is just using different memory buffers - but otherwise the number of graphical objects pushed to GPU should be approximately the same. (I'm using plain good old xfce, so no fancy Gnome3 shell composite desktops...) > No. I did not say the texel reads where the problem, just an indicator as to > how long the EU would execute any particular shader for a fragment. Also > there is only a single sampler and many EU running many more threads, so > contention will also play a factor into how long each fragment takes to > process, and so how long buffers will be active for. Look more closely at > what it is going on, it is clearly that the hardware is not tracking > lifetimes of its URB correctly. > > > Also is there some explanation why intel_gpu_top is showing so much higher > > GPU usage when the flickering is visible ? > > Other than the flickering correlates with GPU activity? Well the same page could be scrolled with either i.e. 15% 'render busy' - and no flickering - or with ~50% 'render busy' and visible problems. I could make a debug build - but it would be probably needed to have some kind of 'signal support' to dump needed data when necessary instead of logging GB of data all the time. > > i.e. I'd have expected if there would be a large press on GPU - but in this > > case it just appear random pixel start to be drawn instead of some letter - > > maybe some font cache corruption ? > > It's still the same bug. So maybe we could start at this - since MAX 64 is giving the problem exposed very easily - it's enough to run gnome-terminal on my desktop.
It could be probably worth to mention - that when I'm flipping between Firefox tabs I could have scrolling tabs without any issue (and low GPU usage), while flipping to other tabs and scrolling in them increases GPU to high levels and visual problems are present. So in the moment the problem appears - it's rather local to certain tabs. And as I said - it's often enough to reload tab to temporarily fix the problem. And as of typing current BZ message - I've noticed (using MAX 6) that letter 'a' in the word 'certain' above has been for a while drawn only from one half - and after I've typed whole sentence it's been just refreshed properly. (And I've not seen such behavior for quite a while... and I think I'm quite sensitive for things like this)...
Just received new debs of the intel driver (2.21.11 + snapshot taken 13-07-08, git dbb585), libdrm (2.4.46 + snapshot 13-07-08, git f8f1f6). Furthermore, I have also received a new 3.8 kernel from ubuntu a few days ago that may contain some drm changes (but I am not certain of what goes in since it is a distribution kernel, not mainline). Now, I think that I am experiencing a lot of this too (or maybe it is something else, please advice). Particularly when writing emails with thunderbird, all of a sudden I notice that some character is corrupted. This seems to happen even without explicitly scrolling, but just typing. Very strange.
One last note: it is not flickering. The characters that get corrupted stay corrupted for a long time (typically until scrolled out and rescrolled in).
Weird enough, even if the artifacts on characters are not transient and stay there for arbitrary long periods of time, provided that one does not affect the neighboring text (or temporarily scrolls them away), it is impossible to take screen snapshots of them. As soon as I press PrtScreen, the artifacts get fixed.
Created attachment 82227 [details] Snapshot on the issue on characters I have finally succeeded in taking a snapshot of the (very frequent) issue that I have with individual characters while editing text. This is from emacs. The issue may look localized and very minor, but being on text it is in fact extremely distracting and annoying. Just while editing this last sentence, it happened five or six times: entering a character makes another character somewhere else get corrupted (and stay so until another character is pressed, or the cursor is moved, or the text is scrolled out and back in). The weird thing is that entering a character here may cause a character corruption in some completely different place. And this is exactly the reason why it is absolutely a killer to productivity. It takes the eye away from where you are working. Furthermore, the issue seems to be much more frequent if I type normally, than if, on purpose, I carefully and slowly enter characters one by one.
(In reply to comment #105) > Created attachment 82227 [details] > Snapshot on the issue on characters > Yep - that's exactly what I easily observe with gnome-terminal when I increase max triangles from recent Chris patches to 64 - this is present almost all the time. And it's very occasional when just 6 is limit.
The quick answer is that if ever see it with MAX_FLUSH_VERTICES set to 1, then it is a different issue. Please do be aware that all current kernels since 3.7 do have a coherency issue, the fix will not arrive before 3.10.1 (outside of drm-intel trees).
(In reply to comment #107) > The quick answer is that if ever see it with MAX_FLUSH_VERTICES set to 1, > then it is a different issue. Please do be aware that all current kernels > since 3.7 do have a coherency issue, the fix will not arrive before 3.10.1 > (outside of drm-intel trees). Could you please commit here the link to the needed kernel commit so I could check if it's fixing issue for me? Also it's then easier to see which upstream vanilla kernel will have it. And also it will be good to remove the low triangle limitation. Yes memory coherency seems like very good explanation for problems I'm seeing.
Ubuntu has an experimental 64bit raring kernel (3.8) with the fix for the coherency bug. Works great for me for fixing the artifacts in the large bitmaps. Can be tested at http://kernel.ubuntu.com/~jsalisbury/lp1200126/ It delivers the fix to 65665. This is very good news also for this bug since the fixed kernels makes it possible to wide test (at least on ubuntu) for the this bug decoupling away effects from the other 65665. Specifically, it is now possible to ignore some of my attachments for this bug as they showed artifacts caused by the other bug (this is certainly the case for the 'sample libreoffice document that almost always shows issues).
Unfortunately, the artifacts on the individual characters are not gone with the fix to the coherency bugs.
I've been looking at drm-intel-next-queued/drm-intel-fixes - there are quite a few patches - but also a lot of reverts recently - so I'm pretty much confused what is actually the solution for gma965 in T61.
Created attachment 82369 [details] Handy phone snapshot of artifacts on chars post drm/i915 fix: Only clear write-domains after a successful wait-seqno This is an example of the artifacts on chars, still happening running a kernel with the drm/i915 fix 'Only clear write-domains after a successful wait-seqno'. In this case we have a completely mangled 't' in firefox. Snapshot was taken with an handy phone so that the monitor pixel frame is visible. In fact, it is now apparently impossible to snapshot with prtscrn. As soon as a print screen is requested the wrong char is always redrawn correctly.
I'm also affected by this character corruption bug. My hardware is a notebook with Intel 4500M (i915 driver). First I thought it's caused by a hardware issue with my external screen, but the notebook screen shows the currupted characters as well. Sometimes when I scroll down to the end of a website with text and then keep scrolling (when the end is reached), about 2 characters per sentence are corrupted and about every 1/10 second the position of the curruptions change. The characters that were corrupted before look fine then and different characters are corrupted. What I have tried so far: - Disable/enable xcompmgr (no effect) - Use XFCE (no effect) - Change driver options i915_enable_rc6, i915_enable_fbc, lvds_downclock (no effect) - Use different fonts (no effect) - Disable sna (works) - Test firefox, libreoffice, chromium (all affected) Version Info: Kernel 3.10.1-1-ARCH xf86-video-intel 2.21.12
Created attachment 82787 [details] Character m of the word "parameters" is corrupted
Created attachment 82788 [details] Character e of word "deletion" and n of "Indicate" is corrupted
Created attachment 82789 [details] Character i of word "application" and p of "bpp>=8" is corrupted
Added 'characters' to the bug title. I believe that this is now the major issue related with this bug and having 'character' in the title may help people having issues look here.
Incorrect rendering of some glyphs is still there as of yesterday's git snapshot (24/7/13). I'm reporting it as I read somewhere that newer released versions of the driver implemented a lower limit on MAX_FLUSH_VERTICES in order to reduce the impact of the bug, but I really see no difference. Yet, it may be the case that the reduction is only applied to the released driver and not in the devel version (haven't checked the actual code). A couple of notes: - The way in which the glyphs corruption appears is weird. Without any scrolling, just typing in characters at some place causes some random character to get corrupted somewhere else (e.g. maybe 1 line above, maybe 10 cm to the right, etc). Some times typing some more character is enough to have the corruption disappear, something it does not. I have not been able to determine if the glyph that gets corrupted is the same one that is being typed (i.e, I type in an 'e' and somewhere and 'e' gets corrupted), but I suspect that this is not the case. - The issue seems to be much less frequent if I type slowly. - Once a glyph is corrupted, just 'selecting' some random character around it, but not necessarily very close to it (e.g., 15 chars to the left or to the right), seems to be enough to cause a redraw of the glyph that fixes its rendering. This seems somehow similar to how pressing 'PrtScrn' to try to get a snapshot causes a redraw that fixes the wrong glyphs, so that it is difficult to get a screenshot of the issue, unless one relies on a camera.
It has been a long long time I am following this bug. So, I built the 3.10.3 plain vanilla kernel, which went stable as of today (Friday, July 26, 2013) and I can say that my crappy GM45 chip works fine with the dreaded "test.odg" resizing test. Seems like things are improving. and if it is of any interest here are the versions of relevant sw stack: x11-libs/libdrm-2.4.45 media-libs/mesa-9.1.2-r1 x11-base/xorg-server-1.13.4 x11-drivers/xf86-video-intel-2.21.12 (sna is the Accel) thanks for continous work on devices that were introduced in 2008. ps: now, only if iwlwifi would get sorted out...
*** Bug 67377 has been marked as a duplicate of this bug. ***
*** Bug 68596 has been marked as a duplicate of this bug. ***
Here's some insight: test/render-copyarea is just a mirth of fail on my gm45.
Well I could add here output of some tests: i.e.: $ ./render-composite-solid Opened connection to :0 for testing. Testing setting of single pixels (root): passed [1 iterations x 4096] Testing area sets (root): passed [1 iterations x 4096] Testing area fills (root): passed [1 iterations x 4096] Testing setting of single pixels (child): passed [1 iterations x 4096] Testing area sets (child): passed [1 iterations x 4096] Testing area fills (child): passed [1 iterations x 4096] Testing setting of single pixels (pixmap): passed [1 iterations x 4096] Testing area sets (pixmap): passed [1 iterations x 4096] Testing area fills (pixmap): passed [1 iterations x 4096] Testing setting of single pixels (root): failed to set pixel (1465,296) to 00816152[9c816152], found 00000000 instead Is it worth ? (git commit b14228fafb654fe7d8f8783475aa0c0ba87e4fea)
My experiments so far indicate that the errors only happen with rendering to the uncached frontbuffer, are not influenced by the number of rectangles in each primitive (though the failure does occur at different frequencies) and do not respond to adding extra MI_FLUSH. I don't think attaching examples of fail will help, as the next step will be trying to dissect the GPU state and work out why fragments fail inside a single shader.
Well here is just another one - $ ./basic-copyarea Opened connection to :0 for testing. Testing setting of single pixels (root): passed [1 iterations x 4096] Testing area sets (root): passed [1 iterations x 4096] Testing area fills (root, using pixmap source): passed [1 iterations x 4096] Testing area fills (root, using window source): passed [1 iterations x 4096] Testing setting of single pixels (child): passed [1 iterations x 4096] Testing area sets (child): passed [1 iterations x 4096] Testing area fills (child, using pixmap source): passed [1 iterations x 4096] Testing area fills (child, using window source): passed [1 iterations x 4096] Testing setting of single pixels (pixmap): passed [1 iterations x 4096] Testing area sets (pixmap): passed [1 iterations x 4096] Testing area fills (pixmap, using pixmap source): passed [1 iterations x 4096] Testing setting of single pixels (root): passed [2 iterations x 2048] Testing area sets (root): failed to set pixel (0,0) to 0072dc99 [0e72dc99], found 00000000 [00000000] instead 00000000 00000000 00000000 0e72dc99 0e72dc99 0e72dc99 5146daef 5146daef 5146daef 0e72dc99 0e72dc99 0e72dc99 5146daef 5146daef 5146daef 0e72dc99 0e72dc99 0e72dc99 Unsure if it's related to anything and test seems to be lenghty - but there seems to be some observable pattern. Anyway - could I help here with testing some patch - or do you get same errors yourself on your available hardware ?
The tests are intentionally overkill - they are also intended to try and test handling of large batches, as well as generally stress the system. I hadn't noticed the basic-copyarea fail. That does look to be different. So far, the failure pattern had seemed to be a subspan doesn't get written (i.e. a single instance of one thread failed to execute correctly in the GPU, which I think could explain the general bug here.) basic-copyarea looks like a larger scale failure. And, yes, it does fail on my gm45 as well.
SNA has grown worse now. Since about six weeks, I've started to see corruption in the terminal (cursor not disappearing or not showing at all, more text missing until marking with the mouse,...). I've switched to UXA, nothing bad visible there.
With kernel-3.12.2 and current git, the situation now is much better. The corruptions happen less often (still often enough to be noticed though), and are usually confined to single characters, or only parts of single characters. While I'm writing this, I can notice what appears to be some slight corruption in the lines above which I've already written, but they are always gone immediately. Maybe today's just a lucky day, or the situation has improved. What's more, the issue with the cursor not disappearing or not showing that I mentioned in my previous post no longer exists, so SNA on this gen4 is usable again.
As a check that all problems are the same, can people who are still affected by this do a test with diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c index a87af39..86c37d6 100644 --- a/src/sna/gen4_render.c +++ b/src/sna/gen4_render.c @@ -63,7 +63,7 @@ #define NO_FILL_BOXES 0 #define NO_VIDEO 0 -#define MAX_FLUSH_VERTICES 6 +#define MAX_FLUSH_VERTICES 1 #define GEN4_GRF_BLOCKS(nreg) ((nreg + 15) / 16 - 1) and see if that cures the last of the corruptions.
Just to report, maybe this isn't intel driver bug at all. At home I have ATI R270X (radeonsi driver) and Intel HD2000, at work X4500 and I get hit by this bug on all of them. Small, single char, graphic corruption, easily triggered by scrolling through a window with lots of small text (tailing a log in debug mode in terminal). But I see it appers and dissapears as I'm typing this comment. I'm attaching the screenshots (/dev/shm is corrupted on black one). @Chris, I'll try applying patches right now and report back.
Created attachment 90707 [details] /dev/shm corruption
I don't want to speak too soon but it seems that the latest patch fixes the problem. First I've tried the latest git version and it didn't help, I've noticed corrupted fonts as soon as I logged on and start poking things in terminal. After I've applied the patch in comment #129 and restarted, no corruption at all (at least for now, it usually can be reproduced right after logging in). Chris, any thoughts on why is this happening with radeon opesource driver also ?
(In reply to comment #132) > Chris, any thoughts on why is this happening with radeon opensource driver > also ? I was hoping that it would be a bug in common component, but in this case it sounds like they have a similar bug in managing internal GPU state.
(In reply to comment #132) > I don't want to speak too soon but it seems that the latest patch fixes the > problem. Let it run for a day or so to be sure. The other thing that is worth checking is whether setting MAX_FLUSH_VERTICES to 2 is also stable, or 4 etc. Setting it to 1 has a major impact on performance (we are roughly an order of magnitude slower at rendering than what can be expected).
Maybe this bug could be more easily tracked down when the amount of vertices is actually much higher - since in this case it seems to crash almost immediatelly. I understand there is some 'maximum queue' size GPU could handle - but the engine should be able to track size of all commands and not outgrown it. So what else could break ordering ? As it's very easy to trigger the problem with higher max - maybe it could be used to deduce which part of code unexpectedly touches the command queue ? (but of course it's just my naive assumption about how the SNA driver works).
The issue that the VUE (which is a memory slot used by the GPU for a vertex entry) are reused by a second thread before the first thread is complete, causing the first thread to generate invalid texture coordinates and corrupt rendering. That is a hardware read-write hazard bug (or at least I have not found any controls in the EU state to prevent it).
(In reply to comment #136) > The issue that the VUE (which is a memory slot used by the GPU for a vertex > entry) are reused by a second thread before the first thread is complete, > causing the first thread to generate invalid texture coordinates and corrupt > rendering. That is a hardware read-write hazard bug (or at least I have not > found any controls in the EU state to prevent it). I don't think it's that easy to explain - The typical problem in my case is - when I freshly start the Xsession - I do not observe any rendering bug. I need to use this machine for a while, to start to get those errors. So if there would be some 'easy to trigger hardware race' - it should be reproducible all the time. But in my case it seem - the probability increases over the time heavily - maybe with the amount of cached BO segments ? Or maybe it depends on how the memory order is set - i.e. the problem is triggered when certain memory read pattern start to appear ? Since once the issue of flicker starts - it then happens all the time - and then suddenly it disappears for a while again ? So maybe there needs to be prohibited some memory offset/alignment of buffers with vertices ?
Nope. The behaviour of this bug is very well characterised by the above analysis.
Well I'm curious how this explains this - I've taken current git - changed the value to '44' - and I'm typing this text. I could see a lot of errors during text typing - but these errors seems to be somehow limited only to certain regions of shown text. It's not destroyed everywhere - only in some particularly piece and in particular time - it's kind of weird defect to observe - and it also happen for only some specific use-case i.e. my text edit 'fte' which is using X drawing code has absolutely no issue - all characters are always correct. But in firefox - typing this BZ I could see a lot of changing letters (usually everything after 10 letter on the line is weirdly changing - like if the font cache would be not working correctly. Typing on keyboard shows rending bugs - as soon as right mouse button is hit - everything is instantly redrawn correctly and pop window is shown. Another thing I notice is - I've about 20 lines of email headers in thunderbird. And exactly only the 4th line is showing problems with letters - even when I just move the mouse over the TB window - this text is being continuously modified - but everything else in that window is without any problem. So if you say - there is hw bug - how is that - it could very easily render everything correctly ? Why only certain portions of text have distortions - why it's not randomly spread over the whole screen (which I'd have expected for time collisions)?
Just to add some more comment on MAX_FLUSH_VERTICES - when set to valu 96 - it gives the highest throughput on x11perf -aa10text 3.7MChar/s - using any higher value doesn't make any different (so the max seems to be somewhere between <64-96] Also when this 3.7MChar is rendered - the parallel move of some other window on the screen becomes very slow - so engine gets overloaded ?? Or the Xserver generates such long queue of events it becomes so slow ?
(In reply to comment #134) > (In reply to comment #132) > > I don't want to speak too soon but it seems that the latest patch fixes the > > problem. > > Let it run for a day or so to be sure. The other thing that is worth > checking is whether setting MAX_FLUSH_VERTICES to 2 is also stable, or 4 > etc. Setting it to 1 has a major impact on performance (we are roughly an > order of magnitude slower at rendering than what can be expected). Setting max_flush_vertices to 1 fixes the problem here, with the mentioned noticeable performance loss. IIRC, hibernating/resuming can accelerate the appearance of the bug, but I'm not quite sure about this, it might also be coincidence. I will now try to decrease the value from the default one to find the sweet spot.
*** Bug 71773 has been marked as a duplicate of this bug. ***
All tests with MAX_FLUSH_VERTICES greater than 2 reveal those single garbled glyphs. I'm still testing with a value of 2. There isn't much noticeable difference between 3, 4, 5. 00:02.0 VGA compatible controller: Intel Corporation 4 Series Chipset Integrated Graphics Controller (rev 03) (prog-if 00 [VGA controller]) Subsystem: Dell Device 0276
Here's another bug report regarding this same issue: https://bugs.launchpad.net/bugs/1227569
Created attachment 91383 [details] gedit in openbox with 2.99.907 GM45 SNA I am finding that GM45 SNA seems unusable with 2.99.907 - git bisect pointed to the bad commit as: 9289e2c56b7f0cc78c5123691ad96611f0e04bed is the first bad commit commit 9289e2c56b7f0cc78c5123691ad96611f0e04bed Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Dec 16 11:39:20 2013 +0000 sna/gen4: Sacrifice performance to workaround render corruption The problems are lines of text keep disappearing (and reappearing) in gedit, and the occasionally the screen becomes unresponsive for a short time and these messages appear in dmesg: [ 1702.349954] [drm] stuck on render ring [ 1702.349966] [drm] capturing error event; look for more information in /sys/class/drm/card0/error [ 1702.354334] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x32a1000 ctx 0) at 0x32a1110
(In reply to comment #145) > [ 1702.349954] [drm] stuck on render ring > [ 1702.349966] [drm] capturing error event; look for more information in > /sys/class/drm/card0/error > [ 1702.354334] [drm:i915_set_reset_status] *ERROR* render ring hung inside > bo (0x32a1000 ctx 0) at 0x32a1110 Attach the error state.
Created attachment 91388 [details] intel_error_decode output I saved the output from intel_error_decode but I didn't save the raw error data.
Worth trying just: diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c index 637137e..dc80de3 100644 --- a/src/sna/gen4_render.c +++ b/src/sna/gen4_render.c @@ -660,9 +660,11 @@ inline static int gen4_get_rectangles(struct sna *sna, if (rem <= 0) { if (sna->render.vertex_offset) { gen4_vertex_flush(sna); - if (gen4_magic_ca_pass(sna, op)) + if (gen4_magic_ca_pass(sna, op)) { + OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH); gen4_emit_pipelined_pointers(sna, op, op->op, op->u.gen4.wm_kernel); + } } OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH); rem = MAX_FLUSH_VERTICES; if you are happy that it reproduces reliably.
(In reply to comment #148) > Worth trying just: > > diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c > index 637137e..dc80de3 100644 > --- a/src/sna/gen4_render.c > +++ b/src/sna/gen4_render.c > @@ -660,9 +660,11 @@ inline static int gen4_get_rectangles(struct sna *sna, > if (rem <= 0) { > if (sna->render.vertex_offset) { > gen4_vertex_flush(sna); > - if (gen4_magic_ca_pass(sna, op)) > + if (gen4_magic_ca_pass(sna, op)) { > + OUT_BATCH(MI_FLUSH | > MI_INHIBIT_RENDER_CACHE_FLUSH); > gen4_emit_pipelined_pointers(sna, > op, op->op, > > op->u.gen4.wm_kernel); > + } > } > OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH); > rem = MAX_FLUSH_VERTICES; > > if you are happy that it reproduces reliably. This change did not fully solve the problem. One text file in gedit displayed as blank initially for a bit, although things then seemed fine. But the freeze with "[drm] stuck on render ring" happened during a second run of gtkperf - while the "GtkDrawingArea - Text" test was running. But I didn't spot any corrupted characters while running 2.99.907 with MAX_FLUSH_VERTICES set back to 6 - although I hadn't been running that for very long, and I only see a single garbled char occasionally, and I think they only appear in Firefox.
Created attachment 91389 [details] intel_error_decode output the decoded error that occurred while running gtkperf, 2.99.907 with the change in comment #148
(In reply to comment #149) > > But I didn't spot any corrupted characters while running 2.99.907 with > MAX_FLUSH_VERTICES set back to 6 And now I have spotted the single character corruption with 2.99.907 + MAX_FLUSH_VERTICES set to 6.
The bug occurs with MAX_FLUSH_VERTICES = 2 too, both in firefox and xterm, so unfortunately setting it to 1 seems to be the only solution here.
(In reply to comment #145) > Created attachment 91383 [details] > gedit in openbox with 2.99.907 GM45 SNA > > I am finding that GM45 SNA seems unusable with 2.99.907 - git bisect pointed > to the bad commit as: > 9289e2c56b7f0cc78c5123691ad96611f0e04bed is the first bad commit > commit 9289e2c56b7f0cc78c5123691ad96611f0e04bed > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Mon Dec 16 11:39:20 2013 +0000 > > sna/gen4: Sacrifice performance to workaround render corruption > > The problems are lines of text keep disappearing (and reappearing) in gedit, > and the occasionally the screen becomes unresponsive for a short time and > these messages appear in dmesg: > > [ 1702.349954] [drm] stuck on render ring > [ 1702.349966] [drm] capturing error event; look for more information in > /sys/class/drm/card0/error > [ 1702.354334] [drm:i915_set_reset_status] *ERROR* render ring hung inside > bo (0x32a1000 ctx 0) at 0x32a1110 Ok, I think I have this fixed: commit 9d8473c5d9489db439aca73f470bda29a22ebab6 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jan 7 13:43:35 2014 +0000 sna/gen4: Check for available batch space before restoring state after CA pass Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=73348 References: https://bugs.freedesktop.org/show_bug.cgi?id=55500 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
(In reply to comment #153) > (In reply to comment #145) > > Created attachment 91383 [details] > > gedit in openbox with 2.99.907 GM45 SNA > > > > I am finding that GM45 SNA seems unusable with 2.99.907 - git bisect pointed > > to the bad commit as: > > 9289e2c56b7f0cc78c5123691ad96611f0e04bed is the first bad commit > > commit 9289e2c56b7f0cc78c5123691ad96611f0e04bed > > Author: Chris Wilson <chris@chris-wilson.co.uk> > > Date: Mon Dec 16 11:39:20 2013 +0000 > > > > sna/gen4: Sacrifice performance to workaround render corruption Interestingly this commit has increased the number of buggy character occurrence - although I admit I'm overriding MAX_FLUSH_VERTICES to 9 (since I don't like the slowness with 1) so I've been rather living with faster desktop and occasional wrong characters on the screen - but with this commit it seems to appearance of visual problem increased to the level which makes noticeable reading difficulties that probably enforce me to switch to very slow '1' for MAX :(... (I'm not sure if it's the direct cause - since I've been using before about 3 week old version of git tree)
I think nice illustration could be - that while before with MAX 9 it seemed like i.e. gnome-terminal running top was not really rendering broken characters - it now seems to show a lot of messed characters. On the other hand - MAX 1 seems to be now more fluent then before - so except for some dramatic drops in performance in benchmark tools like 'x11perf -aa10text' it seems to be quite usable.
bad news - now I'm in fact able to spot badly rendered characters in Firefox also with MAX 1 So something went wrong....
Something worth experimenting with is detuning the GPU, e.g.: diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c index e239c21..bc6af68 100644 --- a/src/sna/gen4_render.c +++ b/src/sna/gen4_render.c @@ -52,7 +52,7 @@ */ #define FORCE_SPANS 0 #define FORCE_NONRECTILINEAR_SPANS -1 -#define FORCE_FLUSH 1 /* https://bugs.freedesktop.org/show_bug.cgi?id=55500 */ +#define FORCE_FLUSH 0 /* https://bugs.freedesktop.org/show_bug.cgi?id=55500 */ #define NO_COMPOSITE 0 #define NO_COMPOSITE_SPANS 0 @@ -74,7 +74,7 @@ #define URB_CS_ENTRIES 0 #define URB_VS_ENTRY_SIZE 1 -#define URB_VS_ENTRIES 32 +#define URB_VS_ENTRIES 16 #define URB_GS_ENTRY_SIZE 0 #define URB_GS_ENTRIES 0 @@ -83,7 +83,7 @@ #define URB_CL_ENTRIES 0 #define URB_SF_ENTRY_SIZE 2 -#define URB_SF_ENTRIES 64 +#define URB_SF_ENTRIES 1 /* * this program computes dA/dx and dA/dy for the texture coordinates along @@ -93,9 +93,9 @@ #define SF_KERNEL_NUM_GRF 16 #define PS_KERNEL_NUM_GRF 32 -#define GEN4_MAX_SF_THREADS 24 -#define GEN4_MAX_WM_THREADS 32 -#define G4X_MAX_WM_THREADS 50 +#define GEN4_MAX_SF_THREADS 8 +#define GEN4_MAX_WM_THREADS 16 +#define G4X_MAX_WM_THREADS 16 static const uint32_t ps_kernel_packed_static[][4] = { #include "exa_wm_xy.g4b"
Created attachment 91613 [details] Grabbed snapshot with patch from comment 157 As could be seen - yes - with little effort I'm able to capture broken characters with given patch compiled in. The only needed thing is to start to edit the dialog in the Firefox and start to add/remove random characters over places.
After updating to current git, the situation has become worse and I now see more visible corruptions even with MAX_FLUSH_VERTICES=1. And I did not have to wait some hours for them to appear as before, they are visible right after starting X.
Typical, it appeared stable through my firefox testing, but if you try reverting b7565a26401e283df94b68019e8093f8104428f4, I expect the corruptions to disappear again.
(In reply to comment #160) > Typical, it appeared stable through my firefox testing, but if you try > reverting b7565a26401e283df94b68019e8093f8104428f4, I expect the corruptions > to disappear again. Yep - correct revert of this commit make MAX_VERTEX 1 again producing correct rendering - even thought it's again noticable slower - thus now it's clear why I've considered it usable with MAX 1 before (in my comment 155). So yep - revert & MAX 1 works again - but it's quite slow. Is it now any better to deduce which operation is make such bad memory interaction ? It seems like the 'synchronization' is really needed only at very certain moments - where the GPU is producing memory corruption error on the screen - but how to catch in which moment ? I'm still suspecting some memory layout of those memory object - since when I see corruption - it usually it specific places (i.e. edit of this firefox input widget and just only certain characters at certain positions are render with errors) How can I try to increasing memory alignment ? (i.e. each object only at 16KB boundary?)
I have updated to current git and reverted commit b7565a26401e283df94b68019e8093f8104428f4 and left the MAX_FLUSH_VERTICES set to 1, but now instead of glyph corruption I notice some icons are corrupted similar to the glyphs before. Example: In thunderbird, I hover over a toolbar icon and when the mouse pointer leaves the icon, it is corrupted. Or it gets corrupted when I hover over the icon. In both cases, the corruptions disappear a short time after, when something else gets updated on the screen. Strangely, I have not seen glyph corruption yet.
I have been experimenting with various numbers in the code in comment #157 without really discovering anything useful. I think any change that slows things down decreases the chances of observing any corruption, but might not necessarily fix the problem completely. With current git + MAX_FLUSH_VERTICES=6, gnome-terminal suffers from the text corruption, firefox is bad, but KDE4 konqueror and MiniBrowser from webkitgtk3 seem to display the same webpages perfectly fine. The other thing I have noticed is that under this setup, I can make gedit suffer from the text corruption if I set an italic (and not monospaced) font - in that case, almost every line of the text file displayed will show some corruption immediately after changing to the italic font.
Created attachment 92240 [details] Text with errors Recently I'm noticing somewhat more 'weird' behavior - it might be related to my temporal usage of night releases of Mozilla (since rawhide version got somewhat broken) What is weird in this image is - the text was badly rendered AND it remained visible even when refreshed, even small scroll up/down left the text as is - it helped only to scroll text out of window and back. So I assume now the rendering happens to some off-screen memory - which is then transfered back to screen with errors - and refresh doesn't help with this case. (and it's a bit different then during typing text into input box). Another probably unrelated comment could be - even when 'glxgears' is running in parallel - the visual errors still happens during typing of this text. I'd have expect the glxgears manages to 'overfill' GPU queue (since it's rendering ~1300FPS with default window) also the flushes are probably completely different. And - there is no observable rendering error in gears window - only Firefox seems to be exposing them.
I am wondering if some extra flushes are needed in regard to what the G45 PRM PDFs say about the BLT (section 8.6, vol 1b p. 170) git + this gives only a moderate amount of corrupt rendering: diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c index e239c21..f150e5b 100644 --- a/src/sna/gen4_render.c +++ b/src/sna/gen4_render.c @@ -63,7 +63,7 @@ #define NO_FILL_BOXES 0 #define NO_VIDEO 0 -#define MAX_FLUSH_VERTICES 1 /* was 6, https://bugs.freedesktop.org/show_bug.cgi?id=55500 */ +#define MAX_FLUSH_VERTICES 12 /* was 6, https://bugs.freedesktop.org/show_bug.cgi?id=55500 */ #define GEN4_GRF_BLOCKS(nreg) ((nreg + 15) / 16 - 1) @@ -571,26 +571,28 @@ static void gen4_emit_vertex_buffer(struct sna *sna, inline static void gen4_emit_pipe_flush(struct sna *sna) { -#if 1 +#if 0 OUT_BATCH(GEN4_PIPE_CONTROL | (4 - 2)); - OUT_BATCH(GEN4_PIPE_CONTROL_WC_FLUSH); + OUT_BATCH(GEN4_PIPE_CONTROL_WC_FLUSH | GEN4_PIPE_CONTROL_TC_FLUSH); OUT_BATCH(0); OUT_BATCH(0); #else OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH); + /* OUT_BATCH(MI_NOOP); */ #endif } inline static void gen4_emit_pipe_break(struct sna *sna) { -#if 1 +#if 0 OUT_BATCH(GEN4_PIPE_CONTROL | (4 - 2)); - OUT_BATCH(0); + OUT_BATCH(GEN4_PIPE_CONTROL_TC_FLUSH); OUT_BATCH(0); OUT_BATCH(0); #else OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH); + /* OUT_BATCH(MI_NOOP); */ #endif } @@ -599,11 +601,12 @@ gen4_emit_pipe_invalidate(struct sna *sna) { #if 0 OUT_BATCH(GEN4_PIPE_CONTROL | (4 - 2)); - OUT_BATCH(GEN4_PIPE_CONTROL_WC_FLUSH | GEN4_PIPE_CONTROL_TC_FLUSH); + OUT_BATCH(GEN4_PIPE_CONTROL_WC_FLUSH | GEN4_PIPE_CONTROL_TC_FLUSH | GEN4_PIPE_CONTROL_IS_FLUSH); OUT_BATCH(0); OUT_BATCH(0); #else - OUT_BATCH(MI_FLUSH); + OUT_BATCH(MI_FLUSH); /* | MI_STATE_INSTRUCTION_CACHE_FLUSH */ + /* OUT_BATCH(MI_NOOP); */ #endif } @@ -781,7 +784,10 @@ gen4_emit_urb(struct sna *sna) urb_cl_end = urb_gs_end + URB_CL_ENTRIES * URB_CL_ENTRY_SIZE; urb_sf_end = urb_cl_end + URB_SF_ENTRIES * URB_SF_ENTRY_SIZE; urb_cs_end = urb_sf_end + URB_CS_ENTRIES * URB_CS_ENTRY_SIZE; - assert(urb_cs_end <= 256); + if (sna->kgem.gen >= 045) + assert(urb_cs_end <= 384); + else + assert(urb_cs_end <= 256); while ((sna->kgem.nbatch & 15) > 12) OUT_BATCH(MI_NOOP); @@ -1623,6 +1629,7 @@ gen4_render_composite_done(struct sna *sna, kgem_bo_destroy(&sna->kgem, op->src.bo); sna_render_composite_redirect_done(sna, op); + gen4_emit_pipe_invalidate(sna); } static bool @@ -2154,6 +2161,7 @@ gen4_render_composite_spans_done(struct sna *sna, kgem_bo_destroy(&sna->kgem, op->base.src.bo); sna_render_composite_redirect_done(sna, &op->base); + gen4_emit_pipe_invalidate(sna); } static bool @@ -2500,6 +2508,7 @@ fallback_blt: gen4_vertex_flush(sna); sna_render_composite_redirect_done(sna, &tmp); kgem_bo_destroy(&sna->kgem, tmp.src.bo); + gen4_emit_pipe_invalidate(sna); return true; fallback_tiled_dst: @@ -2535,6 +2544,7 @@ gen4_render_copy_done(struct sna *sna, const struct sna_copy_op *op) { if (sna->render.vertex_offset) gen4_vertex_flush(sna); + gen4_emit_pipe_invalidate(sna); } static bool @@ -2736,6 +2746,7 @@ gen4_render_fill_boxes(struct sna *sna, gen4_vertex_flush(sna); kgem_bo_destroy(&sna->kgem, tmp.src.bo); + gen4_emit_pipe_invalidate(sna); return true; } @@ -2776,6 +2787,7 @@ gen4_render_fill_op_done(struct sna *sna, const struct sna_fill_op *op) if (sna->render.vertex_offset) gen4_vertex_flush(sna); kgem_bo_destroy(&sna->kgem, op->base.src.bo); + gen4_emit_pipe_invalidate(sna); } static bool I've also tried setting "Render Cache Operational Flush Enable" of the Cache_Mode_0 register with intel_reg_write, this made no difference. I was also wondering if firefox is particularly bad because it uses it's own old version of cairo which seems to be version 1.9.8 plus lots of patches.
I've not yet tested patch from comment 165 - but with regards to Firefox and Cairo - I'm also seeing errors in i.e. pidgin - where status icons looks occasionally damaged. And my rawhide has these related packages: cairo-1.13.1-0.1.git337ab1f.fc21.x86_64 cairo-gobject-1.13.1-0.1.git337ab1f.fc21.x86_64 pidgin-2.10.7-9.fc21.x86_64 xorg-x11-server-Xorg-1.15.0-1.fc21.x86_64 libX11-1.6.1-1.fc20.x86_64 ldd /bin/pidgin linux-vdso.so.1 => (0x00007fff323ad000) libX11.so.6 => /lib64/libX11.so.6 (0x00007f6f151af000) libXext.so.6 => /lib64/libXext.so.6 (0x00007f6f14f9c000) libXss.so.1 => /lib64/libXss.so.1 (0x00007f6f14d98000) libSM.so.6 => /lib64/libSM.so.6 (0x00007f6f14b90000) libICE.so.6 => /lib64/libICE.so.6 (0x00007f6f14973000) libgtkspell.so.0 => /lib64/libgtkspell.so.0 (0x00007f6f1476c000) libgtk-x11-2.0.so.0 => /lib64/libgtk-x11-2.0.so.0 (0x00007f6f140e7000) libgdk-x11-2.0.so.0 => /lib64/libgdk-x11-2.0.so.0 (0x00007f6f13e25000) libpangocairo-1.0.so.0 => /lib64/libpangocairo-1.0.so.0 (0x00007f6f13c18000) libatk-1.0.so.0 => /lib64/libatk-1.0.so.0 (0x00007f6f139f4000) libcairo.so.2 => /lib64/libcairo.so.2 (0x00007f6f136ce000) libgdk_pixbuf-2.0.so.0 => /lib64/libgdk_pixbuf-2.0.so.0 (0x00007f6f134aa000) libgio-2.0.so.0 => /lib64/libgio-2.0.so.0 (0x00007f6f1312b000) libpangoft2-1.0.so.0 => /lib64/libpangoft2-1.0.so.0 (0x00007f6f12f15000) libpango-1.0.so.0 => /lib64/libpango-1.0.so.0 (0x00007f6f12cca000) libfontconfig.so.1 => /lib64/libfontconfig.so.1 (0x00007f6f12a8e000) libfreetype.so.6 => /lib64/libfreetype.so.6 (0x00007f6f127e9000) libpurple.so.0 => /lib64/libpurple.so.0 (0x00007f6f124af000) libdbus-glib-1.so.2 => /lib64/libdbus-glib-1.so.2 (0x00007f6f12287000) libdbus-1.so.3 => /lib64/libdbus-1.so.3 (0x00007f6f1203e000) libfarstream-0.1.so.0 => /lib64/libfarstream-0.1.so.0 (0x00007f6f11e29000) libgstbase-0.10.so.0 => /lib64/libgstbase-0.10.so.0 (0x00007f6f11bd5000) libgstinterfaces-0.10.so.0 => /lib64/libgstinterfaces-0.10.so.0 (0x00007f6f119c2000) libgstreamer-0.10.so.0 => /lib64/libgstreamer-0.10.so.0 (0x00007f6f116d9000) libgobject-2.0.so.0 => /lib64/libgobject-2.0.so.0 (0x00007f6f11487000) libgmodule-2.0.so.0 => /lib64/libgmodule-2.0.so.0 (0x00007f6f11282000) libgthread-2.0.so.0 => /lib64/libgthread-2.0.so.0 (0x00007f6f11080000) libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x00007f6f10d54000) libxml2.so.2 => /lib64/libxml2.so.2 (0x00007f6f109ea000) libidn.so.11 => /lib64/libidn.so.11 (0x00007f6f107b7000) libm.so.6 => /lib64/libm.so.6 (0x00007f6f104b1000) libnsl.so.1 => /lib64/libnsl.so.1 (0x00007f6f10296000) libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f6f1007b000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f6f0fe5d000) libc.so.6 => /lib64/libc.so.6 (0x00007f6f0fa95000) libxcb.so.1 => /lib64/libxcb.so.1 (0x00007f6f0f874000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f6f0f670000) libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f6f0f46a000) libenchant.so.1 => /lib64/libenchant.so.1 (0x00007f6f0f25e000) libXfixes.so.3 => /lib64/libXfixes.so.3 (0x00007f6f0f058000) libXrender.so.1 => /lib64/libXrender.so.1 (0x00007f6f0ee4d000) libXinerama.so.1 => /lib64/libXinerama.so.1 (0x00007f6f0ec4a000) libXi.so.6 => /lib64/libXi.so.6 (0x00007f6f0ea3a000) libXrandr.so.2 => /lib64/libXrandr.so.2 (0x00007f6f0e82f000) libXcursor.so.1 => /lib64/libXcursor.so.1 (0x00007f6f0e624000) libXcomposite.so.1 => /lib64/libXcomposite.so.1 (0x00007f6f0e421000) libXdamage.so.1 => /lib64/libXdamage.so.1 (0x00007f6f0e21d000) libharfbuzz.so.0 => /lib64/libharfbuzz.so.0 (0x00007f6f0dfc8000) libpixman-1.so.0 => /lib64/libpixman-1.so.0 (0x00007f6f0dd1a000) libEGL.so.1 => /lib64/libEGL.so.1 (0x00007f6f0daf7000) libpng16.so.16 => /lib64/libpng16.so.16 (0x00007f6f0d8c4000) libxcb-shm.so.0 => /lib64/libxcb-shm.so.0 (0x00007f6f0d6c0000) libxcb-render.so.0 => /lib64/libxcb-render.so.0 (0x00007f6f0d4b6000) libz.so.1 => /lib64/libz.so.1 (0x00007f6f0d2a0000) libGL.so.1 => /lib64/libGL.so.1 (0x00007f6f0d037000) librt.so.1 => /lib64/librt.so.1 (0x00007f6f0ce2f000) libffi.so.6 => /lib64/libffi.so.6 (0x00007f6f0cc26000) libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f6f0ca02000) libexpat.so.1 => /lib64/libexpat.so.1 (0x00007f6f0c7d7000) liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f6f0c5b2000) /lib64/ld-linux-x86-64.so.2 (0x00007f6f15521000) libXau.so.6 => /lib64/libXau.so.6 (0x00007f6f0c3ad000) libgraphite2.so.3 => /lib64/libgraphite2.so.3 (0x00007f6f0c191000) libX11-xcb.so.1 => /lib64/libX11-xcb.so.1 (0x00007f6f0bf8e000) libxcb-dri2.so.0 => /lib64/libxcb-dri2.so.0 (0x00007f6f0bd89000) libxcb-xfixes.so.0 => /lib64/libxcb-xfixes.so.0 (0x00007f6f0bb82000) libxcb-shape.so.0 => /lib64/libxcb-shape.so.0 (0x00007f6f0b97d000) libgbm.so.1 => /lib64/libgbm.so.1 (0x00007f6f0b775000) libwayland-client.so.0 => /lib64/libwayland-client.so.0 (0x00007f6f0b567000) libwayland-server.so.0 => /lib64/libwayland-server.so.0 (0x00007f6f0b355000) libglapi.so.0 => /lib64/libglapi.so.0 (0x00007f6f0b12e000) libudev.so.1 => /lib64/libudev.so.1 (0x00007f6f0af1c000) libdrm.so.2 => /home/kabi/soft/glx-test/lib/libdrm.so.2 (0x00007f6f0ad0f000) libxcb-glx.so.0 => /lib64/libxcb-glx.so.0 (0x00007f6f0aaf5000) libXxf86vm.so.1 => /lib64/libXxf86vm.so.1 (0x00007f6f0a8ee000) libpcre.so.1 => /lib64/libpcre.so.1 (0x00007f6f0a687000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f6f0a470000) So while the Firefox is the easiest one to see those errors (it's always enough to play for a while with some input box), it's probably not tied with its built-in version of Cairo library.
The image corruptions are also visible on the file/folder icons in thunar. I still have no glyph corruptions any more.
Created attachment 92287 [details] [review] Always force a GPU flush between operations Can you please try this patch against git and see if that improves things - except for performance?
ok - while doing a very quick & light check - at least on Firefox input window I do not observe any rendering bugs (which have been pretty simple to reach before). (Lenovo T61 + git + patch from comment 168) Thought the performance decrease is noticeable and also the setting for MAX_FLUSH... becomes irrelevant.
Created attachment 92288 [details] bug f23ab963c4f4ada2051588dfc85264aa2798dbf7 + that patch and I'm seeing corruption. Using google-chrome and letters in url bar or window title bar sometimes get corrupted and then get fixed. Also seeing the problem in gimp menus. Some letters get corrupted and fixed.
Ok - seem(In reply to comment #170) > Created attachment 92288 [details] > bug > > f23ab963c4f4ada2051588dfc85264aa2798dbf7 + that patch and I'm seeing > corruption. Using google-chrome and letters in url bar or window title bar > sometimes get corrupted and then get fixed. > > Also seeing the problem in gimp menus. Some letters get corrupted and fixed. This could be related to my hw: [219745.896] (II) intel(0): SNA compiled with assertions enabled [219745.898] (--) intel(0): Integrated Graphics Chipset: Intel(R) 965GM [219745.898] (--) intel(0): CPU: x86-64, sse2, sse3, ssse3 [219745.898] (**) intel(0): Depth 24, (--) framebuffer bpp 32 And Arkadiusz hw: [236338.852] (II) intel(0): SNA compiled with assertions enabled [236338.853] (--) intel(0): Integrated Graphics Chipset: Intel(R) GM45 [236338.853] (--) intel(0): CPU: x86-64, sse2, sse3, ssse3, sse4.1 [236338.853] (==) intel(0): Depth 24, (--) framebuffer bpp 32 On my 965GM I've not yet seen any error....
(In reply to comment #168) > Created attachment 92287 [details] [review] [review] > Always force a GPU flush between operations > > Can you please try this patch against git and see if that improves things - > except for performance? Current git (2.99.907-23-gf23ab96) without any other changes: still a few corrupt characters in gedit with italic font Current git + patch in attachment 92287 [details] [review]: still the same - has a few corrupt characters in gedit with italic font, displaying files full of text
Update after some new commits: 4c7b183fd21b461f9f18662c3b9d9732b6bef13d + Always patch - now gives me broken text lines in Thunderbird window. And it's now enough just to move the mouse over text and the text is changing and actually never renders correctly some letters. Checking back f23ab963c4f4ada2051588dfc85264aa2798dbf7 + Always patch - again correct rendering. This relates to GMA965.
Created attachment 92512 [details] [review] Always force a GPU flush between operations Updated always flush patch that passes Arkadiusz's stress test. * sobs
Created attachment 92770 [details] not quite random corruption example With the workarounds disabled, can anything be deduced from the text or pixmap corruption not seeming to be completely random? Italic text seems to be particularly badly hit, and it seems to vary with the font and size. But in the attached screenshot, some of the lines of text never showed any corruption, while others usually showed some corruption, the corruption changing on switching focus to another window and back. Size 10 italic Cantarell seemed particularly badly hit, with even lines of just repeated c or d characters showing corruption (if longer than 18 letters), but other fonts don't usually show any corruption on a line of text filled with a single repeating character. For pixmap corruption, the printer icon in the gedit toolbar seems to get turned into grey vertical bars more often than any other icons get corrupted.
(In reply to comment #175) > Created attachment 92770 [details] > not quite random corruption example > > With the workarounds disabled, can anything be deduced from the text or > pixmap corruption not seeming to be completely random? It's not entirely random. What I have noticed is that one or more vertices are corrupt. Sometimes you see the correct content but skewed, which is what happens if you just move one of the vertices (its texture coordinates). With two or more distorted coordinates, we can be sampling from anywhere within the texture - which can mean that we see the wrong glyph or a highly distorted composite of several glyphs (since all the active glyphs are stored in a single texture).
Running with all workarounds disabled, this change doesn't fix anything nor seem to make any difference, but anyway: Shouldn't the cache flush bits be in dword 0 for gen4 GEN4_PIPE_CONTROL? Maybe gen5 also? diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c index 1d164b6..894418b 100644 --- a/src/sna/gen4_render.c +++ b/src/sna/gen4_render.c @@ -575,8 +575,10 @@ inline static void gen4_emit_pipe_flush(struct sna *sna) { #if 1 - OUT_BATCH(GEN4_PIPE_CONTROL | (4 - 2)); - OUT_BATCH(GEN4_PIPE_CONTROL_WC_FLUSH); + OUT_BATCH(GEN4_PIPE_CONTROL | + GEN4_PIPE_CONTROL_WC_FLUSH | + (4 - 2)); + OUT_BATCH(0); OUT_BATCH(0); OUT_BATCH(0); #else @@ -601,8 +603,10 @@ inline static void gen4_emit_pipe_invalidate(struct sna *sna) { #if 0 - OUT_BATCH(GEN4_PIPE_CONTROL | (4 - 2)); - OUT_BATCH(GEN4_PIPE_CONTROL_WC_FLUSH | GEN4_PIPE_CONTROL_TC_FLUSH); + OUT_BATCH(GEN4_PIPE_CONTROL | + GEN4_PIPE_CONTROL_WC_FLUSH | GEN4_PIPE_CONTROL_TC_FLUSH | + (4 - 2)); + OUT_BATCH(0); OUT_BATCH(0); OUT_BATCH(0); #else
Created attachment 93235 [details] [review] sna/gen4,5: Fix setting pipe control cache flush bits Only the one in gen4_emit_pipe_flush is in an enabled part of the code anyway.
Nevertheless it was a good catch. commit 1cbc59a917e7352fc68aa0e26b1575cbd0ceab0d Author: Edward Sheldrake <ejsheldrake@gmail.com> Date: Mon Feb 3 09:34:33 2014 +0000 sna/gen4,5: Fix setting pipe control cache flush bits Cache flush bits are on dword 0, not 1, on gen4 and gen5. Also texture cache invalidate is only available from Cantiga onwards.
Created attachment 93326 [details] icon corruption Latest git (2.99.909-7-g1cbc59a) has icon corruption, but all text is fine.
Sigh. Probably, diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c index 1580707..ba9c9bc 100644 --- a/src/sna/gen4_render.c +++ b/src/sna/gen4_render.c @@ -602,6 +602,7 @@ gen4_emit_pipe_break(struct sna *sna) inline static void gen4_emit_pipe_invalidate(struct sna *sna) { +#if 0 OUT_BATCH(GEN4_PIPE_CONTROL | GEN4_PIPE_CONTROL_WC_FLUSH | (sna->kgem.gen >= 045 ? GEN4_PIPE_CONTROL_TC_FLUSH : 0) | @@ -609,6 +610,9 @@ gen4_emit_pipe_invalidate(struct sna *sna) OUT_BATCH(0); OUT_BATCH(0); OUT_BATCH(0); +#else + OUT_BATCH(MI_FLUSH); +#endif }
(In reply to comment #181) Without #181 patch I had flickering like: http://ixion.pld-linux.org/~arekm/intel-flicker.mov (best viewed from local fs) With the patch flickering is gone. "GM45" - synonym for all bad words :/
Pushed the flushes once again. Hopefully we are corruption free once more. commit fc001615ff78df4dab6ee0d5dd966b723326c358 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Feb 4 10:36:21 2014 +0000 sna/gen4: Disable use of pipecontrol invalidates again One day, just not today, we may make gen4 work correctly, efficiently and fast. Today, we can barely pick one. References: https://bugs.freedesktop.org/show_bug.cgi?id=55500 Signed-off-by: Chris Wilson <chris@chris-wilson.co.u
Created attachment 96026 [details] GTK+2 fonts corruption as per git, the commit is applied for the version 2.99.910. I have this version installed. But the problem is still there, see the screenshot. Interestingly, the fonts are heavily corrupted in GTK+2 apps. QT/efl apps are fine. GTK+3 apps mainly ok, but gedit have icons corrupted a bit, but not fonts.
(In reply to comment #184) > Created attachment 96026 [details] > GTK+2 fonts corruption > > as per git, the commit is applied for the version 2.99.910. I have this > version installed. > But the problem is still there, see the screenshot. Interestingly, the fonts > are heavily corrupted in GTK+2 apps. QT/efl apps are fine. GTK+3 apps mainly > ok, but gedit have icons corrupted a bit, but not fonts. Please provide your Xorg.0.log to confirm this is the same bug.
Created attachment 96027 [details] Xorg.log
Ah, would you happen to have an uneven amount of memory installed?
(In reply to comment #187) > Ah, would you happen to have an uneven amount of memory installed? Probably. Is this and unsupported configuration?
(In reply to comment #187) > Ah, would you happen to have an uneven amount of memory installed? I too have an uneven amount of memory installed (7 GiB). It's an old workhorse for office usage only, so dual-channel doesn't matter.
(In reply to comment #188) > (In reply to comment #187) > > Ah, would you happen to have an uneven amount of memory installed? > > Probably. Is this and unsupported configuration? We have a known issue in that we don't detect the swizzling correctly and so we may end up with corruption if objects are paged out from memory. Typically you see the affects after running for some time (so that memory pressure takes effect) or after resume. See bug 28813 bug 45092
true. Thanks a lot, Chris!
*** Bug 76804 has been marked as a duplicate of this bug. ***
Dear all, since 4.4-rc series I am again seeing this kind of corruption, especially after return from sleep (suspend to ram). This happens on a Sony VAIO Pro, 00:02.0 VGA compatible controller: Intel Corporation Haswell-ULT Integrated Graphics Controller (rev 09) (prog-if 00 [VGA controller]) Subsystem: Sony Corporation Device 90b6 Flags: bus master, fast devsel, latency 0, IRQ 40 Memory at f5c00000 (64-bit, non-prefetchable) [size=4M] Memory at e0000000 (64-bit, prefetchable) [size=256M] I/O ports at f000 [size=64] Expansion ROM at <unassigned> [disabled] Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit- Capabilities: [d0] Power Management version 2 Capabilities: [a4] PCI Advanced Features Kernel driver in use: i915 Switching to 4.3 resolves the issue. Software is Debian/sid uptodate, that is xorg 7.7, driver-intel 2.99.917 Is this a known problem on the new kernels? Thanks Norbert
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-intel/issues/13.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.