55500 – [sna gen4 w/a] corrupt rendering (including wrong rendering of characters and flickering on redraw)

Bug 55500 - [sna gen4 w/a] corrupt rendering (including wrong rendering of characters and flickering on redraw)

Summary: [sna gen4 w/a] corrupt rendering (including wrong rendering of characters and...

Status:	RESOLVED MOVED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	low normal
Assignee:	Chris Wilson
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Duplicates (8):	58139 59351 59685 60284 62302 67377 68596 71773 (view as bug list)
Depends on:
Blocks:

Reported:	2012-10-01 14:44 UTC by Joe Peterson
Modified:	2019-11-27 13:30 UTC (History)
CC List:	22 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Video screen capture showing the chromium tab flicker (354.42 KB, application/x-matroska) 2012-10-01 14:45 UTC, Joe Peterson	no flags	Details
Video screen capture showing the chromium tab flicker (354.51 KB, application/octet-stream) 2012-10-01 14:55 UTC, Joe Peterson	no flags	Details
Xorg.0.log file from session showing problem (41.54 KB, text/plain) 2012-10-01 14:57 UTC, Joe Peterson	no flags	Details
Flush state changes (2.83 KB, patch) 2012-10-17 22:37 UTC, Chris Wilson	no flags	Details \| Splinter Review
Flush state changes (3.48 KB, patch) 2012-10-18 11:45 UTC, Chris Wilson	no flags	Details \| Splinter Review
Xorg.0.log of the session affected by the bug (32.40 KB, text/plain) 2012-10-29 15:21 UTC, Mauro Fruet	no flags	Details
Screen capture for chromium tabs (454.06 KB, video/x-msvideo) 2012-11-07 16:32 UTC, Mauro Fruet	no flags	Details
How the picture should look like (127.42 KB, image/png) 2012-12-14 08:06 UTC, Zdenek Kabelac	no flags	Details
Incorrect image 1 (112.37 KB, image/png) 2012-12-14 08:09 UTC, Zdenek Kabelac	no flags	Details
Incorrect picture 2 (181.66 KB, image/png) 2012-12-14 08:12 UTC, Zdenek Kabelac	no flags	Details
Sample libreoffice document that almost always shows issues on gen4 (8.40 KB, application/vnd.oasis.opendocument.graphics) 2013-02-04 23:44 UTC, sergio.callegari	no flags	Details
A video showing the issue during a zoom sequence (632.26 KB, video/ogg) 2013-02-20 17:03 UTC, sergio.callegari	no flags	Details
Video capture showing flicker on firefox scroll (1.30 MB, application/octet-stream) 2013-07-06 08:51 UTC, Zdenek Kabelac	no flags	Details
Scroll with intel_gpu_top (138.59 KB, image/jpeg) 2013-07-06 20:45 UTC, Zdenek Kabelac	no flags	Details
Scroll with intel_gpu_top and no problems (153.90 KB, image/jpeg) 2013-07-06 20:49 UTC, Zdenek Kabelac	no flags	Details
Snapshot on the issue on characters (2.67 KB, image/png) 2013-07-09 11:46 UTC, sergio.callegari	no flags	Details
Handy phone snapshot of artifacts on chars post drm/i915 fix: Only clear write-domains after a successful wait-seqno (392.20 KB, image/png) 2013-07-12 16:06 UTC, sergio.callegari	no flags	Details
Character m of the word "parameters" is corrupted (705.88 KB, image/jpg) 2013-07-21 21:00 UTC, Alexander Haeussler	no flags	Details
Character e of word "deletion" and n of "Indicate" is corrupted (1.11 MB, image/jpeg) 2013-07-21 21:41 UTC, Alexander Haeussler	no flags	Details
Character i of word "application" and p of "bpp>=8" is corrupted (959.35 KB, image/jpeg) 2013-07-21 21:45 UTC, Alexander Haeussler	no flags	Details
/dev/shm corruption (4.57 KB, image/png) 2013-12-13 10:39 UTC, Ivan Bulatovic	no flags	Details
gedit in openbox with 2.99.907 GM45 SNA (199.50 KB, image/png) 2014-01-01 11:06 UTC, Edward Sheldrake	no flags	Details
intel_error_decode output (517.89 KB, application/octet-stream) 2014-01-01 13:12 UTC, Edward Sheldrake	no flags	Details
intel_error_decode output (649.75 KB, application/octet-stream) 2014-01-01 14:13 UTC, Edward Sheldrake	no flags	Details
Grabbed snapshot with patch from comment 157 (56.89 KB, image/png) 2014-01-07 20:14 UTC, Zdenek Kabelac	no flags	Details
Text with errors (52.08 KB, image/png) 2014-01-16 19:40 UTC, Zdenek Kabelac	no flags	Details
Always force a GPU flush between operations (2.42 KB, patch) 2014-01-17 14:35 UTC, Chris Wilson	no flags	Details \| Splinter Review
bug (77.65 KB, image/jpeg) 2014-01-17 15:11 UTC, Arkadiusz Miskiewicz	no flags	Details
Always force a GPU flush between operations (2.64 KB, patch) 2014-01-21 10:29 UTC, Chris Wilson	no flags	Details \| Splinter Review
not quite random corruption example (71.73 KB, image/png) 2014-01-25 15:34 UTC, Edward Sheldrake	no flags	Details
sna/gen4,5: Fix setting pipe control cache flush bits (1.66 KB, patch) 2014-02-02 16:53 UTC, Edward Sheldrake	no flags	Details \| Splinter Review
icon corruption (7.90 KB, image/png) 2014-02-03 22:06 UTC, Edward Sheldrake	no flags	Details
GTK+2 fonts corruption (198.17 KB, image/jpeg) 2014-03-19 06:43 UTC, Ildar Muyukov	no flags	Details
Xorg.log (44.15 KB, text/plain) 2014-03-19 07:02 UTC, Ildar Muyukov	no flags	Details
Show Obsolete (3) View All

Description Joe Peterson 2012-10-01 14:44:45 UTC

Using the xf86-video-intel driver in SNA mode (using version 2.20.9, but I've seen it back to 2.20.7 as well - I have not tested SNA in earlier versions), chromium tabs exhibit a strange "blotchy" flickering of text/graphics as the mouse is moved over them.  Note that this may vary depending on the text (or more probably the length of the text) in the tab.

I wonder if the transparent gradient/fade of the text in these tabs plays a part here.

I will attach a video screen capture of this happening.  One URL that causes it is the URL of a previous bug I reported: https://bugs.freedesktop.org/show_bug.cgi?id=55484 (the tab in the video uses this one).

[From my lspci: VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)]

Comment 1 Joe Peterson 2012-10-01 14:45:40 UTC

Created attachment 67928 [details]
Video screen capture showing the chromium tab flicker

Comment 2 Chris Wilson 2012-10-01 14:50:07 UTC

Can you please check that attached video is the correct one? mplayer doesn't complain, but doesn't show anything either.

Also can you please attach your Xorg.0.log so that I know what hardware you have, and which WM you are using (and any compositing options)?

Comment 3 Joe Peterson 2012-10-01 14:55:25 UTC

Created attachment 67929 [details]
Video screen capture showing the chromium tab flicker

Trying this attachment again - this time specifying binary manually...  the previous upload seemed to alter the file.

Comment 4 Joe Peterson 2012-10-01 14:57:31 UTC

Created attachment 67930 [details]
Xorg.0.log file from session showing problem

Comment 5 Joe Peterson 2012-10-01 14:58:38 UTC

I am using openbox as a WM.  I have not selected any special (non-default) compositing options that I know of.

Comment 6 Chris Wilson 2012-10-01 15:12:09 UTC

What it actually looks like is that the gradient is misapplied whilst rendering the glyphs.

Can you please test with either downgrading pixman to 0.26 or compile -intel with -UHAS_PIXMAN_GLYPHS?

Comment 7 Chris Wilson 2012-10-01 15:20:08 UTC

And for an extra level of paranoia, can you also check if the error persists if you do a debug build (unoptimized, -O0) of pixman, xserver, and -intel? Having been burnt by bugs uncovered by aggressive compiler optimisations before, it helps to keep me calm to have a sanity check. ;-)

Comment 8 Joe Peterson 2012-10-01 19:03:34 UTC

Ok, I downgraded to pixman-0.26 (which then causes -intel to be build without HAS_PIXMAN_GLYPHS).  I also compiled xorg-server and -intel with -O0.  However, I was unable to compile pixman with -O0, because configure failed (checking for MMX support).

The same problem persists even with the above done.  Let me know if I should try anything else, or if this is enough to verify that it's not pixman...

Comment 9 Chris Wilson 2012-10-01 19:46:26 UTC

Yes, that is enough to rule out the new pixman_glyph_t routines and enough that is not some random miscompilation. Which means we^W I need to look harder.

Comment 10 Chris Wilson 2012-10-02 13:36:46 UTC

First of all lets disable acceleration of glyphs:

diff --git a/src/sna/sna_glyphs.c b/src/sna/sna_glyphs.c
index 53494e3..4e510a4 100644
--- a/src/sna/sna_glyphs.c
+++ b/src/sna/sna_glyphs.c
@@ -69,7 +69,7 @@
 
 #include <mipict.h>
 
-#define FALLBACK 0
+#define FALLBACK 1
 #define NO_GLYPH_CACHE 0
 #define NO_GLYPHS_TO_DST 0
 #define NO_GLYPHS_VIA_MASK 0

That will tell us whether the corruption occurs as we render the glyphs using the GPU or as we upload. Similarly working through each of the NO_* options thereafter would be very helpful to identify which path in particular is affected.

Secondly,

diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index ceef528..f901008 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -1863,7 +1863,7 @@ gen4_composite_picture(struct sna *sna,
        if (picture->pDrawable == NULL) {
                int ret;
 
-               if (picture->pSourcePict->type == SourcePictTypeLinear)
+               if (picture->pSourcePict->type == SourcePictTypeLinear && 0)
                        return gen4_composite_linear_init(sna, picture, channel,
                                                          x, y,
                                                          w, h,
@@ -2046,7 +2046,6 @@ check_gradient(PicturePtr picture)
 {
        switch (picture->pSourcePict->type) {
        case SourcePictTypeSolidFill:
-       case SourcePictTypeLinear:
                return false;
        default:
                return true;

will confirm whether is it the gradient that is implicated in this bug.

Comment 11 Joe Peterson 2012-10-03 01:13:23 UTC

OK, tried your suggestions - I set the following, one at a time, to 1.  Here are the results:

 #define FALLBACK 0                       clean
 #define NO_GLYPH_CACHE 0                 clean
 #define NO_GLYPHS_TO_DST 0               clean
 #define NO_GLYPHS_VIA_MASK 0             bad
 #define NO_SMALL_MASK 0                  bad
 #define NO_GLYPHS_SLOW 0                 bad
 #define NO_DISCARD_MASK 0                bad

Where it says "clean", changing only this define to 1 (and leaving the rest at 0) fixed the problem.  Where it says "bad", the bug still existed with only this define set to 1.

Applying the gen4_render.c patches (with all of the above unchanged at set to 0) had no effect - the bug still existed.

Let me know if this helps.

Comment 12 Chris Wilson 2012-10-03 08:45:17 UTC

Ok, we are now into the realms of a missing GPU flush, or rather a missing workaround.

Can you please try:

diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index ceef528..9d298dd 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -1265,6 +1265,7 @@ gen4_emit_pipelined_pointers(struct sna *sna,
        if (key == sna->render_state.gen4.last_pipelined_pointers)
                return;
 
+       OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH);
        OUT_BATCH(GEN4_3DSTATE_PIPELINED_POINTERS | 5);
        OUT_BATCH(sna->render_state.gen4.vs);
        OUT_BATCH(GEN4_GS_DISABLE); /* passthrough */

with everything else back to normal.

Comment 13 Joe Peterson 2012-10-03 11:50:42 UTC

(In reply to comment #12)
> +       OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH);

Nope, bug still occurs.

Comment 14 Chris Wilson 2012-10-03 12:08:31 UTC

Hmm, ok. A more drastic patch to confirm that this is the flushing bug I think it is...

diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index ceef528..5e35ff1 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -1376,7 +1376,7 @@ gen4_emit_state(struct sna *sna,
                const struct sna_composite_op *op,
                uint16_t wm_binding_table)
 {
-       if (FLUSH_EVERY_VERTEX)
+       if (1||FLUSH_EVERY_VERTEX)
                OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH);
 
        gen4_emit_drawing_rectangle(sna, op);

Comment 15 Joe Peterson 2012-10-03 14:12:38 UTC

(In reply to comment #14)
> Hmm, ok. A more drastic patch to confirm that this is the flushing bug I
> think it is...
> ...
> +       if (1||FLUSH_EVERY_VERTEX)
>                 OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH);

No, sorry to say, this did not fix it.

Comment 16 Chris Wilson 2012-10-17 22:37:38 UTC

Created attachment 68731 [details] [review]
Flush state changes

Since this sounds like a flush issue and gen5+ require flushes between certain pipelined ops, presume gen4 also needs them. Can you please test the attached patch as I've yet to reproduce this issue?

Comment 17 Chris Wilson 2012-10-18 11:45:08 UTC

Created attachment 68756 [details] [review]
Flush state changes

Comment 18 Chris Wilson 2012-10-19 14:57:21 UTC

I've uploaded commit 257abfdabe39629fb458ed65fab11283f7518dc4
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Oct 17 23:34:22 2012 +0100

    sna/gen4: Presume we need a flush upon state change similar to gen5+
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=55627
    References: https://bugs.freedesktop.org/show_bug.cgi?id=55500
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

in the hope that gen4 has the same brokenness as the later generations. Please reopen if this doesn't fix the corruption, thanks.

Comment 19 Joe Peterson 2012-10-23 20:06:00 UTC

I tested just now with the latest git, and problem persists.  Just to make sure I did this correctly, I replaced the following three files on my system with freshly-built ones:

/usr/lib/xorg/modules/drivers/intel_drv.so
/usr/lib/libI810XvMC.so.1.0.0
/usr/lib/libIntelXvMC.so.1.0.0

The other lib symlinks are, of course, there, for the last two.

Comment 20 Mauro Fruet 2012-10-29 15:20:32 UTC

This bug affects me too, on both the machines in which I have tried to enable the SNA option. I've tried to enable it both on GNOME and Xfce, with the same results. I attach the Xorg.0.log of one of them.
I have tried to install the git version of xf86-video-intel, but the bug persists.
I have also noticed that the bug seems to affect only chromium tabs in which the html title can not fit, as in the video posted by Joe Peterson.

Comment 21 Mauro Fruet 2012-10-29 15:21:56 UTC

Created attachment 69237 [details]
Xorg.0.log of the session affected by the bug

Comment 22 Mauro Fruet 2012-11-01 15:56:01 UTC

Chris, the bug has disappeared today after upgrading cairo on my Archlinux machine. The upgrade includes this set of patches (half of them are from yourself):
https://projects.archlinux.org/svntogit/packages.git/tree/trunk/git_fixes.diff?h=packages/cairo
They were submitted on September 17, but they were included into the Archlinux cairo package only yesterday.

Comment 23 Chris Wilson 2012-11-01 16:34:42 UTC

Hmm, more relevant to this case is that they dropped the 'cairo-1.10.0-buggy_gradients.patch' which will impact the rendering in the tabs (and notably be about an order of magnitude faster). So it feels like the bug is still lurking.

Comment 24 Joe Peterson 2012-11-01 17:36:20 UTC

(In reply to comment #23)
> Hmm, more relevant to this case is that they dropped the
> 'cairo-1.10.0-buggy_gradients.patch' which will impact the rendering in the
> tabs (and notably be about an order of magnitude faster). So it feels like
> the bug is still lurking.

Also, I have that cairo update from Arch as well, and although the problem looks (perhaps) less obvious, it's definitely still there.

Comment 25 Mauro Fruet 2012-11-02 08:06:04 UTC

You are both right. At a first sight, the flickering seemed to be gone, but even if it's much less noticeable (at least for me) it's still there.

Comment 26 Chris Wilson 2012-11-06 16:41:55 UTC

Found a hint with

commit b2245838c15b54d72557de8facb7cc15d59624ae
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Nov 6 16:32:32 2012 +0000

    sna/gen4: opacity spans requires the per-rectangle flush w/a
    
    Note that this is worsened, but not caused, by:
    
    commit e1a63de8991a6586b83c06bcb3369208871cf43d
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Fri Nov 2 09:10:32 2012 +0000
    
        sna/gen4+: Prefer GPU spans if the destination is active
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=55500
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Does this help?

Comment 27 Mauro Fruet 2012-11-06 17:48:36 UTC

I've just tested the latest git, but unfortunately the problem persists.

Comment 28 Chris Wilson 2012-11-07 09:25:50 UTC

Ok, looking closely on this gm45 I can see a similar subtle flicker in the chromium tabs. Is the effect still as severe as shown in the first video? Any improvements from the flushing changing?

Comment 29 Mauro Fruet 2012-11-07 16:32:41 UTC

Created attachment 69666 [details]
Screen capture for chromium tabs

I have attached a video screen capture that shows the flickering after installing the latest git version of xf86-video-intel, lib-drm and cairo. As I wrote in a previous comment, the problem is almost not noticeable. This is the same behavior that I have encountered just before the latest patch.

Comment 30 Chris Wilson 2012-11-09 10:39:58 UTC

I think the effect is now subtle enough that I'm not going to worry too much - it is undoubtably a missing flush or incorrect hw state. Since I can reproduce using the chromium tabs, I'll fix it one day (hopefully!). Please do ping occasionally to remind me, or if you have a found a particular nasty example.

Comment 31 Joe Peterson 2012-11-09 13:21:15 UTC

(In reply to comment #30)
> I think the effect is now subtle enough that I'm not going to worry too much
> - it is undoubtably a missing flush or incorrect hw state. Since I can
> reproduce using the chromium tabs, I'll fix it one day (hopefully!). Please
> do ping occasionally to remind me, or if you have a found a particular nasty
> example.

Hey Chris, do you think this could be mainly a HW bug only on the older graphics chips?  In other words, is the code obeying the spec, but it would take an odd (non-spec) workaround to make the HW behave?

If you have put in any extra flushes to try to fix this, perhaps those should now be removed so as not to affect performance adversely, especially if the extra ones affect newer HW as well (or did you only do it in the code for the old HW?).

I suspect I'll be upgrading from this old laptop soon for other reasons, so I agree that worrying too much about a hard-to-fix problem that only affects really old HW is probably not worth it (people could just turn off SNA on older HW, as long as UXA is supported for the foreseeable future).

Comment 32 Chris Wilson 2012-11-09 13:26:29 UTC

The code paths are specific for this chipset, and this flicker only seems to appear on gm45 for now. (Though I need to look at the other gen4 closely.)

And for the order of magnitude performance improvement switching from to UXA to SNA I think such a minor bug is a small price to pay... And it will be fixed as soon as I find a workaround.

Comment 33 Chris Wilson 2012-12-11 13:06:55 UTC

*** Bug 58139 has been marked as a duplicate of this bug. ***

Comment 34 Zdenek Kabelac 2012-12-11 13:47:22 UTC

I'm not yet convinced that my Bug 58139 is duplicate of this BZ.

There are noticeable differences  - when I use some ffmpeg capture tool, to grab picture from screen - it seems to be not catch those 'visual' errors I'm observing on laptops LCD (and captured with mobile phone).
(On the other hand I could have just wrong options passed in to the grab tool)

Also in my case, it's rather something 'new' in this 'scale' - since I'm not aware I'd have been seeing this with several months older SNA driver.

(Eventually I could possibly try bisect if I found some 'boring' movie to watch :))

Comment 35 Chris Wilson 2012-12-11 14:01:43 UTC

(In reply to comment #34)
> I'm not yet convinced that my Bug 58139 is duplicate of this BZ.
> 
> There are noticeable differences  - when I use some ffmpeg capture tool, to
> grab picture from screen - it seems to be not catch those 'visual' errors
> I'm observing on laptops LCD (and captured with mobile phone).
> (On the other hand I could have just wrong options passed in to the grab
> tool)
> 
> Also in my case, it's rather something 'new' in this 'scale' - since I'm not
> aware I'd have been seeing this with several months older SNA driver.
> 
> (Eventually I could possibly try bisect if I found some 'boring' movie to
> watch :))

I'm using this bug as a catch-all for the issues I'm uncovering with enabling gen4.

Comment 36 Chris Wilson 2012-12-12 09:54:03 UTC

Zdenek, can you see if this reduces the majority of your flicker:

commit 2dbe7d91a7f15a3a9ddad696c5088ca98898fca2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Dec 12 09:50:34 2012 +0000

    sna/gen4: Use the single-threaded SF w/a for spans as well
    
    Fixes the flickering seen in the fishtank demo, for example.

Comment 37 Zdenek Kabelac 2012-12-12 10:20:06 UTC

(In reply to comment #36)
> Zdenek, can you see if this reduces the majority of your flicker:
> 
> commit 2dbe7d91a7f15a3a9ddad696c5088ca98898fca2
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Wed Dec 12 09:50:34 2012 +0000
> 


Hmm, nope, no difference.

Went through some some Czech magazines and their headlines and here are links:

http://goo.gl/xvPt1

http://goo.gl/3kBAV

which seems to be the most obvious on my laptop.

UXA seems to be without problems.

Comment 38 Chris Wilson 2012-12-12 10:33:16 UTC

(In reply to comment #37)
> UXA seems to be without problems.

Because for UXA/gen4, I have a massive hammer to prevent the GPU from trying to execute operations in parallel. The battle is to understand precisely what doesn't work and find alternatives.

Comment 39 Zdenek Kabelac 2012-12-12 10:48:51 UTC

It's probably interesting to note - that with those 2 links above it's not always causing problem - in some case the  firefox tab seems to be rendered without any visible problem.

If I open 2 firefox windows - then it's much easier to get into rendering problems, even in a way, that I've seen pictures being completely replaced with
some brown colorish blur picture - even for seconds.

(I'm using firefox-17.0.1-1.fc19.x86_64)

Comment 40 Chris Wilson 2012-12-12 16:16:24 UTC

*** Bug 54357 has been marked as a duplicate of this bug. ***

Comment 41 Zdenek Kabelac 2012-12-14 08:06:15 UTC

Created attachment 71490 [details]
How the picture should look like

Here is firefox grab of original picture from this page:

http://goo.gl/KAAgP

This is the proper look - at the top are visible 'tabs'

Comment 42 Zdenek Kabelac 2012-12-14 08:09:20 UTC

Created attachment 71491 [details]
Incorrect image 1

This picture is grabbed with some delay - to not move mouse outside of the window (since in that case it would be properly refreshed).

However if if just scroll page up/down - I'm able to get for a short moment picture like this visible there.

Looks like coordinates of the rendered picture were squeezed and mirrored ?

Comment 43 Zdenek Kabelac 2012-12-14 08:12:08 UTC

Created attachment 71492 [details]
Incorrect picture 2

And here is another one I've managed to take - again nicely visible distortion.

When I scroll further up and down - image again gets correct size and look,
and then again after a while I could seen something like this.

Comment 44 Chris Wilson 2013-01-22 00:56:56 UTC

*** Bug 59685 has been marked as a duplicate of this bug. ***

Comment 45 Chris Wilson 2013-02-04 10:58:33 UTC

*** Bug 59351 has been marked as a duplicate of this bug. ***

Comment 46 Chris Wilson 2013-02-04 16:27:01 UTC

*** Bug 60284 has been marked as a duplicate of this bug. ***

Comment 47 sergio.callegari 2013-02-04 23:44:04 UTC

Created attachment 74207 [details]
Sample libreoffice document that almost always shows issues on gen4

I am attaching a test file that does almost never look right when opened with libreoffice on gen4

Comment 48 Chris Wilson 2013-02-05 10:22:39 UTC

So my testcase was "fixed" by:

commit 1565917f10d9fb3c7e2e7e273173c38c364b9861
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Feb 5 10:11:14 2013 +0000

    sna/gen4: Disable non-rectilinear GPU span compositing
    
    This seems to be the primary victim of the render corruption, so disable
    until the root cause is fixed.

Can you please check your worst-case / typical behaviours and see if any still remain?

Comment 49 Chris Wilson 2013-02-05 10:30:25 UTC

(In reply to comment #47)
> Created attachment 74207 [details]
> Sample libreoffice document that almost always shows issues on gen4
> 
> I am attaching a test file that does almost never look right when opened
> with libreoffice on gen4

I get different rendering of that test.odg with different versions of lodraw - and on the machine that renders it different, it switches back to the old output at a certain scale factor. That is quite atypical behaviour for this gen4 bug.

Comment 50 Till Matthiesen 2013-02-05 11:34:08 UTC

It does not fix #59351
https://bugs.freedesktop.org/show_bug.cgi?id=59351

The symptoms remain unchanged.

test.odg works fine here on both screens, though. [lodraw 3.6.5.2]

Comment 51 Chris Wilson 2013-02-05 11:42:47 UTC

Hmm, right, that would not be along the spans path in the first place. Oh well, I can try one of the other workarounds I had earlier...

Comment 52 Chris Wilson 2013-02-05 11:44:05 UTC

How about: 

diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index cc1778a..0a59681 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -1895,6 +1895,9 @@ gen4_render_composite(struct sna *sna,
        tmp->has_component_alpha = false;
        tmp->need_magic_ca_pass = false;
 
+       if (!mask)
+               mask = sna->render.white_picture;
+
        if (mask) {
                if (mask->componentAlpha && PICT_FORMAT_RGB(mask->format)) {
                        tmp->has_component_alpha = true;

Comment 53 sergio.callegari 2013-02-05 11:55:36 UTC

On 05/02/2013 12:34, bugzilla-daemon@freedesktop.org wrote:
>
> *Comment # 50 <https://bugs.freedesktop.org/show_bug.cgi?id=55500#c50> on bug 
> 55500 <https://bugs.freedesktop.org/show_bug.cgi?id=55500> from Till 
> Matthiesen <mailto:high.entropy@web.de> *
> It does not fix #59351
> https://bugs.freedesktop.org/show_bug.cgi?id=59351  <show_bug.cgi?id=59351>
>
> The symptoms remain unchanged.
>
> test.odg works fine here on both screens, though. [lodraw 3.6.5.2]
> --------------------------------------------------------------------------------
>
Please test test.odg zooming in/out and selecting all and moving it around by 
steps (e.g. with the arrows).
Sometimes when opened it initially renders well, but when zooming and moving by 
little amounts the rendering breaks, bot on 3.6.5.2 and on 4.0.0.3 (RC).

Comment 54 Till Matthiesen 2013-02-05 12:09:59 UTC

(In reply to comment #52)
> How about: 
> 
> diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
> index cc1778a..0a59681 100644
> --- a/src/sna/gen4_render.c
> +++ b/src/sna/gen4_render.c
> @@ -1895,6 +1895,9 @@ gen4_render_composite(struct sna *sna,
>         tmp->has_component_alpha = false;
>         tmp->need_magic_ca_pass = false;
>  
> +       if (!mask)
> +               mask = sna->render.white_picture;
> +
>         if (mask) {
>                 if (mask->componentAlpha && PICT_FORMAT_RGB(mask->format)) {
>                         tmp->has_component_alpha = true;

Doesn't fix it, unfortunately.

Comment 55 Till Matthiesen 2013-02-05 12:13:48 UTC

(In reply to comment #53)
> On 05/02/2013 12:34, bugzilla-daemon@freedesktop.org wrote:
>
> Please test test.odg zooming in/out and selecting all and moving it around
> by 
> steps (e.g. with the arrows).
> Sometimes when opened it initially renders well, but when zooming and moving
> by 
> little amounts the rendering breaks, bot on 3.6.5.2 and on 4.0.0.3 (RC).

I tried hard to reproduce the issue, but wasn't able to do so.
Either it's that hard to trigger or it simply doesn't exist for *my* configuration.

Chris, how much influence could the kernel drm and libdrm version have on those issues? I use Linux 3.7.5 and the recent libdrm from git.

Comment 56 Chris Wilson 2013-02-05 12:20:31 UTC

The bug that I'm presuming underlies all of these gen4 corruption issues is a misprogramming of GPU state - I have seen the flicker persist for a few kernels now and would not expect it to be a factor. (Except that there may be an eventual w/a required in the kernel, we have not applied one recently.)

Comment 57 Till Matthiesen 2013-02-06 12:05:53 UTC

(In reply to comment #56)
> The bug that I'm presuming underlies all of these gen4 corruption issues is
> a misprogramming of GPU state - I have seen the flicker persist for a few
> kernels now and would not expect it to be a factor. (Except that there may
> be an eventual w/a required in the kernel, we have not applied one recently.)

I see. 

Is there anything we can do to help you further?
Enable debug mode, etc.?

Comment 58 sergio.callegari 2013-02-07 08:37:55 UTC

On 05/02/2013 13:13, bugzilla-daemon@freedesktop.org wrote:
>
> *Comment # 55 <https://bugs.freedesktop.org/show_bug.cgi?id=55500#c55> on bug 
> 55500 <https://bugs.freedesktop.org/show_bug.cgi?id=55500> from Till 
> Matthiesen <mailto:high.entropy@web.de> *
> (In reply tocomment #53  <show_bug.cgi?id=55500#c53>)
> > On 05/02/2013 12:34,bugzilla-daemon@freedesktop.org  <mailto:bugzilla-daemon@freedesktop.org>  wrote:
> >
> > Please test test.odg zooming in/out and selecting all and moving it around
> > by
> > steps (e.g. with the arrows).
> > Sometimes when opened it initially renders well, but when zooming and moving
> > by
> > little amounts the rendering breaks, bot on 3.6.5.2 and on 4.0.0.3 (RC).
>
> I tried hard to reproduce the issue, but wasn't able to do so.
> Either it's that hard to trigger or it simply doesn't exist for *my*
> configuration.
>
>
Weird enough, I have discovered that the rendering bug with my test file 
(test.odg) goes away if antialiasing is disabled in libreoffice.

Comment 59 sergio.callegari 2013-02-07 10:50:13 UTC

This morning I've received the kde 4.10 update and newer debs for git version of the intel driver (13/02/07 - git 974b6a).

I cannot decouple if the rendering issues have worsened with the newer driver (probably not) or if the newer kde framework and kwin is exposing bugs in a more aggressive way, but things have become a bit problematic with this newer setup.

1) Border of the toolbar disappears
2) Shades/light effects of kwin do not render properly and remain in place (often with distortion) after use
3) Some characters appear underlined in the konsole

Moving back to uxa seems to fix all these issues, as well as the libreoffice rendering.

Comment 60 Chris Wilson 2013-02-07 13:28:05 UTC

I should mention that Option "AccelMethod" "blt" should eliminate the artifacts and still outperform uxa. Can you please confirm that supposition?

Comment 61 sergio.callegari 2013-02-07 14:44:54 UTC

Trying blt right now. No rendering issues. SNA was perceivably faster.

Comment 62 Till Matthiesen 2013-02-07 16:31:18 UTC

I can't test it, obviously.
The rotated zaphod screen, at which the artefacts appear in my case, is only available with "sna" enabled.

On a side note:
I tried it nevertheless as I forgot about that.
The result was a segfault of the xserver.

[   406.336] Requested Entity already in use!
[   406.336] (EE) Screen 1 deleted because of no matching config section.
[   406.336] (EE) 
[   406.336] (EE) Backtrace:
[   406.336] (EE) 0: /usr/bin/X (xorg_backtrace+0x36) [0x58a416]
[   406.336] (EE) 1: /usr/bin/X (0x400000+0x18e269) [0x58e269]
[   406.336] (EE) 2: /usr/lib/libpthread.so.0 (0x7f3cfacee000+0xf1e0) [0x7f3cfacfd1e0]
[   406.336] (EE) 3: /usr/lib/xorg/modules/drivers/intel_drv.so (0x7f3cf84d2000+0x16150) [0x7f3cf84e8150]
[   406.336] (EE) 4: /usr/lib/xorg/modules/drivers/intel_drv.so (0x7f3cf84d2000+0x17435) [0x7f3cf84e9435]
[   406.336] (EE) 5: /usr/bin/X (xf86DeleteScreen+0x84) [0x480c64]
[   406.336] (EE) 6: /usr/bin/X (xf86BusConfig+0x216) [0x46c086]
[   406.336] (EE) 7: /usr/bin/X (InitOutput+0x956) [0x479df6]
[   406.336] (EE) 8: /usr/bin/X (0x400000+0x26776) [0x426776]
[   406.336] (EE) 9: /usr/lib/libc.so.6 (__libc_start_main+0xf5) [0x7f3cf9b7aa15]
[   406.336] (EE) 10: /usr/bin/X (0x400000+0x26c9d) [0x426c9d]
[   406.336] (EE) 
[   406.336] (EE) Segmentation fault at address 0x0
[   406.336] 
Fatal server error:
[   406.336] Caught signal 11 (Segmentation fault). Server aborting
[   406.336] 
[   406.336] (EE) 

Wouldn't it be possible to gracefully exit and remind users that rotated setups are only available with the "sna" option?

Comment 63 Chris Wilson 2013-02-07 21:02:54 UTC

Ah, sorry I mislead. I forgot everybody doesn't use SNA as their default accelmethod. In order to use the BLT trick, you need to --disable-uxa or --with-default-accel=sna.

Comment 64 Till Matthiesen 2013-02-08 10:47:55 UTC

Hi Chris,

thanks for clarification.

I compiled the driver with --with-default-accel=sna and set
both, the rotated and non-rotated screens, to "AccelMethod" "blt".

Now, the rotated screen is _all black_ but the cursor.
So I, somehow, managed to start acroread but do not see anything but the cursor.
The other screen works fine, though.

Comment 65 sergio.callegari 2013-02-20 17:03:45 UTC

Created attachment 75189 [details]
A video showing the issue during a zoom sequence

Attaching a video showing the issue while zooming with libreoffice. Hope that the shape of the patterns that appear may be a clue to identify the issue.

Comment 66 sergio.callegari 2013-02-27 15:36:42 UTC

Still present (exactly as in the video) with kernel 3.8.0
libdrm and libkms 2.4.42 + git shapshot 13/02/25 commit 41fc2c...
mesa 9.2 + git snapshot 13/02/25 commit 533dc3...
xserver intel driver 2.21.3 + git snapshot 13/02/25 commit 421910...

I have discovered that I encounter a similarly looking rendering issue, with the image getting decomposed in multiple pieces that are not correctly aligned, on the whole screen whenever I attach an external displayport monitor and call

xrandr --output DP1 --auto --primary --output LVDS1 --off

curiosly, this does not happen if I first switch on the monitor and then switch
off the laptop screen with 2 independent xrandr calls.

Comment 67 Chris Wilson 2013-02-27 16:25:03 UTC

(In reply to comment #66)
> Still present (exactly as in the video) with kernel 3.8.0
> libdrm and libkms 2.4.42 + git shapshot 13/02/25 commit 41fc2c...
> mesa 9.2 + git snapshot 13/02/25 commit 533dc3...
> xserver intel driver 2.21.3 + git snapshot 13/02/25 commit 421910...
> 
> I have discovered that I encounter a similarly looking rendering issue, with
> the image getting decomposed in multiple pieces that are not correctly
> aligned, on the whole screen whenever I attach an external displayport
> monitor and call

Like https://bugs.freedesktop.org/attachment.cgi?id=70128 (bug 57160)?

Comment 68 sergio.callegari 2013-02-28 09:31:52 UTC

Looks like the 'decomposition' in small elements that are not aligned correctly has an even finer granularity. See https://bugs.freedesktop.org/attachment.cgi?id=75673

Comment 69 sergio.callegari 2013-03-08 14:48:36 UTC

The issue with the libreoffice test file is fixed for me. Thanks!!!

Comment 70 Chris Wilson 2013-03-08 14:58:52 UTC

(In reply to comment #69)
> The issue with the libreoffice test file is fixed for me. Thanks!!!

Okay, now that was an accident!

Can you try

diff --git a/src/sna/sna_accel.c b/src/sna/sna_accel.c
index ae6d3c1..5edad51 100644
--- a/src/sna/sna_accel.c
+++ b/src/sna/sna_accel.c
@@ -57,7 +57,7 @@
 #define FORCE_INPLACE 0
 #define FORCE_FALLBACK 0
 #define FORCE_FLUSH 0
-#define FORCE_FULL_SYNC 1 /* https://bugs.freedesktop.org/show_bug.cgi?id=61628
+#define FORCE_FULL_SYNC 0
 
 #define DEFAULT_TILING I915_TILING_X

and see if the corruption returns?

Comment 71 Chris Wilson 2013-03-13 16:46:26 UTC

*** Bug 62302 has been marked as a duplicate of this bug. ***

Comment 72 Roman Elshin 2013-03-15 18:24:14 UTC

"and see if the corruption returns?"
after this diff, openoffice test file in my config looks the same as in comment 65, and ok without it. It is after recompiling cairo without server_side_gradients.patch, what makes chromium tabs much better.

Comment 73 Chris Wilson 2013-03-15 21:41:36 UTC

(In reply to comment #72)
> It is after recompiling cairo without
> server_side_gradients.patch, what makes chromium tabs much better.

Don't bother with that, the bug is manifest inside the GPU. Your chromium tabs is just one instance where the GPU stutters, but it is not the only one and they are all not related to gradients (which is just a different texture after all).

Comment 74 Zdenek Kabelac 2013-06-14 07:16:58 UTC

I think I've a sort of positive message here
at least on my T61 - now using upstream git commit 

8f340f90f4b2f269d6308d0bd31fbc2a5f579608

I'm no longer observing corruptions while scrolling Firefox pages.
So some recent commit is probably behind this change.
Before I've used 14 days older commit and I've been able to easily
see those corruptions. Now it looks like they are gone.

Of course I'll make a longer observation here - but so far it looks promising.

Comment 75 Zdenek Kabelac 2013-06-18 08:39:03 UTC

   (In reply to comment #74)
> I think I've a sort of positive message here
> at least on my T61 - now using upstream git commit 
> 
> 8f340f90f4b2f269d6308d0bd31fbc2a5f579608
> 
> I'm no longer observing corruptions while scrolling Firefox pages.
> So some recent commit is probably behind this change.
> Before I've used 14 days older commit and I've been able to easily
> see those corruptions. Now it looks like they are gone.
> 
> Of course I'll make a longer observation here - but so far it looks
> promising.


Ok - it seems that on the longterm run the corruptions starts to appear again.
So it seems to be related how the memory is being used over the time.

But with my 4 days uptime - now I see easily images corrupted during the scroll in firefox.

Comment 76 sergio.callegari 2013-06-18 08:43:04 UTC

One very effective way to accelerate the appearence of the issues (which may help debugging) seems to be using libreoffice draw/impress. Drawing large shapes (or even better importing large bitmaps) and then selecting them causes a sort of a grid to be drawn over the shapes and then to be erased after the object is not anymore selected. Doing this a few times is often sufficient to cause incorrect drawing or erasing of the grid or incorrect redrawing of the image under the grid.

Comment 77 sergio.callegari 2013-06-18 08:45:09 UTC

Also note that the 'sample libreoffice document that almost always shows issues on gen4' is again quite often showing issues on gen4. This file seems to be a nice regression test.

Comment 78 Deve 2013-06-18 11:12:24 UTC

No hope to solve this and many, many other bugs on gen4 - here and in mesa. I suggest to buy new laptop because it doesn't have any sense. Some time ago I also thought that it has. Now I have Intel Ivybridge and Nvidia through Bumblebee and both runs almost perfect :-)

Regards

Comment 79 Zdenek Kabelac 2013-06-18 11:42:42 UTC

(In reply to comment #78)
> No hope to solve this and many, many other bugs on gen4 - here and in mesa.
> I suggest to buy new laptop because it doesn't have any sense. Some time ago
> I also thought that it has. Now I have Intel Ivybridge and Nvidia through
> Bumblebee and both runs almost perfect :-)
> 
> Regards

Sure I could buy and run Windows, and forget about buggy Linux drivers ;)
But we are not all the same - I still hope it can be fixed :)....
Enjoy your non free proprietary Nvidia drivers....

Comment 80 sergio.callegari 2013-06-18 11:50:17 UTC

I think that all Intel developers are extremely willing to help, and I appreciate a lot that they try to do so even if I have hardware that is 3 years old. Let's try to provide good info and replicable demo cases with the bug reports. And let's stick just to them, since these bug reports are probably already long, difficult to read and to interpret as they are.

Comment 81 Deve 2013-06-18 14:16:05 UTC

Actually I didn't say that Nvidia has better drivers. Just Ivy- and Sandybridge based grahics cards have MUCH better support. 

I reported some bugs in mesa driver. Bisected, logs from sysprof, wine etc. For example:
https://bugs.freedesktop.org/show_bug.cgi?id=51471
Bug since mesa 8.0-rc1, easy to repair. Fixed fortunately in february 2013. And many other not fixed.

Bugs in X.org drivers were usually fixed quickly. Sometimes in the same day ;)

Sorry for offtopic.

Regards,
Deve

Comment 82 Chris Wilson 2013-07-02 10:03:02 UTC

Sergio, I believe the corruption you are seeing in the presentation is from the coherency bug. My apologies for confusing that with the general gen4 rendering issues.

Comment 83 Zdenek Kabelac 2013-07-06 08:51:40 UTC

Created attachment 82118 [details]
Video capture showing flicker on firefox scroll

I've captured at 25FPS some example how the pictures are flickering when i.e. Firefox windows is being scrolled up/down. In most cases, the pictures is visible normally when scrolling stops, but in rare case the picture stays non viewable.  In this short video if you use i.e. mplayer '.'/step-by-step-frame you could find 2 places where picture is broken.

Comment 84 Chris Wilson 2013-07-06 15:10:07 UTC

Yes, that is characteristic of this bug. The internal vertex/texture coordinates that are being passed along the GPU pipeline become corrupt (it looks like we overflow a small ring buffer). The only effective approach I've found so far has been to keep the number of rectangles inside the GPU pipeline below a magic value - but that is quite tricky here, and the simplest seems to be the big hammer that I stumbled upon for UXA. (Stumbled as it was an artifact of the original implementation and fixes an entirely different GPU bug - an immediate hang.)

Comment 85 Zdenek Kabelac 2013-07-06 18:31:14 UTC

(In reply to comment #84)
> Yes, that is characteristic of this bug. The internal vertex/texture
> coordinates that are being passed along the GPU pipeline become corrupt (it
> looks like we overflow a small ring buffer). The only effective approach

If it would be plain 'overflow' - than I'd be seeing this kind of corruption
always right after X starts.

But it seems like i.e. Firefox must be used for certain period of time,
to make these corruptions visible.

Surely I'm not not an expert on GPU programming - but maybe when the physical memory gets fragmented enough after some usage there are some 'cached bo' objects - maybe their usage is not fully synchronized?

Also the effect could also disappear (I've not yet noticed any particular way for that) - so then Firefox scroll even the very same pages and there
is no problem (i.e. now I run still the same session - and I do not get any visible problem)

> I've found so far has been to keep the number of rectangles inside the GPU
> pipeline below a magic value - but that is quite tricky here, and the

Well why the number would sometimes work for ours without a single visible problem, and suddenly start to show them again?

My impression here is, when this problem is visible,
it looks like operations is working with 'wrong' parameters
(i.e. sometimes I see the image stretched, inverted, zoomed)

Maybe parameters have only some bits mangled - not fully synchronized memory
from CPU for GPU??

Comment 86 Chris Wilson 2013-07-06 18:41:51 UTC

Whilst more than likely you are using a kernel with a known incoherency bug (as there is yet to be a kernel release with it fixed), the effects are quite different to the ones you captured. The issue which I believe to be behind the distorted flickering can be quite easily triggered by simply changing the number of rectangles submitted in a single draw call. In practice, this then both depends upon the exact details of the rendering and its timing.

Comment 87 Zdenek Kabelac 2013-07-06 20:45:58 UTC

Created attachment 82127 [details]
Scroll with intel_gpu_top

I've captured video when problem appears and when doesn't
(Attaching only frame)

Using some past kernel 3.10 b2c311075db578f1433d9b303698491bfa21279a
(as of now - current vanilla)

Using current xf86 tree 5aaab9ea0310d48bb1a1ca20308d1c9721a9de3f
(as of now)

Running Firefox scroll when I've managed to trigger problem
(seem opening a lot of pages with small icons on them helps to speedup this process)

As could be seen - intel_gpu_top shows quite high usage there.
Machine has been otherwise unloaded.
During scrolling those percentages were quite the same.

Comment 88 Zdenek Kabelac 2013-07-06 20:49:20 UTC

Created attachment 82128 [details]
Scroll with intel_gpu_top and no problems

Quite the same system - except the problem was not visible (restarted X session).
As could be seen - now during scrolling the GPU has been basically unloaded.

Nothing else the Firefox scroll has been basically done.

I should also mention I'm running Firefox nightly - but the same could be triggered with stable version.

Comment 89 Zdenek Kabelac 2013-07-06 21:31:09 UTC

Hmm - after doing some more experiments -  the  high load on GPU seems to be pretty much the thing which make the problem visible.

Another interesting thing is - it's usually enough to just 'reload' the very same page in Firefox and the load goes down to ~1% and everything is ok.

Also the kernel perf top doesn't seem to be showing anything interesting.

And a side note - when  intel_gpu_top is running - the laptop is actually giving some whistling noise not really pleasant for longterm usage ;)....

Comment 90 Chris Wilson 2013-07-06 21:55:23 UTC

I have pushed a revised workaround to the best of my understanding to

commit 368c909b29758f996dbbdbec4d471df23f60bc04
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Jul 6 22:27:44 2013 +0100

    sna/gen4: Restore the flush-every-vertex w/a
    
    This is an abhorrent workaround for some internal GPU brokenness. A
    slight refinement since earlier times is the recognition that 16 is a
    magic number limiting the maximum number of inflight rectangles through
    the GPU.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=55500
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 91 Zdenek Kabelac 2013-07-06 22:27:05 UTC

From what I'm experiencing after driver rebuild - I'd have said - now it's actually much simpler to trigger that flickering/high GPU usage.

So unless the patch is missing something - than limitation to 16 doesn't help here on T61.

Comment 92 Chris Wilson 2013-07-07 08:31:12 UTC

Removed an older hack and had to reduce the max further. Bah, humbugs.

Comment 93 Zdenek Kabelac 2013-07-07 09:01:54 UTC

(In reply to comment #92)
> Removed an older hack and had to reduce the max further. Bah, humbugs.

OK, this time I'm definitely not able to reproduce the issue so far.
Maybe it will some extra time - hard to say now - but before it has took me just minutes to get the issue visible.

intel_gpu_top now shows only few lines busy to 1% during scroll.


The bad part is - while before I've been getting 3.7Mchar/s in x11perf -aa10text - now it's like 1.3Mchar/s  so significantly slower.

So the question here would be - isn't the corruption based on  triangle surface size ? So i.e. GPU is able to  process a lot of small ones - but has bug with bigger ones ?

Maybe a cheap test would be to flush when some longer triangle edge is pushed in ?

Comment 94 Zdenek Kabelac 2013-07-07 09:13:07 UTC

I'm doing some experiments with MAX_FLUSH_VERTICES.

When set to 64 - even gnome-terminal starts so show weird pixels in some cases.

Now I'm just playing with value 12  (gives ~2.1Mchars/s) and I'm hunting for visual problems.

Comment 95 Zdenek Kabelac 2013-07-07 09:19:38 UTC

And with 12 - flickering starts to appear as well.

Comment 96 Chris Wilson 2013-07-07 09:21:08 UTC

(In reply to comment #93)
> The bad part is - while before I've been getting 3.7Mchar/s in x11perf
> -aa10text - now it's like 1.3Mchar/s  so significantly slower.

That's the sacrifice, we have to stop sending commands to the GPU and wait for it complete those in flight (quite frequently). Or else new rectangles overwrite vertex entries still being used by later entries 

> So the question here would be - isn't the corruption based on  triangle
> surface size ? So i.e. GPU is able to  process a lot of small ones - but has
> bug with bigger ones ?
> 
> Maybe a cheap test would be to flush when some longer triangle edge is
> pushed in ?

Not really, you have to predict when a VUE being used by the end of the pipeline will be overwritten by a new rectangle at the start of the pipeline. This is completely internal state - the primitive command we want to feed to the GPU can contains thousands of rectangles. Instead of counting rectangles, you want to start counting fragments (actually texel reads since that will be the ratelimiting factor) and flush if we queue up too much work for the GPU. If you also model how fast the gpu is retiring fragments so that you can predict how much work is in flight, you could further reduce flushes...

We still need to stop the gpu and wait for it to complete. No matter how finely you do it, it will sacrifice throughput and latency.

It's a workaround. I still live in hope that after all these years we missed some configuration detail required for gen4.

Comment 97 Zdenek Kabelac 2013-07-07 10:16:46 UTC

(In reply to comment #96)
> (In reply to comment #93)
> > The bad part is - while before I've been getting 3.7Mchar/s in x11perf
> > -aa10text - now it's like 1.3Mchar/s  so significantly slower.
> 
> That's the sacrifice, we have to stop sending commands to the GPU and wait
> for it complete those in flight (quite frequently). Or else new rectangles
> overwrite vertex entries still being used by later entries 
> 
> > So the question here would be - isn't the corruption based on  triangle
> > surface size ? So i.e. GPU is able to  process a lot of small ones - but has
> > bug with bigger ones ?
> > 

But as I said before - if that would be plain hw defect - IMHO it would simply always appear - but it seems like it's working for a while - then 'something' happens - and flickering starts to appear - with (assumingly) same amount
of texels/triangle/vertices - and than something again may happen,
and the problem is gone for a while.

> Not really, you have to predict when a VUE being used by the end of the
> pipeline will be overwritten by a new rectangle at the start of the
> pipeline. This is completely internal state - the primitive command we want
> to feed to the GPU can contains thousands of rectangles. Instead of counting

Well I've tried even 8 max triangles - and the error appeared after a while,
so far '6' is magic.

> rectangles, you want to start counting fragments (actually texel reads since
> that will be the ratelimiting factor) and flush if we queue up too much work
> for the GPU. If you also model how fast the gpu is retiring fragments so

But in case the same page is rendered with problems as well as without problems,
then it doesn't look like texel read is problem, it rather looks like some
kind of memory mapping/ordering.

Also is there some explanation why  intel_gpu_top is showing so much higher GPU usage when the flickering is visible ?

Comment 98 Zdenek Kabelac 2013-07-07 10:26:42 UTC

Well I should wait a while before posting a comment about magic value 6.

I'm now observing flickering with value 6 as well.

So yeah - it's more or less time related - and it takes more or less time until the problem becomes visible.

Also is there explanation with the max value 64  starts to make problems with text rendering in gnome terminal ?

i.e. I'd have expected if there would be a large press on GPU - but in this case it just appear random pixel start to be drawn instead of some letter - maybe some font cache corruption ?

Comment 99 Chris Wilson 2013-07-07 10:36:13 UTC

(In reply to comment #97)
> (In reply to comment #96)
> > (In reply to comment #93)
> > > The bad part is - while before I've been getting 3.7Mchar/s in x11perf
> > > -aa10text - now it's like 1.3Mchar/s  so significantly slower.
> > 
> > That's the sacrifice, we have to stop sending commands to the GPU and wait
> > for it complete those in flight (quite frequently). Or else new rectangles
> > overwrite vertex entries still being used by later entries 
> > 
> > > So the question here would be - isn't the corruption based on  triangle
> > > surface size ? So i.e. GPU is able to  process a lot of small ones - but has
> > > bug with bigger ones ?
> > > 
> 
> But as I said before - if that would be plain hw defect - IMHO it would
> simply always appear - but it seems like it's working for a while - then
> 'something' happens - and flickering starts to appear - with (assumingly)
> same amount
> of texels/triangle/vertices - and than something again may happen,
> and the problem is gone for a while.

It does. You do not have quite as much control over your tests as you presume.

> > Not really, you have to predict when a VUE being used by the end of the
> > pipeline will be overwritten by a new rectangle at the start of the
> > pipeline. This is completely internal state - the primitive command we want
> > to feed to the GPU can contains thousands of rectangles. Instead of counting
> 
> Well I've tried even 8 max triangles - and the error appeared after a while,
> so far '6' is magic.
> 
> > rectangles, you want to start counting fragments (actually texel reads since
> > that will be the ratelimiting factor) and flush if we queue up too much work
> > for the GPU. If you also model how fast the gpu is retiring fragments so
> 
> But in case the same page is rendered with problems as well as without
> problems,
> then it doesn't look like texel read is problem, it rather looks like some
> kind of memory mapping/ordering.

No. I did not say the texel reads where the problem, just an indicator as to how long the EU would execute any particular shader for a fragment. Also there is only a single sampler and many EU running many more threads, so contention will also play a factor into how long each fragment takes to process, and so how long buffers will be active for. Look more closely at what it is going on, it is clearly that the hardware is not tracking lifetimes of its URB correctly.

> Also is there some explanation why  intel_gpu_top is showing so much higher
> GPU usage when the flickering is visible ?

Other than the flickering correlates with GPU activity?


(In reply to comment #98)
> Well I should wait a while before posting a comment about magic value 6.
> 
> I'm now observing flickering with value 6 as well.
> 
> So yeah - it's more or less time related - and it takes more or less time
> until the problem becomes visible.
> 
> Also is there explanation with the max value 64  starts to make problems
> with text rendering in gnome terminal ?
> 
> i.e. I'd have expected if there would be a large press on GPU - but in this
> case it just appear random pixel start to be drawn instead of some letter -
> maybe some font cache corruption ?

It's still the same bug.

Comment 100 Zdenek Kabelac 2013-07-07 10:52:55 UTC

(In reply to comment #99)
> (In reply to comment #97)
> > (In reply to comment #96)
> > > (In reply to comment #93)
> > > > The bad part is - while before I've been getting 3.7Mchar/s in x11perf
> > > > -aa10text - now it's like 1.3Mchar/s  so significantly slower. 
> > But as I said before - if that would be plain hw defect - IMHO it would
> > simply always appear - but it seems like it's working for a while - then
> > 'something' happens - and flickering starts to appear - with (assumingly)
> > same amount
> > of texels/triangle/vertices - and than something again may happen,
> > and the problem is gone for a while.
> 
> It does. You do not have quite as much control over your tests as you
> presume.

Well I understand there are some weird things underneath  - but when I've
a page where the  scrolling up/down is showing heavy intel_gpu_top usage,
and flickering of pictures is visible -  then  I make nothing else
then page reload - I've just naive assumption, that this redrawn page
is just using different memory buffers - but otherwise the number of graphical objects pushed to GPU should be approximately the same.

(I'm using plain good old xfce, so no fancy Gnome3 shell composite desktops...)
 
> No. I did not say the texel reads where the problem, just an indicator as to
> how long the EU would execute any particular shader for a fragment. Also
> there is only a single sampler and many EU running many more threads, so
> contention will also play a factor into how long each fragment takes to
> process, and so how long buffers will be active for. Look more closely at
> what it is going on, it is clearly that the hardware is not tracking
> lifetimes of its URB correctly.
> 
> > Also is there some explanation why  intel_gpu_top is showing so much higher
> > GPU usage when the flickering is visible ?
> 
> Other than the flickering correlates with GPU activity?

Well the same page could be scrolled with either i.e. 15% 'render busy' - and no flickering -  or  with ~50% 'render busy' and visible problems.

I could make a debug build - but it would be probably needed to have some
kind of 'signal support' to dump needed data when necessary instead of logging
GB of data all the time.

> > i.e. I'd have expected if there would be a large press on GPU - but in this
> > case it just appear random pixel start to be drawn instead of some letter -
> > maybe some font cache corruption ?
> 
> It's still the same bug.

So maybe we could start at this - since  MAX 64 is giving the problem exposed very easily - it's enough to run gnome-terminal on my desktop.

Comment 101 Zdenek Kabelac 2013-07-07 21:38:01 UTC

It could be probably worth to mention - that when I'm flipping between Firefox tabs I could have scrolling tabs without any issue (and low GPU usage), while flipping to other tabs and scrolling in them increases GPU to high levels and visual problems are present.

So in the moment the problem appears - it's rather local to certain tabs.
And as I said - it's often enough to reload tab to temporarily fix the problem.

And as of typing current BZ message - I've noticed (using MAX 6) that letter 'a' in the word 'certain' above has been for a while drawn only from one half - and after I've typed whole sentence it's been just refreshed properly.
(And I've not seen such behavior for quite a while... and I think I'm quite sensitive for things like this)...

Comment 102 sergio.callegari 2013-07-08 16:59:47 UTC

Just received new debs of the intel driver (2.21.11 + snapshot taken 13-07-08, git dbb585), libdrm (2.4.46 + snapshot 13-07-08, git f8f1f6). Furthermore, I have also received a new 3.8 kernel from ubuntu a few days ago that may contain some drm changes (but I am not certain of what goes in since it is a distribution kernel, not mainline).

Now, I think that I am experiencing a lot of this too (or maybe it is something else, please advice).  Particularly when writing emails with thunderbird, all of a sudden I notice that some character is corrupted. This seems to happen even without explicitly scrolling, but just typing. Very strange.

Comment 103 sergio.callegari 2013-07-08 17:08:31 UTC

One last note: it is not flickering. The characters that get corrupted stay corrupted for a long time (typically until scrolled out and rescrolled in).

Comment 104 sergio.callegari 2013-07-09 10:39:19 UTC

Weird enough, even if the artifacts on characters are not transient and stay there for arbitrary long periods of time, provided that one does not affect the neighboring text (or temporarily scrolls them away), it is impossible to take screen snapshots of them. As soon as I press PrtScreen, the artifacts get fixed.

Comment 105 sergio.callegari 2013-07-09 11:46:29 UTC

Created attachment 82227 [details]
Snapshot on the issue on characters

I have finally succeeded in taking a snapshot of the (very frequent) issue that I have with individual characters while editing text. This is from emacs.

The issue may look localized and very minor, but being on text it is in fact extremely distracting and annoying.

Just while editing this last sentence, it happened five or six times: entering a character makes another character somewhere else get corrupted (and stay so until another character is pressed, or the cursor is moved, or the text is scrolled out and back in). The weird thing is that entering a character here may cause a character corruption in some completely different place. And this is exactly the reason why it is absolutely a killer to productivity. It takes the eye away from where you are working.

Furthermore, the issue seems to be much more frequent if I type normally, than if, on purpose, I carefully and slowly enter characters one by one.

Comment 106 Zdenek Kabelac 2013-07-09 11:58:30 UTC

(In reply to comment #105)
> Created attachment 82227 [details]
> Snapshot on the issue on characters
> 


Yep - that's exactly what I easily observe with gnome-terminal when I increase max triangles from recent Chris patches to 64 - this is present almost all the time. And it's very occasional when just 6 is limit.

Comment 107 Chris Wilson 2013-07-09 12:19:54 UTC

The quick answer is that if ever see it with MAX_FLUSH_VERTICES set to 1, then it is a different issue. Please do be aware that all current kernels since 3.7 do have a coherency issue, the fix will not arrive before 3.10.1 (outside of drm-intel trees).

Comment 108 Zdenek Kabelac 2013-07-09 14:03:02 UTC

(In reply to comment #107)
> The quick answer is that if ever see it with MAX_FLUSH_VERTICES set to 1,
> then it is a different issue. Please do be aware that all current kernels
> since 3.7 do have a coherency issue, the fix will not arrive before 3.10.1
> (outside of drm-intel trees).

Could you please commit here the link to the needed kernel commit so I could check if it's fixing issue for me?

Also it's then easier to see which upstream vanilla kernel will have it.

And also it will be good to remove the low triangle limitation.

Yes memory coherency seems like very good explanation for problems I'm seeing.

Comment 109 sergio.callegari 2013-07-12 06:24:03 UTC

Ubuntu has an experimental 64bit raring kernel (3.8) with the fix for the coherency bug. Works great for me for fixing the artifacts in the large bitmaps.

Can be tested at http://kernel.ubuntu.com/~jsalisbury/lp1200126/

It delivers the fix to 65665.

This is very good news also for this bug since the fixed kernels makes it possible to wide test (at least on ubuntu) for the this bug decoupling away effects from the other 65665.

Specifically, it is now possible to ignore some of my attachments for this bug as they showed artifacts caused by the other bug (this is certainly the case for the 'sample libreoffice document that almost always shows issues).

Comment 110 sergio.callegari 2013-07-12 06:32:55 UTC

Unfortunately, the artifacts on the individual characters are not gone with the fix to the coherency bugs.

Comment 111 Zdenek Kabelac 2013-07-12 11:43:14 UTC

I've been looking at drm-intel-next-queued/drm-intel-fixes  - there are quite a few patches - but also a lot of reverts recently -  so I'm pretty much confused what is actually the solution for gma965  in T61.

Comment 112 sergio.callegari 2013-07-12 16:06:57 UTC

Created attachment 82369 [details]
Handy phone snapshot of artifacts on chars post drm/i915 fix: Only clear write-domains after a successful wait-seqno

This is an example of the artifacts on chars, still happening running a kernel with the drm/i915 fix 'Only clear write-domains after a successful wait-seqno'.
In this case we have a completely mangled 't' in firefox.

Snapshot was taken with an handy phone so that the monitor pixel frame is visible. In fact, it is now apparently impossible to snapshot with prtscrn. As soon as a print screen is requested the wrong char is always redrawn correctly.

Comment 113 Alexander Haeussler 2013-07-21 19:50:44 UTC

I'm also affected by this character corruption bug. My hardware is a notebook with Intel 4500M (i915 driver). First I thought it's caused by a hardware issue with my external screen, but the notebook screen shows the currupted characters as well. Sometimes when I scroll down to the end of a website with text and then keep scrolling (when the end is reached), about 2 characters per sentence are corrupted and about every 1/10 second the position of the curruptions change. The characters that were corrupted before look fine then and different characters are corrupted.

What I have tried so far:

- Disable/enable xcompmgr (no effect)
- Use XFCE (no effect)
- Change driver options i915_enable_rc6, i915_enable_fbc, lvds_downclock (no effect)
- Use different fonts (no effect)
- Disable sna (works)
- Test firefox, libreoffice, chromium (all affected)

Version Info:

Kernel 3.10.1-1-ARCH
xf86-video-intel 2.21.12

Comment 114 Alexander Haeussler 2013-07-21 21:00:50 UTC

Created attachment 82787 [details]
Character m of the word "parameters" is corrupted

Comment 115 Alexander Haeussler 2013-07-21 21:41:43 UTC

Created attachment 82788 [details]
Character e of word "deletion" and n of "Indicate" is corrupted

Comment 116 Alexander Haeussler 2013-07-21 21:45:42 UTC

Created attachment 82789 [details]
Character i of word "application" and p of "bpp>=8" is corrupted

Comment 117 sergio.callegari 2013-07-22 09:21:34 UTC

Added 'characters' to the bug title.

I believe that this is now the major issue related with this bug and having 'character' in the title may help people having issues look here.

Comment 118 sergio.callegari 2013-07-25 09:41:17 UTC

Incorrect rendering of some glyphs is still there as of yesterday's git snapshot (24/7/13). I'm reporting it as I read somewhere that newer released versions of the driver implemented a lower limit on MAX_FLUSH_VERTICES in order to reduce the impact of the bug, but I really see no difference. Yet, it may be the case that the reduction is only applied to the released driver and not in the devel version (haven't checked the actual code).

A couple of notes:

- The way in which the glyphs corruption appears is weird. Without any scrolling, just typing in characters at some place causes some random character to get corrupted somewhere else (e.g. maybe 1 line above, maybe 10 cm to the right, etc). Some times typing some more character is enough to have the corruption disappear, something it does not. I have not been able to determine if the glyph that gets corrupted is the same one that is being typed (i.e, I type in an 'e' and somewhere and 'e' gets corrupted), but I suspect that this is not the case.

- The issue seems to be much less frequent if I type slowly.

- Once a glyph is corrupted, just 'selecting' some random character around it, but not necessarily very close to it (e.g., 15 chars to the left or to the right), seems to be enough to cause a redraw of the glyph that fixes its rendering. This seems somehow similar to how pressing 'PrtScrn' to try to get a snapshot causes a redraw that fixes the wrong glyphs, so that it is difficult to get a screenshot of the issue, unless one relies on a camera.

Comment 119 Aidar 2013-07-26 10:53:47 UTC

It has been a long long time I am following this bug.

So, I built the 3.10.3 plain vanilla kernel, which went stable as of today (Friday, July 26, 2013) and I can say that my crappy GM45 chip works fine with the dreaded "test.odg" resizing test. Seems like things are improving.


and if it is of any interest here are the versions of relevant sw stack:

x11-libs/libdrm-2.4.45
media-libs/mesa-9.1.2-r1
x11-base/xorg-server-1.13.4
x11-drivers/xf86-video-intel-2.21.12 (sna is the Accel)


thanks for continous work on devices that were introduced in 2008.


ps: now, only if iwlwifi would get sorted out...

Comment 120 Chris Wilson 2013-07-26 22:25:18 UTC

*** Bug 67377 has been marked as a duplicate of this bug. ***

Comment 121 Chris Wilson 2013-08-27 09:29:54 UTC

*** Bug 68596 has been marked as a duplicate of this bug. ***

Comment 122 Chris Wilson 2013-11-20 14:25:45 UTC

Here's some insight: test/render-copyarea is just a mirth of fail on my gm45.

Comment 123 Zdenek Kabelac 2013-11-20 16:17:33 UTC

Well I could add here output of some tests:

i.e.:

$ ./render-composite-solid
Opened connection to :0 for testing.
Testing setting of single pixels (root): passed [1 iterations x 4096]
Testing area sets (root): passed [1 iterations x 4096]
Testing area fills (root): passed [1 iterations x 4096]
Testing setting of single pixels (child): passed [1 iterations x 4096]
Testing area sets (child): passed [1 iterations x 4096]
Testing area fills (child): passed [1 iterations x 4096]
Testing setting of single pixels (pixmap): passed [1 iterations x 4096]
Testing area sets (pixmap): passed [1 iterations x 4096]
Testing area fills (pixmap): passed [1 iterations x 4096]
Testing setting of single pixels (root): failed to set pixel (1465,296) to 00816152[9c816152], found 00000000 instead


Is it worth ?

(git commit  b14228fafb654fe7d8f8783475aa0c0ba87e4fea)

Comment 124 Chris Wilson 2013-11-20 16:28:24 UTC

My experiments so far indicate that the errors only happen with rendering to the uncached frontbuffer, are not influenced by the number of rectangles in each primitive (though the failure does occur at different frequencies) and do not respond to adding extra MI_FLUSH. I don't think attaching examples of fail will help, as the next step will be trying to dissect the GPU state and work out why fragments fail inside a single shader.

Comment 125 Zdenek Kabelac 2013-11-20 16:33:47 UTC

Well here is just another one -

$ ./basic-copyarea
Opened connection to :0 for testing.
Testing setting of single pixels (root): passed [1 iterations x 4096]
Testing area sets (root): passed [1 iterations x 4096]
Testing area fills (root, using pixmap source): passed [1 iterations x 4096]
Testing area fills (root, using window source): passed [1 iterations x 4096]
Testing setting of single pixels (child): passed [1 iterations x 4096]
Testing area sets (child): passed [1 iterations x 4096]
Testing area fills (child, using pixmap source): passed [1 iterations x 4096]
Testing area fills (child, using window source): passed [1 iterations x 4096]
Testing setting of single pixels (pixmap): passed [1 iterations x 4096]
Testing area sets (pixmap): passed [1 iterations x 4096]
Testing area fills (pixmap, using pixmap source): passed [1 iterations x 4096]
Testing setting of single pixels (root): passed [2 iterations x 2048]
Testing area sets (root): failed to set pixel (0,0) to 0072dc99 [0e72dc99], found 00000000 [00000000] instead
00000000 00000000 00000000 	0e72dc99 0e72dc99 0e72dc99 
5146daef 5146daef 5146daef 	0e72dc99 0e72dc99 0e72dc99 
5146daef 5146daef 5146daef 	0e72dc99 0e72dc99 0e72dc99 


Unsure if it's related to anything and test seems to be lenghty -
but there seems to be some observable pattern.

Anyway - could I help here with testing some patch - or do you get same errors yourself on your available hardware ?

Comment 126 Chris Wilson 2013-11-20 16:51:22 UTC

The tests are intentionally overkill - they are also intended to try and test handling of large batches, as well as generally stress the system.

I hadn't noticed the basic-copyarea fail. That does look to be different. So far, the failure pattern had seemed to be a subspan doesn't get written (i.e. a single instance of one thread failed to execute correctly in the GPU, which I think could explain the general bug here.) basic-copyarea looks like a larger scale failure. And, yes, it does fail on my gm45 as well.

Comment 127 Harald Judt 2013-11-20 17:03:00 UTC

SNA has grown worse now. Since about six weeks, I've started to see corruption in the terminal (cursor not disappearing or not showing at all, more text missing until marking with the mouse,...). I've switched to UXA, nothing bad visible there.

Comment 128 Harald Judt 2013-12-03 09:24:05 UTC

With kernel-3.12.2 and current git, the situation now is much better. The corruptions happen less often (still often enough to be noticed though), and are usually confined to single characters, or only parts of single characters. While I'm writing this, I can notice what appears to be some slight corruption in the lines above which I've already written, but they are always gone immediately. Maybe today's just a lucky day, or the situation has improved.

What's more, the issue with the cursor not disappearing or not showing that I mentioned in my previous post no longer exists, so SNA on this gen4 is usable again.

Comment 129 Chris Wilson 2013-12-13 10:32:21 UTC

As a check that all problems are the same, can people who are still affected by this do a test with

diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index a87af39..86c37d6 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -63,7 +63,7 @@
 #define NO_FILL_BOXES 0
 #define NO_VIDEO 0
 
-#define MAX_FLUSH_VERTICES 6
+#define MAX_FLUSH_VERTICES 1
 
 #define GEN4_GRF_BLOCKS(nreg)    ((nreg + 15) / 16 - 1)

and see if that cures the last of the corruptions.

Comment 130 Ivan Bulatovic 2013-12-13 10:38:14 UTC

Just to report, maybe this isn't intel driver bug at all. At home I have ATI R270X (radeonsi driver) and Intel HD2000, at work X4500 and I get hit by this bug on all of them.

Small, single char, graphic corruption, easily triggered by scrolling through a window with lots of small text (tailing a log in debug mode in terminal). But I see it appers and dissapears as I'm typing this comment.

I'm attaching the screenshots (/dev/shm is corrupted on black one).

@Chris, I'll try applying patches right now and report back.

Comment 131 Ivan Bulatovic 2013-12-13 10:39:25 UTC

Created attachment 90707 [details]
/dev/shm corruption

Comment 132 Ivan Bulatovic 2013-12-13 11:30:04 UTC

I don't want to speak too soon but it seems that the latest patch fixes the problem.

First I've tried the latest git version and it didn't help, I've noticed corrupted fonts as soon as I logged on and start poking things in terminal.

After I've applied the patch in comment #129 and restarted, no corruption at all (at least for now, it usually can be reproduced right after logging in).

Chris, any thoughts on why is this happening with radeon opesource driver also ?

Comment 133 Chris Wilson 2013-12-13 11:37:44 UTC

(In reply to comment #132)
> Chris, any thoughts on why is this happening with radeon opensource driver
> also ?

I was hoping that it would be a bug in common component, but in this case it sounds like they have a similar bug in managing internal GPU state.

Comment 134 Chris Wilson 2013-12-13 11:39:43 UTC

(In reply to comment #132)
> I don't want to speak too soon but it seems that the latest patch fixes the
> problem.

Let it run for a day or so to be sure. The other thing that is worth checking is whether setting MAX_FLUSH_VERTICES to 2 is also stable, or 4 etc. Setting it to 1 has a major impact on performance (we are roughly an order of magnitude slower at rendering than what can be expected).

Comment 135 Zdenek Kabelac 2013-12-13 11:42:33 UTC

Maybe this bug could be more easily tracked down when the amount of vertices is actually much higher - since in this case it seems to crash almost immediatelly.

I understand there is some 'maximum queue' size GPU could handle - but the engine should be able to track size of all commands and not outgrown it.

So what else could break ordering ?

As it's very easy to trigger the problem with higher max - maybe it could be used to deduce which part of code unexpectedly touches the command queue ?

(but of course it's just my naive assumption about how the SNA driver works).

Comment 136 Chris Wilson 2013-12-13 11:51:11 UTC

The issue that the VUE (which is a memory slot used by the GPU for a vertex entry) are reused by a second thread before the first thread is complete, causing the first thread to generate invalid texture coordinates and corrupt rendering. That is a hardware read-write hazard bug (or at least I have not found any controls in the EU state to prevent it).

Comment 137 Zdenek Kabelac 2013-12-13 11:58:36 UTC

(In reply to comment #136)
> The issue that the VUE (which is a memory slot used by the GPU for a vertex
> entry) are reused by a second thread before the first thread is complete,
> causing the first thread to generate invalid texture coordinates and corrupt
> rendering. That is a hardware read-write hazard bug (or at least I have not
> found any controls in the EU state to prevent it).

I don't think it's that easy to explain -

The typical problem in my case is -  when I freshly start the Xsession -
I do not observe any rendering bug.  I need to use this machine for a while,
to start to get those errors.

So if there would be some 'easy to trigger hardware race' - it should be reproducible all the time.

But in my case it seem - the probability increases over the time heavily - maybe with the amount of  cached BO segments ?

Or maybe it depends on how the memory order is set - i.e. the problem is triggered when certain memory read pattern start to appear ?

Since once the issue of flicker starts - it then happens all the time - and then suddenly it disappears for a while again ?

So maybe there needs to be prohibited some memory offset/alignment of buffers with vertices ?

Comment 138 Chris Wilson 2013-12-13 12:04:52 UTC

Nope. The behaviour of this bug is very well characterised by the above analysis.

Comment 139 Zdenek Kabelac 2013-12-13 12:21:02 UTC

Well I'm curious how this explains this -  I've taken current git -
changed the value to '44' - and I'm typing this text.  I could see a lot of errors during text typing - but these errors seems to be somehow limited only to certain regions of shown text.

It's not destroyed everywhere - only in some particularly piece and in particular time - it's kind of weird defect to observe - and it also happen for only some specific use-case 

i.e. my text edit 'fte' which is using  X drawing code has absolutely no issue - all characters are always correct. 
But in firefox - typing this BZ I could see a lot of changing letters (usually everything after 10 letter on the line is weirdly changing - like if the font cache would be not working correctly.

Typing on keyboard shows rending bugs - as soon as right mouse button is hit - everything is instantly redrawn correctly and pop window is shown.

Another thing I notice is -  I've about 20 lines of email headers in thunderbird. And exactly only the 4th line is showing problems with letters - even when I just move the mouse over the TB window - this text is being continuously modified - but everything else in that window is without any problem.

So if you say - there is hw bug - how is that  - it could very easily render everything correctly ?

Why only certain portions of text have distortions - why it's not randomly spread over the whole screen (which I'd have expected for time collisions)?

Comment 140 Zdenek Kabelac 2013-12-13 12:49:24 UTC

Just to add some more comment on MAX_FLUSH_VERTICES  - when set to valu 96 - it gives the highest throughput on  x11perf -aa10text 3.7MChar/s  - using any higher value doesn't make any different (so the max seems to be somewhere  between <64-96]

Also when this 3.7MChar is rendered - the parallel move of some other window on the screen becomes very slow - so engine gets overloaded ??

Or the Xserver generates such long queue of events it becomes so slow ?

Comment 141 Harald Judt 2013-12-16 13:37:39 UTC

(In reply to comment #134)
> (In reply to comment #132)
> > I don't want to speak too soon but it seems that the latest patch fixes the
> > problem.
> 
> Let it run for a day or so to be sure. The other thing that is worth
> checking is whether setting MAX_FLUSH_VERTICES to 2 is also stable, or 4
> etc. Setting it to 1 has a major impact on performance (we are roughly an
> order of magnitude slower at rendering than what can be expected).

Setting max_flush_vertices to 1 fixes the problem here, with the mentioned noticeable performance loss. IIRC, hibernating/resuming can accelerate the appearance of the bug, but I'm not quite sure about this, it might also be coincidence.

I will now try to decrease the value from the default one to find the sweet spot.

Comment 142 Chris Wilson 2013-12-22 09:48:59 UTC

*** Bug 71773 has been marked as a duplicate of this bug. ***

Comment 143 Harald Judt 2013-12-22 11:28:10 UTC

All tests with MAX_FLUSH_VERTICES greater than 2 reveal those single garbled glyphs. I'm still testing with a value of 2. There isn't much noticeable difference between 3, 4, 5.

00:02.0 VGA compatible controller: Intel Corporation 4 Series Chipset Integrated Graphics Controller (rev 03) (prog-if 00 [VGA controller])
        Subsystem: Dell Device 0276

Comment 144 Lonnie Lee Best 2013-12-24 16:09:38 UTC

Here's another bug report regarding this same issue:
https://bugs.launchpad.net/bugs/1227569

Comment 145 Edward Sheldrake 2014-01-01 11:06:28 UTC

Created attachment 91383 [details]
gedit in openbox with 2.99.907 GM45 SNA

I am finding that GM45 SNA seems unusable with 2.99.907 - git bisect pointed to the bad commit as:
9289e2c56b7f0cc78c5123691ad96611f0e04bed is the first bad commit
commit 9289e2c56b7f0cc78c5123691ad96611f0e04bed
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Dec 16 11:39:20 2013 +0000

    sna/gen4: Sacrifice performance to workaround render corruption

The problems are lines of text keep disappearing (and reappearing) in gedit, and the occasionally the screen becomes unresponsive for a short time and these messages appear in dmesg:

[ 1702.349954] [drm] stuck on render ring
[ 1702.349966] [drm] capturing error event; look for more information in /sys/class/drm/card0/error
[ 1702.354334] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x32a1000 ctx 0) at 0x32a1110

Comment 146 Chris Wilson 2014-01-01 11:22:31 UTC

(In reply to comment #145)
> [ 1702.349954] [drm] stuck on render ring
> [ 1702.349966] [drm] capturing error event; look for more information in
> /sys/class/drm/card0/error
> [ 1702.354334] [drm:i915_set_reset_status] *ERROR* render ring hung inside
> bo (0x32a1000 ctx 0) at 0x32a1110

Attach the error state.

Comment 147 Edward Sheldrake 2014-01-01 13:12:17 UTC

Created attachment 91388 [details]
intel_error_decode output

I saved the output from intel_error_decode but I didn't save the raw error data.

Comment 148 Chris Wilson 2014-01-01 13:35:39 UTC

Worth trying just:

diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index 637137e..dc80de3 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -660,9 +660,11 @@ inline static int gen4_get_rectangles(struct sna *sna,
                if (rem <= 0) {
                        if (sna->render.vertex_offset) {
                                gen4_vertex_flush(sna);
-                               if (gen4_magic_ca_pass(sna, op))
+                               if (gen4_magic_ca_pass(sna, op)) {
+                                       OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH);
                                        gen4_emit_pipelined_pointers(sna, op, op->op,
                                                                     op->u.gen4.wm_kernel);
+                               }
                        }
                        OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH);
                        rem = MAX_FLUSH_VERTICES;

if you are happy that it reproduces reliably.

Comment 149 Edward Sheldrake 2014-01-01 14:12:00 UTC

(In reply to comment #148)
> Worth trying just:
> 
> diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
> index 637137e..dc80de3 100644
> --- a/src/sna/gen4_render.c
> +++ b/src/sna/gen4_render.c
> @@ -660,9 +660,11 @@ inline static int gen4_get_rectangles(struct sna *sna,
>                 if (rem <= 0) {
>                         if (sna->render.vertex_offset) {
>                                 gen4_vertex_flush(sna);
> -                               if (gen4_magic_ca_pass(sna, op))
> +                               if (gen4_magic_ca_pass(sna, op)) {
> +                                       OUT_BATCH(MI_FLUSH |
> MI_INHIBIT_RENDER_CACHE_FLUSH);
>                                         gen4_emit_pipelined_pointers(sna,
> op, op->op,
>                                                                     
> op->u.gen4.wm_kernel);
> +                               }
>                         }
>                         OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH);
>                         rem = MAX_FLUSH_VERTICES;
> 
> if you are happy that it reproduces reliably.

This change did not fully solve the problem. One text file in gedit displayed as blank initially for a bit, although things then seemed fine. But the freeze with "[drm] stuck on render ring" happened during a second run of gtkperf - while the "GtkDrawingArea - Text" test was running.

But I didn't spot any corrupted characters while running 2.99.907 with MAX_FLUSH_VERTICES set back to 6 - although I hadn't been running that for very long, and I only see a single garbled char occasionally, and I think they only appear in Firefox.

Comment 150 Edward Sheldrake 2014-01-01 14:13:47 UTC

Created attachment 91389 [details]
intel_error_decode output

the decoded error that occurred while running gtkperf, 2.99.907 with the change in comment #148

Comment 151 Edward Sheldrake 2014-01-01 14:26:19 UTC

(In reply to comment #149)
> 
> But I didn't spot any corrupted characters while running 2.99.907 with
> MAX_FLUSH_VERTICES set back to 6

And now I have spotted the single character corruption with 2.99.907 + MAX_FLUSH_VERTICES set to 6.

Comment 152 Harald Judt 2014-01-07 11:20:45 UTC

The bug occurs with MAX_FLUSH_VERTICES = 2 too, both in firefox and xterm, so unfortunately setting it to 1 seems to be the only solution here.

Comment 153 Chris Wilson 2014-01-07 15:02:08 UTC

(In reply to comment #145)
> Created attachment 91383 [details]
> gedit in openbox with 2.99.907 GM45 SNA
> 
> I am finding that GM45 SNA seems unusable with 2.99.907 - git bisect pointed
> to the bad commit as:
> 9289e2c56b7f0cc78c5123691ad96611f0e04bed is the first bad commit
> commit 9289e2c56b7f0cc78c5123691ad96611f0e04bed
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Mon Dec 16 11:39:20 2013 +0000
> 
>     sna/gen4: Sacrifice performance to workaround render corruption
> 
> The problems are lines of text keep disappearing (and reappearing) in gedit,
> and the occasionally the screen becomes unresponsive for a short time and
> these messages appear in dmesg:
> 
> [ 1702.349954] [drm] stuck on render ring
> [ 1702.349966] [drm] capturing error event; look for more information in
> /sys/class/drm/card0/error
> [ 1702.354334] [drm:i915_set_reset_status] *ERROR* render ring hung inside
> bo (0x32a1000 ctx 0) at 0x32a1110

Ok, I think I have this fixed:

commit 9d8473c5d9489db439aca73f470bda29a22ebab6
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 7 13:43:35 2014 +0000

    sna/gen4: Check for available batch space before restoring state after CA pass
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=73348
    References: https://bugs.freedesktop.org/show_bug.cgi?id=55500
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 154 Zdenek Kabelac 2014-01-07 15:52:19 UTC

(In reply to comment #153)
> (In reply to comment #145)
> > Created attachment 91383 [details]
> > gedit in openbox with 2.99.907 GM45 SNA
> > 
> > I am finding that GM45 SNA seems unusable with 2.99.907 - git bisect pointed
> > to the bad commit as:
> > 9289e2c56b7f0cc78c5123691ad96611f0e04bed is the first bad commit
> > commit 9289e2c56b7f0cc78c5123691ad96611f0e04bed
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Mon Dec 16 11:39:20 2013 +0000
> > 
> >     sna/gen4: Sacrifice performance to workaround render corruption


Interestingly this commit has increased the number of buggy character occurrence - although I admit I'm overriding MAX_FLUSH_VERTICES to  9  (since I don't like the slowness with 1) so I've been rather living with faster desktop and occasional wrong characters on the screen - but with this commit it seems to appearance of visual problem increased to the level which makes noticeable reading difficulties that probably enforce me to switch to very slow  '1' for MAX :(...

(I'm not sure if it's the direct cause - since I've been using before about 3 week old version of git tree)

Comment 155 Zdenek Kabelac 2014-01-07 16:26:20 UTC

I think nice illustration could be - that while before with MAX 9 it seemed like i.e. gnome-terminal running top was not really rendering broken characters - it now seems to show a lot of messed characters.

On the other hand -  MAX 1 seems to be now more fluent then before - so except for some dramatic drops in performance in benchmark tools like  'x11perf -aa10text' it seems to be quite usable.

Comment 156 Zdenek Kabelac 2014-01-07 16:37:29 UTC

bad news -  now I'm in fact able to spot badly rendered characters in Firefox also with  MAX 1

So something went wrong....

Comment 157 Chris Wilson 2014-01-07 17:46:18 UTC

Something worth experimenting with is detuning the GPU, e.g.:


diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index e239c21..bc6af68 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -52,7 +52,7 @@
  */
 #define FORCE_SPANS 0
 #define FORCE_NONRECTILINEAR_SPANS -1
-#define FORCE_FLUSH 1 /* https://bugs.freedesktop.org/show_bug.cgi?id=55500 */
+#define FORCE_FLUSH 0 /* https://bugs.freedesktop.org/show_bug.cgi?id=55500 */
 
 #define NO_COMPOSITE 0
 #define NO_COMPOSITE_SPANS 0
@@ -74,7 +74,7 @@
 #define URB_CS_ENTRIES	      0
 
 #define URB_VS_ENTRY_SIZE     1
-#define URB_VS_ENTRIES	      32
+#define URB_VS_ENTRIES	      16
 
 #define URB_GS_ENTRY_SIZE     0
 #define URB_GS_ENTRIES	      0
@@ -83,7 +83,7 @@
 #define URB_CL_ENTRIES      0
 
 #define URB_SF_ENTRY_SIZE     2
-#define URB_SF_ENTRIES	      64
+#define URB_SF_ENTRIES	      1
 
 /*
  * this program computes dA/dx and dA/dy for the texture coordinates along
@@ -93,9 +93,9 @@
 #define SF_KERNEL_NUM_GRF 16
 #define PS_KERNEL_NUM_GRF 32
 
-#define GEN4_MAX_SF_THREADS 24
-#define GEN4_MAX_WM_THREADS 32
-#define G4X_MAX_WM_THREADS 50
+#define GEN4_MAX_SF_THREADS 8
+#define GEN4_MAX_WM_THREADS 16
+#define G4X_MAX_WM_THREADS 16
 
 static const uint32_t ps_kernel_packed_static[][4] = {
 #include "exa_wm_xy.g4b"

Comment 158 Zdenek Kabelac 2014-01-07 20:14:43 UTC

Created attachment 91613 [details]
Grabbed snapshot with patch from comment 157

As could be seen - yes - with little effort I'm able to capture broken characters with given patch compiled in.

The only needed thing is to start to edit the dialog in the Firefox and start to add/remove random characters over places.

Comment 159 Harald Judt 2014-01-09 10:25:08 UTC

After updating to current git, the situation has become worse and I now see more visible corruptions even with MAX_FLUSH_VERTICES=1. And I did not have to wait some hours for them to appear as before, they are visible right after starting X.

Comment 160 Chris Wilson 2014-01-09 10:28:31 UTC

Typical, it appeared stable through my firefox testing, but if you try reverting b7565a26401e283df94b68019e8093f8104428f4, I expect the corruptions to disappear again.

Comment 161 Zdenek Kabelac 2014-01-09 14:23:06 UTC

(In reply to comment #160)
> Typical, it appeared stable through my firefox testing, but if you try
> reverting b7565a26401e283df94b68019e8093f8104428f4, I expect the corruptions
> to disappear again.

Yep - correct revert of this commit make  MAX_VERTEX 1 again producing correct rendering - even thought it's again noticable slower - thus now it's clear why I've considered it usable with MAX 1 before (in my comment 155).

So yep - revert & MAX 1 works again - but it's quite slow.

Is it now any better to deduce which operation is make such bad memory interaction ?

It seems like the 'synchronization' is really needed only at very certain moments - where the GPU is producing memory corruption error on the screen - but how to catch in which moment ?

I'm still suspecting some memory layout of those memory object - since when I see corruption - it usually it specific places (i.e. edit of this firefox input widget and just only certain characters at certain positions are render with errors)

How can I try to increasing memory alignment ?
(i.e. each object only at 16KB boundary?)

Comment 162 Harald Judt 2014-01-10 13:37:41 UTC

I have updated to current git and reverted commit b7565a26401e283df94b68019e8093f8104428f4 and left the MAX_FLUSH_VERTICES set to 1, but now instead of glyph corruption I notice some icons are corrupted similar to the glyphs before.

Example: In thunderbird, I hover over a toolbar icon and when the mouse pointer leaves the icon, it is corrupted. Or it gets corrupted when I hover over the icon. In both cases, the corruptions disappear a short time after, when something else gets updated on the screen.

Strangely, I have not seen glyph corruption yet.

Comment 163 Edward Sheldrake 2014-01-12 16:39:13 UTC

I have been experimenting with various numbers in the code in comment #157 without really discovering anything useful. I think any change that slows things down decreases the chances of observing any corruption, but might not necessarily fix the problem completely.

With current git + MAX_FLUSH_VERTICES=6, gnome-terminal suffers from the text corruption, firefox is bad, but KDE4 konqueror and MiniBrowser from webkitgtk3 seem to display the same webpages perfectly fine. The other thing I have noticed is that under this setup, I can make gedit suffer from the text corruption if I set an italic (and not monospaced) font - in that case, almost every line of the text file displayed will show some corruption immediately after changing to the italic font.

Comment 164 Zdenek Kabelac 2014-01-16 19:40:16 UTC

Created attachment 92240 [details]
Text with errors

Recently I'm noticing somewhat more 'weird' behavior - it might be related to my temporal usage of night releases of Mozilla (since rawhide version got somewhat broken)

What is weird in this image is - the text was badly rendered AND it remained visible even when refreshed, even small scroll up/down left the text as is - it helped only to scroll text out of window and back. 

So I assume now the rendering happens to some off-screen memory - which is then transfered back to screen with errors - and refresh doesn't help with this case.
(and it's a bit different then during typing text into input box).

Another probably unrelated comment could be - even when  'glxgears' is running in parallel - the visual errors still happens during typing of this text. I'd have expect the glxgears manages to 'overfill' GPU queue (since it's rendering ~1300FPS with default window) also the flushes are probably completely different. And - there is no observable rendering error in gears window - only Firefox seems to be exposing them.

Comment 165 Edward Sheldrake 2014-01-17 07:06:55 UTC

I am wondering if some extra flushes are needed in regard to what the G45 PRM PDFs say about the BLT (section 8.6, vol 1b p. 170)

git + this gives only a moderate amount of corrupt rendering:
diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index e239c21..f150e5b 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -63,7 +63,7 @@
 #define NO_FILL_BOXES 0
 #define NO_VIDEO 0
 
-#define MAX_FLUSH_VERTICES 1 /* was 6, https://bugs.freedesktop.org/show_bug.cgi?id=55500 */
+#define MAX_FLUSH_VERTICES 12 /* was 6, https://bugs.freedesktop.org/show_bug.cgi?id=55500 */
 
 #define GEN4_GRF_BLOCKS(nreg)    ((nreg + 15) / 16 - 1)
 
@@ -571,26 +571,28 @@ static void gen4_emit_vertex_buffer(struct sna *sna,
 inline static void
 gen4_emit_pipe_flush(struct sna *sna)
 {
-#if 1
+#if 0
 	OUT_BATCH(GEN4_PIPE_CONTROL | (4 - 2));
-	OUT_BATCH(GEN4_PIPE_CONTROL_WC_FLUSH);
+	OUT_BATCH(GEN4_PIPE_CONTROL_WC_FLUSH | GEN4_PIPE_CONTROL_TC_FLUSH);
 	OUT_BATCH(0);
 	OUT_BATCH(0);
 #else
 	OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH);
+	/* OUT_BATCH(MI_NOOP); */
 #endif
 }
 
 inline static void
 gen4_emit_pipe_break(struct sna *sna)
 {
-#if 1
+#if 0
 	OUT_BATCH(GEN4_PIPE_CONTROL | (4 - 2));
-	OUT_BATCH(0);
+	OUT_BATCH(GEN4_PIPE_CONTROL_TC_FLUSH);
 	OUT_BATCH(0);
 	OUT_BATCH(0);
 #else
 	OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH);
+	/* OUT_BATCH(MI_NOOP); */
 #endif
 }
 
@@ -599,11 +601,12 @@ gen4_emit_pipe_invalidate(struct sna *sna)
 {
 #if 0
 	OUT_BATCH(GEN4_PIPE_CONTROL | (4 - 2));
-	OUT_BATCH(GEN4_PIPE_CONTROL_WC_FLUSH | GEN4_PIPE_CONTROL_TC_FLUSH);
+	OUT_BATCH(GEN4_PIPE_CONTROL_WC_FLUSH | GEN4_PIPE_CONTROL_TC_FLUSH | GEN4_PIPE_CONTROL_IS_FLUSH);
 	OUT_BATCH(0);
 	OUT_BATCH(0);
 #else
-	OUT_BATCH(MI_FLUSH);
+	OUT_BATCH(MI_FLUSH); /* | MI_STATE_INSTRUCTION_CACHE_FLUSH */
+	/* OUT_BATCH(MI_NOOP); */
 #endif
 }
 
@@ -781,7 +784,10 @@ gen4_emit_urb(struct sna *sna)
 	urb_cl_end = urb_gs_end + URB_CL_ENTRIES * URB_CL_ENTRY_SIZE;
 	urb_sf_end = urb_cl_end + URB_SF_ENTRIES * URB_SF_ENTRY_SIZE;
 	urb_cs_end = urb_sf_end + URB_CS_ENTRIES * URB_CS_ENTRY_SIZE;
-	assert(urb_cs_end <= 256);
+	if (sna->kgem.gen >= 045)
+		assert(urb_cs_end <= 384);
+	else
+		assert(urb_cs_end <= 256);
 
 	while ((sna->kgem.nbatch & 15) > 12)
 		OUT_BATCH(MI_NOOP);
@@ -1623,6 +1629,7 @@ gen4_render_composite_done(struct sna *sna,
 		kgem_bo_destroy(&sna->kgem, op->src.bo);
 
 	sna_render_composite_redirect_done(sna, op);
+	gen4_emit_pipe_invalidate(sna);
 }
 
 static bool
@@ -2154,6 +2161,7 @@ gen4_render_composite_spans_done(struct sna *sna,
 
 	kgem_bo_destroy(&sna->kgem, op->base.src.bo);
 	sna_render_composite_redirect_done(sna, &op->base);
+	gen4_emit_pipe_invalidate(sna);
 }
 
 static bool
@@ -2500,6 +2508,7 @@ fallback_blt:
 	gen4_vertex_flush(sna);
 	sna_render_composite_redirect_done(sna, &tmp);
 	kgem_bo_destroy(&sna->kgem, tmp.src.bo);
+	gen4_emit_pipe_invalidate(sna);
 	return true;
 
 fallback_tiled_dst:
@@ -2535,6 +2544,7 @@ gen4_render_copy_done(struct sna *sna, const struct sna_copy_op *op)
 {
 	if (sna->render.vertex_offset)
 		gen4_vertex_flush(sna);
+	gen4_emit_pipe_invalidate(sna);
 }
 
 static bool
@@ -2736,6 +2746,7 @@ gen4_render_fill_boxes(struct sna *sna,
 
 	gen4_vertex_flush(sna);
 	kgem_bo_destroy(&sna->kgem, tmp.src.bo);
+	gen4_emit_pipe_invalidate(sna);
 	return true;
 }
 
@@ -2776,6 +2787,7 @@ gen4_render_fill_op_done(struct sna *sna, const struct sna_fill_op *op)
 	if (sna->render.vertex_offset)
 		gen4_vertex_flush(sna);
 	kgem_bo_destroy(&sna->kgem, op->base.src.bo);
+	gen4_emit_pipe_invalidate(sna);
 }
 
 static bool

I've also tried setting "Render Cache Operational Flush Enable" of the Cache_Mode_0 register with intel_reg_write, this made no difference.

I was also wondering if firefox is particularly bad because it uses it's own old version of cairo which seems to be version 1.9.8 plus lots of patches.

Comment 166 Zdenek Kabelac 2014-01-17 09:41:02 UTC

I've not yet tested patch from comment 165 - but with regards to Firefox and Cairo -  I'm also seeing errors in  i.e. pidgin - where status icons looks occasionally damaged.

And my rawhide has these related packages:

cairo-1.13.1-0.1.git337ab1f.fc21.x86_64
cairo-gobject-1.13.1-0.1.git337ab1f.fc21.x86_64
pidgin-2.10.7-9.fc21.x86_64
xorg-x11-server-Xorg-1.15.0-1.fc21.x86_64
libX11-1.6.1-1.fc20.x86_64


 ldd /bin/pidgin
	linux-vdso.so.1 =>  (0x00007fff323ad000)
	libX11.so.6 => /lib64/libX11.so.6 (0x00007f6f151af000)
	libXext.so.6 => /lib64/libXext.so.6 (0x00007f6f14f9c000)
	libXss.so.1 => /lib64/libXss.so.1 (0x00007f6f14d98000)
	libSM.so.6 => /lib64/libSM.so.6 (0x00007f6f14b90000)
	libICE.so.6 => /lib64/libICE.so.6 (0x00007f6f14973000)
	libgtkspell.so.0 => /lib64/libgtkspell.so.0 (0x00007f6f1476c000)
	libgtk-x11-2.0.so.0 => /lib64/libgtk-x11-2.0.so.0 (0x00007f6f140e7000)
	libgdk-x11-2.0.so.0 => /lib64/libgdk-x11-2.0.so.0 (0x00007f6f13e25000)
	libpangocairo-1.0.so.0 => /lib64/libpangocairo-1.0.so.0 (0x00007f6f13c18000)
	libatk-1.0.so.0 => /lib64/libatk-1.0.so.0 (0x00007f6f139f4000)
	libcairo.so.2 => /lib64/libcairo.so.2 (0x00007f6f136ce000)
	libgdk_pixbuf-2.0.so.0 => /lib64/libgdk_pixbuf-2.0.so.0 (0x00007f6f134aa000)
	libgio-2.0.so.0 => /lib64/libgio-2.0.so.0 (0x00007f6f1312b000)
	libpangoft2-1.0.so.0 => /lib64/libpangoft2-1.0.so.0 (0x00007f6f12f15000)
	libpango-1.0.so.0 => /lib64/libpango-1.0.so.0 (0x00007f6f12cca000)
	libfontconfig.so.1 => /lib64/libfontconfig.so.1 (0x00007f6f12a8e000)
	libfreetype.so.6 => /lib64/libfreetype.so.6 (0x00007f6f127e9000)
	libpurple.so.0 => /lib64/libpurple.so.0 (0x00007f6f124af000)
	libdbus-glib-1.so.2 => /lib64/libdbus-glib-1.so.2 (0x00007f6f12287000)
	libdbus-1.so.3 => /lib64/libdbus-1.so.3 (0x00007f6f1203e000)
	libfarstream-0.1.so.0 => /lib64/libfarstream-0.1.so.0 (0x00007f6f11e29000)
	libgstbase-0.10.so.0 => /lib64/libgstbase-0.10.so.0 (0x00007f6f11bd5000)
	libgstinterfaces-0.10.so.0 => /lib64/libgstinterfaces-0.10.so.0 (0x00007f6f119c2000)
	libgstreamer-0.10.so.0 => /lib64/libgstreamer-0.10.so.0 (0x00007f6f116d9000)
	libgobject-2.0.so.0 => /lib64/libgobject-2.0.so.0 (0x00007f6f11487000)
	libgmodule-2.0.so.0 => /lib64/libgmodule-2.0.so.0 (0x00007f6f11282000)
	libgthread-2.0.so.0 => /lib64/libgthread-2.0.so.0 (0x00007f6f11080000)
	libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x00007f6f10d54000)
	libxml2.so.2 => /lib64/libxml2.so.2 (0x00007f6f109ea000)
	libidn.so.11 => /lib64/libidn.so.11 (0x00007f6f107b7000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f6f104b1000)
	libnsl.so.1 => /lib64/libnsl.so.1 (0x00007f6f10296000)
	libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f6f1007b000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f6f0fe5d000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f6f0fa95000)
	libxcb.so.1 => /lib64/libxcb.so.1 (0x00007f6f0f874000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f6f0f670000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f6f0f46a000)
	libenchant.so.1 => /lib64/libenchant.so.1 (0x00007f6f0f25e000)
	libXfixes.so.3 => /lib64/libXfixes.so.3 (0x00007f6f0f058000)
	libXrender.so.1 => /lib64/libXrender.so.1 (0x00007f6f0ee4d000)
	libXinerama.so.1 => /lib64/libXinerama.so.1 (0x00007f6f0ec4a000)
	libXi.so.6 => /lib64/libXi.so.6 (0x00007f6f0ea3a000)
	libXrandr.so.2 => /lib64/libXrandr.so.2 (0x00007f6f0e82f000)
	libXcursor.so.1 => /lib64/libXcursor.so.1 (0x00007f6f0e624000)
	libXcomposite.so.1 => /lib64/libXcomposite.so.1 (0x00007f6f0e421000)
	libXdamage.so.1 => /lib64/libXdamage.so.1 (0x00007f6f0e21d000)
	libharfbuzz.so.0 => /lib64/libharfbuzz.so.0 (0x00007f6f0dfc8000)
	libpixman-1.so.0 => /lib64/libpixman-1.so.0 (0x00007f6f0dd1a000)
	libEGL.so.1 => /lib64/libEGL.so.1 (0x00007f6f0daf7000)
	libpng16.so.16 => /lib64/libpng16.so.16 (0x00007f6f0d8c4000)
	libxcb-shm.so.0 => /lib64/libxcb-shm.so.0 (0x00007f6f0d6c0000)
	libxcb-render.so.0 => /lib64/libxcb-render.so.0 (0x00007f6f0d4b6000)
	libz.so.1 => /lib64/libz.so.1 (0x00007f6f0d2a0000)
	libGL.so.1 => /lib64/libGL.so.1 (0x00007f6f0d037000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f6f0ce2f000)
	libffi.so.6 => /lib64/libffi.so.6 (0x00007f6f0cc26000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f6f0ca02000)
	libexpat.so.1 => /lib64/libexpat.so.1 (0x00007f6f0c7d7000)
	liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f6f0c5b2000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f6f15521000)
	libXau.so.6 => /lib64/libXau.so.6 (0x00007f6f0c3ad000)
	libgraphite2.so.3 => /lib64/libgraphite2.so.3 (0x00007f6f0c191000)
	libX11-xcb.so.1 => /lib64/libX11-xcb.so.1 (0x00007f6f0bf8e000)
	libxcb-dri2.so.0 => /lib64/libxcb-dri2.so.0 (0x00007f6f0bd89000)
	libxcb-xfixes.so.0 => /lib64/libxcb-xfixes.so.0 (0x00007f6f0bb82000)
	libxcb-shape.so.0 => /lib64/libxcb-shape.so.0 (0x00007f6f0b97d000)
	libgbm.so.1 => /lib64/libgbm.so.1 (0x00007f6f0b775000)
	libwayland-client.so.0 => /lib64/libwayland-client.so.0 (0x00007f6f0b567000)
	libwayland-server.so.0 => /lib64/libwayland-server.so.0 (0x00007f6f0b355000)
	libglapi.so.0 => /lib64/libglapi.so.0 (0x00007f6f0b12e000)
	libudev.so.1 => /lib64/libudev.so.1 (0x00007f6f0af1c000)
	libdrm.so.2 => /home/kabi/soft/glx-test/lib/libdrm.so.2 (0x00007f6f0ad0f000)
	libxcb-glx.so.0 => /lib64/libxcb-glx.so.0 (0x00007f6f0aaf5000)
	libXxf86vm.so.1 => /lib64/libXxf86vm.so.1 (0x00007f6f0a8ee000)
	libpcre.so.1 => /lib64/libpcre.so.1 (0x00007f6f0a687000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f6f0a470000)


So while the Firefox is the easiest one to see those errors (it's always enough to play for a while with some input box), it's probably not tied with its built-in version of Cairo library.

Comment 167 Harald Judt 2014-01-17 10:41:55 UTC

The image corruptions are also visible on the file/folder icons in thunar. I still have no glyph corruptions any more.

Comment 168 Chris Wilson 2014-01-17 14:35:39 UTC

Created attachment 92287 [details] [review]
Always force a GPU flush between operations

Can you please try this patch against git and see if that improves things - except for performance?

Comment 169 Zdenek Kabelac 2014-01-17 14:59:20 UTC

ok - while doing a very quick & light check - at least on Firefox input window I do not observe any rendering bugs (which have been pretty simple to reach before).

(Lenovo T61 +  git + patch from comment 168)

Thought the performance decrease is noticeable and also the setting for MAX_FLUSH... becomes irrelevant.

Comment 170 Arkadiusz Miskiewicz 2014-01-17 15:11:38 UTC

Created attachment 92288 [details]
bug

f23ab963c4f4ada2051588dfc85264aa2798dbf7 + that patch and I'm seeing corruption. Using google-chrome and letters in url bar or window title bar sometimes get corrupted and then get fixed.

Also seeing the problem in gimp menus. Some letters get corrupted and fixed.

Comment 171 Zdenek Kabelac 2014-01-17 16:24:32 UTC

Ok - seem(In reply to comment #170)
> Created attachment 92288 [details]
> bug
> 
> f23ab963c4f4ada2051588dfc85264aa2798dbf7 + that patch and I'm seeing
> corruption. Using google-chrome and letters in url bar or window title bar
> sometimes get corrupted and then get fixed.
> 
> Also seeing the problem in gimp menus. Some letters get corrupted and fixed.

This could be related to my hw:

[219745.896] (II) intel(0): SNA compiled with assertions enabled
[219745.898] (--) intel(0): Integrated Graphics Chipset: Intel(R) 965GM
[219745.898] (--) intel(0): CPU: x86-64, sse2, sse3, ssse3
[219745.898] (**) intel(0): Depth 24, (--) framebuffer bpp 32


And Arkadiusz hw:

[236338.852] (II) intel(0): SNA compiled with assertions enabled
[236338.853] (--) intel(0): Integrated Graphics Chipset: Intel(R) GM45
[236338.853] (--) intel(0): CPU: x86-64, sse2, sse3, ssse3, sse4.1
[236338.853] (==) intel(0): Depth 24, (--) framebuffer bpp 32


On my 965GM I've not yet seen any error....

Comment 172 Edward Sheldrake 2014-01-18 09:01:42 UTC

(In reply to comment #168)
> Created attachment 92287 [details] [review] [review]
> Always force a GPU flush between operations
> 
> Can you please try this patch against git and see if that improves things -
> except for performance?

Current git (2.99.907-23-gf23ab96) without any other changes: still a few corrupt characters in gedit with italic font

Current git + patch in attachment 92287 [details] [review]: still the same - has a few corrupt characters in gedit with italic font, displaying files full of text

Comment 173 Zdenek Kabelac 2014-01-20 20:31:59 UTC

Update after some new commits:

4c7b183fd21b461f9f18662c3b9d9732b6bef13d +  Always patch - now gives me broken text lines in Thunderbird window.

And it's now enough just to move the mouse over text and the text is changing and actually never renders correctly some letters.

Checking back f23ab963c4f4ada2051588dfc85264aa2798dbf7 + Always patch - again correct rendering.

This relates to GMA965.

Comment 174 Chris Wilson 2014-01-21 10:29:53 UTC

Created attachment 92512 [details] [review]
Always force a GPU flush between operations

Updated always flush patch that passes Arkadiusz's stress test.

* sobs

Comment 175 Edward Sheldrake 2014-01-25 15:34:30 UTC

Created attachment 92770 [details]
not quite random corruption example

With the workarounds disabled, can anything be deduced from the text or pixmap corruption not seeming to be completely random?

Italic text seems to be particularly badly hit, and it seems to vary with the font and size. But in the attached screenshot, some of the lines of text never showed any corruption, while others usually showed some corruption, the corruption changing on switching focus to another window and back. Size 10 italic Cantarell seemed particularly badly hit, with even lines of just repeated c or d characters showing corruption (if longer than 18 letters), but other fonts don't usually show any corruption on a line of text filled with a single repeating character.

For pixmap corruption, the printer icon in the gedit toolbar seems to get turned into grey vertical bars more often than any other icons get corrupted.

Comment 176 Chris Wilson 2014-01-25 15:40:07 UTC

(In reply to comment #175)
> Created attachment 92770 [details]
> not quite random corruption example
> 
> With the workarounds disabled, can anything be deduced from the text or
> pixmap corruption not seeming to be completely random?

It's not entirely random. What I have noticed is that one or more vertices are corrupt. Sometimes you see the correct content but skewed, which is what happens if you just move one of the vertices (its texture coordinates). With two or more distorted coordinates, we can be sampling from anywhere within the texture - which can mean that we see the wrong glyph or a highly distorted composite of several glyphs (since all the active glyphs are stored in a single texture).

Comment 177 Edward Sheldrake 2014-02-02 13:29:14 UTC

Running with all workarounds disabled, this change doesn't fix anything nor seem to make any difference, but anyway:
Shouldn't the cache flush bits be in dword 0 for gen4 GEN4_PIPE_CONTROL? Maybe gen5 also?

diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index 1d164b6..894418b 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -575,8 +575,10 @@ inline static void
 gen4_emit_pipe_flush(struct sna *sna)
 {
 #if 1
-	OUT_BATCH(GEN4_PIPE_CONTROL | (4 - 2));
-	OUT_BATCH(GEN4_PIPE_CONTROL_WC_FLUSH);
+	OUT_BATCH(GEN4_PIPE_CONTROL |
+		  GEN4_PIPE_CONTROL_WC_FLUSH |
+		  (4 - 2));
+	OUT_BATCH(0);
 	OUT_BATCH(0);
 	OUT_BATCH(0);
 #else
@@ -601,8 +603,10 @@ inline static void
 gen4_emit_pipe_invalidate(struct sna *sna)
 {
 #if 0
-	OUT_BATCH(GEN4_PIPE_CONTROL | (4 - 2));
-	OUT_BATCH(GEN4_PIPE_CONTROL_WC_FLUSH | GEN4_PIPE_CONTROL_TC_FLUSH);
+	OUT_BATCH(GEN4_PIPE_CONTROL |
+		  GEN4_PIPE_CONTROL_WC_FLUSH | GEN4_PIPE_CONTROL_TC_FLUSH |
+		  (4 - 2));
+	OUT_BATCH(0);
 	OUT_BATCH(0);
 	OUT_BATCH(0);
 #else

Comment 178 Edward Sheldrake 2014-02-02 16:53:39 UTC

Created attachment 93235 [details] [review]
sna/gen4,5: Fix setting pipe control cache flush bits

Only the one in gen4_emit_pipe_flush is in an enabled part of the code anyway.

Comment 179 Chris Wilson 2014-02-03 10:04:56 UTC

Nevertheless it was a good catch.

commit 1cbc59a917e7352fc68aa0e26b1575cbd0ceab0d
Author: Edward Sheldrake <ejsheldrake@gmail.com>
Date:   Mon Feb 3 09:34:33 2014 +0000

    sna/gen4,5: Fix setting pipe control cache flush bits
    
    Cache flush bits are on dword 0, not 1, on gen4 and gen5. Also texture
    cache invalidate is only available from Cantiga onwards.

Comment 180 Edward Sheldrake 2014-02-03 22:06:01 UTC

Created attachment 93326 [details]
icon corruption

Latest git (2.99.909-7-g1cbc59a) has icon corruption, but all text is fine.

Comment 181 Chris Wilson 2014-02-04 09:54:24 UTC

Sigh.

Probably,


diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index 1580707..ba9c9bc 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -602,6 +602,7 @@ gen4_emit_pipe_break(struct sna *sna)
 inline static void
 gen4_emit_pipe_invalidate(struct sna *sna)
 {
+#if 0
        OUT_BATCH(GEN4_PIPE_CONTROL |
                  GEN4_PIPE_CONTROL_WC_FLUSH |
                  (sna->kgem.gen >= 045 ? GEN4_PIPE_CONTROL_TC_FLUSH : 0) |
@@ -609,6 +610,9 @@ gen4_emit_pipe_invalidate(struct sna *sna)
        OUT_BATCH(0);
        OUT_BATCH(0);
        OUT_BATCH(0);
+#else
+       OUT_BATCH(MI_FLUSH);
+#endif
 }

Comment 182 Arkadiusz Miskiewicz 2014-02-04 10:32:28 UTC

(In reply to comment #181)


Without #181 patch I had flickering like:
http://ixion.pld-linux.org/~arekm/intel-flicker.mov
(best viewed from local fs)

With the patch flickering is gone.

"GM45" - synonym for all bad words :/

Comment 183 Chris Wilson 2014-02-04 10:38:59 UTC

Pushed the flushes once again. Hopefully we are corruption free once more.

commit fc001615ff78df4dab6ee0d5dd966b723326c358
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Feb 4 10:36:21 2014 +0000

    sna/gen4: Disable use of pipecontrol invalidates again
    
    One day, just not today, we may make gen4 work correctly, efficiently and
    fast. Today, we can barely pick one.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=55500
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.u

Comment 184 Ildar Muyukov 2014-03-19 06:43:54 UTC

Created attachment 96026 [details]
GTK+2 fonts corruption

as per git, the commit is applied for the version 2.99.910. I have this version installed.
But the problem is still there, see the screenshot. Interestingly, the fonts are heavily corrupted in GTK+2 apps. QT/efl apps are fine. GTK+3 apps mainly ok, but gedit have icons corrupted a bit, but not fonts.

Comment 185 Chris Wilson 2014-03-19 06:59:57 UTC

(In reply to comment #184)
> Created attachment 96026 [details]
> GTK+2 fonts corruption
> 
> as per git, the commit is applied for the version 2.99.910. I have this
> version installed.
> But the problem is still there, see the screenshot. Interestingly, the fonts
> are heavily corrupted in GTK+2 apps. QT/efl apps are fine. GTK+3 apps mainly
> ok, but gedit have icons corrupted a bit, but not fonts.

Please provide your Xorg.0.log to confirm this is the same bug.

Comment 186 Ildar Muyukov 2014-03-19 07:02:30 UTC

Created attachment 96027 [details]
Xorg.log

Comment 187 Chris Wilson 2014-03-19 07:16:21 UTC

Ah, would you happen to have an uneven amount of memory installed?

Comment 188 Ildar Muyukov 2014-03-19 10:38:25 UTC

(In reply to comment #187)
> Ah, would you happen to have an uneven amount of memory installed?

Probably. Is this and unsupported configuration?

Comment 189 Harald Judt 2014-03-19 10:41:02 UTC

(In reply to comment #187)
> Ah, would you happen to have an uneven amount of memory installed?

I too have an uneven amount of memory installed (7 GiB). It's an old workhorse for office usage only, so dual-channel doesn't matter.

Comment 190 Chris Wilson 2014-03-19 10:45:08 UTC

(In reply to comment #188)
> (In reply to comment #187)
> > Ah, would you happen to have an uneven amount of memory installed?
> 
> Probably. Is this and unsupported configuration?

We have a known issue in that we don't detect the swizzling correctly and so we may end up with corruption if objects are paged out from memory. Typically you see the affects after running for some time (so that memory pressure takes effect) or after resume. See bug 28813 bug 45092

Comment 191 Ildar Muyukov 2014-03-19 11:15:34 UTC

true. Thanks a lot, Chris!

Comment 192 Chris Wilson 2014-03-30 21:41:39 UTC

*** Bug 76804 has been marked as a duplicate of this bug. ***

Comment 193 Norbert Preining 2015-12-12 13:33:41 UTC

Dear all,

since 4.4-rc series I am again seeing this kind of corruption, especially after return from sleep (suspend to ram).

This happens on a Sony VAIO Pro, 

00:02.0 VGA compatible controller: Intel Corporation Haswell-ULT Integrated Graphics Controller (rev 09) (prog-if 00 [VGA controller])
	Subsystem: Sony Corporation Device 90b6
	Flags: bus master, fast devsel, latency 0, IRQ 40
	Memory at f5c00000 (64-bit, non-prefetchable) [size=4M]
	Memory at e0000000 (64-bit, prefetchable) [size=256M]
	I/O ports at f000 [size=64]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
	Capabilities: [d0] Power Management version 2
	Capabilities: [a4] PCI Advanced Features
	Kernel driver in use: i915

Switching to 4.3 resolves the issue.

Software is Debian/sid uptodate, that is xorg 7.7, driver-intel 2.99.917

Is this a known problem on the new kernels?

Thanks

Norbert

Comment 194 Martin Peres 2019-11-27 13:30:47 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-intel/issues/13.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.

alex.haeussler
arekm
chaujc
deveee
ejsheldrake
freedesktop-bugs
high.entropy
h.judt
ildar
iskatu
iveand
mail
mar.kolya
maurofruet
pavel.ondracka
preining
roman.elshin
sergio.callegari
tuksgig
ysangkok
yumpusamongus
zdenek.kabelac