Bug 51422

Summary: [ilk] Glyph rendering corruption in firefox (single character strings?) [SNA]
Product: xorg Reporter: Clemens Eisserer <linuxhippy>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium    
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
screenshot of the first few characters of a facebook friend list
none
screenshot of glyph corruption on geizhals.at
none
screenshot of a few entries of my facebook friend list
none
background pixmap corruption
none
screenshot after slowly moving terminal window on top of FF
none
glyph corruptions when GLYPH_MIN_SIZE=12 none

Description Clemens Eisserer 2012-06-25 13:43:13 UTC
I've spotted glyph corruptions which seem to occur from time-to-time when using Firefox. 

On Facebook I've seen every 2-4th entry in the friend's list to have a corrupted first character - on geizhals.at I've seen corrupted links in the short time between having a link clicked and rendering the new page.

I've been using Fedora16+updates and  SNA compiled from 2.19.0-334-gfa10005.

I've retired my i945GM based gen3 machine, this happens with an i5-540m.
Comment 1 Clemens Eisserer 2012-06-25 13:44:19 UTC
Created attachment 63465 [details]
screenshot of the first few characters of a facebook friend list
Comment 2 Clemens Eisserer 2012-06-25 13:44:59 UTC
Created attachment 63466 [details]
screenshot of glyph corruption on geizhals.at
Comment 3 Chris Wilson 2012-06-25 13:55:33 UTC
No! You can't consign your 945gm to the scrap heap! That means I need to make Arrandale fast now. :(

How long have you been using the i5? Do you have any feel whether this is a recent regression? I use an i7-640m as my primary workstation, so I am slightly surprised...

Which kernel is f16 current shipping?
Comment 4 Clemens Eisserer 2012-06-26 14:28:21 UTC
The i540 is "brand new", I started using it last week ago - I could test older drivers versions if that helps.
Fedora started shipping 3.4.2, but the bug has been there with 3.3 too if I remember correctly.

The i945GM will go to my fathers office (backlight burnt out after 5.5 years) - so probably will generate a few bug-reports still - besides, Arrandale seems to be quite fast already :)
Comment 5 Chris Wilson 2012-06-27 04:06:52 UTC
For the time being, can I ask you to simply keep an eye for the pattern and see if we can nail the reproduction steps? There are certain steps that we can then take to find whereabouts the cause is likely to be, but unless we can reproduce the issue reliably we will be wasting our time.
Comment 6 Chris Wilson 2012-07-04 10:45:47 UTC
Marking as needinfo until a pattern emerges for us to dig into.
Comment 7 Clemens Eisserer 2012-07-06 07:51:44 UTC
Hmm, for me this is 100% reproduceable when browsing on Facebook with Firefox.
Every 2nd-3rd entry in the Friend's list on the right has the first 1-2 glyphs garbled.

I've also observed this problem a single time in LibreOffice when selecting test, however I wasn't able rto reproduce it reliable.
Comment 8 Clemens Eisserer 2012-07-06 07:52:14 UTC
Created attachment 63894 [details]
screenshot of a few entries of my facebook friend list
Comment 9 Chris Wilson 2012-07-06 08:53:52 UTC
Clements, there are a few defines in src/sna/sna_glyphs.c to control whether certain accelerated paths are used. Can you play about with those to identify which path is causing the trouble?
Comment 10 Clemens Eisserer 2012-07-06 10:43:18 UTC
Don't know if this is useful, but NO_SMALL_MASK seems to solve the problem.
Comment 11 Chris Wilson 2012-07-06 10:53:24 UTC
That confirms what I expected. Now need to find out why the corruption occurs.
Comment 12 Clemens Eisserer 2012-07-07 00:03:45 UTC
any further steps I could do to narrow down the problem?

btw I also experienced a corruption of the desktop background pixmap (screenshot added) - should I file a seperate bug-report about it?
Comment 13 Clemens Eisserer 2012-07-07 00:05:01 UTC
Created attachment 63921 [details]
background pixmap corruption
Comment 14 Chris Wilson 2012-07-07 02:11:01 UTC
(In reply to comment #12)
> any further steps I could do to narrow down the problem?

The next step along the trail is investigating the upload buffers.
In src/sna/kgem.c, there are the debugging switches (2 just added):
#define DBG_NO_UPLOAD_CACHE 0
#define DBG_NO_UPLOAD_ACTIVE 0
#define DBG_NO_MAP_UPLOAD 0

Can you please play around with those?
 
> btw I also experienced a corruption of the desktop background pixmap
> (screenshot added) - should I file a seperate bug-report about it?

For the time being, that looks like it could be related to this upload buffer issue.
Comment 15 Clemens Eisserer 2012-07-07 13:27:56 UTC
Unfoutunatly, those three settings didn't have any effect :/
Comment 16 Chris Wilson 2012-07-07 13:43:14 UTC
Hmm, can you try bumping kgem->min_alignment to 64?

diff --git a/src/sna/kgem.c b/src/sna/kgem.c
index 2578ff9..e75cd5f 100644
--- a/src/sna/kgem.c
+++ b/src/sna/kgem.c
@@ -709,7 +709,7 @@ void kgem_init(struct kgem *kgem, int fd, struct pci_device 
                kgem->batch_size = 16*1024;
 
        kgem->min_alignment = 4;
-       if (gen < 40)
+       if (1||gen < 40)
                kgem->min_alignment = 64;
 
        kgem->half_cpu_cache_pages = cpu_cache_size() >> 13;
Comment 17 Clemens Eisserer 2012-07-07 13:54:45 UTC
min_alignment=64 didn't help either
Comment 18 Chris Wilson 2012-07-07 14:09:18 UTC
So the usual suspects draw a blank. :(

At this point, I tend to start poking randomly introducing deliberate errors to try and sanity check my assumptions.

I wonder if you made all the text go through the SMALL_MASK code how often does the error occur? Can we identify what the peculiarity is for the start of the row? Can you try and produce a minimal debug=full logfile, perhaps convincing firefox to create a window only large enough to trigger the error without too much excess rendering? I doubt we'll spot the actual error, but there might be something that stands out from the trace.
Comment 19 Chris Wilson 2012-07-08 07:53:31 UTC
Just for you, I logged back into facebook. Rendered fine for the few minutes I spent on the site. :!
Comment 20 Clemens Eisserer 2012-07-11 22:05:47 UTC
Another new finding - when moving a window on top of firefox, the desktop background is obscured.

I'll try to come up with a debug=full log of the facebook issue soon.
Comment 21 Clemens Eisserer 2012-07-11 22:06:33 UTC
Created attachment 64127 [details]
screenshot after slowly moving terminal window on top of FF
Comment 22 Chris Wilson 2012-07-15 14:53:13 UTC
All I can hope is the glyph mask reduction code makes this less likely to be hit, because I still do not know what the root cause is. :|
Comment 23 Chris Wilson 2012-07-21 18:14:10 UTC
Clemens, any news on whether the frequency has changed? I'm still at a loss to explain it, my best guess at the moment is a some cache-incoherency but I would have seen such myself...
Comment 24 Clemens Eisserer 2012-08-07 08:44:58 UTC
Hi Chris,

I've uploaded a debug=full log to: http://93.83.133.214/Xorg.1.log.7za

Testing started 2-3s after the VT switch.
Basically I was just moving the mouse over the facebook friend's list, I hope firefox is smart enough not to repaint too much.
Comment 25 Chris Wilson 2012-08-07 10:56:13 UTC
Thanks for the debug=full, perusing it right now.
Comment 26 Chris Wilson 2012-08-07 16:23:26 UTC
Hmm, can you check whether it is still the glyph upload mask that is causing the issue (#define NO_SMALL_MASK 1 in src/sna/sna_glyphs.c) again? As far as I can see the only instances where we use the small mask code in that log are instances where the result is clipped out anyway. Odd.

I've pushed a patch to avoid computing the glyph mask in that case:

commit 1a0590d133ea6991e0939d1f170f9c10df6856a0
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Aug 7 17:16:35 2012 +0100

    sna: Check the composite extents against the clip
    
    When computing the composite extents (as opposed to the composite
    region) also check if the resultant box overlaps the destination clip
    region (we know it already fits into the extents). This helps in cases
    with small roi against clipped drawables, such as drawing text onto
    expose events.

I expect it to have no effect, so if it does let me know!
Comment 27 Clemens Eisserer 2012-08-07 22:24:27 UTC
The patch (applied against 2.20.3) didn't help unfourtunatly.
Setting NO_SMALL_MASK didn't help either, maybe I overlooked the issue when I first tested it ... sorry about that.

However, I found the issue is dependent on the width of firefox' window:
 
I can reproduce the problem with the following window widths:
Width: 1730
Width: 1748
Width: 1767
Width: 1782
Width: 1802
Width: 1814

However rendering is perfect for those window-sizes:
Width: 1725
Width: 1740
Width: 1756
Width: 1772
Width: 1788
Width: 1806

It seems to toogle every ~8px between good & broken.
Comment 28 Chris Wilson 2012-08-07 22:37:35 UTC
Hmm, so which of the styles of corruption am I looking for in the debug=full log?
Comment 29 Clemens Eisserer 2012-08-08 06:23:37 UTC
Sorry for articulating not well.
There is only one style of corruption (that one shown in the firefox+facebook screenshot). However the corruption only happens, when the FF window has a certain size.

I'll try to create a screencast.
Comment 30 Chris Wilson 2012-08-08 11:28:40 UTC
Pushed a few more patches for trivial things I've spotted reading your debug log. No insight as to the root cause yet.
Comment 31 Clemens Eisserer 2012-08-08 18:55:21 UTC
As suspected, the recent patches didn't solve the issue.

I played a bit with the settings in sna_glyphs.c, the only settings that helped were:
- FALLBACK
- NO_GLYPHS_TO_DST and NO_GLYPHS_VIA_MASK when both set, just setting a single define didn't solve the corruption.

Hope this is helpful ...
Comment 32 Chris Wilson 2012-08-08 19:02:14 UTC
Ok, getting scary. Now we are implying that the corruption is on either render path. Can you also try the new NO_DISCARD_MASK and see how that affects the corruption with the other debug options?
Comment 33 Clemens Eisserer 2012-08-08 19:18:25 UTC
NO_DISCARD_MASK itself didn't help, is there any paricular combination that would be interesting?
Comment 34 Chris Wilson 2012-08-08 19:44:55 UTC
My guess is that with NO_DISCARD_MASK, the NO_GLYPHS_TO_DST will be completely ineffective. Probably can't distinguish that those. :|

Don't worry about, but if you could grab another debug=full, that will at least give a few suggestions as what DBG is missing. :(
Comment 35 Clemens Eisserer 2012-08-10 06:20:47 UTC
I uploaded a few debug=full logs with different options:

default options (corrupted): http://93.83.133.214/Xorg.log.default.7za


with defines below (corrupted): http://93.83.133.214/Xorg.log2.7za
#define NO_GLYPH_CACHE 1
#define NO_GLYPHS_VIA_MASK 1 
#define NO_SMALL_MASK 1 
#define NO_GLYPHS_SLOW 1  

with defines below (no corruption, renders fine): http://93.83.133.214/Xorg_allgood.7za
#define NO_GLYPHS_VIA_MASK 1 
#define NO_DISCARD_MASK 1 


A bit off-topic, I noticed I get quite a lot corruptions all over the place when setting GLYPH_MIN_SIZE to 12 - screenshot attached). (However setting it to 2 didn't solve the facebook-issue).
Comment 36 Clemens Eisserer 2012-08-10 06:21:14 UTC
Created attachment 65368 [details]
glyph corruptions when GLYPH_MIN_SIZE=12
Comment 37 Clemens Eisserer 2012-08-10 06:33:23 UTC
Forgot to mention, in the above cases I started testing about 2-3s after the VT switch.


Did another test with default settings: http://93.83.133.214/Xorg.14.log.7za
1. VT switch
2. did testing with a window-size which triggers the corruption
3. FF window resize
4. VT switch
5. testing with a window-size where the corruption does not occur

Hopefully there is some visible difference in the logs between the two different window sizes...
Comment 38 Chris Wilson 2012-08-10 07:14:03 UTC
(In reply to comment #36)
> Created attachment 65368 [details]
> glyph corruptions when GLYPH_MIN_SIZE=12

Hmm. A non-power-of-two is going to upset the code iirc, not sure what the repercussions would be.

Thanks for the debug logs.
Comment 39 Chris Wilson 2012-08-27 19:54:55 UTC
\o/

commit 26c731efc2048663b6a19a7ed7db0e94243ab30f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Aug 27 20:50:08 2012 +0100

    sna: Ensure that we create a GTT mapping for the inplace upload buffer
    
    As the code will optimistically convert a request for a GTT mapping into
    a CPU mapping if the object is still in the CPU domain, we need to
    overrule that in this case where we explicitly want to write directly
    into the GTT and furthermore keep the buffer around in an upload cache.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=51422
    References: https://bugs.freedesktop.org/show_bug.cgi?id=52299
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

That's definitely the original bug, however the recent symptoms started to digress. Can you please retest?
Comment 40 Clemens Eisserer 2012-08-27 20:09:20 UTC
Unfourtunatly I can still reproduce the facebook-glyph-corruption on Gen5 :/
Comment 41 Chris Wilson 2012-08-28 21:28:31 UTC
So I don't think this will help the current incarnation of the bug, but who knows...

commit deaa1cac269be03f4ec44092f70349ff466d59de
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Aug 28 22:23:22 2012 +0100

    sna: Align active upload buffers to the next page for reuse
    
    If we write to the same page as it already active on the GPU then
    despite the invalidation performed at the beginning of each batch, we do
    not seem to correctly sample the new data.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=51422
    References: https://bugs.freedesktop.org/show_bug.cgi?id=52299
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 42 Chris Wilson 2012-10-08 08:15:27 UTC
Stumpled upon this today. Broken hardware reigns supreme. :(

commit fb5205a86da09b344dbc20598655e917c263125c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Oct 8 09:12:33 2012 +0100

    sna/gen5: Flush pipelined ops when changing state
    
    When is a pipelined operation, not pipelined? That is the mystery posed
    by our hardware!
    
    Reported-by: Clemens Eisserer <linuxhippy@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=51422
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 43 Clemens Eisserer 2012-10-08 08:44:14 UTC
Thansk! :)
Comment 44 Chris Wilson 2012-12-01 10:22:58 UTC
I have taken a risk and tried another attempt at solving this:

commit 37eb7343be1aeeb90a860096756603a577df1a77
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Dec 1 09:40:11 2012 +0000

    sna/gen5: Inspired by gen4, reorder the flushing
    
    This may not be totally safe, but it is a nicer explanation for random
    single character corruption.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=51422
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Keep your eyes peeled for those bad glyphs!
Comment 45 Clemens Eisserer 2013-06-07 14:50:36 UTC
I've just cleaned the dust off my i540M based notebook and unfortunately I can still reproduce this bug to some degree.

Right after logging into facebook, I still see the same corrupted glyphs.
However as soon as those strings are re-drawn (e.g. mouse-over, expose) the corruptions vanish.

Sorry for being so vague...
Comment 46 Chris Wilson 2013-06-12 12:30:02 UTC
Similar to bug 64767, but less likely to apply in this case. I am pretty sure we identified this as being on the GPU side rather than CPU coherency, but would be worthwhile to check with -intel.git, just in case it gets magically fixed.
Comment 47 Clemens Eisserer 2013-06-14 09:57:48 UTC
unfortunately the coherence fixes didn't work out for this issue.
Comment 48 Chris Wilson 2013-06-23 08:03:15 UTC
commit 16fb786bd0791986fcac8b11a7e182090c5b6249
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Jun 23 09:00:17 2013 +0100

    sna/gen5: Force a write flush when changing blend modes
    
    Otherwise it appears that the hardware will readback from memory
    bypassing its render cache after a change of modes. There is probably a
    lot more subtly to it than this, but this appears to be a good first
    approximation.
    
    Reported-by: Clemens Eisserer <linuxhippy@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=51422
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

I wonder if this will help with gen4?
Comment 49 Chris Wilson 2013-07-23 09:05:48 UTC
Clements, I've played about with the ilk flushing to fix a severe performance regression associated with the fix 16fb786bd. However, due to fan failure, I don't have the ilk anymore for testing. Can you please check that I haven't reintroduced the render error? Thanks.
Comment 50 Clemens Eisserer 2013-07-23 09:09:47 UTC
unfortunately I don't have access to that machine for about ~2 weeks or so.
Comment 51 Clemens Eisserer 2013-08-05 07:41:37 UTC
I can reproduce it again with 2536ad044b259eb3ce3e7b1ccf2c59ab3fe37d06, although it isn't as bad as before.

When the friends list is loaded the first glyph of each string is corrupted, but after a few seconds the list is re-drawn and is displayed fine and the artifacts do not return until the page is re-loaded.
Comment 52 Chris Wilson 2013-08-05 12:44:28 UTC
Darn it.
Comment 53 Chris Wilson 2013-08-05 16:23:25 UTC
Test one is to see if we need a full pipleline in there:

diff --git a/src/sna/gen5_render.c b/src/sna/gen5_render.c
index 9b469f3..7b21c1d 100644
--- a/src/sna/gen5_render.c
+++ b/src/sna/gen5_render.c
@@ -1015,10 +1015,14 @@ gen5_emit_vertex_elements(struct sna *sna,
 inline static void
 gen5_emit_pipe_flush(struct sna *sna)
 {
+#if 0
        OUT_BATCH(GEN5_PIPE_CONTROL | (4 - 2));
        OUT_BATCH(GEN5_PIPE_CONTROL_WC_FLUSH);
        OUT_BATCH(0);
        OUT_BATCH(0);
+#else
+       OUT_BATCH(MI_FLUSH);
+#endif
 }
 
 static void


if that fails, then we need to work out which bit of state requires a flush (but is missing one currently).
Comment 54 Chris Wilson 2013-09-30 13:58:58 UTC
I rescued a laptop I had forgotten about with an i3-330m (Ironlake), so far it seems to be rendering correctly.
Comment 55 Clemens Eisserer 2013-10-03 16:49:44 UTC
still reproduceable with 3.11.2 + todays git.

The flush-patch of comment 53 however does the trick and I can not reproduce the corruption with it.
Comment 56 Chris Wilson 2013-10-03 16:51:47 UTC
(In reply to comment #55)
> still reproduceable with 3.11.2 + todays git.
> 
> The flush-patch of comment 53 however does the trick and I can not reproduce
> the corruption with it.

Can you please compare x11perf -aa10text -rgb10text with and without that patch?
Comment 57 Clemens Eisserer 2013-10-03 17:02:22 UTC
baseline:

40000000 reps @   0.0001 msec (12000000.0/sec): Char in 80-char aa line (Charter 10)
40000000 reps @   0.0001 msec (11300000.0/sec): Char in 80-char aa line (Charter 10)
40000000 reps @   0.0001 msec (12200000.0/sec): Char in 80-char aa line (Charter 10)
40000000 reps @   0.0001 msec (13400000.0/sec): Char in 80-char aa line (Charter 10)
40000000 reps @   0.0001 msec (13200000.0/sec): Char in 80-char aa line (Charter 10)
200000000 trep @   0.0001 msec (12400000.0/sec): Char in 80-char aa line (Charter 10)

32000000 reps @   0.0002 msec (5060000.0/sec): Char in 80-char rgb line (Charter 10)
32000000 reps @   0.0002 msec (5230000.0/sec): Char in 80-char rgb line (Charter 10)
32000000 reps @   0.0002 msec (5310000.0/sec): Char in 80-char rgb line (Charter 10)
32000000 reps @   0.0002 msec (5400000.0/sec): Char in 80-char rgb line (Charter 10)
32000000 reps @   0.0002 msec (5470000.0/sec): Char in 80-char rgb line (Charter 10)
160000000 trep @   0.0002 msec (5290000.0/sec): Char in 80-char rgb line (Charter 10)



patched:

72000000 reps @   0.0001 msec (14200000.0/sec): Char in 80-char aa line (Charter 10)
72000000 reps @   0.0001 msec (13200000.0/sec): Char in 80-char aa line (Charter 10)
72000000 reps @   0.0001 msec (13800000.0/sec): Char in 80-char aa line (Charter 10)
72000000 reps @   0.0001 msec (13000000.0/sec): Char in 80-char aa line (Charter 10)
72000000 reps @   0.0001 msec (12800000.0/sec): Char in 80-char aa line (Charter 10)
360000000 trep @   0.0001 msec (13400000.0/sec): Char in 80-char aa line (Charter 10)

24000000 reps @   0.0002 msec (4790000.0/sec): Char in 80-char rgb line (Charter 10)
24000000 reps @   0.0002 msec (5000000.0/sec): Char in 80-char rgb line (Charter 10)
24000000 reps @   0.0002 msec (5180000.0/sec): Char in 80-char rgb line (Charter 10)
24000000 reps @   0.0002 msec (5340000.0/sec): Char in 80-char rgb line (Charter 10)
24000000 reps @   0.0002 msec (5500000.0/sec): Char in 80-char rgb line (Charter 10)
120000000 trep @   0.0002 msec (5150000.0/sec): Char in 80-char rgb line (Charter 10)
Comment 58 Chris Wilson 2013-10-03 19:56:12 UTC
Hmm, can you please repeat the x11perf about 4 times, then report the 5 set of results. As you can see IPS on Ironlake is quite slow to respond and takes about 2 minutes to settle on the right boost frequencies for the GPU - until it settles the results are unstable (you can observe a difference of up to 2x).
Comment 59 Chris Wilson 2013-10-04 08:40:00 UTC
I took a risk and pushed a slightly cheaper MI_FLUSH - thinking that all we need here is the top-of-pipe synchronisation, and not the render cache flush. Please do test, thanks.

commit dbf98d8963b53a3c68deb1e2624d1269c8b0d97a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Oct 4 09:36:49 2013 +0100

    sna/gen5: Replace pipe-control with full MI_FLUSH for required CS stall
    
    It appears that we need top-of-pipe synchronisation for changing of
    certain state, and that the gen5 pipecontrol instruction is
    insufficient. So we have to fall back on the good old MI_FLUSH in order
    to make sure that the GPU invalidates its state correctly.
    
    Reported-by: Clemens Eisserer <linuxhippy@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=51422
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.