Summary: | [ilk] Glyph rendering corruption in firefox (single character strings?) [SNA] | ||
---|---|---|---|
Product: | xorg | Reporter: | Clemens Eisserer <linuxhippy> |
Component: | Driver/intel | Assignee: | Chris Wilson <chris> |
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> |
Severity: | normal | ||
Priority: | medium | ||
Version: | git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
Clemens Eisserer
2012-06-25 13:43:13 UTC
Created attachment 63465 [details]
screenshot of the first few characters of a facebook friend list
Created attachment 63466 [details]
screenshot of glyph corruption on geizhals.at
No! You can't consign your 945gm to the scrap heap! That means I need to make Arrandale fast now. :( How long have you been using the i5? Do you have any feel whether this is a recent regression? I use an i7-640m as my primary workstation, so I am slightly surprised... Which kernel is f16 current shipping? The i540 is "brand new", I started using it last week ago - I could test older drivers versions if that helps. Fedora started shipping 3.4.2, but the bug has been there with 3.3 too if I remember correctly. The i945GM will go to my fathers office (backlight burnt out after 5.5 years) - so probably will generate a few bug-reports still - besides, Arrandale seems to be quite fast already :) For the time being, can I ask you to simply keep an eye for the pattern and see if we can nail the reproduction steps? There are certain steps that we can then take to find whereabouts the cause is likely to be, but unless we can reproduce the issue reliably we will be wasting our time. Marking as needinfo until a pattern emerges for us to dig into. Hmm, for me this is 100% reproduceable when browsing on Facebook with Firefox. Every 2nd-3rd entry in the Friend's list on the right has the first 1-2 glyphs garbled. I've also observed this problem a single time in LibreOffice when selecting test, however I wasn't able rto reproduce it reliable. Created attachment 63894 [details]
screenshot of a few entries of my facebook friend list
Clements, there are a few defines in src/sna/sna_glyphs.c to control whether certain accelerated paths are used. Can you play about with those to identify which path is causing the trouble? Don't know if this is useful, but NO_SMALL_MASK seems to solve the problem. That confirms what I expected. Now need to find out why the corruption occurs. any further steps I could do to narrow down the problem? btw I also experienced a corruption of the desktop background pixmap (screenshot added) - should I file a seperate bug-report about it? Created attachment 63921 [details]
background pixmap corruption
(In reply to comment #12) > any further steps I could do to narrow down the problem? The next step along the trail is investigating the upload buffers. In src/sna/kgem.c, there are the debugging switches (2 just added): #define DBG_NO_UPLOAD_CACHE 0 #define DBG_NO_UPLOAD_ACTIVE 0 #define DBG_NO_MAP_UPLOAD 0 Can you please play around with those? > btw I also experienced a corruption of the desktop background pixmap > (screenshot added) - should I file a seperate bug-report about it? For the time being, that looks like it could be related to this upload buffer issue. Unfoutunatly, those three settings didn't have any effect :/ Hmm, can you try bumping kgem->min_alignment to 64? diff --git a/src/sna/kgem.c b/src/sna/kgem.c index 2578ff9..e75cd5f 100644 --- a/src/sna/kgem.c +++ b/src/sna/kgem.c @@ -709,7 +709,7 @@ void kgem_init(struct kgem *kgem, int fd, struct pci_device kgem->batch_size = 16*1024; kgem->min_alignment = 4; - if (gen < 40) + if (1||gen < 40) kgem->min_alignment = 64; kgem->half_cpu_cache_pages = cpu_cache_size() >> 13; min_alignment=64 didn't help either So the usual suspects draw a blank. :( At this point, I tend to start poking randomly introducing deliberate errors to try and sanity check my assumptions. I wonder if you made all the text go through the SMALL_MASK code how often does the error occur? Can we identify what the peculiarity is for the start of the row? Can you try and produce a minimal debug=full logfile, perhaps convincing firefox to create a window only large enough to trigger the error without too much excess rendering? I doubt we'll spot the actual error, but there might be something that stands out from the trace. Just for you, I logged back into facebook. Rendered fine for the few minutes I spent on the site. :! Another new finding - when moving a window on top of firefox, the desktop background is obscured. I'll try to come up with a debug=full log of the facebook issue soon. Created attachment 64127 [details]
screenshot after slowly moving terminal window on top of FF
All I can hope is the glyph mask reduction code makes this less likely to be hit, because I still do not know what the root cause is. :| Clemens, any news on whether the frequency has changed? I'm still at a loss to explain it, my best guess at the moment is a some cache-incoherency but I would have seen such myself... Hi Chris, I've uploaded a debug=full log to: http://93.83.133.214/Xorg.1.log.7za Testing started 2-3s after the VT switch. Basically I was just moving the mouse over the facebook friend's list, I hope firefox is smart enough not to repaint too much. Thanks for the debug=full, perusing it right now. Hmm, can you check whether it is still the glyph upload mask that is causing the issue (#define NO_SMALL_MASK 1 in src/sna/sna_glyphs.c) again? As far as I can see the only instances where we use the small mask code in that log are instances where the result is clipped out anyway. Odd. I've pushed a patch to avoid computing the glyph mask in that case: commit 1a0590d133ea6991e0939d1f170f9c10df6856a0 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Aug 7 17:16:35 2012 +0100 sna: Check the composite extents against the clip When computing the composite extents (as opposed to the composite region) also check if the resultant box overlaps the destination clip region (we know it already fits into the extents). This helps in cases with small roi against clipped drawables, such as drawing text onto expose events. I expect it to have no effect, so if it does let me know! The patch (applied against 2.20.3) didn't help unfourtunatly. Setting NO_SMALL_MASK didn't help either, maybe I overlooked the issue when I first tested it ... sorry about that. However, I found the issue is dependent on the width of firefox' window: I can reproduce the problem with the following window widths: Width: 1730 Width: 1748 Width: 1767 Width: 1782 Width: 1802 Width: 1814 However rendering is perfect for those window-sizes: Width: 1725 Width: 1740 Width: 1756 Width: 1772 Width: 1788 Width: 1806 It seems to toogle every ~8px between good & broken. Hmm, so which of the styles of corruption am I looking for in the debug=full log? Sorry for articulating not well. There is only one style of corruption (that one shown in the firefox+facebook screenshot). However the corruption only happens, when the FF window has a certain size. I'll try to create a screencast. Pushed a few more patches for trivial things I've spotted reading your debug log. No insight as to the root cause yet. As suspected, the recent patches didn't solve the issue. I played a bit with the settings in sna_glyphs.c, the only settings that helped were: - FALLBACK - NO_GLYPHS_TO_DST and NO_GLYPHS_VIA_MASK when both set, just setting a single define didn't solve the corruption. Hope this is helpful ... Ok, getting scary. Now we are implying that the corruption is on either render path. Can you also try the new NO_DISCARD_MASK and see how that affects the corruption with the other debug options? NO_DISCARD_MASK itself didn't help, is there any paricular combination that would be interesting? My guess is that with NO_DISCARD_MASK, the NO_GLYPHS_TO_DST will be completely ineffective. Probably can't distinguish that those. :| Don't worry about, but if you could grab another debug=full, that will at least give a few suggestions as what DBG is missing. :( I uploaded a few debug=full logs with different options: default options (corrupted): http://93.83.133.214/Xorg.log.default.7za with defines below (corrupted): http://93.83.133.214/Xorg.log2.7za #define NO_GLYPH_CACHE 1 #define NO_GLYPHS_VIA_MASK 1 #define NO_SMALL_MASK 1 #define NO_GLYPHS_SLOW 1 with defines below (no corruption, renders fine): http://93.83.133.214/Xorg_allgood.7za #define NO_GLYPHS_VIA_MASK 1 #define NO_DISCARD_MASK 1 A bit off-topic, I noticed I get quite a lot corruptions all over the place when setting GLYPH_MIN_SIZE to 12 - screenshot attached). (However setting it to 2 didn't solve the facebook-issue). Created attachment 65368 [details]
glyph corruptions when GLYPH_MIN_SIZE=12
Forgot to mention, in the above cases I started testing about 2-3s after the VT switch. Did another test with default settings: http://93.83.133.214/Xorg.14.log.7za 1. VT switch 2. did testing with a window-size which triggers the corruption 3. FF window resize 4. VT switch 5. testing with a window-size where the corruption does not occur Hopefully there is some visible difference in the logs between the two different window sizes... (In reply to comment #36) > Created attachment 65368 [details] > glyph corruptions when GLYPH_MIN_SIZE=12 Hmm. A non-power-of-two is going to upset the code iirc, not sure what the repercussions would be. Thanks for the debug logs. \o/ commit 26c731efc2048663b6a19a7ed7db0e94243ab30f Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Aug 27 20:50:08 2012 +0100 sna: Ensure that we create a GTT mapping for the inplace upload buffer As the code will optimistically convert a request for a GTT mapping into a CPU mapping if the object is still in the CPU domain, we need to overrule that in this case where we explicitly want to write directly into the GTT and furthermore keep the buffer around in an upload cache. References: https://bugs.freedesktop.org/show_bug.cgi?id=51422 References: https://bugs.freedesktop.org/show_bug.cgi?id=52299 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> That's definitely the original bug, however the recent symptoms started to digress. Can you please retest? Unfourtunatly I can still reproduce the facebook-glyph-corruption on Gen5 :/ So I don't think this will help the current incarnation of the bug, but who knows... commit deaa1cac269be03f4ec44092f70349ff466d59de Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Aug 28 22:23:22 2012 +0100 sna: Align active upload buffers to the next page for reuse If we write to the same page as it already active on the GPU then despite the invalidation performed at the beginning of each batch, we do not seem to correctly sample the new data. References: https://bugs.freedesktop.org/show_bug.cgi?id=51422 References: https://bugs.freedesktop.org/show_bug.cgi?id=52299 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Stumpled upon this today. Broken hardware reigns supreme. :( commit fb5205a86da09b344dbc20598655e917c263125c Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Oct 8 09:12:33 2012 +0100 sna/gen5: Flush pipelined ops when changing state When is a pipelined operation, not pipelined? That is the mystery posed by our hardware! Reported-by: Clemens Eisserer <linuxhippy@gmail.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=51422 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Thansk! :) I have taken a risk and tried another attempt at solving this: commit 37eb7343be1aeeb90a860096756603a577df1a77 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat Dec 1 09:40:11 2012 +0000 sna/gen5: Inspired by gen4, reorder the flushing This may not be totally safe, but it is a nicer explanation for random single character corruption. References: https://bugs.freedesktop.org/show_bug.cgi?id=51422 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Keep your eyes peeled for those bad glyphs! I've just cleaned the dust off my i540M based notebook and unfortunately I can still reproduce this bug to some degree. Right after logging into facebook, I still see the same corrupted glyphs. However as soon as those strings are re-drawn (e.g. mouse-over, expose) the corruptions vanish. Sorry for being so vague... Similar to bug 64767, but less likely to apply in this case. I am pretty sure we identified this as being on the GPU side rather than CPU coherency, but would be worthwhile to check with -intel.git, just in case it gets magically fixed. unfortunately the coherence fixes didn't work out for this issue. commit 16fb786bd0791986fcac8b11a7e182090c5b6249 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Jun 23 09:00:17 2013 +0100 sna/gen5: Force a write flush when changing blend modes Otherwise it appears that the hardware will readback from memory bypassing its render cache after a change of modes. There is probably a lot more subtly to it than this, but this appears to be a good first approximation. Reported-by: Clemens Eisserer <linuxhippy@gmail.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=51422 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> I wonder if this will help with gen4? Clements, I've played about with the ilk flushing to fix a severe performance regression associated with the fix 16fb786bd. However, due to fan failure, I don't have the ilk anymore for testing. Can you please check that I haven't reintroduced the render error? Thanks. unfortunately I don't have access to that machine for about ~2 weeks or so. I can reproduce it again with 2536ad044b259eb3ce3e7b1ccf2c59ab3fe37d06, although it isn't as bad as before. When the friends list is loaded the first glyph of each string is corrupted, but after a few seconds the list is re-drawn and is displayed fine and the artifacts do not return until the page is re-loaded. Darn it. Test one is to see if we need a full pipleline in there: diff --git a/src/sna/gen5_render.c b/src/sna/gen5_render.c index 9b469f3..7b21c1d 100644 --- a/src/sna/gen5_render.c +++ b/src/sna/gen5_render.c @@ -1015,10 +1015,14 @@ gen5_emit_vertex_elements(struct sna *sna, inline static void gen5_emit_pipe_flush(struct sna *sna) { +#if 0 OUT_BATCH(GEN5_PIPE_CONTROL | (4 - 2)); OUT_BATCH(GEN5_PIPE_CONTROL_WC_FLUSH); OUT_BATCH(0); OUT_BATCH(0); +#else + OUT_BATCH(MI_FLUSH); +#endif } static void if that fails, then we need to work out which bit of state requires a flush (but is missing one currently). I rescued a laptop I had forgotten about with an i3-330m (Ironlake), so far it seems to be rendering correctly. still reproduceable with 3.11.2 + todays git. The flush-patch of comment 53 however does the trick and I can not reproduce the corruption with it. (In reply to comment #55) > still reproduceable with 3.11.2 + todays git. > > The flush-patch of comment 53 however does the trick and I can not reproduce > the corruption with it. Can you please compare x11perf -aa10text -rgb10text with and without that patch? baseline: 40000000 reps @ 0.0001 msec (12000000.0/sec): Char in 80-char aa line (Charter 10) 40000000 reps @ 0.0001 msec (11300000.0/sec): Char in 80-char aa line (Charter 10) 40000000 reps @ 0.0001 msec (12200000.0/sec): Char in 80-char aa line (Charter 10) 40000000 reps @ 0.0001 msec (13400000.0/sec): Char in 80-char aa line (Charter 10) 40000000 reps @ 0.0001 msec (13200000.0/sec): Char in 80-char aa line (Charter 10) 200000000 trep @ 0.0001 msec (12400000.0/sec): Char in 80-char aa line (Charter 10) 32000000 reps @ 0.0002 msec (5060000.0/sec): Char in 80-char rgb line (Charter 10) 32000000 reps @ 0.0002 msec (5230000.0/sec): Char in 80-char rgb line (Charter 10) 32000000 reps @ 0.0002 msec (5310000.0/sec): Char in 80-char rgb line (Charter 10) 32000000 reps @ 0.0002 msec (5400000.0/sec): Char in 80-char rgb line (Charter 10) 32000000 reps @ 0.0002 msec (5470000.0/sec): Char in 80-char rgb line (Charter 10) 160000000 trep @ 0.0002 msec (5290000.0/sec): Char in 80-char rgb line (Charter 10) patched: 72000000 reps @ 0.0001 msec (14200000.0/sec): Char in 80-char aa line (Charter 10) 72000000 reps @ 0.0001 msec (13200000.0/sec): Char in 80-char aa line (Charter 10) 72000000 reps @ 0.0001 msec (13800000.0/sec): Char in 80-char aa line (Charter 10) 72000000 reps @ 0.0001 msec (13000000.0/sec): Char in 80-char aa line (Charter 10) 72000000 reps @ 0.0001 msec (12800000.0/sec): Char in 80-char aa line (Charter 10) 360000000 trep @ 0.0001 msec (13400000.0/sec): Char in 80-char aa line (Charter 10) 24000000 reps @ 0.0002 msec (4790000.0/sec): Char in 80-char rgb line (Charter 10) 24000000 reps @ 0.0002 msec (5000000.0/sec): Char in 80-char rgb line (Charter 10) 24000000 reps @ 0.0002 msec (5180000.0/sec): Char in 80-char rgb line (Charter 10) 24000000 reps @ 0.0002 msec (5340000.0/sec): Char in 80-char rgb line (Charter 10) 24000000 reps @ 0.0002 msec (5500000.0/sec): Char in 80-char rgb line (Charter 10) 120000000 trep @ 0.0002 msec (5150000.0/sec): Char in 80-char rgb line (Charter 10) Hmm, can you please repeat the x11perf about 4 times, then report the 5 set of results. As you can see IPS on Ironlake is quite slow to respond and takes about 2 minutes to settle on the right boost frequencies for the GPU - until it settles the results are unstable (you can observe a difference of up to 2x). I took a risk and pushed a slightly cheaper MI_FLUSH - thinking that all we need here is the top-of-pipe synchronisation, and not the render cache flush. Please do test, thanks. commit dbf98d8963b53a3c68deb1e2624d1269c8b0d97a Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Oct 4 09:36:49 2013 +0100 sna/gen5: Replace pipe-control with full MI_FLUSH for required CS stall It appears that we need top-of-pipe synchronisation for changing of certain state, and that the gen5 pipecontrol instruction is insufficient. So we have to fall back on the good old MI_FLUSH in order to make sure that the GPU invalidates its state correctly. Reported-by: Clemens Eisserer <linuxhippy@gmail.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=51422 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.