Bug 61628

Summary: [ilk] Corrupted rendering of page previews in Firefox with >xf86-video-intel-2.20.18
Product: xorg Reporter: Coacher <itumaykin+freedesktop>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: ccr
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
See Also: https://launchpad.net/bugs/1189850
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
lspci -vvv
none
glxinfo -l -t
none
Screenshot with example of corrupted rendering
none
xf86-video-intel-2.21.3-revert-dc643ef753bcfb69685f1eb10828d0c8f830c30e.patch
none
Force CPU synchronisation after writes
none
kgem_bo_sync__cpu_full-revert-bad.patch
none
kgem_bo_sync__cpu_full-revert-good.patch
none
Disable read-read optimisations none

Description Coacher 2013-02-28 19:24:03 UTC
Created attachment 75706 [details]
lspci -vvv

Hello.

Since I've upgraded from 2.20.18 version of intel driver page previews in Firefox are rendered improperly (see attached screenshot). Tested versions of intel driver are 2.20.{18,19} and 2.21.{0,2,3}, Firefox's versions are 17.0-19.0. I don't think it is a Firefox issues it is completely gone when downgrading back to 2.20.18.

My system is Gentoo amd64, currently with latest Firefox and intel driver. My current kernel version is 3.8 and it is vanilla. I am using SNA acceleration.
If there is any additional info that would be helpful I am ready to provide it.
Comment 1 Coacher 2013-02-28 19:24:42 UTC
Created attachment 75707 [details]
glxinfo -l -t
Comment 2 Coacher 2013-02-28 19:27:11 UTC
Created attachment 75708 [details]
Screenshot with example of corrupted rendering
Comment 3 Chris Wilson 2013-03-03 19:22:06 UTC
I still haven't been able to reproduce this one yet. Do you have a foolproof (and remember just how big a fool I am!) recipe?
Comment 4 Coacher 2013-03-03 20:18:29 UTC
This issue happens occasionally, but I don't have a 100% reproducible way to show it. One of the most sucessfull attempts to reproduce it is:

1. make all `speed dial` buttons (previews on about:newtab) in Firefox filled with something reasonably heavy, not plain-text pages (on my machine it is a couple of youtube pages, web interface to SAGE, couple of redmines, etc.)
2. close all tabs except one and this last one tab should be about:newtab page
3. middle-click all the previews as fast as you can one by one, so the pages begin to load in background
3. now hit Ctrl+W till you close everything including that about:newtab page where you've started. You shouldn't wait until all pages you've opened on step 3 are loaded.
4. now open about:newtab again and with a good chance some of the preview will be corrupted. Sometimes there is no corruption, but some preview is displayed on the wrong position, for example two different sites share the same preview image.

Another way to reproduce:

1. make at least one `speed dial` button (preview on about:newtab) in Firefox filled with any kind of preview, just any site you want
2. close all tabs except one and this last one tab should be about:newtab page
3. go to http://www.dreamworksanimation.com/ and add it to bookmarks, then close tab (sorry, bookmarking is the only way I know to make a specific site to show up in previews)
4. open about:newtab again and remove any preview image from it by pressing [X]
5. open bookmarks and drag dreamworksanimation bookmark you've made on step 3 into the freed on step 4 place
6. now visit http://www.dreamworksanimation.com/ so Firefox will generate preview
7. close tab and open again about:newtab. The preview for dreamworksanimation should be corupted

Sorry if the descriptions are a bit messy. Also I don't have any other issues with firefox sites rendering, just issues with rendering previews. I wish there was an easier way to reproduce it.
Comment 5 Coacher 2013-03-03 21:22:19 UTC
I did git bisecting between 2.20.18 and 2.20.19 and the result is this commit:

dc643ef753bcfb69685f1eb10828d0c8f830c30e is the first bad commit
commit dc643ef753bcfb69685f1eb10828d0c8f830c30e
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jan 17 12:27:55 2013 +0000

    sna: Apply read-only synchronization hints for move-to-cpu

    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

:040000 040000 0f53950ba9a9756a39722f12c322c2d629c1a2a4 d5ff0a7307cc718ee94c78ee2fb1c9bf6158ed91 M      src

As this bug is not 100% reproducible it could slipped out of my sight during some bisect runs, however it is something to start with. What do you think? Could this sommit lead to the rendering problems I have?
Comment 6 Chris Wilson 2013-03-03 21:34:14 UTC
There was a related bug, fixed with

commit 19bd005056a2083de64753681b96716996e4237d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Feb 22 12:05:04 2013 +0000

    sna: Avoid migrating and making the GPU bo busy prior to mmapping it
    
    References: https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/1131134
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

that I thought was already in 2.21.3 and so you had tested it. It is actually in master, so can you try compiling from git and checking if that fixes the issue?
Comment 7 Chris Wilson 2013-03-03 21:36:08 UTC
I'll admit to not fully explaining how that prevented the corruption, as the damage should had been migrated and then the kernel should have stalled upon the read... But it did have an effect and prevented a similar issue that bisected to the same commit.
Comment 8 Coacher 2013-03-03 22:54:47 UTC
(In reply to comment #6)
> It is
> actually in master, so can you try compiling from git and checking if that
> fixes the issue?

I've just tested master and the issue is still there.
Comment 9 Coacher 2013-03-03 22:59:40 UTC
Created attachment 75871 [details] [review]
xf86-video-intel-2.21.3-revert-dc643ef753bcfb69685f1eb10828d0c8f830c30e.patch

With this patch applied on top of xf86-video-intel-2.21.3 the problem is gone (at least I tried hard to reproduce it, but failed). This patch is simply reverting dc643ef753bcfb69685f1eb10828d0c8f830c30e commit mentioned above.
Comment 10 Chris Wilson 2013-03-04 09:47:53 UTC
Can you try converting each of those kgem_bo_sync__cpu_full() back to kgem_bo_sync__cpu() individually and see if we can narrow it down to one particular path?
Comment 11 Chris Wilson 2013-03-04 11:13:26 UTC
Created attachment 75892 [details] [review]
Force CPU synchronisation after writes

Another test to try.
Comment 12 Coacher 2013-03-04 21:42:45 UTC
(In reply to comment #11)
> Created attachment 75892 [details] [review] [review]
> Force CPU synchronisation after writes
> 
> Another test to try.

With this patch applied on top of 2.21.3 the problem seems to be fixed.
Comment 13 Coacher 2013-03-04 22:03:28 UTC
Created attachment 75920 [details] [review]
kgem_bo_sync__cpu_full-revert-bad.patch

(In reply to comment #10)
> Can you try converting each of those kgem_bo_sync__cpu_full() back to
> kgem_bo_sync__cpu() individually and see if we can narrow it down to one
> particular path?

With this patch on top of 2.21.3 I've hit the bug almost immediately. In this case I've left first kgem_bo_sync__cpu_full() as is and converted only second one.
Comment 14 Coacher 2013-03-04 22:06:42 UTC
Created attachment 75921 [details] [review]
kgem_bo_sync__cpu_full-revert-good.patch

(In reply to comment #10)
> Can you try converting each of those kgem_bo_sync__cpu_full() back to
> kgem_bo_sync__cpu() individually and see if we can narrow it down to one
> particular path?

With this patch on top of 2.21.3 I was unable to reproduce the bug anymore. In this case I've converted first kgem_bo_sync__cpu_full() and left second one as is.
Comment 15 Chris Wilson 2013-03-05 11:17:45 UTC
I've looked through all callers to see if I can find one that missed the MOVE_WRITE to no avail. I've double checked the kernel to see if there is a loop hole, again to no avail. So I'm a little bit lost to see where the missed synchronisation is coming from, and I haven't yet thought of a good test to force/catch an error.

In the meantime, I've applied one minor tweak to xf86-video-intel.git,

commit 60ec35b8d25ecfabf1744ea7bc81109d7f2a90e2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Mar 5 11:14:37 2013 +0000

    sna: Be explicit when checking for an idle bo after CPU synchronisation

Do you mind giving that a quick test?
Comment 16 Chris Wilson 2013-03-05 11:28:30 UTC
Also one other test is to try with the drm-intel-next kernel.
Comment 17 Coacher 2013-03-06 09:27:29 UTC
(In reply to comment #15)
> I've looked through all callers to see if I can find one that missed the
> MOVE_WRITE to no avail. I've double checked the kernel to see if there is a
> loop hole, again to no avail. So I'm a little bit lost to see where the
> missed synchronisation is coming from, and I haven't yet thought of a good
> test to force/catch an error.
> 
> In the meantime, I've applied one minor tweak to xf86-video-intel.git,
> 
> commit 60ec35b8d25ecfabf1744ea7bc81109d7f2a90e2
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Mar 5 11:14:37 2013 +0000
> 
>     sna: Be explicit when checking for an idle bo after CPU synchronisation
> 
> Do you mind giving that a quick test?

OK, I'll test it later today
Comment 18 Coacher 2013-03-06 09:28:57 UTC
(In reply to comment #16)
> Also one other test is to try with the drm-intel-next kernel.

Could you please give me a quick link to their git repo?
Would 3.9-rc1 would be enough?
Comment 19 Chris Wilson 2013-03-06 09:30:56 UTC
Our upstream is http://cgit.freedesktop.org/~danvet/drm-intel

If you are using ubuntu, you can find pre-packaged kernels here http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/current/
Comment 20 Chris Wilson 2013-03-06 19:04:08 UTC
Created attachment 76040 [details] [review]
Disable read-read optimisations

And one last request, can you please test that this patch as a temporary solution?
Comment 21 Coacher 2013-03-06 21:27:45 UTC
(In reply to comment #20)
> Created attachment 76040 [details] [review] [review]
> Disable read-read optimisations
> 
> And one last request, can you please test that this patch as a temporary
> solution?

This patch also fixes the issue. It was tested on 3.7.10 kernel as well as all previous patches. Now gonna try with drm-intel-next.
Comment 22 Chris Wilson 2013-03-06 21:35:19 UTC
(In reply to comment #21)
> (In reply to comment #20)
> > Created attachment 76040 [details] [review] [review] [review]
> > Disable read-read optimisations
> > 
> > And one last request, can you please test that this patch as a temporary
> > solution?
> 
> This patch also fixes the issue. It was tested on 3.7.10 kernel as well as
> all previous patches. Now gonna try with drm-intel-next.

Thanks. In the meantime, I'm going to push the temporary workaround - obviously I still hope to find the real bug.
Comment 23 Coacher 2013-03-06 21:56:15 UTC
(In reply to comment #16)
> Also one other test is to try with the drm-intel-next kernel.

Ok, just tried out today's drm-intel-next kernel and was unable to reproduce this bug anymore. This sounds like good news.
Comment 24 Coacher 2013-03-06 22:00:55 UTC
(In reply to comment #23)
> (In reply to comment #16)
> > Also one other test is to try with the drm-intel-next kernel.
> 
> Ok, just tried out today's drm-intel-next kernel and was unable to reproduce
> this bug anymore. This sounds like good news.

Oh, wait, I forgot to rebuild xf86-video-intel without patch. Sorry. Will try vanilla now
Comment 25 Chris Wilson 2013-03-06 22:04:05 UTC
/o\ Can you confirm that result with vanilla xf86-video-intel?
Comment 26 Coacher 2013-03-06 22:12:26 UTC
(In reply to comment #25)
> /o\ Can you confirm that result with vanilla xf86-video-intel?

Sorry to disappoint you, but the issue is reproducible with vanilla xf86-video-intel and drm-intel-next.
Comment 27 Coacher 2013-03-07 17:46:35 UTC
(In reply to comment #22)
> Thanks. In the meantime, I'm going to push the temporary workaround -
> obviously I still hope to find the real bug.

Is there a way I can help? Attach some debug info or test something?
Comment 28 Chris Wilson 2013-03-07 20:54:50 UTC
(In reply to comment #27)
> (In reply to comment #22)
> > Thanks. In the meantime, I'm going to push the temporary workaround -
> > obviously I still hope to find the real bug.
> 
> Is there a way I can help? Attach some debug info or test something?

If you change the define in src/sna/sna_accel.c:

diff --git a/src/sna/sna_accel.c b/src/sna/sna_accel.c
index ae6d3c1..5edad51 100644
--- a/src/sna/sna_accel.c
+++ b/src/sna/sna_accel.c
@@ -57,7 +57,7 @@
 #define FORCE_INPLACE 0
 #define FORCE_FALLBACK 0
 #define FORCE_FLUSH 0
-#define FORCE_FULL_SYNC 1 /* https://bugs.freedesktop.org/show_bug.cgi?id=61628 */
+#define FORCE_FULL_SYNC 0
 
 #define DEFAULT_TILING I915_TILING_X

that restores the buggy behaviour. If you can keep running with that patch and with --enable-debug to check if any assertions are triggered and see how things progress.
Comment 29 Coacher 2013-03-07 22:09:27 UTC
(In reply to comment #28)
> If you can keep running with that patch
> and with --enable-debug to check if any assertions are triggered and see how
> things progress.

OK, I've did what you've said, powered on and started to watch Xorg.0.log.

The first thing I did was to open Firefox and trigger this issue several times - no output.
Then I've tried to simulate some typical workflow i.e. opened programs I use on a daily basis and do some things inside them like checking mail, browsing a couple of webpages - still no output.
Then I've decided to close them and return to Firefox and again triggered this issue several times and opened a couple of heavy tabs with flash and suddenly caught this:

(EE) [mi] EQ overflowing.  Additional events will be discarded until existing events are processed.
(EE)
(EE) Backtrace:
(EE) 0: /usr/bin/X (xorg_backtrace+0x34) [0x5969b4]
(EE) 1: /usr/bin/X (mieqEnqueue+0x263) [0x5776c3]
(EE) 2: /usr/bin/X (0x400000+0x4fcd4) [0x44fcd4]
(EE) 3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7f236e1d0000+0x6208) [0x7f236e1d6208]
(EE) 4: /usr/bin/X (0x400000+0x7a477) [0x47a477]
(EE) 5: /usr/bin/X (0x400000+0xa5527) [0x4a5527]
(EE) 6: /lib64/libpthread.so.0 (0x3a9c400000+0x10bf0) [0x3a9c410bf0]
(EE) 7: /lib64/libc.so.6 (ioctl+0x7) [0x3a9bce3437]
(EE) 8: /usr/lib64/libdrm.so.2 (drmIoctl+0x28) [0x3fd3c040d8]
(EE) 9: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f236fd9c000+0x1c1a0) [0x7f236fdb81a0]
(EE) 10: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f236fd9c000+0x1d9f7) [0x7f236fdb99f7]
(EE) 11: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f236fd9c000+0x4fe3a) [0x7f236fdebe3a]
(EE) 12: /usr/bin/X (BlockHandler+0x44) [0x43f224]
(EE) 13: /usr/bin/X (WaitForSomething+0x11d) [0x593e7d]
(EE) 14: /usr/bin/X (0x400000+0x3ade2) [0x43ade2]
(EE) 15: /usr/bin/X (0x400000+0x29b5a) [0x429b5a]
(EE) 16: /lib64/libc.so.6 (__libc_start_main+0xed) [0x3a9bc2460d]
(EE) 17: /usr/bin/X (0x400000+0x29eb1) [0x429eb1]
(EE)
(EE) [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.
(EE) [mi] mieq is *NOT* the cause.  It is a victim.
[  8739.251] [mi] Increasing EQ size to 512 to prevent dropped events.
[  8739.251] [mi] EQ processing has resumed after 64 dropped events.
[  8739.251] [mi] This may be caused my a misbehaving driver monopolizing the server's resources.

After that I've tried to reproduce this trace again opening same tabs and triggering issue again and again, but without any luck. Is this stack trace useful in any way?
Comment 30 Chris Wilson 2013-03-07 22:34:37 UTC
Hmm, I expect dmesg to contain a GPU hang and /sys/kernel/debug/0/i915_error_state to be populated, mind attaching it?
Comment 31 Coacher 2013-03-08 14:12:15 UTC
(In reply to comment #30)
> Hmm, I expect dmesg to contain a GPU hang and
> /sys/kernel/debug/0/i915_error_state to be populated, mind attaching it?

Too bad I turned off my machine later after I've caught that stack trace, so I can't give you the dump of i915_error_state, but I was checking both dmesg and xsession-errors and there was nothing unusual and no signs of error output from i915.

I'll try to catch it again and if I do I'll attach dmesg and dump of i915_error_state here.
Comment 32 Chris Wilson 2013-03-09 19:42:31 UTC
*** Bug 61610 has been marked as a duplicate of this bug. ***
Comment 33 Coacher 2013-03-12 16:11:02 UTC
(In reply to comment #28)
> If you change the define in src/sna/sna_accel.c:
> 
> diff --git a/src/sna/sna_accel.c b/src/sna/sna_accel.c
> index ae6d3c1..5edad51 100644
> --- a/src/sna/sna_accel.c
> +++ b/src/sna/sna_accel.c
> @@ -57,7 +57,7 @@
>  #define FORCE_INPLACE 0
>  #define FORCE_FALLBACK 0
>  #define FORCE_FLUSH 0
> -#define FORCE_FULL_SYNC 1 /*
> https://bugs.freedesktop.org/show_bug.cgi?id=61628 */
> +#define FORCE_FULL_SYNC 0
>  
>  #define DEFAULT_TILING I915_TILING_X
> 
> that restores the buggy behaviour. If you can keep running with that patch
> and with --enable-debug to check if any assertions are triggered and see how
> things progress.

I've been running this way ever since you've asked me, but that stack trace was the only one I was able to trigger, though improper rendering happened a lot.
I am positive that when I caught that trace there were no errors in dmesg.

Now, 2.21.4 is out and I will continue trying to catch something, though
since it happens only in firefox maybe there is issue somewhere else?
What versions of firefox, cairo and gtk do you have?

Also I've noticed this message in .xsession-errors whenever I move previews in Firefox:

(firefox:3574): GdkPixbuf-CRITICAL **: gdk_pixbuf_new: assertion `width > 0' failed

This happens both with FORCE_FULL_SYNC 0 and 1.
Comment 34 Chris Wilson 2013-03-12 17:49:01 UTC
I've been primarily using iceweasel (based on ff10) with the system cairo as that is many times faster for gfx. But I've also been using the bloated ff from ubuntu and fedora on different systems (and they use the ancient cairo embedded into firefox). There are a lot of differences in cairo between those versions, so it would not surprise me if it was a bug specific to an older cairo. But I've hoped to have seen it by now as well. :|
Comment 35 Coacher 2013-03-12 19:53:02 UTC
I've just tested binary Firefox's versions from their site. I've tried latest versions of 16,17,18 and 19 branches and I was able to trigger the issue in all of them.

Will play with cairo versions now, my current cairo is 1.10.2 with some distro patches on top.
Comment 36 Chris Wilson 2013-03-12 20:15:40 UTC
Just note well that all firefox post version-10 use their builtin version of cairo. In order to use system cairo, firefox needs a patch to remove its reliance upon non-upstreamed API.
Comment 37 Coacher 2013-03-12 20:26:54 UTC
Tested firefox-19.0.2 with all available versions of cairo from repos: 1.10.2, 1.12.8, 1.12.10, 1.12.12. Issue is reproducible with all versions.

(In reply to comment #36)
> Just note well that all firefox post version-10 use their builtin version of
> cairo. In order to use system cairo, firefox needs a patch to remove its
> reliance upon non-upstreamed API.

Thanks for info, though I am using Gentoo and use Firefox built from sources on my machine and it is distro-patched to link against system-wide cairo so it's fine.
Comment 38 Chris Wilson 2013-03-12 20:35:10 UTC
Hmmm, that's news to me. Do you have a link to the patches they apply against firefox?

Or a simple test is something like: http://ie.microsoft.com/testdrive/Performance/ParticleAcceleration/ which should be CPU bound in Xorg and not firefox.
Comment 39 Coacher 2013-03-12 20:37:25 UTC
Also I've noticed that "disable read-read optimisations" patch practically does the same as converting kgem_bo_sync__cpu_full back to kgem_bo_sync__cpu (I may be wrong here though it looks this way to me). I will not question this as you are developer and know best, though as tests shown only one particular branch of kgem_bo_sync__cpu_full triggers this issue, see kgem_bo_sync__cpu_full-revert-bad.patch. Maybe you could add some asserts in that branch, I will apply them and give you some more info?
Comment 40 Coacher 2013-03-12 20:41:30 UTC
(In reply to comment #38)
> Hmmm, that's news to me. Do you have a link to the patches they apply
> against firefox?

http://mirror.yandex.ru/gentoo-distfiles/distfiles/firefox-19.0-patches-0.3.tar.xz

> Or a simple test is something like:
> http://ie.microsoft.com/testdrive/Performance/ParticleAcceleration/ which
> should be CPU bound in Xorg and not firefox.

Well, I've visited this link and see some spherical thingy made of particles. What should I check?
Comment 41 Chris Wilson 2013-03-12 20:52:15 UTC
(In reply to comment #40)
> Well, I've visited this link and see some spherical thingy made of
> particles. What should I check?

Just look at top; For this particular benchmark, it should be ratelimited by the Xorg process not firefox - or better look at sudo perf top, if firefox is hitting pixman functions, it is a bad firefox.
Comment 42 Chris Wilson 2013-03-12 20:54:37 UTC
Seems like gentoo has the right patch though, it should be fine. Now if only the other distros also used that patch :(
Comment 43 Coacher 2013-03-12 20:58:40 UTC
(In reply to comment #42)
> Seems like gentoo has the right patch though, it should be fine. Now if only
> the other distros also used that patch :(

So, should I check top or not? Because I am a bit confused what exactly means 
"ratelimited by the Xorg process not firefox". I am building perf right now though.
Comment 44 Coacher 2013-03-12 21:07:06 UTC
(In reply to comment #41)
> (In reply to comment #40)
> > Well, I've visited this link and see some spherical thingy made of
> > particles. What should I check?
> 
> Just look at top; For this particular benchmark, it should be ratelimited by
> the Xorg process not firefox - or better look at sudo perf top, if firefox
> is hitting pixman functions, it is a bad firefox.

When running this demo in firefox `# perf top` says "42% libpixman-1.so.0.29.2" and this line sits on top of the list. Does that mean bad firefox? :(
Comment 45 Chris Wilson 2013-03-12 21:17:21 UTC
Only if that pixman time is inside firefox and not Xorg... Have gentoo also disabled server-side gradients in cairo?
Comment 46 Coacher 2013-03-12 21:24:29 UTC
(In reply to comment #45)
> Only if that pixman time is inside firefox and not Xorg...

I am not familiar with this tool. How do I check this?

> Have gentoo also
> disabled server-side gradients in cairo?

Yes, part of changelog:

10 Sep 2010; Samuli Suominen <ssuominen@gentoo.org>
+cairo-1.10.0-r2.ebuild, +files/cairo-1.10.0-buggy_gradients.patch:
Do not use server-side gradients. It hurts performance, and causes bad
rendering on at least nvidia. Bug 336696.

And this patch is still applied on top of cairo version I am running now. Though maintainers added option to disable it in the latest version in tree. It enabled by default though, so I tested this version also with disabled gradients. Should I check without it?
Comment 48 Chris Wilson 2013-03-12 21:34:32 UTC
Yeah, that gradient patch dramatically hurts performance on Nvidia and Intel systems, whilst having little impact on EXA systems. Kill that patch with fire.
Comment 49 Coacher 2013-03-12 21:37:09 UTC
(In reply to comment #48)
> Yeah, that gradient patch dramatically hurts performance on Nvidia and Intel
> systems, whilst having little impact on EXA systems. Kill that patch with
> fire.

Tested without this patch, but the issue is still presented.
Comment 50 Coacher 2013-03-14 18:33:36 UTC
What do you think about comment #39? And how can I check if pixman time shown in `perf top` belongs to Xorg or Firefox? (see comment #46)
Comment 51 Chris Wilson 2013-03-14 22:15:33 UTC
If you have the ncurses gui, the second column shows you the "comm" i.e. the process name. Similarly in the perf report.

I'm trying to install gentoo to see if that helps (the prospect of a modern ff using system cairo is very appealing).
Comment 52 Coacher 2013-03-15 00:56:43 UTC
(In reply to comment #51)
> If you have the ncurses gui, the second column shows you the "comm" i.e. the
> process name. Similarly in the perf report.

Oh, finally, I was able to get it. Yes, that pixman rendering belongs to Firefox process, not Xorg. Though there is somehow no "comm" column in my perf-top, ncurses gui allows to zoom into threads and that's the solution.

> I'm trying to install gentoo to see if that helps (the prospect of a modern
> ff using system cairo is very appealing).

That't nice to hear :) We have a handbook which covers most of the aspects of installation, but if you'll get stuck somewhere feel free to send me an e-mail, I'll be glad to help you.
Comment 53 Chris Wilson 2013-03-15 11:13:15 UTC
Reading http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/www-client/firefox/firefox-19.0.2.ebuild?view=markup it seems that the use of system-cairo has been dropped. Which is a shame.

On the positive news though the latest unstable cairo has dropped the buggy gradients patch (unless legacy-drivers is set).
Comment 54 Coacher 2013-03-15 11:27:47 UTC
(In reply to comment #53)
> Reading
> http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/www-client/firefox/
> firefox-19.0.2.ebuild?view=markup it seems that the use of system-cairo has
> been dropped. Which is a shame.

Well, you've seen the patches applied on top of firefox and support for system cairo is there. Out of curiosity I've run some initial steps of firefox build and here's a bit filtered result:

grep cairo /var/tmp/portage/www-client/firefox-19.0.2/temp/build.log
 *   6009_fix_system_cairo_support.patch ...
    --enable-system-cairo           system_libs
    --enable-default-toolkit=cairo-gtk2  mozilla.org default
  --enable-system-cairo
  --enable-default-toolkit=cairo-gtk2
checking for cairo >= 1.10... yes
checking CAIRO_CFLAGS... -I/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libdrm -I/usr/include/libpng15
checking CAIRO_LIBS... -lcairo
checking for cairo-tee >= 1.10... yes
checking CAIRO_TEE_CFLAGS... -I/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libdrm -I/usr/include/libpng15
checking CAIRO_TEE_LIBS... -lcairo
checking for cairo-xlib-xrender >= 1.10... yes
checking CAIRO_XRENDER_CFLAGS... -I/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libdrm -I/usr/include/libpng15
checking CAIRO_XRENDER_LIBS... -lcairo -lXrender -lX11

and this is output from already built firefox I am running now:

ldd /usr/lib/firefox/libxul.so | grep cairo
        libcairo.so.2 => /usr/lib64/libcairo.so.2 (0x00007f205d497000)
        libpangocairo-1.0.so.0 => /usr/lib64/libpangocairo-1.0.so.0 (0x00007f2059cc2000)

So, system-wide cairo enabled at build time and it is really there as shown by ldd.
Comment 55 Coacher 2013-03-15 11:34:34 UTC
(In reply to comment #53)
> Reading
> http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/www-client/firefox/
> firefox-19.0.2.ebuild?view=markup it seems that the use of system-cairo has
> been dropped. Which is a shame.

You are not seeing thing like "we're enabling system cairo here ..." directly in ebuild because it is done inside mozcoreconf-2.eclass which inherited by mozconfig-3.eclass which inherited by firefox ebuild. Inheriting eclass can be thought of as pretty close equavivalent of using #include directive in C.
Comment 56 Coacher 2013-03-15 11:36:46 UTC
(In reply to comment #53)
> Reading
> http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/www-client/firefox/
> firefox-19.0.2.ebuild?view=markup it seems that the use of system-cairo has
> been dropped. Which is a shame.

And the last one, you can find sources of eclasses in your $PORTDIR/eclass dir which is most probably /usr/portage/eclass.

P.S. sorry for a burst of comments.
Comment 57 Chris Wilson 2013-03-15 22:09:28 UTC
Ok, I have ff-19 built at last using gentoo ~amd64 on a lowly ilk. It seems to be doing the right thing regarding using system-cairo and server-side gradients. Next step is to piece together enough components to see if I can reproduce the bug.
Comment 58 Coacher 2013-03-16 00:59:52 UTC
(In reply to comment #57)
> Ok, I have ff-19 built at last using gentoo ~amd64 on a lowly ilk. It seems
> to be doing the right thing regarding using system-cairo and server-side
> gradients. Next step is to piece together enough components to see if I can
> reproduce the bug.

Ok, tell me what info I should provide and I'll post it.

As s first step, my firefox and xf86-video-intel USE-flags are:
x11-drivers/xf86-video-intel-2.21.4 was built with the following:
USE="dri sna udev xvmc -glamor -uxa"

www-client/firefox-19.0.2 was built with the following:
USE="alsa dbus gstreamer jit libnotify minimal (multilib) pgo system-jpeg wifi -bindist -custom-cflags -custom-optimization -debug (-selinux) -startup-notification -system-sqlite" ABI_X86="64" LINGUAS="ru -af -ak -ar -as -ast -be -bg -bn_BD -bn_IN -br -bs -ca -cs -csb -cy -da -de -el -en_GB -en_ZA -eo -es_AR -es_CL -es_ES -es_MX -et -eu -fa -fi -fr -fy_NL -ga_IE -gd -gl -gu_IN -he -hi_IN -hr -hu -hy_AM -id -is -it -ja -kk -km -kn -ko -ku -lg -lt -lv -mai -mk -ml -mr -nb_NO -nl -nn_NO -nso -or -pa_IN -pl -pt_BR -pt_PT -rm -ro -si -sk -sl -son -sq -sr -sv_SE -ta -ta_LK -te -th -tr -uk -vi -zh_CN -zh_TW -zu"
CFLAGS="-march=core2 -mtune=generic -pipe -mno-avx"
CXXFLAGS="-march=core2 -mtune=generic -pipe -mno-avx"
Comment 59 Coacher 2013-03-17 09:41:53 UTC
I was able to reproduce that stack trace from Xorg log and intel driver is not an issue here at all.

I found out that the cause of this is the fast spinning mouse wheel. I have a mouse with a wheel which can be scrolled like in 'free roam' mode, without that 'clicks', you know. And if I scroll too fast that stack appears. As before dmesg is clean from any i915 errors and no error state was caught. So, that stack is not related to the bug at all.
Comment 60 Chris Wilson 2013-04-06 09:55:40 UTC
I'm still using the optimized flushes on all of my machines and have yet to encounter corruption. :|
Comment 61 Coacher 2013-04-06 17:02:04 UTC
Well, I am still experiencing this issue even with latest intel driver :(

Are you running Gentoo now? What is your setup? Could you please give me the output of `emerge --info firefox` and `emerge --info xf86-video-intel`?

I haven't tried Firefox 20 yet though. Could it be the issue in Firefox itself?
Comment 62 Coacher 2013-04-09 18:22:08 UTC
Same issue with firefox 20 and xf86-video-intel 2.21.5
Comment 63 Coacher 2013-04-22 21:28:05 UTC
Hello.

At last, there is some positive dynamic! Though I still from time to time see corrupted rendering of certain elements on some pages, but at least I haven't seen for a while any completely corrupted previews like it was before. Portions of previews could be corrupted, but only those parts which are rendered corrupted while browsing. So now there are no previews consisiting of complete garbage.
(Both previews and pages are rendered via same drawWindow function in firefox as far as I can tell from sources)

Updates that introduced(?) these changes:

libdrm 2.4.43 -> 2.4.44
xorg-server 1.13.1 -> 1.13.4
GTK+ 2.24.16 -> 2.24.17
agg 2.5 -> 2.5-r2 (nothing big, maintainer changed couple of build options; added in the list because I use gnash in Firefox which uses agg, so maybe somehow connected)

There were other updates, but these are the only changes that are possibly related to the effects I see. I was (and currently do) running xf86-video-intel-2.21.6 with disabled FORCE_SYNC all the time.
Comment 64 Chris Wilson 2013-04-23 08:16:03 UTC
That's unexpected - those updates should have had no impact upon this issue. :|
Comment 65 Coacher 2013-05-05 18:39:24 UTC
(In reply to comment #64)
> That's unexpected - those updates should have had no impact upon this issue.
> :|

Nevertheless, the overall look and feel in firefox was improved somehow. Now I've updated mesa to 9.1.2 and kernel to 3.9.0 and these positive effects are preserved.

The situation is much better now than it was when I opened this bug: I don't have random huge screen corruptions in firefox both in thumbnails and during normal browsing. Though I can still trigger this issue and get corrupted page preview, it doesn't interfere with browsing. All other applications are unaffected.

Since, things are quite good now, maybe it is a good idea to enable back that optimizations? What do you think? It looks like I am the only one who has this issue:(
Comment 66 Chris Wilson 2013-05-21 10:22:13 UTC
Ok, having made a new release, it is time to see if anyone else is seeing this bug:

commit 8e42637050275945200797538a34c13c90b295cc
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue May 21 11:13:03 2013 +0100

    sna: Re-enable read-read optimisations
Comment 67 Coacher 2013-05-21 16:17:05 UTC
(In reply to comment #66)
> Ok, having made a new release, it is time to see if anyone else is seeing
> this bug:
> 
> commit 8e42637050275945200797538a34c13c90b295cc
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue May 21 11:13:03 2013 +0100
> 
>     sna: Re-enable read-read optimisations

Thank you. I'll update this bug with any new info if I notice any changes bad or good.
Comment 69 Coacher 2013-06-12 12:16:39 UTC
(In reply to comment #68)
> It's back:
> https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/
> 1189850

Thanks for the link. I've tried today's xf86-video-intel git with the commit which is marked as a solution via link you provided. I can confirm that I was unable to reproduce this issue, but I cannot say for sure as with recent changes this bug on my machine apperars much more rarely than before. It can reappear later, but I hope it won't. I'll provide any new info here if any.
Comment 70 Chris Wilson 2013-06-28 16:50:58 UTC
commit 22fd5ca947b58901927d100d2b1aa0f1672b3435
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jun 28 16:54:08 2013 +0100

    drm/i915: Only clear write-domains after a successful wait-seqno
    
    In the introduction of the non-blocking wait, I cut'n'pasted the wait
    completion code from normal locked path. Unfortunately, this neglected
    that the normal path returned early if the wait returned early. The
    result is that read-only waits may return whilst the GPU is still
    writing to the bo.
    
    Fixes regression from
    commit 3236f57a0162391f84b93f39fc1882c49a8998c7 [v3.7]
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Fri Aug 24 09:35:09 2012 +0100
    
        drm/i915: Use a non-blocking wait for set-to-domain ioctl
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=66163
    Cc: stable@vger.kernel.org
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Comment 71 Coacher 2013-06-28 21:37:06 UTC
This bug just reappeared with xf86-video-intel-2.21.10. Next thing I am going to try is this commit you've posted above.
Comment 72 Coacher 2013-07-03 18:25:54 UTC
(In reply to comment #70)
> commit 22fd5ca947b58901927d100d2b1aa0f1672b3435
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Fri Jun 28 16:54:08 2013 +0100
> 
>     drm/i915: Only clear write-domains after a successful wait-seqno
>     
>     In the introduction of the non-blocking wait, I cut'n'pasted the wait
>     completion code from normal locked path. Unfortunately, this neglected
>     that the normal path returned early if the wait returned early. The
>     result is that read-only waits may return whilst the GPU is still
>     writing to the bo.
>     
>     Fixes regression from
>     commit 3236f57a0162391f84b93f39fc1882c49a8998c7 [v3.7]
>     Author: Chris Wilson <chris@chris-wilson.co.uk>
>     Date:   Fri Aug 24 09:35:09 2012 +0100
>     
>         drm/i915: Use a non-blocking wait for set-to-domain ioctl
>     
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=66163
>     Cc: stable@vger.kernel.org
>     Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Yes, this commit fixes the issue for me (on 3.10 kernel with this patch only).

Thanks a lot for your help!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.