68987 – intel-virtual-overlay: damage tracking goes awry

Bug 68987 - intel-virtual-overlay: damage tracking goes awry

Summary: intel-virtual-overlay: damage tracking goes awry

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Chris Wilson
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-09-05 16:00 UTC by Severin Strobl
Modified:	2013-10-31 21:06 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
xrandr output (822 bytes, text/plain) 2013-09-05 16:00 UTC, Severin Strobl	no flags	Details
intel-virtual-output debug output (2.07 MB, text/plain) 2013-09-05 16:37 UTC, Severin Strobl	no flags	Details
output 12de799598ffbbe056bd4455efefa62d4b081d98 (305.54 KB, text/plain) 2013-09-05 19:39 UTC, Severin Strobl	no flags	Details
x11trace output (523.49 KB, text/plain) 2013-09-05 20:28 UTC, Severin Strobl	no flags	Details
w/a? (1.16 KB, patch) 2013-09-05 20:56 UTC, Chris Wilson	no flags	Details \| Splinter Review
pictures of the output using i-v-o (2.78 MB, application/x-tar) 2013-09-07 16:12 UTC, Severin Strobl	no flags	Details
View All

Description Severin Strobl 2013-09-05 16:00:18 UTC

Created attachment 85267 [details]
xrandr output

When using the new virtual output support in version 2.99.901 of the driver, the external screen only shows the background image of the desktop and the mouse cursor. Any window partially or fully moved to the virtual screen is not displayed.

Hardware:
Intel HD 3000
NVIDIA Quadro NVS 4200M

Software:
X.Org Intel driver 2.99.901
NVIDIA driver 325.15
X.Org X server 1.13.4

Please let me know if any additional information would be helpful in tracking down this issue.

Comment 1 Chris Wilson 2013-09-05 16:12:49 UTC

My suspicion is that I've forgotten my X and missed a flag somewhere...

If you apply

diff --git a/tools/virtual.c b/tools/virtual.c
index 8e241e5..ee8b3c6 100644
--- a/tools/virtual.c
+++ b/tools/virtual.c
@@ -56,7 +56,7 @@
 #include <fcntl.h>
 #include <assert.h>
 
-#if 0
+#if 1
 #define DBG(x) printf x
 #else
 #define DBG(x)

you will get lots of debug output if you then use:

intel-virtual-output -f > i-v-o.txt

and then repeat the test of dragging the window onto the virtual screen. What I am looking for in the i-v-o.txt is whether we are detecting and then painting the damaged regions on the virtual screen.

Comment 2 Severin Strobl 2013-09-05 16:37:59 UTC

Created attachment 85273 [details]
intel-virtual-output debug output

Right, I recompiled intel-virtual-output with debugging support. It seems some damaged regions are detected, however to me it seems like the positions are invalid. Some kind of overflow? Unfortunately the log file grew quite large rather quickly, nevertheless I attached it.

During testing I encountered a problem with intel-virtual-output itself: Once I killed the process and tried to start intel-virtual-output again, it fails:

display_open((null))
:0.0: singleton registered? 1
:0.0: pinging singleton
:0.0: send command '4567P'
*** buffer overflow detected ***: intel-virtual-output terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7fd3457b68c7]
/lib64/libc.so.6(+0xff6c0)[0x7fd3457b46c0]
/lib64/libc.so.6(+0xfea39)[0x7fd3457b3a39]
/lib64/libc.so.6(_IO_default_xsputn+0x85)[0x7fd34572e015]
/lib64/libc.so.6(_IO_vfprintf+0x1ae1)[0x7fd3456fc4c1]
/lib64/libc.so.6(__vsprintf_chk+0x9d)[0x7fd3457b3add]
/lib64/libc.so.6(__sprintf_chk+0x83)[0x7fd3457b3a23]
intel-virtual-output[0x406ab3]
intel-virtual-output[0x4031ee]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fd3456d74bd]
intel-virtual-output[0x405ce9]

Not sure whether restarting intel-virtual-output should work at all. In this case I could try to recompile it with debugging symbols.

Comment 3 Chris Wilson 2013-09-05 16:55:19 UTC

Whoops, that assert failure was a silly use of sprintf.

commit f01c5eae2a3d6efebe9332a8654303356e622bec
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Sep 5 17:48:20 2013 +0100

    intel-virtual-output: Reserve space for the '\0' in the sprintf
    
    Even though we don't use the trailing NUL byte for our comparisons,
    sprintf will write it, so we need make space for it.
    
    Reported-by: Severin Strobl <fd@severin-strobl.de>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68987
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

The idea is that only one intel-virtual-output exists (otherwise we have fun with multiple processes conflicting over the allocation of VIRTUAL), so if you run it a second time it sends the request for new screens to the first process. (There's a timeout for it detection the old process expired.) 

The problem appears to lie in damage tracking. We get nothing reported, so we don't draw any updates.

Comment 4 Chris Wilson 2013-09-05 17:45:36 UTC

I double checked that I see damage for both composited and uncomposited desktops (thinking the the GL compositor might be upsetting how I receive damage), but everything looks ok. Can you describe your DE so that I can see if I can replicate your setup?

Comment 5 Severin Strobl 2013-09-05 17:57:43 UTC

Currently I am still on GNOME 2 (version 2.32.1) using Metacity, so nothing fancy. Not sure what else I can report about my setup. Are there any other components that might interfere?

Comment 6 Chris Wilson 2013-09-05 18:16:17 UTC

The repaint algorithm is extremely simple:

1. Register for Damage tracking on the root window of the Intel X server
2. Wait for Damage events
3. Mark the outputs as damaged, and activate a timer
4. Wait for timer to expire
5. Paint damaged regions on remote outputs (essentially GetImage(local root window), PutImage(remote root window)

Now, that I mention it... It might not be the damage tracking but the timer that's failing. But to answer your question, the only thing that could perturb this is if something prevented the Damage from being sent by X or managed to hide itself from the GetImage. (Neither of which are likely, and would be X or DDX bugs.)

Comment 7 Chris Wilson 2013-09-05 18:19:42 UTC

Nope, it is definitely a lack of damage events.

Comment 8 Chris Wilson 2013-09-05 18:48:41 UTC

I believe I found it. The clue was that you did get a damage event, exactly one.

Can you please update and test with

commit 12de799598ffbbe056bd4455efefa62d4b081d98
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Sep 5 19:36:13 2013 +0100

    intel-virtual-output: Flush the damage received message back to the local display
    
    After processing the Damage notification, we need to send a message back
    to the Xserver to clear the pending damage before we will be sent more
    events. To make sure that message is sent we need to flush the output,
    as we may never flush the output queue otherwise.
    
    Reported-by: Severin Strobl <fd@severin-strobl.de>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68987
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 9 Severin Strobl 2013-09-05 19:39:53 UTC

Created attachment 85289 [details]
output 12de799598ffbbe056bd4455efefa62d4b081d98

Unfortunately this does not seem to fix the problem. I get the same behaviour as before. Using the version 12de799598ffbbe056bd4455efefa62d4b081d98 I performed another debug run, the output is attached.

Comment 10 Chris Wilson 2013-09-05 19:56:59 UTC

Drat the issue still appears to be the same, the solitary damage event. Hmm, the reply will be flushed by the XShmGetImage in any case. So I need to find another reason for the lack of damage.

Can you please comment out the call to XSetErrorHandler() (and cross your fingers that you don't die too early with races with multiple RandR clients) and see if that reports an issue? Grabbing xtrace and running:


xtrace [-n] -f i-v-o.xtrace intel-virtual-output -f

would be useful as well. (The -n may or may not be necessary depending on your setup)

Comment 11 Severin Strobl 2013-09-05 20:21:41 UTC

(In reply to comment #10)
> Can you please comment out the call to XSetErrorHandler() (and cross your
> fingers that you don't die too early with races with multiple RandR clients)
> and see if that reports an issue?

Not sure this generated any useful information:

display_open((null))
:0.0: singleton registered? 1
:0.0: pinging singleton
:0.0: send command '4567P'
:0.0: wait for act '4567R'
:0.0: reading event type 33
:0.0: client message '4567P'
No reply from singleton; assuming control
:0.0: reading event type 28
display_init_damage(:0.0)
record_mouse(:0.0)
record_callback
display_open(:0.1)
bumblebee_open query result 'Value: :8
'
display_open(:8)
last_display_add_clones__randr(:8)
display_init_randr_hpd(:8)
:8 - noutputs=8
claim_virtual(1)
claim_virtual(VIRTUAL1): rr_output=68
clone_output_init(:0.0, VIRTUAL1)
:0.0-VIRTUAL1 use shm? 1 (use shm pixmap? 1)
clone_output_init(:8, VGA-0)
:8-VGA-0 use shm? 1 (use shm pixmap? 0)
:8-VGA-0 wants depth 24
:0.0 is depth 24, want 24
:8 is depth 24, want 24
:8-VGA-0 using depth 24, requires xrender for src? 0, for dst? 0
X Error of failed request:  BadAccess (attempt to access private resource denied)
  Major opcode of failed request:  139 (RANDR)
  Minor opcode of failed request:  17 (RRDestroyMode)
  Serial number of failed request:  51
  Current serial number in output stream:  52


Will try xtrace next...

Comment 12 Severin Strobl 2013-09-05 20:28:31 UTC

Created attachment 85290 [details]
x11trace output

Here the trace generated by x11trace.

Comment 13 Chris Wilson 2013-09-05 20:30:33 UTC

(In reply to comment #11)
> (In reply to comment #10)
> > Can you please comment out the call to XSetErrorHandler() (and cross your
> > fingers that you don't die too early with races with multiple RandR clients)
> > and see if that reports an issue?
> 
> Not sure this generated any useful information:
> 
> X Error of failed request:  BadAccess (attempt to access private resource
> denied)
>   Major opcode of failed request:  139 (RANDR)
>   Minor opcode of failed request:  17 (RRDestroyMode)
>   Serial number of failed request:  51
>   Current serial number in output stream:  52

Yes, that's why errors are being ignored. It seems to be impossible to atomically adjust the outputs and modes (especially we require round trips with the server) - so we end up with other RandR clients using our modes that we want to clean up.

Comment 14 Chris Wilson 2013-09-05 20:55:54 UTC

Looks ok, no errors, just the damage stops coming. What may be significant is that it occurs when we reconfigure the screen for the virtual outputs. There have been a number of Xserver bugs in the past where Damage breaks across a modeset, so I wonder if this is another...

Maybe this will workaround the issue?

diff --git a/tools/virtual.c b/tools/virtual.c
index 257f37f..4da2f3a 100644
--- a/tools/virtual.c
+++ b/tools/virtual.c
@@ -1760,9 +1760,22 @@ static int display_init_damage(struct display *display)
        if (display->damage == 0)
                return EACCES;
 
+       display->flush = 1;
        return 0;
 }
 
+static void display_reset_damage(struct display *display)
+{
+       Damage damage;
+
+       damage = XDamageCreate(display->dpy, display->root, XDamageReportRawRectangles);
+       if (damage) {
+               XDamageDestroy(display->dpy, display->damage);
+               display->damage = damage;
+               display->flush = 1;
+       }
+}
+
 static void display_init_randr_hpd(struct display *display)
 {
        int major, minor;
@@ -2632,8 +2645,10 @@ int main(int argc, char **argv)
                        ret--;
                }
 
-               if (reconfigure)
+               if (reconfigure) {
                        context_update(&ctx);
+                       display_reset_damage(ctx.display);
+               }
 
                if (rr_update) {
                        for (i = 0; i < ctx.nclone; i++)

Comment 15 Chris Wilson 2013-09-05 20:56:35 UTC

Created attachment 85292 [details] [review]
w/a?

Comment 16 Severin Strobl 2013-09-05 21:24:33 UTC

Yes, this did it! Nice work! Thanks for addressing this issue so quickly. Now it works exactly as expected.
One tiny problem I noticed: When dragging around windows on the external screen, every now and then some parts of the screen are not correctly redrawn, so parts of the window are still visible at the old position until a redraw of the area in question becomes necessary by dragging the window across it again.

Comment 17 Chris Wilson 2013-09-05 21:34:49 UTC

commit 2263f8f26fb3c4fabceac2e192e95c5d26a4d173
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Sep 5 21:55:49 2013 +0100

    intel-virtual-overlay: Reset damage across modesets
    
    Some versions of the Xserver lose Damage tracking across the modeset,
    causing a loss of damage notifications and repainting to cease on the
    virtual outputs. We can workaround this by reattaching the damage every
    time we receive notification that the local Screen configuration
    changes.
    
    Reported-and-tested-by: Severin Strobl <fd@severin-strobl.de>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68987
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

But it sounds like we have other Damage races to fix as well.

Comment 18 Chris Wilson 2013-09-06 08:59:12 UTC

I've tweaked the damage tracking (gone from RawRectangles to BoundingBoxes, and tweaked the ordering of clearing the damage), this seems to be behaving as I would expect here. Can you please give the updated i-v-o a twirl?

Comment 19 Severin Strobl 2013-09-06 10:18:50 UTC

It seems better now, so it is harder to get it to produce the artifacts. However vary rarely I still have sections of the screen which are not updated correctly. However this does require some serious trying.

Comment 20 Chris Wilson 2013-09-06 23:07:56 UTC

And it is definitely a missing paint as opposed it to painting the wrong image in the wrong location, or painting garbage instead? The former would indicate that the damage is still iffy, whereas the latter would be a driver bug in not managing the coherency of the shared memory transport.

Comment 21 Severin Strobl 2013-09-07 16:12:12 UTC

Created attachment 85399 [details]
pictures of the output using i-v-o

Sorry for the delay. So to me it looks like a missing paint of the area the window left, as the window itself is not damaged in any way and the remaining parts can clearly be identified as parts of the window at the old position. I did some tests again and took some pictures (sorry for the crappy quality), just to make sure I'm not mistaken.

Comment 22 Chris Wilson 2013-09-08 12:53:40 UTC

Hmm, looks like damage.

I've another little test for you to try, can you update to 

commit 0ceba648211009a464e46855a52737219ab9b534
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Sep 8 13:51:14 2013 +0100

    intel-virtual-output: Debug option for forcing full redraws
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=68987
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

and then set FORCE_FULL_REDRAW to 1:

diff --git a/tools/virtual.c b/tools/virtual.c
index c363a3e..1f3113e 100644
--- a/tools/virtual.c
+++ b/tools/virtual.c
@@ -62,7 +62,7 @@
 #define DBG(x)
 #endif
 
-#define FORCE_FULL_REDRAW 0
+#define FORCE_FULL_REDRAW 1
 
 struct display {
        Display *dpy;

and see what happens?

Comment 23 Severin Strobl 2013-09-09 11:52:39 UTC

Forcing the complete redraw eliminates the problem, of course at the cost of increased CPU load. For some strange reason, with deactivated full redraw I encountered the image corruption only extremely rarely today. I was able to reproduce it twice after trying for several minutes. It was with a different external screen (resolution), so I'll try again this evening with the setup I used on the weekend.

Comment 24 Chris Wilson 2013-09-25 22:44:03 UTC

I haven't spotted anything else of significance in intel-virtual-overlay, so I think we may have a driver bug. In particular, we may be suffering from an old bug in the kernel (thinking that you may be forced to use an older kernel for compatability with nvidia). If using 3.11 is possible that will rule out the known bugs that I think could result in similar corruption.

Comment 25 Chris Wilson 2013-10-02 11:42:29 UTC

Hmm, something else you can try (as well as confirming the issue on the latest stable versions) is whether the same issue occurs if you use the nouveau driver. (And then using the nouveau driver, we can also test the latest unstable versions!)

Comment 26 Chris Wilson 2013-10-21 12:57:52 UTC

I think i-v-o is correct as can be, and this is an old Xserver bug...

Comment 27 Severin Strobl 2013-10-31 17:49:24 UTC

Due to some travelling, I was not able to test the newer versions of i-v-o for quite a while. Using the 2.99.905 snapshot and a 3.11 kernel, I did indeed not observe any problems in the damage tracking any more, it works perfectly!
One minor issue I noticed (and did not encounter before): Something weird is happening to the mouse pointer the moment the external display becomes active. Outside of the windows open at this moment, the mouse pointer is not drawn any more. Starting another grahical application fixes this and brings back the mouse pointer for all screens. Unfortunatelly I have no idea whether this is actually related to i-v-o, or just some unrelated bug.

Comment 28 Chris Wilson 2013-10-31 21:06:43 UTC

The cursor disappearing sounds like an artifact of how the cursor image is captured. At one point I thought about adding an explicit query to record the cursor image, sounds like I need to check it again.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.