103269 – Regression in WaitForSomething() causes indirect OpenGL applications to lock up Xvnc

Bug 103269 - Regression in WaitForSomething() causes indirect OpenGL applications to lock up Xvnc

Summary: Regression in WaitForSomething() causes indirect OpenGL applications to lock ...

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Server/General (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	medium normal
Assignee:	Xorg Project Team
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-10-13 22:23 UTC by DRC
Modified:	2018-08-17 19:14 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments

Description DRC 2017-10-13 22:23:15 UTC

I've discussed this offline with Keith P., but I wanted to get additional input, particularly from anyone who might have a deep knowledge of how the GLX extension interacts with Mesa.

I develop a high-performance X proxy (TurboVNC) based on X.org.  As with all Xvnc implementations, mine interfaces with X.org through the DDX interface, with the VNC-specific code contained in hw/vnc.  In the process of updating my source tree from xorg-server 1.17.4 to 1.19.5, I noticed that indirect OpenGL stopped working properly.  The symptoms are that, when an application is using indirect OpenGL, any timer callbacks I created with SetTimer() in the VNC server code are not invoked until the OpenGL application exits or is killed.  Thus, the rendered images from the OpenGL application are not sent to the VNC viewer in real time, since it relies on an X.org timer to determine when to send those images.  This doesn't seem to be related to the rate at which the application is rendering.  The problem occurs both with non-interactive applications (e.g. GLXgears or GLXspheres) and interactive applications that render only in response to mouse movement.  This may also not be specific to indirect OpenGL.  It may simply be that Mesa is grabbing the X server or doing something else that triggers the bug when indirect rendering is enabled.  It may also be that other code that relies on SetTimer() will experience similar problems under certain circumstances.  I'm not entirely sure, since I don't fully understand why the problem is occurring.

I did, however, bisect the X.org source tree, which led me to this commit:

https://cgit.freedesktop.org/xorg/xserver/commit/?id=0b2f30834b1a9f4a03542e25c5f54ae800df57e2

and I confirmed that reverting the commit while leaving the rest of the 1.19.5 source tree in tact fixes the problem.  I examined the logical differences introduced by the commit and discovered that it left out a timer check in one conditional branch.  I was able to restore the logic with this patch:

diff --git a/os/WaitFor.c b/os/WaitFor.c
index 613608f..45b2aa6 100644
--- a/os/WaitFor.c
+++ b/os/WaitFor.c
@@ -238,8 +238,18 @@ WaitForSomething(Bool are_ready)
         } else
             are_ready = clients_are_ready();
 
-        if (InputCheckPending())
-            return FALSE;
+        if (i > 0) {
+            if (InputCheckPending())
+                return FALSE;
+
+            if ((timer = first_timer()) != NULL) {
+                now = GetTimeInMillis();
+                if ((int) (timer->expires - now) <= 0) {
+                    DoTimers(now);
+                    return FALSE;
+                }
+            }
+        }
 
         if (are_ready) {
             were_ready = TRUE;

This eliminated the issue on the surface.  I could also eliminate the issue on the surface by calling TimerCheck() within a block handler, but in both cases, this created performance problems in other areas.  Basically, doing the above messed with the smart scheduler in such a way that it became necessary to revert to the old scheduling parameters from prior X.org releases (SMART_SCHEDULE_DEFAULT_INTERVAL=20, SMART_SCHEDULE_MAX_SLICE=200) in order to get decent image drawing performance in general.  However, doing this also somehow affected direct rendering, which became significantly slower than indirect rendering.

I'm hoping for some insight into how to fix this.  So far, the configurations that seem to partially work are:

1. Default xorg-server-1.19.5 code
   - Except that indirect rendering locks up the server in such a way that it stops responding to any of my timers until the OpenGL application exits-- i.e. the bug described in this report.

2. xorg-server-1.19.5 with the above patch and SMART_SCHEDULE_DEFAULT_INTERVAL=20, SMART_SCHEDULE_MAX_SLICE=200
   - Indirect rendering works properly (and at full speed), but direct rendering is now 2x slower than it should be.

Ideally I want to support both direct and indirect rendering at full speed.  Since VNC servers have to use a virtual framebuffer in main memory, DRI2 or DRI3 are non-starters, so the only way to use Mesa with direct rendering is swrast.  That doesn't work when the nVidia drivers are installed unless the system has a glvnd-enabled build of Mesa, so unfortunately I need to continue supporting indirect rendering until glvnd becomes the norm in Linux distributions.

Comment 1 DRC 2017-10-13 22:29:41 UTC

* "full speed" = "the maximum speed possible with software rendering"

I should also note that I develop VirtualGL as well, so the purpose of supporting software OpenGL in TurboVNC is mainly to facilitate running certain window managers.  I don't expect swrast to ever be a screamin' demon, but the more performance I can squeeze out of it, the better.  Supporting direct rendering is desirable for compatibility (indirect is limited to OGL 1.4.)

Comment 2 DRC 2017-10-18 21:54:23 UTC

NOTE: This same problem happens with TigerVNC if it is built against xorg-server 1.19.x and started with +iglx, so it isn't specific to my code.  TigerVNC uses a different internal timer mechanism for the VNC server portion of its code (not based on SetTimer().)

Comment 3 DRC 2017-10-26 04:36:30 UTC

I was able to work around this by changing the behavior of WaitForSomething() if any indirect OpenGL context is active on the X server (refer to #ifdef TURBOVNC sections in https://github.com/TurboVNC/turbovnc/blob/dev/unix/Xvnc/programs/Xserver/glx/glxext.c and https://github.com/TurboVNC/turbovnc/blob/dev/unix/Xvnc/programs/Xserver/os/WaitFor.c).  That allows indirect OpenGL to work properly without affecting performance when it isn't in use.

Whether or not this should be incorporated in X.org would probably depend a lot on whether this issue is reproducible outside of Xvnc (maybe in Xvfb?  Not sure how that could be tested.)  So far, I haven't been able to reproduce it in a non-virtual X server environment, so it may simply be something that has to be fixed in my product.  Feel free to close this issue if it's not something you want to pursue.  For my purposes, an acceptable fix was discovered and implemented downstream.

Comment 4 Adam Jackson 2018-06-12 16:07:35 UTC

Pretty sure this would have been fixed by:

commit ac7a4bf44c68c5f323375974b208d4530fb5b60f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Apr 15 15:40:03 2018 +0100

    os/WaitFor: Check timers on every iteration
    
    Currently we only check timer expiry if there are no client fd (or
    other input) waiting to be serviced. This makes it very easy to starve
    the timers with long request queues, and so miss critical timestamps.
    
    The timer subsystem is just another input waiting to be serviced, so
    evaluate it on every loop like all the others, at the cost of calling
    GetTimeInMillis() slightly more frequently. (A more invasive and likely
    OS specific alternative would be to move the timer wheel to the local
    equivalent of timerfd, and treat it as an input fd to the event loop
    exactly equivalent to all the others, and so also serviced on every
    pass. The trade-off being that the kernel timer wheel is likely more
    efficiently integrated with epoll, but individual updates to each timer
    would then require syscalls.)
    
    Reviewed-by: Peter Harris <pharris@opentext.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 5 DRC 2018-08-17 19:14:57 UTC

ac7a4bf44c68c5f323375974b208d4530fb5b60f does indeed fix the problem, but it seems to generate a lot of overhead with indirect rendering, and thus the overall performance is reduced (bearing in mind that I'm developing a VNC server, so the same thread has to be shared between the VNC and X11 portions of it.)  The approach I'm using (reverting to the Xorg 1.18 timing behavior only if an indirect OpenGL context is active) performs optimally both with direct and indirect rendering.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.