Bug 54274 - SNA is much slower than UXA in Blobby Volley game
Summary: SNA is much slower than UXA in Blobby Volley game
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: 7.7 (2012.06)
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Chris Wilson
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-08-30 16:20 UTC by Deve
Modified: 2012-09-01 11:31 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
glxinfo (12.11 KB, text/plain)
2012-08-30 17:41 UTC, Deve
no flags Details
xorg log (226.17 KB, text/plain)
2012-08-30 17:41 UTC, Deve
no flags Details
glxgears - SNA and UXA (2.26 KB, text/plain)
2012-08-30 19:32 UTC, Deve
no flags Details
Immediate vsync swapbuffers (905 bytes, patch)
2012-08-30 21:01 UTC, Chris Wilson
no flags Details | Splinter Review
Xorg.0.log - debug with patch (633.71 KB, application/x-bzip)
2012-08-30 21:23 UTC, Deve
no flags Details
trace-cmd report (316.70 KB, application/x-bzip)
2012-08-30 22:18 UTC, Deve
no flags Details
Xorg.0.log - debug log with patch from comment22 (639.97 KB, application/x-bzip)
2012-08-31 10:14 UTC, Deve
no flags Details
Xorg.0.log - debug with patch from comment 27 (711.30 KB, application/x-bzip)
2012-09-01 11:17 UTC, Deve
no flags Details

Description Deve 2012-08-30 16:20:59 UTC
Using SNA acceleration, Blobby Volley game is not playable. Game runs much slower than with UXA.

Tested on:
blobby 0.9c with OpenGL
intel 965GM, driver version 2.20.5
mesa 7.12.2 and 8.0.4
kernel 3.2.23 and 3.5.2
Comment 1 Chris Wilson 2012-08-30 16:28:27 UTC
Since this is a GL game, it suggests that something seriously broke when you tried to enable SNA. Please attach your Xorg.0.log, glxinfo and for my convenience a link to a deb ;-)
Comment 2 Chris Wilson 2012-08-30 17:00:56 UTC
Even in the non-GL mode it takes <10% cpu on an 965gm @ 800x600.
Comment 3 Deve 2012-08-30 17:41:27 UTC
Created attachment 66353 [details]
glxinfo
Comment 4 Deve 2012-08-30 17:41:58 UTC
Created attachment 66354 [details]
xorg log
Comment 5 Deve 2012-08-30 17:47:08 UTC
Version 0.9 is no longer in the repo :( But since 0.8 there was few changes and probably will behave the same.

http://ftp.debian.org/debian/pool/main/b/blobby/blobby-data_0.8-dfsg-2_all.deb
http://ftp.debian.org/debian/pool/main/b/blobby/blobby_0.8-dfsg-2_i386.deb
Comment 6 Chris Wilson 2012-08-30 17:53:21 UTC
Ok, they look sane enough. What is top like whilst running blobby? And perf top?
Comment 7 Deve 2012-08-30 18:14:16 UTC
I'm not sure what you are asking. 

I'm running blobby on fullscreen, in preferences I set opengl. Btw. window mode works faster. Game has about 5% cpu usage.
Comment 8 Chris Wilson 2012-08-30 18:36:42 UTC
I'm trying to get some metrics to diagnose where your problem lies. Let's start with basic glxgears to check everything is in order.

Can you run glxgears with and without export vblank_mode=0, windowed and fullscreen (glxgears -fullscreen) under UXA and SNA?
Comment 9 Deve 2012-08-30 19:32:48 UTC
Created attachment 66370 [details]
glxgears - SNA and UXA
Comment 10 Deve 2012-08-30 19:35:58 UTC
Added glxears logs on attachment.

Generally, SNA works better on my laptop. I have problems only with this game. Other games runs the same or faster.
Comment 11 Chris Wilson 2012-08-30 19:39:57 UTC
Ok, the GL setup itself is functioning as expected. What environment are you running blobby under (kwin, gnome-shell etc)?
Comment 12 Chris Wilson 2012-08-30 19:46:54 UTC
As far as I can tell, blobbly is equally responsive under UXA and SNA on an thinkpad t61 (with 965gm). Other than something interfering with the presentation of frames (like a compositor), the only other suggestion is that for some reason blobby is not being woken up after a swapbuffers in a timely manner.

Probably time to compile xf86-video-intel manually. Check that the issue still occurs and then add --enable-debug=full as a configure option and attach the Xorg.0.log for a game session.
Comment 13 Deve 2012-08-30 20:24:12 UTC
Log is large, it has about 150 MB. So i compressed it and uploaded there:
http://www.speedyshare.com/a3pUR/Xorg.0.log-debug.tar.bz2

I use openbox 3.5. But if I remember correctly, on gnome 3.4 was the same.

Previous logs were from my own compilation of driver too.
Comment 14 Chris Wilson 2012-08-30 21:00:39 UTC
What is immediately apparent from the debug file is that it often takes a couple of vblanks for the swapbuffers to complete.

To test that the scheduling is snafu can you try:

diff --git a/src/sna/sna_dri.c b/src/sna/sna_dri.c
index 5fa17d7..b380ab4 100644
--- a/src/sna/sna_dri.c
+++ b/src/sna/sna_dri.c
@@ -1708,6 +1708,25 @@ sna_dri_immediate_blit(struct sna *sna,
 	DBG(("%s: emitting immediate blit, throttling client\n", __FUNCTION__));
 	VG_CLEAR(vbl);
 
+	{
+		info->type = DRI2_SWAP;
+		info->bo = sna_dri_copy_to_front(sna, draw, NULL,
+						 get_private(info->front)->bo,
+						 get_private(info->back)->bo,
+						 true);
+
+		vbl.request.type =
+			DRM_VBLANK_RELATIVE |
+			DRM_VBLANK_NEXTONMISS |
+			DRM_VBLANK_EVENT |
+			pipe_select(info->pipe);
+		vbl.request.sequence = 0;
+		vbl.request.signal = (unsigned long)info;
+		if (sna_wait_vblank(sna, &vbl))
+			sna_dri_frame_event_info_free(sna, draw, info);
+		return;
+	}
+
 	if ((sna->flags & SNA_NO_WAIT) == 0) {
 		info->type = DRI2_SWAP_THROTTLE;
 		if (sna_dri_window_get_chain((WindowPtr)draw) == info) {
Comment 15 Chris Wilson 2012-08-30 21:01:48 UTC
Created attachment 66374 [details] [review]
Immediate vsync swapbuffers
Comment 16 Deve 2012-08-30 21:23:20 UTC
Created attachment 66376 [details]
Xorg.0.log - debug with patch
Comment 17 Deve 2012-08-30 21:26:25 UTC
With this patch (Immediate vsync swapbuffers) runs slower than without it.
Comment 18 Chris Wilson 2012-08-30 21:40:59 UTC
Still taking multiple vblanks from the time the blit for the swap is queue to the time it takes effect. At this point the indication would be that it is simply taking too long for blobby to render its frames (the key difference between UXA and SNA here is that SNA checks that the swap occurs before signalling the client, UXA just signals the client regardless after queueing the blit.)

Can you build trace-cmd (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) and record what the driver sees with 'sudo trace-cmd record -e i915' then attach the output of 'trace-cmd report'.
Comment 19 Deve 2012-08-30 22:17:42 UTC
I don't know if I use it correctly:

root@deve:/home/Dokumenty/trace-cmd# ./trace-cmd record -e i915
/sys/kernel/debug/tracing/events/i915/filter
/sys/kernel/debug/tracing/events/*/i915/filter
Hit Ctrl^C to stop recording
^CKernel buffer statistics:
  Note: "entries" are the entries left in the kernel ring buffer and are not
        recorded in the trace data. They should all be zero.

CPU: 0
entries: 0
overrun: 0
commit overrun: 0
bytes: 1084
oldest event ts:     0.348270
now ts:   112.072546

CPU0 data recorded at offset=0x12e000
    2482176 bytes in size


Before hit Ctrl^C I run blobby.

On attachment trace-cmd report.
Comment 20 Deve 2012-08-30 22:18:12 UTC
Created attachment 66378 [details]
trace-cmd report
Comment 21 Chris Wilson 2012-08-31 07:03:08 UTC
(In reply to comment #19)
> I don't know if I use it correctly:

You did just fine. The report contains the interactions between blobby, X and fluxbox. Now I just need to piece it together to work out why the frame latency is so high. Just for my information, which -intel did you use whilst generating the trace?
Comment 22 Chris Wilson 2012-08-31 07:28:05 UTC
Ok, I have a theory based on that trace, and the fault lies in -intel submitting the next vsync'ed blit immediately upon receiving the vblank - and so blocks the client rendering for a whole frame. Then we queue up two frames the client and the cycle continues.

That should be demonstrable with

diff --git a/src/sna/sna_dri.c b/src/sna/sna_dri.c
index 0210d73..87881ba 100644
--- a/src/sna/sna_dri.c
+++ b/src/sna/sna_dri.c
@@ -1182,7 +1182,7 @@ static void chain_swap(struct sna *sna,
                chain->bo = sna_dri_copy_to_front(sna, draw, NULL,
                                                  get_private(chain->front)->bo,
                                                  get_private(chain->back)->bo,
-                                                 true);
+                                                 false);
 
                type = DRI2_BLIT_COMPLETE;
        }
Comment 23 Chris Wilson 2012-08-31 07:36:02 UTC
Perhaps the bigger problem is the 10ms latency from the vblank firing to X waking up... Hmm.
Comment 24 Deve 2012-08-31 09:39:26 UTC
All logs were from intel driver 2.20.5. To generate trace-cmd report I use debug version with SNA.

Trace-cmd was on fluxbox because debug version of drivers often crashes my X environment. I thought it causes by openbox. But it probably doesn't matter.
Comment 25 Deve 2012-08-31 10:03:42 UTC
With lastest change (comment 22), on first sight, blobby runs perfect :)
Comment 26 Deve 2012-08-31 10:14:21 UTC
Created attachment 66399 [details]
Xorg.0.log - debug log with patch from comment22
Comment 27 Chris Wilson 2012-09-01 07:32:35 UTC
Ok, how about this little line:

diff --git a/src/sna/sna_dri.c b/src/sna/sna_dri.c
index 0210d73..a04427d 100644
--- a/src/sna/sna_dri.c
+++ b/src/sna/sna_dri.c
@@ -1268,7 +1268,7 @@ void sna_dri_vblank_handler(struct sna *sna, struct drm_ev
                break;
 
        case DRI2_SWAP_THROTTLE:
-               if (!sna_dri_blit_complete(sna, info))
+               if (0&&!sna_dri_blit_complete(sna, info))
                        return;
 
                DBG(("%s: %d complete, frame=%d tv=%d.%06d\n",
Comment 28 Deve 2012-09-01 11:16:12 UTC
With patch from comment 27 runs very similar to comment 22. I don't know if faster, maybe yes. Logs in attachment.
Comment 29 Deve 2012-09-01 11:17:04 UTC
Created attachment 66438 [details]
Xorg.0.log - debug with patch from comment 27
Comment 30 Chris Wilson 2012-09-01 11:31:26 UTC
Most important is that we don't see any ill-effects, especially starvation of other clients...

commit f837807cc257fbedd35fa6101dc0b4a57e5ad78d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Sep 1 10:36:09 2012 +0100

    sna/dri: Remove busy-wait spin from vblank chaining of swapbuffers
    
    The issue being that, due to the delay, the chained swap would miss its
    intended vblank and so cause an unwanted reduction in frame throughput
    and increase output latency even further. Since both client and server
    have other rate-limiting processes in place, we can forgo the stall here
    and still keep the clients in check.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=54274
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.