Bug 38800 - glXSwapBuffersMscOML is slow on AMD Fusion but not on Intel 945 w/Atom
glXSwapBuffersMscOML is slow on AMD Fusion but not on Intel 945 w/Atom
Status: RESOLVED FIXED
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/r600
unspecified
x86 (IA32) Linux (All)
: medium normal
Assigned To: Default DRI bug account
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-06-30 03:48 UTC by Simon Farnsworth
Modified: 2011-07-14 14:19 UTC (History)
1 user (show)

See Also:


Attachments
My test program to demonstrate the performance issues (10.04 KB, text/x-csrc)
2011-06-30 03:48 UTC, Simon Farnsworth
Details
dmesg from AMD system (59.18 KB, text/plain)
2011-06-30 03:53 UTC, Simon Farnsworth
Details
Xorg.log from AMD system (37.51 KB, text/plain)
2011-06-30 03:53 UTC, Simon Farnsworth
Details
An adapted test program, showing that it's not the wait for the swap events that slows me down (8.60 KB, text/x-csrc)
2011-07-01 09:54 UTC, Simon Farnsworth
Details
drm.debug=0xf dmesg snippet, showing timings of page flips (11.07 KB, text/plain)
2011-07-05 05:24 UTC, Simon Farnsworth
Details
Use new frontbuffer fence, if any (2.82 KB, patch)
2011-07-08 03:05 UTC, Michel Dänzer
Details | Splinter Review
dmesg snippet showing which bit of the IRQ handler is slow (4.12 KB, patch)
2011-07-08 04:32 UTC, Simon Farnsworth
Details | Splinter Review
Add missing register access guards (9.48 KB, patch)
2011-07-08 05:18 UTC, Michel Dänzer
Details | Splinter Review
thoroughly clean up multiple crtc guards (16.38 KB, patch)
2011-07-08 07:12 UTC, Alex Deucher
Details | Splinter Review
The register access guards patch as applied by hand (9.60 KB, patch)
2011-07-08 07:17 UTC, Simon Farnsworth
Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Farnsworth 2011-06-30 03:48:09 UTC
Created attachment 48588 [details]
My test program to demonstrate the performance issues

I'm trying to get an in-house OpenGL compositor up to the same performance on AMD Fusion as it currently achieves on Intel graphics, and I've found that glXSwapBuffersMscOML is slower (with a trivial test program) on AMD than it is on Intel.

I've attached my test program (swapbufferstest.c), which I've been compiling with:

gcc -O2 -ggdb -Wall -Werror -o swapbufferstest swapbufferstest.c -lX11 -lGL -lGLEW -lGLU

I'm running identical Fedora 15 based software stacks on an Intel Atom D510 (integrated GMA 950 class graphics) and on an AMD Fusion board (GPU PCI ID [1002:9802]).

On Intel, I get output like:
# ./swapbufferstest 
Screen is 1920x1200
Longest frame time 17.308969 ms
Biggest delta 3 frames, averaging 1.034188 frames per swap
Average swap interval 17.118282 ms
121 frames, 117 swaps

On AMD, I get:
# ./swapbufferstest 
Screen is 1920x1200
Longest frame time 16.751312 ms
Biggest delta 3 frames, averaging 3.000000 frames per swap
Average swap interval 49.326561 ms
123 frames, 41 swaps

I've confirmed that I've got a direct rendering context in both cases; on Intel, I've been able to confirm that I'm doing page-flipping, but I can't yet work out how to do that on the AMD driver; however, only hitting one frame in 3 doesn't seem right to me, especially since it's not reaching 10% CPU load. I can't however see whether I'm actually hitting the pageflip codepath, or whether I'm in a slow path.

Versions of interesting components are:

kernel-PAE-2.6.38.8-91.fc15.i686
mesa-dri-drivers-7.11-0.11.20110525.0.fc15.i686
libdrm-2.4.26-1.fc15.i686
xorg-x11-server-Xorg-1.10.2-1.fc15.i686
xorg-x11-drv-ati-6.14.1-1.20110504gita6d2dba6.fc15.i686
xorg-x11-drv-intel-2.15.0-3.fc15.i686

I'm able to rebuild RPMs with extra patches if needed for testing, and I have both systems to hand for testing as needed.

I'll attach dmesg and Xorg.log from the AMD system to this bug.
Comment 1 Simon Farnsworth 2011-06-30 03:53:11 UTC
Created attachment 48589 [details]
dmesg from AMD system
Comment 2 Simon Farnsworth 2011-06-30 03:53:31 UTC
Created attachment 48590 [details]
Xorg.log from AMD system
Comment 3 Simon Farnsworth 2011-06-30 05:20:41 UTC
I turned on verbose logging by adding -logverbose 65535 to my X startup line, and I can see the following in the log:


[  3640.369] (II) RADEON(0): radeon_dri2_schedule_flip:619 fevent[0x99fa868]
[  3640.384] (II) RADEON(0): radeon_dri2_flip_event_handler:981 fevent[0x99fa868] width 1920 pitch 7680 (/4 1920)

If I'm understanding the code in radeon_dri2.c properly, this suggests that I'm not getting a page flip happening, as radeon_dri2_schedule_flip does a DRI2_SWAP type buffer exchange, not a DRI2_FLIP. It appears that DRI2_SWAP is "exchange front buffer with back buffer, blit contents to screen".
Comment 4 Simon Farnsworth 2011-06-30 08:09:03 UTC
Added more logging - I'm clearly wrong, because I see that I enter the following codepath around line 1100 of radeon_dri2.c in radeon_dri2_schedule_swap, which converts DRI2_SWAPs to DRI2_FLIPs.

    /* Flips need to be submitted one frame before */
    if (info->allowPageFlip &&
	DRI2CanFlip(draw) &&
	can_exchange(scrn, front, back)) {
	swap_type = DRI2_FLIP;
	flip = 1;
    }
Comment 5 Simon Farnsworth 2011-06-30 10:10:36 UTC
More digging - I changed drm.debug to 0xf, and I can see [22545.268466] "drm:evergreen_page_flip], Update pending now high. Unlocking vupdate_lock." in the dmesg output - I take it this means I'm trying to pageflip, but I'm simply failing to manage 60 frames per second?
Comment 6 Simon Farnsworth 2011-07-01 09:54:25 UTC
Created attachment 48657 [details]
An adapted test program, showing that it's not the wait for the swap events that slows me down

This is a simplified version of the test program, which doesn't wait for GLX_INTEL_swap_event responses, but queues buffer swaps as fast as possible - output is substantially the same as with the original test program:


# ./swapbuffersnowaittest 
Screen is 1920x1200
Averaging 2.975610 frames per swap
Average swap interval 49.120903 ms
122 frames, 41 swaps
Comment 7 Simon Farnsworth 2011-07-01 09:57:33 UTC
Also, just in case there's oddities lurking with multihead, I've confirmed via XRandR that only VGA-0 is enabled, and that the screen and CRTC are both at 1920x1200.
Comment 8 Simon Farnsworth 2011-07-04 04:22:55 UTC
Moving up a kernel revision, to airlied's drm-fixes 6002525170df5f72c92ab946b6ebf1656aaec74d hasn't helped. I'm going to try and get a full X11 stack, including new Mesa, running from git.
Comment 9 Simon Farnsworth 2011-07-04 06:27:19 UTC
Updating to git master (on top of airlied's drm-fixes) has improved things, but I'm still not there. I have gone to the following commits in each interesting tree:

dri2proto: commit 7fd18b15646a62bd82a4eb0eca60a34c1731813d
drm: commit 6dd804c5a92104042b8be0a67d107946394a0b7a
glproto: commit 6d7e398150dde1c15b7f62f485cb33bad8f4d5c6
libpciaccess: commit f9159b97834ba4b4e42a07953a33866e7ac90dbd
mesa: commit 424b1210d951c206e7c2fb8f2778acbd384eb247
pixman: commit eff7c8efabe2da33edbf0bdc06e101352981286b
randrproto: commit b5c977ad70b845c278ef52c004d9e5fc3ec97cbf
xf86-video-ati: commit 9bb31158466e6168116d841d12c8b4303f11c4a6
xorg-font-util: commit 9f807dfb4e7bea7128b7538e41de78bb5994ca80
xorg-macros: commit 08dd5d30c137abe43d5b1673edc39eb569881d66
xproto: commit 06ebd5b88bee8c54222bde3178dc773776036802
xserver: commit 73c05be80a1d682ff623e66f47e87403d94caef5

Both test programs still get the same results, but now they look like:

# ~/swapbuffersnowaittest 
Screen is 1920x1200
Averaging 2.000000 frames per swap
Average swap interval 33.317115 ms
122 frames, 61 swaps

# ~/swapbufferstest 
Screen is 1920x1200
Averaging 1.936508 frames per swap
Average swap interval 32.274285 ms
122 frames, 63 swaps

xrandr shows:

# xrandr 
Screen 0: minimum 320 x 200, current 1920 x 1200, maximum 8192 x 8192
LVDS connected (normal left inverted right x axis y axis)
   1024x768       60.0 +
   800x600        59.9  
   848x480        59.7  
   720x480        59.7  
   640x480        59.4  
HDMI-0 disconnected (normal left inverted right x axis y axis)
VGA-0 connected 1920x1200+0+0 (normal left inverted right x axis y axis) 519mm x 324mm
   1920x1200      60.0*+
   1600x1200      60.0  
   1280x1024      75.0     60.0  
   1152x864       75.0  
   1024x768       75.1     60.0  
   800x600        75.0     60.3  
   640x480        75.0     60.0  
   720x400        70.1

Note that the LVDS is a fake - there isn't anything actually connected there.

What's the best way for me to proceed from here? Is there a perf invocation that would help me find bottlenecks, for example?
Comment 10 Jerome Glisse 2011-07-04 09:19:39 UTC
Profilling is likely the solution to get meaningfull data. However it won't work on x86-64 unless you build everything with frame pointer.
Comment 11 Michel Dänzer 2011-07-04 09:30:56 UTC
Note that as you say it's not hitting even 10% CPU load, it's probably not a simple CPU bottleneck but some kind of timing / timeout issue. The tracepoint events listed by

 perf list|grep -e drm -e radeon

might be useful, but I haven't played with those myself, so I can't give specific instructions.

Or, just attach gdb to the app and/or X server, interrupt them every now and them and see if they're in a particular place most of the time. That might at least give an idea where to continue digging.
Comment 12 Simon Farnsworth 2011-07-05 05:18:12 UTC
I can't persuade the tracepoints Michel recommended to do anything - no data gets recorded by perf when I run my test programs. I've attempted to record the data with:

perf script record -a -e  radeon:radeon_bo_create -e radeon:radeon_fence_emit -e radeon:radeon_fence_retire -e radeon:radeon_fence_wait_begin -e radeon:radeon_fence_wait_end -e drm:drm_vblank_event -e drm:drm_vblank_event_queued -e drm:drm_vblank_event_delivered

I looked at perf record/perf report output, to find the biggest non-idle CPU user, and got to r600_fence_finish being 50% of the overall CPU usage, while still not using more than 10% CPU total at any time; despite having frame pointers, I can't get perf to give me a call trace telling me where it's being called from (I did remember the -g option to perf).

Checking dmesg when drm.debug is set to 0xf shows me that I'm calling the radeon_crtc_page_flip function through ioctl every 30 milliseconds or so. Looking in detail (I'll attach the snippet I'm analyzing), I appear to call the flip ioctl, then wait until after the flip has completed before I continue on to waiting for the next vblank.
Comment 13 Simon Farnsworth 2011-07-05 05:24:19 UTC
Created attachment 48768 [details]
drm.debug=0xf dmesg snippet, showing timings of page flips

PID 955 is Xorg, PID 1360 is my test program. Notice that everything blocks after Xorg calls the pageflip ioctl until over 16ms have passed - as if we are blocking in the pageflip unexpectedly.
Comment 14 Simon Farnsworth 2011-07-05 09:01:34 UTC
We've got some form of unexpected resolution dependent behaviour here, that Intel doesn't show.

My target is always 60fps, to match the screen refresh rate of 60Hz.

On Fedora 15's versions of packages, 1024x768 is 60fps, 1920x1080 gets me 30fps, 1920x1200 gets me 20fps.

When I move to self-compiled git versions, 1920x1080 is now 60fps, but 1920x1200 is 30fps. This doesn't match my expectation from page-flipping; I would expect there to be two framebuffers in memory, and to be swapping between them, drawing as I go - resolution-dependent flip rates suggest that there's some form of copying going on that isn't needed in Intel land.
Comment 15 Simon Farnsworth 2011-07-05 09:56:41 UTC
Redoing my benchmarking with the resolution knowledge in mind:

Whatever resolution I choose, perf top shows me that 50% of my used CPU time is spent in r600_fence_finish from the r600g DRI driver.

At 1920x1200, I see my test program using approximately 27% CPU, and I have about 80% idle time across both cores of the Fusion APU.

These numbers are unchanged at 1024x768.

I also see about 5% of my used CPU time in radeon_fp_native_mode from the kernel driver; again, this is resolution independent.

I'm forced to conclude that it's not CPU-side slowdown - is there any way (like Intel's intel_gpu_top and intel_gpu_dump) to see roughly what the GPU is up to? Or is my best bet at this point to instrument something so that I can see the GPU command stream (where?), and use http://developer.amd.com/gpu_assets/r600isa.pdf to decode what's going on?
Comment 16 Michel Dänzer 2011-07-05 10:37:21 UTC
(In reply to comment #15)
> Whatever resolution I choose, perf top shows me that 50% of my used CPU time is
> spent in r600_fence_finish from the r600g DRI driver.

That makes some sense, as r600_fence_finish() is currently more or less a polling loop. This CPU usage is probably glFinish() waiting for the frame to finish rendering.

(BTW, IIRC the intel drivers used to have a broken glFinish() implementation which didn't always wait for all rendering to finish. I wonder if your tests might have been affected by that)


> I also see about 5% of my used CPU time in radeon_fp_native_mode from the
> kernel driver; again, this is resolution independent.

Hmm, this might indicate it's constantly trying to get a native mode for the fake LVDS. Have you tried force-disabling it, to see if that makes any difference?


> I'm forced to conclude that it's not CPU-side slowdown - is there any way (like
> Intel's intel_gpu_top and intel_gpu_dump) to see roughly what the GPU is up to?

Not yet, unfortunately.

However, I think we first need to narrow down further where exactly the CPU is blocking, and why.
Comment 17 Simon Farnsworth 2011-07-05 10:57:45 UTC
I added "video=LVDS-1:d" to my kernel command line, which changes xrandr output to:

# xrandr 
Screen 0: minimum 320 x 200, current 1920 x 1200, maximum 8192 x 8192
LVDS disconnected (normal left inverted right x axis y axis)
HDMI-0 connected 1920x1200+0+0 (normal left inverted right x axis y axis) 519mm x 324mm
   1920x1200      60.0*+
   1920x1080      60.0  
   1600x1200      60.0  
   1280x1024      75.0     60.0  
   1152x864       75.0  
   1024x768       75.1     60.0  
   800x600        75.0     60.3  
   640x480        75.0     60.0  
   720x400        70.1  
VGA-0 disconnected (normal left inverted right x axis y axis)

It also stops the HDMI interface working; I have no picture once KMS kicks in.

Frame rate does not improve; r600_fence_finish is now around 52% of my CPU load, while delay_tsc from the kernel has moved up to be around 5%, replacing radeon_fp_native_mode.

There has been no change in overall CPU usage - I'm still seeing the same 27% used by my test app, 80% overall idle time.

I'm going home now (it's nearly 7pm here) - if there are any tests I should try for you tomorrow, please let me know; getting the Fusion up to speed compared to Intel is my current work priority. I'm happy to try random git trees, arbitrary patches, or just editing code as described by someone else.

I have confirmed that the Intel drivers I'm using have a working glFinish() implementation; rendering always completes before swapbuffers; in as far as possible, I'm comparing like-with-like, and the baseline for both systems is Fedora 15.
Comment 18 Michel Dänzer 2011-07-06 03:15:51 UTC
AFAICT one problem is:

[drm:drm_calc_vbltimestamp_from_scanoutpos], crtc 0 : v 5 p(1154,2)@ 1309862042.323318 -> 1309862042.323284 [e 3 us, 0 rep]

The value 5 does not include DRM_SCANOUTPOS_INVBL (2), so apparently scanout is outside of vertical blank when the flip is processed, so the flip only happens on the next vertical blank.

p(1154,2) looks like it missed the vertical blank period by 2 scanlines. On an RS880 laptop here, this is -10 or lower, and the test program runs at full speed.

Simon, are you using reduced blanking modes?

Alex / Mario, what was the reason for waiting for a vblank interrupt before emitting the flip to the hardware? Wouldn't it be possible to emit it immediately, and only if scanout is outside of vertical blank, wait for the next vblank interrupt before signalling flip completion to userspace?
Comment 19 Simon Farnsworth 2011-07-06 03:19:46 UTC
I'm using the preferred mode from the panel's EDID, which xrandr says is:

  1920x1200 (0x54)  154.0MHz +HSync -VSync *current +preferred
        h: width  1920 start 1968 end 2000 total 2080 skew    0 clock   74.0KHz
        v: height 1200 start 1203 end 1209 total 1235           clock   60.0Hz

The panel is a Dell 2408WFP. This looks identical to the mode generated by cvt -r 1920 1200 60:

# cvt -r 1920 1200 60
# 1920x1200 59.95 Hz (CVT 2.30MA-R) hsync: 74.04 kHz; pclk: 154.00 MHz
Modeline "1920x1200R"  154.00  1920 1968 2000 2080  1200 1203 1209 1235 +hsync -vsync

so I'd guess that it is reduced blanking.
Comment 20 Simon Farnsworth 2011-07-06 07:24:10 UTC
So, this is a gross hack to demonstrate that it's the delaying of the pageflip that's hurting me:

[sfarnsworth@f12simon airlied-drm-fixes]$ git diff
diff --git a/drivers/gpu/drm/radeon/radeon_display.c b/drivers/gpu/drm/radeon/radeon_display.c
index 292f73f..b3c5630 100644
--- a/drivers/gpu/drm/radeon/radeon_display.c
+++ b/drivers/gpu/drm/radeon/radeon_display.c
@@ -475,6 +475,13 @@ static int radeon_crtc_page_flip(struct drm_crtc *crtc,
        /* fire the ring */
        radeon_ring_unlock_commit(rdev);
 
+       spin_lock_irqsave(&dev->event_lock, flags);
+       if (!radeon_crtc->deferred_flip_completion) {
+               radeon_page_flip(rdev, radeon_crtc->crtc_id, work->new_crtc_base);
+               radeon_crtc->deferred_flip_completion = 1;
+       }
+       spin_unlock_irqrestore(&dev->event_lock, flags);
+
        return 0;
 
 pflip_cleanup2:

With this bit of evil in place, I get 60fps at 1920x1200.
Comment 21 Mario Kleiner 2011-07-06 13:54:14 UTC
@Simon: Michel and your testcode is right, the pageflip is only programmed about 2 scanlines after the end of vblank, so the crtc waits for another full refresh cycle before flipping at the start of the next vblank.

Looking at the drm.debug=0xf log, there is a huge delay between entering the radeon irq handler and reaching the vblank handling code:

[68774.608689] [drm:evergreen_irq_process], r600_irq_process start: rptr 8528, wptr 8544
[68774.609052] [drm:drm_calc_vbltimestamp_from_scanoutpos], crtc 0 : v 5 p(1154,2)@ 1309862042.323318 -> 1309862042.323284 [e 3 us, 0 rep]
[68774.609069] [drm:evergreen_page_flip], Update pending now high. Unlocking vupdate_lock.
[68774.609077] [drm:evergreen_irq_process], IH: D1 vblank

-> 68774.609052 - 68774.608689 -> 363 microseconds! That's much larger than anything i've ever seen during testing of that code path even on rather ancient hardware. The whole vblank interrupt handler takes almost 400 usecs to execute.

It probably explains why the flip is scheduled so late and misses the deadline with reduced blanking. I think it would be good to find out where so much time is spent. If you set the drm.timestamp_precision_usec module parameter to zero, it will skip the high precision timestamping and do a do_gettimeofday() call instead, just to see how much time is spent there.

@Michel:

We wait with programming the pageflip until vblank irq, because it was the most simple way to get scheduling, pageflip completion and timestamping for pageflip events reliably done. We thought through various other methods, but each turned out to have some races:

Alex told me that older Radeon's (R100-R500) don't have pageflip completion irq's, so we couldn't use those to detect and timestamp when the pageflip was really done. We first thought about scheduling the pageflip via a packet in the command stream and only checking for pageflip completion in the vblank irq handler. But that has a couple of funny race conditions, especially if the pageflip is scheduled close to or inside the target vblank interval, which can make the pageflip completion and timestamping unreliable and can also lead to large latency for scheduling a flip with the dri2 vblank event method of scheduling flips (glXSwapBuffersMscOML() etc.) if multiple clients are rendering, e.g., to multiple displays. Not good.

The current implementation is relatively simple and robust, and common to all Radeons - at least as long as the vblank irq doesn't consistently miss the whole vblank interval due to large irq execution times like here ;-)  -- My desire is to get very reliable and sub-millisecond precise timestamps for my applications (neuro-science stuff), but problems there would also cause screen corruption and other problems for the average user.

Looking at the current code, R600 and later seem to have pageflip completion irq's. They are currently acknowledged by the irq handler, but not used at the moment.

For >= R600 maybe we could do the pageflip completion/timestamping in a dedicated pageflip irq handler and do the programming of the pageflip in the fence interrupt handler? That would be as early as possible - as soon as the backbuffer is swap ready. One could move the calls to the "flip programming part" of radeon_crtc_handle_flip() from the vblank irq handler to the fence irq handler, basically unmodified with minimal overhead. And then put the "flip completion part " of radeon_crtc_handle_flip() in a separate function, called from the pageflip irq handler.

There's a little issue then with the ordering of vblank irq's wrt. pageflip completion irq's and we'd need to take care of this for correct timestamping, but we can solve that in the same way as we do for the intel-kms driver, which also uses pageflip completion interrupts and has to take care of this ordering as well to get correct timestamps.

For a future pageflip ioctl() v2 it would make sense to allow synchronous flipping of multiple crtc's with one ioctl() invocation to allow for tear-free multi-display swaps and for implementation of things like swap group extensions. This could be easier if pageflips are scheduled from within the fence irq handler or even earlier. With the current implementation and six display heads like on evergreen/eyefinity hardware it can get difficult to implement reliable swaps across multiple displays.

-mario
Comment 22 Alex Deucher 2011-07-06 14:13:27 UTC
FWIW, the pageflip interrupts weren't too reliable at least on 6xx, but they seemed to work ok on 7xx+.  Also, I can see problems where you might not get interrupts for some flips if a later base update write comes in via the CP before the irq handler has acked the interrupt from the frame before it.  FWIW, I asked internally and the closed driver uses vblanks. 

Also for reference, Simon's new patch:
http://lists.freedesktop.org/archives/dri-devel/2011-July/012807.html
Comment 23 Jerome Glisse 2011-07-06 14:25:54 UTC
So what exactly the timestamp should correspond to ? For me vblank is good
enough and looking at specification its says that it's when the hw register are
using the new value. So either you consider that's when vblank occur or that's
at the end of vblank when the new buffer start being scanout.

Still i think this is achievable with Simon patch which simplify this whole
process to:

So :
pflip_ioctl
  if (invblank)
    wait;
  update crtc reg
  enable irq and ask for vblank or pageflip completed irq

irq handler:
  report proper timestamp

For older hw we could either accept the error margin due to interrupt happening
on vblank or we could compute how much time is needed for vblank to end and add
it to timestamp.

This change would remove complexity or least confine it to the timestamp
computation. I don't think we can count too much on fast irq handling.
Comment 24 Jerome Glisse 2011-07-06 14:35:49 UTC
So rereading the spec it's all about computing vblank lenght and adding it to timestamp at vblank interrupt.

Seems like the most reliable solution
Comment 25 Alex Deucher 2011-07-06 14:52:46 UTC
You can select where you want the double buffered update to take place by adjusting the D*MODE_MASTER_UPDATE_MODE regs (pre-avivo has similar bits).
Comment 26 Mario Kleiner 2011-07-06 18:02:26 UTC
Ok, i looked at the patch. The way it is now it won't work reliably and is vulnerable to the races we tried to remove in the current implementation.

First, as far as i understood, the pageflip ioctl() is supposed to be asynchronous, non-blocking and fast:
<http://www.mail-archive.com/dri-devel@lists.sourceforge.net/msg39857.html>

With the patch, the ioctl() doesn't block until pageflip completion, but it does add approx. 0.7 msecs for waiting until outside of vblank whenever a flip is scheduled more than 1 frame into the future, because then the ioctl() is usually called within vblank. It blocks in radeon_bo_wait() until all pending rendering has completed. If you have a large amount of rendering to do, or multiple clients rendering to multiple displays, i'd assume the blocking time could be substantial. It would block the x-server's main loop for that time, not just the actual client app which submits the rendering work. I don't know how much real world sluggishness this adds to the GUI, compared to other current sources of lag, but it doesn't sound like such a good idea to me to block the x-server there to avoid a little bit of complexity if we don't need to block and will have more displays in the future, esp. on 6-display radeon gpu's. If you think about having swap group support for synchronized swaps across different drawables or similar features in the future, then it would also be better to stay away from blocking in the ioctl() itself.

If we move the pageflip programming from the vblank irq handler to the fence irq handler we could avoid blocking in the ioctl, with basically the same complexity as now.

The bigger problem for me is that the current patch doesn't wait reliably until the flip has completed before it sends the pageflip completion event. Getting accurate vblank timestamps is not the problem. That is already solved by the drm_vblank_count_and_time() function. It returns a timestamp which corresponds to start of scanout at the end of the most recent/current vblank, just as the OML_sync_control spec requires. I tested r500, r600 and a HD-5000 eyefinity 4-display setup with special measurement equipment and the timestamps are almost microsecond accurate, compensating for any irq delays.

The problem is making sure that we emit the pageflip completion event during the video frame in which the pageflip completed, not one or multiple refresh cycles later or earlier. Otherwise the timestamps are meaningless, even dangerous for applications like mine that really must rely on them, and the whole synchronization of rendering to bufferswaps and throttling is broken and we'll have beautiful graphics glitches.

The proposed patch schedules radeon_crtc->unpin_work for the vblank irq handler, then blocks on radeon_bo_wait() for an unknown amount of time, then programs the flip. During this time a vblank irq could already do a too early unpin and emission of a wrong pageflip completion event and timestamp.

The "wait until outside vblank" is not reliable. E.g., assume the ioctl() is called a few microseconds before the vblank, detects it is not in vblank, therefore goes ahead, gets delayed for some reason (interrupt, preemption by higher prio thread, whatever) while the gpu enters the vblank, and then programs the flip just inside the vblank -> boom.

As i had to learn, vbank irq's on radeon's can also happen a few scanlines before the start of vblank, so it can happen that the check for pageflip completion would run before the flip happens, detect that no flip happened, and then the flip would happen a few microseconds later and go unnoticed for a refresh cycle and we'd be back to 30 fps on a 60 hz display.

I'm not saying the proposed patch couldn't be fixed to handle this correctly, but it would require a bit of extra locking and communication between the ioctl() in process context and the vblank irq handler, and some protection of the ioctl() against preemption, to avoid stalling the irq handler on some shared locks etc. I'm not sure if the correctly behaving end result wouldn't be as "complex" as the current implementation. The current method is simple in the sense that all timing sensitive stuff happens sequentially in a single "irq thread" - the gpu irq handler can't race against itself. There can be races between the gpu and the cpu, which i think we took care of, but it avoids possible additional races between scheduling code in process context and irq context. It also tries to make sure that it emits the pageflip completion as soon as possible and doesn't delay it by a frame just because of an unlucky order of execution.

My proposal and Simons patch are very similar in spirit, just that i'd program the flip from within the fence interrupt handler to avoid potential races between ioctl and irq and to avoid blocking in the ioctl().
Comment 27 Mario Kleiner 2011-07-06 18:43:42 UTC
(In reply to comment #22)
> FWIW, the pageflip interrupts weren't too reliable at least on 6xx, but they
> seemed to work ok on 7xx+. 

Thanks for the info Alex. Can you define "work ok"? Is this "no apparent problems, but only lightly tested" or more like "works nine out of ten times"?

> Also, I can see problems where you might not get
> interrupts for some flips if a later base update write comes in via the CP
> before the irq handler has acked the interrupt from the frame before it. 

When would this happen? During or close to a mode-set? Not during normal page flipping, no? I thought that so far all crtc updates go directly via mmio instead of the CP? Or does Atombios use the CP?

>FWIW, I asked internally and the closed driver uses vblanks. 

Ok, that's a bit scary. Do you know if it is because it is supposed to be "good enough" or due to some real problems? I do know that the Catalyst drivers were far from perfect wrt. vsync'ed swapping in the past. My own toolkit implements a couple of crazy workarounds for some oddities i saw at least on Windows and OS/X, can't remember if it was on Linux as well. So from my experience i'm not sure if i personally would count that as a pro or contra argument against using something else than vblank irq's ;-)

At least evergreen seems to also have vline interrupts? One could (ab)use them as well to fake a home-grown pageflip completion interrupt, e.g., by triggering one at scanline 1, immediately after end of vblank and then just checking in the vline irq if the update_pending bit is clear -> pageflip completed.

-mario
Comment 28 Alex Deucher 2011-07-06 20:07:58 UTC
(In reply to comment #27)
> (In reply to comment #22)
> > FWIW, the pageflip interrupts weren't too reliable at least on 6xx, but they
> > seemed to work ok on 7xx+. 
> 
> Thanks for the info Alex. Can you define "work ok"? Is this "no apparent
> problems, but only lightly tested" or more like "works nine out of ten times"?
> 

No apparent problems, but only lightly tested.

> > Also, I can see problems where you might not get
> > interrupts for some flips if a later base update write comes in via the CP
> > before the irq handler has acked the interrupt from the frame before it. 
> 
> When would this happen? During or close to a mode-set? Not during normal page
> flipping, no? I thought that so far all crtc updates go directly via mmio
> instead of the CP? Or does Atombios use the CP?
> 

All current pageflipping and modesetting happens via MMIO.  I'm not even sure it's a problem, just thinking out loud.

> >FWIW, I asked internally and the closed driver uses vblanks. 
> 
> Ok, that's a bit scary. Do you know if it is because it is supposed to be "good
> enough" or due to some real problems? I do know that the Catalyst drivers were
> far from perfect wrt. vsync'ed swapping in the past. My own toolkit implements
> a couple of crazy workarounds for some oddities i saw at least on Windows and
> OS/X, can't remember if it was on Linux as well. So from my experience i'm not
> sure if i personally would count that as a pro or contra argument against using
> something else than vblank irq's ;-)

I think it was mostly due to using the same well tested code across generations since pre-r6xx asics didn't have the flip interrupts.  Things might have changed since I last asked months ago.  I can check again.  I'm assuming the flip interrupts were added for a reason...

> 
> At least evergreen seems to also have vline interrupts? One could (ab)use them
> as well to fake a home-grown pageflip completion interrupt, e.g., by triggering
> one at scanline 1, immediately after end of vblank and then just checking in
> the vline irq if the update_pending bit is clear -> pageflip completed.

All asics support vline interrupts.
Comment 29 Michel Dänzer 2011-07-07 01:47:49 UTC
Thanks for explaining the problems with doing everything directly in the ioctl, Mario. Guess it was too good to be true...

I agree it should be a good idea to defer to an interrupt triggered by completion of rendering to the new front buffer. But I don't think a newly created fence is suitable for that, as more, unrelated rendering might have been queued since the last rendering to the new front buffer. The existing fence bo->tbo.sync_obj should be used. However, there might be no bo->tbo.sync_obj if rendering has already completed.

Also, I'm afraid we do need to wait for out of vblank first, otherwise the flip might happen one frame earlier than expected by userspace, if it manages to call the flip ioctl during the previous vblank.

In summary, how about something like this for the flip ioctl:

1. If we're in vblank, defer step 2 until after vblank. Maybe via a vsync or vline interrupt, or if that's not reliable, maybe it's acceptable for the ioctl to block for up to ~1ms.

2. If the new front buffer has a bo->tbo.sync_obj fence, defer step 3 until it has signalled.

3. Emit flip. If the hardware doesn't take it immediately, defer step 4 until the next vertical blank.

4. Signal flip completion to userspace.
Comment 30 Jerome Glisse 2011-07-07 06:36:38 UTC
This flip stuff is hairy. I think best is :

flip ioctl:
- pin new front
- delay work to bo fence (if no fence use the oldest alive fence)
- return

fence handler:
- on flip work wait for outside vblank
- write reg

irq handler:
- report timestamp + cleanup (unpin old front)

So wait for bo to be idle before waiting to be out of vblank

Note that as long as we don't introduce new ioctl we can't properly handle the case where reminder != 0 in the oml extension

"
 if the current MSC is greater than or equal to <target_msc>, the
 buffer swap will occur the next time the MSC value is incremented
 to a value such that MSC % <divisor> = <remainder>. If <divisor> = 0,
 the swap will occur when MSC becomes greater than or equal to
 <target_msc>.
"
Comment 31 Michel Dänzer 2011-07-07 06:49:55 UTC
(In reply to comment #30)
> So wait for bo to be idle before waiting to be out of vblank

Bad idea I'm afraid: Consider the case where the ioctl is called outside of vblank, but by the time the BO is idle we're in vblank. We'd need to (probably needlessly) wait for a full frame before the flip takes effect.
Comment 32 Jerome Glisse 2011-07-07 07:11:43 UTC
Well i can think of similar bad case in the other way.

-wait for vblank
-wait for bo put us in next vblank
-program reg flip happen on this frame
-irq reporting report on next vblank

So we are off by one frame on the timestamp.

To me it seems more logic to first make sure that the potentialy longer wait (bo_wait is unbound while vblank wait is bound for any human video mode) is done first and then do the smaller one. Yes can we miss the next vblank but so could we with bo wait after and with bo wait after we might do bad reporting on the timestamp.
Comment 33 Simon Farnsworth 2011-07-07 10:46:36 UTC
(In reply to comment #31)
> (In reply to comment #30)
> > So wait for bo to be idle before waiting to be out of vblank
> 
> Bad idea I'm afraid: Consider the case where the ioctl is called outside of
> vblank, but by the time the BO is idle we're in vblank. We'd need to (probably
> needlessly) wait for a full frame before the flip takes effect.

Against that, and arguing in favour of Jerome's method: if userspace is submitting enough rendering that the BO goes idle just as vblank starts, it's already on a performance knife-edge. We're just arguing about which way to push you if you're already on that knife-edge.

As Jerome's method is guaranteed to at least accurately report the timestamps of page flips, it's probably the right way to push you; I've been caught up in other issues today, but I'll try to code up an RFC patch for Jerome's idea tomorrow.
Comment 34 Mario Kleiner 2011-07-07 12:06:40 UTC
On Jul 7, 2011, at 5:18 PM, Alex Deucher wrote:

> 2011/7/7 Michel Dänzer <michel@daenzer.net>:
>> On Don, 2011-07-07 at 11:00 -0400, Alex Deucher wrote:
>>> On Thu, Jul 7, 2011 at 10:45 AM, Mario Kleiner
>>> <mario.kleiner@tuebingen.mpg.de> wrote:
>>>> Jerome, can you please forward this to the bugzilla discussion? My
>>>> web-access to bugzilla is dead at the moment [Internet problems]:
>>>>
>>>> Need to run to a meeting for two hours or so, will be right back  
>>>> after the
>>>> break ;-)
>>>>
>>>> What about this? It seems to address our issues:
>>>>
>>>> We could use the 'vline' irq's if Alex thinks they are reliable and
>>>> supported on all asic's. If we program vline irq's to trigger  
>>>> immediately
>>>> after end of vblank, e.g., scanline 1, we could do deferred  
>>>> pageflip
>>>> programming there, and wait for flip completion/timestamping/etc.
>>>
>>> The only tricky thing with vlines is that the ddx uses them for tear
>>> free rendering so a command stream may change the vline at any time.
>>> However, we could adjust the CS checker to override the vline from
>>> userspace with whatever the kernel happened to need.
>>
>> What about the vsync interrupts? Are those still available as of  
>> AVIVO?
>
> Here's a list of the available display int srcs on r6xx+:
> VBLANK
> VLINE
> SNAPSHOT
> FORCE_COUNT
> FORCE_VSYNC
> TRIG_A
> TRIG_B
> V_UPDATE
> GRPH_PFLIP
>
> 7xx+ adds:
> VSYNC_NOM
> SET_V_TOTAL_MIN
>
> V_UPDATE looks like the most useful.  It generates an interrupt when
> the double buffered display regs switch (falling edge).
>

That sounds good. V_UPDATE irq would mark the end of the portion of  
vblank that we care about, so we could return a finished pageflip  
event earlier and don't need to mess with the vline and ddx.

Btw. i've added Simon, and forwarded the last e-mails, we "lost" him.  
Webinterface works again for me, should we move back to bugzilla?
-mario
Comment 35 Mario Kleiner 2011-07-07 13:39:39 UTC
Hmm, but V_UPDATE irq is not supported on R500, if i look at the correct manual, which is still in widespread use for many serious applications (and on my main laptop for development and testing).
Comment 36 Mario Kleiner 2011-07-07 14:06:51 UTC
(In reply to comment #35)
> Hmm, but V_UPDATE irq is not supported on R500, if i look at the correct
> manual, which is still in widespread use for many serious applications (and on
> my main laptop for development and testing).

Hmm, and we can only use it to detect pageflip completion, not for programming a pageflip outside of vblank, because at a regular vblank the double-buffered display registers won't switch. Therefore we would need the vline irq anyway for programming the flip, so maybe it is less implementation effort to only use the vline irq? It is also supported on all asics, which is good.

At least on R500 and evergreen the vline irq is per crtc, and a crtc can either display a fullscreen drawable, which would use page-flipping for bufferswaps, or display a regular desktop where drawables use vline sync'ed ddx CopyRegion blits for swapping. Offscreen blits (e.g., inside desktop compositor?) are not vsync'ed, so maybe it can't happen that the ddx and pageflip code would make simultaneous conflicting use of the vline irq's?

> Michel Daenzer wrote:
> Looks good, but note that we only need to avoid emitting the flip from
> the same vblank we were already in when the ioctl was called, to avoid
> flipping too early. After that, we can and probably want to emit the
> flip even from vblank, to avoid the flip being delayed by another frame.

We could use the vblank timestamping to solve this:

In ioctl() use drm_vblank_count_and_time() to get the timestamp of the (predicted) end of the current/most recent vblank interval. If we are already outside the "bad" vblank at ioctl() call time, this will be a timestamp in the past. If we're inside the "bad" vblank, it will be the predicted time of the end of the forbidden vblank. In any case, programming a flip at a do_gettimeofday() time after that timestamp (plus maybe one or two scanline durations for extra paranoia) would be save. We could just submit that this deadline to the crtc->work item and check system time against it in the fence irq handler to decide if it is safe to program the flip immediately.
Comment 37 Alex Deucher 2011-07-07 14:25:29 UTC
Maybe we should try and sort out why there is so much latency in the interrupt handler for Simon rather than rewriting the entire thing.  Or we could just use the pflip interrupts where they are working and use the existing code on older asics.
Comment 38 Mario Kleiner 2011-07-07 14:51:16 UTC
(In reply to comment #37)
> Maybe we should try and sort out why there is so much latency in the interrupt
> handler for Simon rather than rewriting the entire thing.  Or we could just use
> the pflip interrupts where they are working and use the existing code on older
> asics.

Yes. I think that delay needs to be sorted out anyway - it would also kill dynamic clock switching which needs to happen within vblank afaik and any kind of realtime performance, e.g., with preempt_rt kernels, if not fixed. Then the immediate problem of Simon would go away. iirc correctly, the numbers should be closer to 10-30 usecs, instead of close to 400 usecs, tested on Core2Duo 2.2 Ghz rv530, and ancient Athlon 1.2 Ghz with r600.

But the proposed new design could make sense for a future pageflip ioctl v2 implementation to add robustness and some headroom for things like synchronized swaps across crtc's, swap groups, frame-sequential stereo, or lower latency flips. I will probably toy around with this in my spare time if we choose not to rewrite the entire thing now - which i would sort of prefer, given that the current implementation is tested to be pretty reliable on r500+ and i have actual users now starting to use/rely on the radeon timestamping being precise and reliable.
Comment 39 Simon Farnsworth 2011-07-08 02:41:15 UTC
(In reply to comment #37)
> Maybe we should try and sort out why there is so much latency in the interrupt
> handler for Simon rather than rewriting the entire thing.  Or we could just use
> the pflip interrupts where they are working and use the existing code on older
> asics.

I'm going to sprinkle DRM_DEBUGs in the IRQ handling path, see if I can find a smoking gun for latency, or if it's just lots of little things building up.
Comment 40 Michel Dänzer 2011-07-08 03:05:20 UTC
Created attachment 48883 [details] [review]
Use new frontbuffer fence, if any

(In reply to comment #38)
> I will probably toy around with this in my spare time if we choose not
> to rewrite the entire thing now - which i would sort of prefer, given that the
> current implementation is tested to be pretty reliable on r500+ and i have
> actual users now starting to use/rely on the radeon timestamping being precise
> and reliable.

Yeah, sounds like a good plan.

However, I think we should at least fix it to use the new frontbuffer fence. Something like the attached patch (only compile tested).


(In reply to comment #33)
> As Jerome's method is guaranteed to at least accurately report the timestamps
> of page flips, [...]

I don't understand what would prevent accurate timestamp reporting under any circumstances.
Comment 41 Simon Farnsworth 2011-07-08 04:32:57 UTC
Created attachment 48888 [details] [review]
dmesg snippet showing which bit of the IRQ handler is slow

I've come up to airlied drm-fixes d61a06862ba8c14466e1dd718cac460da0465ddd and applied the attached patch to help track my IRQ latencies down.


[   83.799765] [drm:evergreen_irq_process], r600_irq_process start: rptr 15728, wptr 15744
[   83.799785] [drm:evergreen_irq_process], About to evergreen_irq_ack
[   83.799794] [drm:evergreen_irq_ack], Before register reads
[   83.799804] [drm:evergreen_irq_ack], Read DISP_INTERRUPT_STATUS
[   83.799813] [drm:evergreen_irq_ack], Read DISP_INTERRUPT_STATUS_CONTINUE
[   83.799823] [drm:evergreen_irq_ack], Read DISP_INTERRUPT_STATUS_CONTINUE2
[   83.799833] [drm:evergreen_irq_ack], Read DISP_INTERRUPT_STATUS_CONTINUE3
[   83.799842] [drm:evergreen_irq_ack], Read DISP_INTERRUPT_STATUS_CONTINUE4
[   83.799852] [drm:evergreen_irq_ack], Read DISP_INTERRUPT_STATUS_CONTINUE5
[   83.799861] [drm:evergreen_irq_ack], Read GRPH_INT_STATUS + EVERGREEN_CRTC0_REGISTER_OFFSET
[   83.799868] [drm:evergreen_irq_ack], Read GRPH_INT_STATUS + EVERGREEN_CRTC1_REGISTER_OFFSET
[   83.799955] [drm:evergreen_irq_ack], Read GRPH_INT_STATUS + EVERGREEN_CRTC2_REGISTER_OFFSET
[   83.800042] [drm:evergreen_irq_ack], Read GRPH_INT_STATUS + EVERGREEN_CRTC3_REGISTER_OFFSET
[   83.800130] [drm:evergreen_irq_ack], Read GRPH_INT_STATUS + EVERGREEN_CRTC4_REGISTER_OFFSET
[   83.800217] [drm:evergreen_irq_ack], Read GRPH_INT_STATUS + EVERGREEN_CRTC5_REGISTER_OFFSET
[   83.800222] [drm:evergreen_irq_ack], After register reads
[   83.800226] [drm:evergreen_irq_ack], Before pageflip occurred
[   83.800230] [drm:evergreen_irq_ack], After pageflip occurred
[   83.800234] [drm:evergreen_irq_ack], Before vblank/vline
[   83.800238] [drm:evergreen_irq_ack], After vblank/vline
[   83.800242] [drm:evergreen_irq_ack], Before HPD
[   83.800246] [drm:evergreen_irq_ack], After HPD
[   83.800250] [drm:evergreen_irq_process], Done evergreen_irq_ack
[   83.800261] [drm:drm_calc_vbltimestamp_from_scanoutpos], crtc 1 : v 5 p(736,12)@ 1310124309.668537 -> 1310124309.668370 [e 2 us, 0 rep]

is the resulting block in dmesg when I'm trying to pageflip. You can see that the slow bit is reading GRPH_INT_STATUS + EVERGREEN_CRTC2_REGISTER_OFFSET to GRPH_INT_STATUS + EVERGREEN_CRTC5_REGISTER_OFFSET, at about 80usec each; if I understand the differences between discrete Evergreen and Fusion properly, it's being very slow to read interrupt statuses for CRTCs that don't exist in silicon.

What's the best way to handle this? A "sumo" specific irq_process function and helpers in evergreen.c, and appropriate changes to radeon_asic.c?
Comment 42 Michel Dänzer 2011-07-08 05:18:32 UTC
Created attachment 48892 [details] [review]
Add missing register access guards

(In reply to comment #41)
> What's the best way to handle this? A "sumo" specific irq_process function and
> helpers in evergreen.c, and appropriate changes to radeon_asic.c?

I think something like this patch should be enough.
Comment 43 Alex Deucher 2011-07-08 07:12:49 UTC
Created attachment 48897 [details] [review]
thoroughly clean up multiple crtc guards

This patch cleans up the crtc guards throughout the evergreen+ code and also protects cards with 4 crtcs.
Comment 44 Simon Farnsworth 2011-07-08 07:17:59 UTC
Created attachment 48898 [details] [review]
The register access guards patch as applied by hand

Michel's register guards patch didn't apply cleanly to the drm-fixes kernel I'm using; the attached is what I ended up running with.

I can confirm that it fixes this bug.

Michel, there's one change I've made that's beyond what you did: in the old pageflip function, the ring lock and unlock pair were needed to emit the fence to hardware. In your version, where the fence isn't emitted, you don't need the ring lock/unlock pair.
Comment 45 Michel Dänzer 2011-07-08 07:20:44 UTC
Review of attachment 48897 [details] [review]:

Yeah, that's better than my patch, thanks. :)

Reviewed-by: Michel Dänzer <michel@daenzer.net>
Comment 46 Simon Farnsworth 2011-07-08 07:25:15 UTC
(In reply to comment #43)
> Created an attachment (id=48897) [details]
> thoroughly clean up multiple crtc guards
> 
> This patch cleans up the crtc guards throughout the evergreen+ code and also
> protects cards with 4 crtcs.

I can confirm that this patch fixes the bug for me.

Tested-by: Simon Farnsworth <simon.farnsworth@onelan.co.uk>
Comment 47 Michel Dänzer 2011-07-08 07:51:04 UTC
Comment on attachment 48883 [details] [review]
Use new frontbuffer fence, if any

(In reply to comment #44)
> Michel, there's one change I've made that's beyond what you did: in the old
> pageflip function, the ring lock and unlock pair were needed to emit the fence
> to hardware. In your version, where the fence isn't emitted, you don't need the
> ring lock/unlock pair.

Ah right, thanks. I'll post a new version of the fence patch to the dri-devel list once I get around to testing it.
Comment 48 Mario Kleiner 2011-07-08 11:16:38 UTC
Review of attachment 48897 [details] [review]:

Other than that typo
Reviewed-by: mario.kleiner@tuebingen.mpg.de

::: drivers/gpu/drm/radeon/evergreen.c
@@ +1129,3 @@
 		WREG32(EVERGREEN_CRTC_UPDATE_LOCK + EVERGREEN_CRTC3_REGISTER_OFFSET, 1);
+	}
+	if (rdev->num_crtc >= 4) {

I think this is a typo and should be >=6 ?
Comment 49 Alex Deucher 2011-07-08 11:21:51 UTC
(In reply to comment #48)
> Review of attachment 48897 [details] [review]:
> 
> Other than that typo
> Reviewed-by: mario.kleiner@tuebingen.mpg.de

I just sent out a fixed version, forgot to update the bug:
http://lists.freedesktop.org/archives/dri-devel/2011-July/012864.html