I wrote the attached patch that adds mediaLib support to cairo and makes a big difference speeding up evince. The following is some performance stats showing percentage of time in various functions while starting up evince with a big complicated pdf document and scrolling down about 50 pages. before 15.19% fbCombineInU 10.30% fbFetchTransformed 9.15% GfxImageColorMap::getRGBLine 6.51% fbCombineOverU 5.12% GfxDeviceGrayColorSpace::getRGBLine 3.26% fbFetch_a8 2.64% fbFetchPixel_x8r8g8b8 2.62% fbCombineMaskU 1.35% fbStore_a8 after also now doing inline 16.82 fbFetchTransformed 9.90 GfxImageColorMap 6.97 GfxDeviceGrayColorSpace 6.85 mlib_ImageBlend_ZERO_DA 3.07 fbFetchPixel_x8r8g8b8 2.23 fbFetch_a8 1.67 mlib_VideoColorRGBint_to_BGRAint 1.26 mlib_ImageBlend_ONE_OMDA 0.06 mlib_ImageSetStruct 0.05 mlib_ImageChannelExtract 0.02 mlib_ImageBlend_SA_ZERO I suspect that before the maintainers would accept this patch that they would want the mediaLib functions to be separated more like the fbmmx.[ch] files. But before I do that extra work, I wanted to see if you would accept this patch. Note that I am also working on a GTK+ mediaLib integration patch which the GTK maintainers say they will accept sometime soon. http://bugzilla.gnome.org/show_bug.cgi?id=344813 I reference this only because some of the discussion in that bug explains answers to questions you may have about what mediaLib is and that it is appropriate for integration.
Created attachment 8938 [details] [review] mediaLib patch for cairo
Created attachment 8943 [details] updated patch This patch properly surrounds the #pragma with "#ifdef __SUNPRO_C" since this minor performance tuning (not related to mediaLib) works with other compilers. Caught by James Cheng (from the mediaLib team) in review.
Created attachment 8945 [details] updated patch This patch properly surrounds the #pragma with "#ifdef __SUNPRO_C" since this minor performance tuning (not related to mediaLib) works with other compilers. Caught by James Cheng (from the mediaLib team) in review.
Created attachment 8946 [details] [review] updated patch This patch properly surrounds the #pragma with "#ifdef __SUNPRO_C" since this minor performance tuning (not related to mediaLib) works with other compilers. Caught by James Cheng (from the mediaLib team) in review.
Oh sorry for adding the same patch 3 times - my browser went a bit wonky
Sounds like good stuff. What did the built-in cairo performance test suite say about any speedups/slowdowns as a result of this change?
Created attachment 8980 [details] performance output with mediaLib performance output with mediaLib patch
Created attachment 8981 [details] [review] output without mediaLib I have attached the performance output with and without mediaLib. You can see that mediaLib does improve some functions. I do notice that the functions that have MEDIALIB tunings aren't exercised a great deal by this performance suite: fbCombineOverU called 3.67% of the time, fbCombineMaskU 1.67% of the time, fbStore_a8 0.03% of the time, and fbCombineInU not at all. In fact, the performance test looks like this: pixman_gradient_color 10.79% fbFetchSourcePict 10.49% fbCombineOverU 3.67% gradientPixel 2.64% sqrt 2.56% _active_edges_to_traps 2.15% memcpy 2.05% fbCombineMaskU 1.67% fbRasterizeEdges8 1.58% _cairo_bo_sweep_line_validate 1.36% _cairo_bo_edge_start_or_continue_trap 1.17% _cairo_pixman_render_edge_init 1.17% fbCombineOutReverseU 1.13% UDiv 1.07% pixman_fill_rect_32bpp 0.95% qsort 0.93% fbCombineAddU 0.91% So, almost 50% of the total time is in just the above functions. I wonder if the performance test coverage is a bit poor? For reference, using mediaLib the percentages change as follows: fbFetchSourcePict 16.26% pixman_gradient_color 15.28% gradientPixel 3.94% sqrt 3.80% _active_edges_to_traps 1.98% memcpy 1.92% fbRasterizeEdges8 1.48% _cairo_bo_sweep_line_validate 1.12% _cairo_bo_edge_start_or_continue_trap 1.11% _cairo_pixman_render_edge_init 1.05% mlib_ImageBlend_ONE_OMDA 1.05% _cairo_uint64x64_128_mul 1.01% UDiv 0.91% mlib_ImageBlend_ZERO_DA 0.91% fbCombineAddU 0.87% Total runtime 86.531 seconds
(In reply to comment #8) > I do notice that the functions that have MEDIALIB tunings aren't exercised a > great deal by this performance suite: fbCombineOverU called 3.67% of the time, > fbCombineMaskU 1.67% of the time, fbStore_a8 0.03% of the time, and > fbCombineInU not at all. ... > So, almost 50% of the total time is in just the above functions. I wonder if > the performance test coverage is a bit poor? The general strategy we've been taking is to allow the performance suite to grow according to specific performance issues that people encounter or work to improve. Perctange of time spent in any given function across the entire performance test suite is not an interesting metric at all. We have not done any work to balance that, nor do we report any metrics across the entire test suite. Instead, as you work to improve specific aspects of cairo's performance, you should be writing new performance test cases, (see cairo/perf/README), to exercise the aspects of interest. After adding the test cases, then you can add your fixes and cairo-perf-diff will then demonstrate the improvement. And, with this approach, the existence of the test case will help us to avoid any future performance regression in this area. Thanks for your efforts to improve cairo. -Carl
Created attachment 9471 [details] [review] Updated patch In testing we found that the previous patch didn't work on Sparc because it was using "#ifdef __BIG_ENDIAN__" instead of the correct "#ifdef _BIG_ENDIAN". This updated patch fixes the problem and works properly on x86 and Sparc. No comments on this bug report for a while. Can this mediaLib support go into cairo?
Created attachment 9845 [details] [review] updated patch for cairo 1.4.6 The last version of the patch does not apply against cairo 1.4.6 due to a change in pixman/src/Makefile.am. This patch fixes this minor problem so it applies cleanly against the latest cairo.
Note that the patch to add mediaLib support to GTK+ was recently accepted and commited, so it would be nice to get this patch upstream as well so mediaLib is better supported out-of-the-box by cairo as well. Refer to: http://bugzilla.gnome.org/show_bug.cgi?id=344813
A couple things: 1) pixman is currently being extracted from cairo into its own separate library that will be shared with the xserver. So expect to move your efforts in that direction and update your patches accordingly. 2) As Carl said above, it's important to use cairo-perf to report any speedups you've gained. That way, the advantages are clear. So far, the only numbers reported here are relative percentages, which aren't nearly as convincing. In addition, you're not leveraging the statistical methods that Carl has spent a good amount of time on to make cairo-perf numbers as meaningful as possible. That said, I don't doubt that you'll see some nice speedups since we currently have no SIMD optimized code for Sun architectures. Can't wait until Sun open sources the "fast" version of mediaLib :)
How do you run cairo-perf? I've never used it before. The test case that I notice significant performance improvement is with using many documents in evince. The tuned functions seem to be hit a lot when viewing PDF documents. Can you run cairo-perf against evince? Also, FYI, the SSE2/VIS/etc. mediaLib code is planned to be released under an open license in the near future. The mediaLib team is working through a few final issues that are necessary before this can happen, but they are working on it. They aren't just releasing the C code just to be rude. :)
(In reply to comment #14) > How do you run cairo-perf? I've never used it before. The cairo-perf program is included with the cairo source distribution in the cairo/perf directory. You can build and run it by simply typing "make perf" at the top level of the cairo source distribution. Or you can read cairo/perf/README for details on other ways of running it. > The test case that I > notice significant performance improvement is with using many documents in > evince. The tuned functions seem to be hit a lot when viewing PDF documents. > Can you run cairo-perf against evince? No, there's no way to run cairo-perf against evince. It's not that kind of tool. Instead, cairo-perf is a set of micro-benchmarks that serve as the performance regression test suite for cairo. We use is to attempt to ensure that no slowdowns are introduced from one release to the next. And we also use it to quantify the benefits of any proposed performance optimizations. So, if you've got a proposed performance optimization, please run cairo-perf to quantify its benefit, (and see the discussion of cairo-perf-diff in the README for more hints on how to do this). And if you run into a case where you're certain that your change has some benefit that cairo-perf isn't showing, then it's quite easy to augment cairo-perf, (and again the README has pointers on how to do this). So, if you end up needing to do that, you should be able to use evince to figure out what unique thing it might be doing that might not be currently exercised by the cairo-perf suite. And if you google for libcairowrap you might also find it helpful in extracting a sequence of cairo calls from a complete program. I hope that's helpful. -Carl
The mediaLib team recently released it under CDDL, including all VIS/VIS2/MMX/SSE2 code. So now, people on Linux might consider seeing how well these mediaLib tunings improve things on Linux. I have heard that mediaLib builds and runs well on many Linux distros. The mediaLib 2.5 source code is now available for download from: http://www.opensolaris.org/os/downloads/devpro/ The tarball devpro-mlib-src-20071106.tar.bz2 consists of source code for building all versions of libmlib.so.2 and libmlib_sse2.so.2 under /usr/lib. In other words, all VIS/ VIS2 and MMX/SSE2 code for mediaLib is now open sourced. It has over 4000 files, or 2.4 million lines of code. More information about mediaLib can be found at: http://www.sun.com/processors/vis/mlib.html Please send your questions and comments on mediaLib to:
Created attachment 12795 [details] cairo-perf report Attached please find a diff report generated by running "make check" in the perf directory with the following two test cases: 1) cairo without USE_MEDIALIB defined 2) cairo with USE_MEDIALIB defined Then I ran cairo-perf-diff-files to generate the attachment. I think it shows that many (about 30) functions have 5-14% speedup, and many other functions have smaller speedups. It seems two functions are slightly slower (paint_similar_rgb_source-512 is 6.27x slower and stroke_solid_rgb_source-128 is 4.02x slower though going from 9.39 to 9.47 doesn't really seem that significant to me). There are a few other slower functions, all below 1.57x slower, which is probably not significant. I think this shows that the patch, overall, improves performance significantly. Can this now go upstream?
(In reply to comment #17) > I think it shows that many (about 30) functions have 5-14% speedup, and many > other functions have smaller speedups. Actually, it says 5-14 _times_ faster, not 5-14 percent faster. Don't shortchange yourself :) > It seems two functions are slightly slower (paint_similar_rgb_source-512 is > 6.27x slower and stroke_solid_rgb_source-128 is 4.02x slower though going from > 9.39 to 9.47 doesn't really seem that significant to me). There are a few > other slower functions, all below 1.57x slower, which is probably not > significant. Yea, what the heck is going on with cairo-perf calling a 9.39 -> 9.47 change a "4.02x slowdown"? Carl? And notice that the top 4 slowdowns are all on the xlib surface, while the majority of the speedups are on the image surface. The 6x slowdown is pretty serious, Maybe it has to do with a inefficient xlib fallback-to-image-surface occuring (especially since the modifications only change pixman, and thus should only show up when an image surface comes into play). Might this have something to do with the fact that mlib images are now being created and destroyed very frequently? So that's something you probably shouldn't ignore. Along those lines, remember what Carl said here: https://bugs.freedesktop.org/show_bug.cgi?id=11529#c15 > Can this now go upstream? FYI, Soeren is the pixman maintainer, so he makes the final call. Oh, and you have run the regression tests, right?
Created attachment 12820 [details] comparing two runs Thanks for the comments. Note that I am using Solaris x86, which has RENDER support. This bug is not related to the issues with running Solaris on Sparc, where Xrender is not available (due to using the Xsun Xserver on Sparc). So Xrender shouldn't be a factor with any differences discussed on this report. I think the differences in the slowdown's are red herrings, and perhaps due to something odd in how the Xserver behaves on Solaris. I ran two tests, both with mediaLib enabled and am attaching this. --- The attached comparison is checking cairo compiled without mediaLib in both cases. xlib-rgba fill_solid_rgb_source-256 19.94 0.50% -> 19.96 64.96%: 7.68x speedup xlib-rgba paint-with-alpha_solid_rgb_source-256 42.20 0.36% -> 42.24 64.99%: 7.66x speedup xlib-rgb paint_similar_rgba_source-256 0.28 84.96% -> 1.50 1.12%: 6.98x slowdown xlib-rgba fill_solid_rgb_source-128 5.86 60.16% -> 5.95 1.64%: 5.92x slowdown --- The last report I shared with you shows the following: xlib-rgba paint-with-alpha_solid_rgb_source-512 167.66 0.22% -> 167.70 65.14%: 7.66x speedup xlib-rgba paint_similar_rgb_source-256 1.55 3.52% -> 0.37 79.66%: 5.95x speedup xlib-rgba paint_similar_rgb_source-512 0.80 84.14% -> 4.93 1.46%: 6.27x slowdown xlib-rgba stroke_solid_rgb_source-128 9.39 52.67% -> 9.47 1.88%: 4.02x slowdown --- I also ran a 3rd comparison and got these results: xlib-rgba paint-with-alpha_solid_rgb_source-256 42.20 0.36% -> 42.18 64.69%: 7.60x speedup xlib-rgba paint_similar_rgb_source-256 1.50 2.79% -> 0.30 80.80%: 6.07x speedup xlib-rgb paint_similar_rgba_source-256 0.28 84.96% -> 1.49 1.68%: 6.93x slowdown xlib-rgba fill_solid_rgb_source-128 5.86 60.16% -> 5.78 1.62%: 5.79x slowdown --- Note the following - In all 3 cases there are two functions that are faster and two functions that are slower by roughly the same amount. - Note that if the test is "xlib-rgb" then the functions are "*rgba*" while if the test is "xlib-rgba", then the functions are "*rgb*" (not "*rgba*") - Note that while the functions in the 3 examples tend to be the same subset of functions, they aren't the same for each test. --- So, I think the changes in these xlib-rgb/xlib-rgba functions are note related to the patch. My guess is there is some Xserver behavior that perhaps causes this? Like perhaps the Xserver skips frames in some situations when you test it hard, causing some tests to vary like this? Also the fact that the functions that are patched to use MEDIALIB are not even used in the xlib-rgb/xlib-rgba case. So, I think we can ignore the xlib-rgb/xlib-rgba functions as being an artifact of something not related to this patch. The reports do show, however, that the functions that do use MEDIALIB show consistant speed-up, some of them fairly significant. So I think it is okay to accept this patch.
(In reply to comment #19) > I think the differences in the slowdown's are red herrings, and perhaps due to > something odd in how the Xserver behaves on Solaris. That would be my guess, too. You could try running your non-medialib version multiple times to see if you get the strange behavior, in which case you'd know more certainly that your patch doesn't introduce this behavior. > Also the fact that the functions that are patched to use MEDIALIB are > not even used in the xlib-rgb/xlib-rgba case. Probably not, unless you hit an image fallback, which AFAIK only happens with the xlib surface when RENDER isn't available. But you've already pointed out that it is in this case. > The reports do show, however, that the functions that do use MEDIALIB > show consistant speed-up, some of them fairly significant. So I think > it is okay to accept this patch. Sure, although it is a little strange that no one has mentioned having run the regression tests with this patch. You have, haven't you?
Created attachment 12840 [details] [review] updated patch Sorry, I didn't realize that there are regression tests. I went ahead and ran them. I found problems with 2 of the mediaLib tunings. This updated patch removes those tunings for now, leaving two others. I notice that the regression tests generate the same output running without the patch and with this patch. Though some tests do fail in both cases. I'll attach the output of the regression tests to this bug report. Perhaps you can let me know if this may indicate Solaris specific bug(s) we should look into? I still notice significant performance improvement even though this updated patch has two tunings removed. I'll attach the performance improvement report next.
Created attachment 12841 [details] output of running "make check" Here is the output from "make check" without mediaLib. The output when the patch is applied is the same. I do notice that the ft-text-vertical-layout-type1 and ft-text-vertical-layout-type-3 tests always fail. Is this expected, or do you have any ideas why these tests might be not working? I notice that to get "make check" in the src directory to pass, I need to change check-def.sh and check-plt.sh to have "#!/bin/bash" at the top rather than having "#!/bin/sh". This is because on Linux /bin/sh is a symlink to /bin/bash. Can this change be made to these two files to make them more portable. Otherwise I get these errors: ./check-def.sh: !: not found make[2]: Entering directory `/export/home/packages/BUILD/SUNWgnome-base-libs-2.21.2/i386/cairo-1.4.12/src' make[2]: `cairo.def' is up to date. make[2]: Leaving directory `/export/home/packages/BUILD/SUNWgnome-base-libs-2.21.2/i386/cairo-1.4.12/src' ./check-def.sh: bad substitution FAIL: check-def.sh ./check-plt.sh: !: not found I also notice that perf/cairo-perf-diff has the same issue and needs to have "#!/bin/sh" changed to "#!/bin/bash" to work on Solrais.
Created attachment 12842 [details] performance report for updated patch Here is the performance report for the updated patch. You'll notice that many functions report (about 29) report over 6x speedup and 12 more report over 4x speedup. The slowdown information also looks better than in the last report.
Created attachment 12851 [details] report for just function #1 (fbStore_a8) Here is the performance report for just tuning the fbStore_a8 function. Ignore the speedup/slowdown for xlib-rgb* functions. It doesn't look like this really makes much of a difference. Perhaps I should resubmit the patch without this? The next report shows that pretty much all of the performance improvement comes from the other change to fbCombineOverU
Created attachment 12852 [details] report for just function #2 (fbCombineOverU) Here is the performance report for fbCombineOverU. Note that this is where the significant changes are.
Okay, thanks for being patient with me, and helping me get going with the regression tests and verifying the performance benfits with cairo-perf. I think based on these tests, I recommend that we add this patch upstream, but only tune the function fbCombineOverU. This adds no regressions according to the regression tests and improves performance significantly according to cairo-perf. Does this make sense? I will touch base with the mediaLib team. The problems I am seeing with the two functions that break the regression tests (fbCombineMaskU and fbCombineInU) may be due to bugs in the mediaLib code. If so, perhaps I can work with them to fix those bugs, and propose these changes again later. Or perhaps we can figure out a way to make the code conditionally use the mediaLib functions for the cases where it doesn't cause regressions. It seems that the mediaLib implementations only cause regressions in the argb case.
(In reply to comment #21) > Created an attachment (id=12840) [details] > updated patch Notice what I said 5 months ago (1st bullet): http://bugs.freedesktop.org/show_bug.cgi?id=10151#c13
With some help from James Cheng from the mediaLib team I found out why the two places where my mediaLib patch was breaking the regression tests (fbCombineMaskU and fbCombineInU). The problem is that both of these functions use the macro FbByteMul, which is defined as follows: /* x_c = (x_c * a) / 255 + y */ #define FbByteMulAdd(x, a, y) do { \ CARD32 t = ((x & 0xff00ff) * a) + 0x800080; \ t = (t + ((t >> 8) & 0xff00ff)) >> 8; \ t &= 0xff00ff; \ t += y & 0xff00ff; \ t |= 0x1000100 - ((t >> 8) & 0xff00ff); \ t &= 0xff00ff; \ \ x = (((x >> 8) & 0xff00ff) * a) + 0x800080; \ x = (x + ((x >> 8) & 0xff00ff)) >> 8; \ x &= 0xff00ff; \ x += (y >> 8) & 0xff00ff; \ x |= 0x1000100 - ((t >> 8) & 0xff00ff); \ x &= 0xff00ff; \ x <<= 8; \ x += t; \ } while (0) As far as I can see, it actually has +0.5 for rounding, i.e., d1 = (x * a)/255.0 + 0.5; Currently, the mediaLib function does this: d2 = (x * a)/256.0; This is the result max12 = 2, max13 = 0, max23 = 2 i.e., the maximum difference is 2 between d1 and d2. BTW, if d1 doesn't have +0.5, then max12 = max13 = 1. Looking into why the mediaLib integration for fbCombineOverU does work without causing any reguression tests, here is some information: As far as I can see, it is actually equivalent to the following: x_c = (x_c * a) / 255.0 + y + 0.5; if (x_c > 255) x_c = 255; In other words, fbCombineOverU does something like this (simplified for one channel): d1 = (d * (255 - a))/255.0 + s + 0.5; if (d1 > 0xff) d1 = 0xff; while the mediaLib function mlib_ImageBlend_OMSA_ONE_Inp does this: d2 = (d * (256 - a))/256.0 + s; /* mediaLib */ if (d2 > 0xff) d2 = 0xff; Here is the result of test2: max12 = 1, max13 = 0, max23 = 1 i.e., the maximum difference between d1 and d2 is 1, better than the other case. There is a performance benefit for doing /256 instead of /255, especially for VIS implementation. So is not doing +0.5. FbByteMul seems to be a relatively expensive macro for doing /255, although it might be faster than doing /255 directly on most platforms, which explains why a mediaLib function call could beat a macro on performance in some cases. As you know, mediaLib is mostly opt to performance when there is a trade-off between precision and performance. So it seems that the regression tests are failing due to reasonable off-by-one issues. Would it make sense to fix the regression tests to allow a bit more fuzz in these cases where these mediaLib tests cause failure? Or might it make sense for cairo to adopt the "/256" method since it should work faster in general, according to James? Any thoughs appreciated.
That last comment was crafted by me from an email thread discussion with a lot of editing. I asked James to review and he had some additional comments: Mostly correct, except that the last sentence might cause some misunderstanding. "Or might it make sense for cairo to adopt the "/256" method since it should work faster in general, according to James?" I didn't suggest that cairo should adopt the "/256" method. It is faster, probably obvious to many programmers, but it's kind of approximation. The author of the pixman appeared to have spent quite some effort to make it more accurate, i.e., using 255 as 1.0 of alpha. The MMX/SSE code in pixman uses similar algorithms so it can keep the same precision. We probably can change our SSE2 mediaLib code to do something like that to get better performance, but it might be too expensive for the VIS version to use similar algorithms. You don't have to modify your question there. But I guess the answer would be "No" or even "Absolutely NOT!"
Not sure what the current status is, but this belongs to pixman.
This bug is obsolete by now.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.