fb exports 330 symbols but has an effective public API of only 8. exporting just those 8 symbols reduces the object's size by 20k (about 10%) on x86. the function call entry for the other 322 symbols is shorter as a result, so function calls get significantly cheaper. i measured about a 6% speedup in Render with render_bench with an early version of the attached patch. fb will show the most performance improvement from this sort of cleanup, since most other modules aren't CPU-intensive. however the footprint reduction would be similar across all modules. excluding GLcore (which has its own set of issues) this would drop runtime code footprint by about 160k on x86 assuming 10% is typical. this is potentially an ABI breaking change. when i have solid performance numbers i'll post them here.
Created attachment 2128 [details] [review] visibility-for-fb-1.patch gcc-only, dirty, contains bits of other changes but should work. to make it take effect, tweak fb's {i,}makefile to include -fvisibility=hidden and compile with gcc 3.4 or later.
seongbae/alanc: Does Sun Workshop/Forte have a similar flag to hide symbols ?
I don't know if there's a compiler flag - we usually use linker mapfiles to control symbol visibility in our Solaris builds. Of course, for things only referenced from a single file, static works well on any compiler I know of, so those changes are no problem.
# cc -flags | grep scope -xldscope=<a> Indicates the appropriate linker scoping within the source prog ram; <a>={global|symbolic|hidden} -xldscope=hidden will do what -fvisibility=hidden does in gcc. This sets the default linker scoping for symbols. You can set linker scoping for each symbol in the source by using __global, __symbolic, __hidden attribute for symbol declaration in c/c++. Of course, as Alan pointed out, you can use the map file.
very cool, didn't know about -xldscope. it and the tags don't seem to be in the Forte 7 C developers manual: http://docs.sun.com/source/816-2454/index.html the only problem with using the map file is it's after code generation has been done. the symbol won't show up in the dynamic symbol table but the call prologue for it will still be the same as if it were default visibility, so you still hit the PLT/GOT indirection overhead. better than nothing. i'd probably pursue a hybrid approach of adding both static and EXPORTED tags, with EXPORTED #defined to nothing for compilers with no visibility control and to __global or equivalent for good compilers. this provides an easy transition path, because when we detect a good compiler we can just add the option to CFLAGS and win. this is the approach i used in Mesa and it works well.
-xldscope was first introduced in Studio 8 compiler. I don't think using mapfiles would incur PLT overhead (at least on SPARC) - the only disadvantage of using mapfiles is a bit of link time overhead, and missing compiler optimizations such as inlining (or any optimization that requires the exact caller/callee relationship) but no more than that.
(In reply to comment #6) > -xldscope was first introduced in Studio 8 compiler. that would explain it. > I don't think using mapfiles would incur PLT overhead (at least on SPARC) - > the only disadvantage of using mapfiles is a bit of link time overhead, > and missing compiler optimizations such as inlining (or any optimization that > requires the exact caller/callee relationship) but no more than that. my worldview is heavily x86 biased (though i'm trying to change that), so you're probably right. thanks for the hints. when i do this for real i'll be sure to add the bits for the Sun compiler.
quick performance numbers using render_bench and Xvfb, with imlib2 numbers culled. before: *** ROUND 1 *** --------------------------------------------------------------- Test: Test Xrender doing non-scaled Over blends Time: 8.254 sec. --------------------------------------------------------------- Test: Test Xrender (offscreen) doing non-scaled Over blends Time: 3.650 sec. *** ROUND 2 *** --------------------------------------------------------------- Test: Test Xrender doing 1/2 scaled Over blends Time: 7.601 sec. --------------------------------------------------------------- Test: Test Xrender (offscreen) doing 1/2 scaled Over blends Time: 7.626 sec. *** ROUND 3 *** --------------------------------------------------------------- Test: Test Xrender doing 2* smooth scaled Over blends Time: 170.836 sec. --------------------------------------------------------------- Test: Test Xrender (offscreen) doing 2* smooth scaled Over blends Time: 171.429 sec. and after: *** ROUND 1 *** --------------------------------------------------------------- Test: Test Xrender doing non-scaled Over blends Time: 7.622 sec. --------------------------------------------------------------- Test: Test Xrender (offscreen) doing non-scaled Over blends Time: 3.493 sec. *** ROUND 2 *** --------------------------------------------------------------- Test: Test Xrender doing 1/2 scaled Over blends Time: 7.520 sec. --------------------------------------------------------------- Test: Test Xrender (offscreen) doing 1/2 scaled Over blends Time: 7.291 sec. *** ROUND 3 *** --------------------------------------------------------------- Test: Test Xrender doing 2* smooth scaled Over blends Time: 168.337 sec. --------------------------------------------------------------- Test: Test Xrender (offscreen) doing 2* smooth scaled Over blends Time: 169.048 sec. the Render software path is extremely function-call intensive in the unscaled case, so the Round 1 numbers are what's interesting here. nearly 8% faster, not bad. the speedup won't be as big on hardware-backed servers since framebuffer reads are slow and Render does a lot of them.
this bug is too confused to be "fixed".
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.