Summary: | shared-mime-info update time regression, like for eg. 1.1 (less than 1 sec) vs. 1.2 (30 secs) | ||
---|---|---|---|
Product: | shared-mime-info | Reporter: | Samuli Suominen <ssuominen> |
Component: | general | Assignee: | Shared Mime Info group <shared_mime_info> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | major | ||
Priority: | medium | CC: | andyrtr, deno, freedesktop, karol.blazewicz, l.jirkovsky, rdieter, tetromino, walters |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
strace from `update-mime-database /usr/share/mime` with 1.2
Add an environment variable to disable fdatasync() minimal patch to support PKGSYSTEM_ENABLE_FSYNC per comment #19 minimal patch to support PKGSYSTEM_ENABLE_FSYNC per comment #19 PKGSYSTEM_ENABLE_FSYNC env var opt-in to fdatasync PKGSYSTEM_ENABLE_FSYNC env var opt-in to fdatasync 0001-Skip-mime-database-update-if-packages-are-older-than.patch |
Description
Samuli Suominen
2013-10-11 07:39:25 UTC
Created attachment 87427 [details]
strace from `update-mime-database /usr/share/mime` with 1.2
GLib used is 2.36.4, and just for the heck if it I tried the g_type_init() patch from git, which made no difference to the performance. Reverting this patch makes it work like 1.1: http://cgit.freedesktop.org/xdg/shared-mime-info/patch/?id=bc7658182f1922d49f33acf614f408a9d3f1f9f2 running `export ac_cv_func_fdatasync=no` before ./configure "workarounds" the bug too for easier testing The correct way to fix this is for your system not to run update-mime-database for each package, put for the whole set instead ("triggers"). (In reply to comment #0) > As reported by Nikos Chantziaras at downstream bug report > https://bugs.gentoo.org/show_bug.cgi?id=487504 I've noticed myself also that > updating the mime database takes about 30x longer with 1.2 than with 1.1 and > since the command is called at the postinstallation phase of multiple > packages, it takes considerable amount of time now when it was almost > innoticable before Well, yes. fdatasync() can take quite a while if the system is heavily loaded. But at least for the icon cache version of this patch: https://git.gnome.org/browse/gtk+/commit/?id=6c6b49392629a8ee2facafb66c8867a49a3e9036 I'd seen a number of real-world issues with corrupted icon caches historically, and haven't seen any more recently than that. Enough people lose power or have their kernel crash that it makes sense to default to safety. Does Gentoo's portage fdatasync() actual package content to disk? If it doesn't, then there's no point in having gtk+ or shared-mime-info's triggers do the same, and I'd certainly be fine with a patch to add a configure option to disable the fsync for you guys. I would also tell anyone I meet they are crazy to trust their data to Gentoo though =) (In reply to comment #5) > The correct way to fix this is for your system not to run > update-mime-database for each package, put for the whole set instead > ("triggers"). Such "whole set" triggers perhaps make sense for binary-package distros. However, when a source-based distro user is updating 100 packages, the last of which might be a beast to compile (think of webkit-gtk or firefox or chromium or libreoffice), we want to update-mime-database as soon as possible (immediately after each package gets installed), and not force the user to wait for the chromium build to finish before he can run any of the other applications which were already updated. (In reply to comment #6) > Does Gentoo's portage fdatasync() actual package content to disk? Yes by default, but there is a documented option to allow users with slow machines to disable sync. (In reply to comment #8) > (In reply to comment #6) > > Does Gentoo's portage fdatasync() actual package content to disk? > > Yes by default, but there is a documented option to allow users with slow > machines to disable sync. Ok; and Samuli, do you use this option? Basically I think it makes sense for portage to chain its fsync() setting to triggers/postinsts. An environment variable perhaps? PKGSYSTEM_DISABLE_FSYNC? There *are* valid use cases for disabling fdatasync, like the case where you're constructing a new chroot/container. There it can be a lot faster to just call e.g. sync() at the end and rename the entire directory into place, rather than doing individual fdatasync() on files. Any patch for this should also change gtk-update-icon-cache too. BTW, while I wrote this patch, I actually no longer need it for gnome-continuous, because we now run "triggers" on the build server side. But it does make sense for dpkg/rpm type systems that are operating on live roots. Data loss issues exist everywhere in the system. It's weird to see individual tools try to improve safety though. I can only imagine the chaos and pain that would be caused if every single program on my system would do a sync after every output operation, be it my compression utility, word processor or system logger. It doesn't seem like the correct solution to me. I would imagine that it would be better to recreate the database *if* an error occurs rather than trying to prevent errors at the cost of extreme sync delays. Especially since the MIME database doesn't actually contain critical information or original information that can't be simply recreated later on. I can understand a financial transaction or critical system log database doing this, but not a MIME db. (In reply to comment #5) > The correct way to fix this is for your system not to run > update-mime-database for each package, put for the whole set instead > ("triggers"). We do this on Alpine Linux and the speed regression is still valid. Any package that pulls the trigger will take 30 sec to install rather than ~2-3 even on an i7. I would love to see this regression fixed. Would it be an option to add a --sync option to update-mime-database so an per-file fdatasync can be optional? Also, as I understand the commit message, OStree expects that the data generated by update-mime-database is on durable storage, would it be an alternative to run sync(2) before exit? Or even better, could OStree run sync(2) after executing its triggers? (In reply to comment #12) > Would it be an option to add a --sync option to update-mime-database so an > per-file fdatasync can be optional? See comment 9. Created attachment 90371 [details] [review] Add an environment variable to disable fdatasync() # time ./update-mime-database /usr/share/mime Unknown media type in type 'all/all' Unknown media type in type 'all/allfiles' Unknown media type in type 'uri/mms' Unknown media type in type 'uri/mmst' Unknown media type in type 'uri/mmsu' Unknown media type in type 'uri/pnm' Unknown media type in type 'uri/rtspt' Unknown media type in type 'uri/rtspu' real 0m54.347s user 0m0.411s sys 0m0.287s # time PKGSYSTEM_DISABLE_FSYNC=1 ./update-mime-database /usr/share/mime Unknown media type in type 'all/all' Unknown media type in type 'all/allfiles' Unknown media type in type 'uri/mms' Unknown media type in type 'uri/mmst' Unknown media type in type 'uri/mmsu' Unknown media type in type 'uri/pnm' Unknown media type in type 'uri/rtspt' Unknown media type in type 'uri/rtspu' real 0m0.362s user 0m0.294s sys 0m0.068s Comment on attachment 90371 [details] [review] Add an environment variable to disable fdatasync() Review of attachment 90371 [details] [review]: ----------------------------------------------------------------- Looks fine to me. I am in the camp who thinks that tools like update-mime-database should not call 'sync'. There are other *nix mechanisms that provide data storage reliability e.g. filesystem journaling. Forcing everyone to use 'sync' causes severe performance degradation. As of patch above - personally I do not like idea of using envars. This functionality should be configurable at compile time instead. And ideally it should be turned off by default - people/distros who want 'sync' will enable it at the compile time. Comment on attachment 90371 [details] [review] Add an environment variable to disable fdatasync() Review of attachment 90371 [details] [review]: ----------------------------------------------------------------- Could this 'fsync' functionality be configured at the compile time instead? packagers of shared-mime-info 1.2 should use the workaround from Comment #4 for now until this is resolved, it's easy, just put it in your .ebuild, PKGBUILD, .spec, or whatever your distribution is using, before the ./configure step export ac_cv_func_fdatasync=no ./configure make i hope the patch gets applied soon for the environment variable. the build system kludge is already there, as how ac_cv_ is supposed to work. It could be argued that fsync should be opt in, given that traditional package systems (at least rpm) don't actually fsync their content either. (It does fsync the rpmdb). So we could add a PKGSYSTEM_ENABLE_FSYNC environment variable or something. Any thoughts on that? Created attachment 99516 [details] [review] minimal patch to support PKGSYSTEM_ENABLE_FSYNC per comment #19 Created attachment 99523 [details] [review] minimal patch to support PKGSYSTEM_ENABLE_FSYNC per comment #19 rebased for shared-mime-info-1.3 Comment on attachment 99523 [details] [review] minimal patch to support PKGSYSTEM_ENABLE_FSYNC per comment #19 Review of attachment 99523 [details] [review]: ----------------------------------------------------------------- I'm ok with this patch semantically. Bastien, what do you think? (Rex, while I understand the rationale for the current format, you might want to submit a version which does reindent and is based on "git format-patch" with a commit message) Created attachment 99538 [details] [review] PKGSYSTEM_ENABLE_FSYNC env var opt-in to fdatasync Recent fdatasync() support in update-mime-database results in large slowdown (x40 or more slower in some cases), and it's unclear the safety benefits outweigh this cost. In the meantime, make this feature opt-in via environment variable PKGSYSTEM_ENABLE_FSYNC Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=70366 Comment on attachment 99538 [details] [review] PKGSYSTEM_ENABLE_FSYNC env var opt-in to fdatasync Review of attachment 99538 [details] [review]: ----------------------------------------------------------------- I would also have expected the patch to be reversed, with the package manager opting out of us running fdatasync instead. ::: update-mime-database.c @@ +951,4 @@ > new_name = g_strndup(pathname, len - 4); > > #ifdef HAVE_FDATASYNC > + if (g_getenv ("PKGSYSTEM_ENABLE_FSYNC")) { This needs to check for the value of the variable at least somewhat. Expectations are that PKGSYSTEM_ENABLE_FSYNC=0 would disable it. (In reply to comment #24) > I would also have expected the patch to be reversed, with the package > manager opting out of us running fdatasync instead. that doesn't sound sane, that would force each package manager to implement shared-mime-info specific skip for fdatasync should definately be opt-in (In reply to comment #25) > (In reply to comment #24) > > I would also have expected the patch to be reversed, with the package > > manager opting out of us running fdatasync instead. > > that doesn't sound sane, that would force each package manager to implement > shared-mime-info specific skip for fdatasync > > should definately be opt-in Yes, so that update-mime-database is correct, rather than fast by default. That seems to be the right way to do it. I'm of a strong mind that the cost-benefit of fsync isn't worth it by default currently, so making this opt-in is really the way to go: * the slowdown is *significant*, x40 (or more) times slower * it's easy to recover (e.g. run update-mime-info again). Contrast to the case of rpm's use of fsync for it's database (if that gets corrupted, you're largely SOL). * (afaik), there's no mechanism to check if generated data is current/valid. For example, gtk-update-icon-cache and icon-spec requires top-devel icon dir timestamp to be updated when content changes, so its easy to check if cache is current. Having something similar here would help mitigate matters (both for validation and optimization). (new proposed patch on the way soon...) Created attachment 99631 [details] [review] PKGSYSTEM_ENABLE_FSYNC env var opt-in to fdatasync In response to feedback, check that PKGSYSTEM_ENABLE_FSYNC env var is set and is not equal to 0 monthly ping, still awaiting feedback. commit 1f7683bfcbecbeffa802a1c361e1842db2fff4f8 Author: Bastien Nocera <hadess@hadess.net> Date: Wed Jun 25 17:25:50 2014 +0200 Disable fdatasync() usage if PKGSYSTEM_ENABLE_FSYNC is set If the PKGSYSTEM_ENABLE_FSYNC envvar is set to a non-zero value, the fdatasync() call will be skipped, at the expense of data integrity. https://bugs.freedesktop.org/show_bug.cgi?id=70366 commit fd48920cf82402a95f658cab93db0cf3786c4d6e Author: Bastien Nocera <hadess@hadess.net> Date: Wed Jun 25 17:23:50 2014 +0200 Split out fdatasync() usage What you committed works differently than my proposed patches. fsync is still used by default, it is only disabled with PKGSYSTEM_ENABLE_FSYNC= or PKGSYSTEM_ENABLE_FSYNC=0 (see testing results at the end) was that intentional? ------------------------ (testing with shared-mime-info-1.3 + those 2 last commits) # time update-mime-database /usr/share/mime real 0m45.768s user 0m0.358s sys 0m0.230s # time PKGSYSTEM_ENABLE_FSYNC=0 update-mime-database /usr/share/mime real 0m0.354s user 0m0.260s sys 0m0.093s # time PKGSYSTEM_ENABLE_FSYNC= update-mime-database /usr/share/mim real 0m0.327s user 0m0.256s sys 0m0.069s Fwiw, to flip it the other way, index 894ac97..8626702 100644 --- a/update-mime-database.c +++ b/update-mime-database.c @@ -943,7 +943,7 @@ sync_enabled(void) env = g_getenv("PKGSYSTEM_ENABLE_FSYNC"); if (!env) - return TRUE; + return FALSE; return atoi(env); } (In reply to comment #31) > What you committed works differently than my proposed patches. fsync is > still used by default, Yes, see comment 26. > it is only disabled with PKGSYSTEM_ENABLE_FSYNC= or > PKGSYSTEM_ENABLE_FSYNC=0 (see testing results at the end) > > > was that intentional? Yep. package manager bug was just fixed in a wrong place, with the option left enabled by default. that's the best description to the situation. Per comment #5, if there were some way to determine if the update-mime-database output were current, then it would know to run only once per transaction, which would mitigate matters a bit. For example, gtk-update-icon-cache compares the timestamp of its output against the timestamp of the toplevel icon dir, and only runs if toplevel timestamp is newer. Could update-mime-database do something similar? Somehow store the timestamp of when last run, and compare against that of MIME_DIR/packages, and only re-generate data if MIME_DIR_packages is newer. (In reply to comment #35) > Per comment #5, if there were some way to determine if the > update-mime-database output were current, then it would know to run only > once per transaction, which would mitigate matters a bit. > > For example, gtk-update-icon-cache compares the timestamp of its output > against the timestamp of the toplevel icon dir, and only runs if toplevel > timestamp is newer. You can't check the toplevel directory's timestamp, changing a file inside the directory (eg. running touch on it, or editing it) doesn't change the mtime of the directory containing it. Even less one level down. In any case, I implemented that, except that it would make absolutely no difference to us. We (should) only run it in packages that do add new packages XML files. You can verify this with: for i in `rpm -qa` ; do rpm -q --scripts $i | grep -q update-mime-database && echo $i >> foo ; done $ for i in `cat foo` ; do rpm -ql $i | grep -q /usr/share/mime/packages || echo $i is broken ; done Looks like those packages might be broken though: control-center gnome-boxes gnome-packagekit gnome-color-manager seahorse They should only run update-mime-database in postuninstall, not postinstall, as they don't have any XML packages installed. > Could update-mime-database do something similar? Somehow store the > timestamp of when last run, and compare against that of MIME_DIR/packages, > and only re-generate data if MIME_DIR_packages is newer. I'll attach a patch I made to do that, but it won't help your use case, which is "we're running update-mime-database too often" or "fdatasync() syncs everything instead of just the file we told it to", both being outside shared-mime-info's remit. Created attachment 101875 [details] [review] 0001-Skip-mime-database-update-if-packages-are-older-than.patch That's a start, but still insufficient, true. rpm payloads of old mime xml files (older than the version file) would fail to trigger an update. I suggested checking the dir timestamp, because the icon-spec explicitly mentions updating it's timestamp on installs, so can reliably use it as a way to check for currentness of cache. We would likely need something like that here too. (In reply to comment #38) > That's a start, but still insufficient, true. > > rpm payloads of old mime xml files (older than the version file) would fail > to trigger an update. > > I suggested checking the dir timestamp, because the icon-spec explicitly > mentions updating it's timestamp on installs, so can reliably use it as a > way to check for currentness of cache. We would likely need something like > that here too. Even if we figured out how to detect whether to update or not, there would be 6 packages on my system running update-mime-database after install. Unless we go to 1, there's not much more shared-mime-info can do. Yes, once this is implemented here, then, and only then, can packaging guidelines be updated to optimize things to run only once per transaction. Sorry if that wasn't clear. (In reply to comment #40) > Yes, once this is implemented here, then, and only then, can packaging > guidelines be updated to optimize things to run only once per transaction. > Sorry if that wasn't clear. Huh, no. The only difference it would make is whether you run it once or not at all (at the end of the transaction). Not whether it runs for each package's post scripts. Maybe we're not talking about the same thing here. What I am suggesting is a way to make it so that shared-mime-info scriptlet only (effectively) runs once per rpm transaction (not once per package). This is what is currently implemented in fedora's packaging guidelines for gtk-update-icon-cache, see https://fedoraproject.org/wiki/Packaging:ScriptletSnippets?rd=Packaging/ScriptletSnippets#Icon_Cache I'd like to see a similar optimization for shared-mime-info if possible. (In reply to comment #42) > Maybe we're not talking about the same thing here. > > What I am suggesting is a way to make it so that shared-mime-info scriptlet > only (effectively) runs once per rpm transaction (not once per package). > This is what is currently implemented in fedora's packaging guidelines for > gtk-update-icon-cache, see > https://fedoraproject.org/wiki/Packaging:ScriptletSnippets?rd=Packaging/ > ScriptletSnippets#Icon_Cache > > I'd like to see a similar optimization for shared-mime-info if possible. Then you can use something like my patch and touch /usr/share/mime/version in your package when you install a file. Or make rpm do this at the end of the whole transaction using triggers instead of carrying on those scriptlets hacks. I'm fine with finishing the patch if somebody updates all the packages and modifies the guidelines. Great! I'll take care of the updating guidelines and packages part. I don't think triggers are a viable option yet. I don't recall the details, but do remember Panu shot that down last time we (the fedora packaging committee) asked him about it Wait, I'm not sure that will work. I thought your patch checks to see if mime .xml files are newer than version, then updates cache (and version). How does updating timestamp of version help there? or my scenario mentioned in comment #38 ? (In reply to comment #44) > Great! I'll take care of the updating guidelines and packages part. Note that there will be a command-line option to use in your snippets, as we can't break existing users of update-mime-database (which expect update-mime-database to update the database, even if the mtime of the version file isn't newer than files in packages/). I'd be interested to know how you will handle the option being missing in update-mime-database during upgrades... (In reply to comment #45) > Wait, I'm not sure that will work. > > I thought your patch checks to see if mime .xml files are newer than > version, then updates cache (and version). How does updating timestamp of > version help there? or my scenario mentioned in comment #38 ? Sorry, you'd touch /usr/share/mime/packages/ (In reply to comment #47) > (In reply to comment #45) > > Wait, I'm not sure that will work. > > > > I thought your patch checks to see if mime .xml files are newer than > > version, then updates cache (and version). How does updating timestamp of > > version help there? or my scenario mentioned in comment #38 ? > > Sorry, you'd touch /usr/share/mime/packages/ Or the file that you just installed in it. ah! As for compatibility, the guidelines changes would likely only affect f21 and newer (where the new shared-mime-info that supports it lives). For added safety, we could make older fedora versions accept (but ignore) the new option too commit 29a04be6c9cbaf0865c8b57428b7b7c37fbda4c3 Author: Bastien Nocera <hadess@hadess.net> Date: Fri Jun 27 18:25:57 2014 +0200 Add "-n" option to update-mime-database When "-n" is passed, the cache will only be updated if $MIME_DIR/packages or one of the files in that directory is newer than $MIME_DIR/version. This is useful for package pre- and post-installation scripts. commit 4b3f9f774da8859d4f1f7e991b12832d6c09b63e Author: Bastien Nocera <hadess@hadess.net> Date: Fri Jun 27 16:57:08 2014 +0200 Skip mime database update if packages are older than cache Check for the mtime of the version file, the last one to be created when updating the database to see against the mtime of the newest packages file. If one of the files inside the packages directory is newer than the version file, really update the database. (In reply to comment #49) > ah! As for compatibility, the guidelines changes would likely only affect > f21 and newer (where the new shared-mime-info that supports it lives). > > For added safety, we could make older fedora versions accept (but ignore) > the new option too Feel free to patch that downstream. thanks guys, I really appericiate it |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.