Summary: | Memory leak in drm_atomic.c eventually (few days) consuming all RAM (on at least one system configuration) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Felix Monninger <felix.monninger> | ||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||
Severity: | major | ||||||||
Priority: | medium | CC: | intel-gfx-bugs | ||||||
Version: | unspecified | ||||||||
Hardware: | x86-64 (AMD64) | ||||||||
OS: | Linux (All) | ||||||||
Whiteboard: | |||||||||
i915 platform: | i915 features: | ||||||||
Attachments: |
|
Description
Felix Monninger
2016-10-24 22:14:17 UTC
(In reply to Felix Monninger from comment #0) > Created attachment 127524 [details] [review] [review] > Patch to drm_atomic.c calling the required drm_property_unreference_blob's > > Problem: > I noticed that RAM irretrievably gets lost over the course of a few days. > "cat /proc/meminfo |grep SUnreclaim" frequently (after 2 or 3 days of > uptime) grew to high amounts (>2GB) of unreclaimable SLABs until > out-of-memory failure. > > System: > Linux 4.7.4+; 4.8.* (problem has been in Kernel since at least Feb 16) > Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09) > i915 > > Investigation: > 1. /proc/slabinfo showed that lot of 4K memory blocks "kmalloc-4096" have > been allocated (visible after modifying slab.c to include those of size>4096 > into the output). Example line from /proc/slabinfo: > kmalloc-4096 11574 11578 4432 7 8 : tunables 0 0 0 : > slabdata 1654 1654 0 > 2. /proc/slab_allocators showed (again after modifying slab.c): > kmalloc-4096: 1954 drm_property_create_blob.part.19+0x27/0xe0 [drm] > (the numbers are from different times, in any case growing unboundedly until > reboot) > 3. using ftrace revealed the following callstack being processed precisely > every .5 seconds: > drm_property_create_blob <- drm_atomic_helper_legacy_gamma_set > <-drm_mode_gamma_set_ioctl <- drm_ioctl > 4. Reference counts on the "blob" allocated in > drm_atomic_helper.c:drm_atomic_helper_legacy_gamma_set increased by two is > then passed to after the line "ret = drm_atomic_crtc_set_property(crtc, > crtc_state, config->gamma_lut_property, blob->base.id);". (This should only > be incremented by 1 as ownership of this blob is passed to the crtc_state > which then keeps the blob as an updated property value.) > 5. As the blob is passed by id ("..., blob->base.id)"), in > drm_atomic.c:drm_atomic_replace_property_blob_from_id the function > drm_property_lookup_blob(dev, blob_id); is called. Note that the function > manual says "If successful, this takes an additional reference to the blob > property. callers need to make sure to eventually unreference the returned > property again, using @drm_property_unreference_blob.", which is not being > done in this case. > 6. This leads to the old state->degamma_lut that is replaced by the updated > blob never being freed (even after its refcount being properly decremented > by 1 at the remaining places), since its refcount has been incremented once > too much. Thus every half second an 4K block is wasted. > > Fix: > We call drm_property_unreference_blob(new_blob) at the appropriate spots in > drm_atomic.c:drm_atomic_replace_property_blob_from_id . Please see the > attached patch. Looks like a solid fix. Can you redo the patch with your full name and signed-off-by line as per Documentation/SubmittingPatches, and either send directly to dri-devel@lists.freedesktop.org with git send-email or attach here and someone can forward it along. > > I wonder, which hardware is affected by this "legacy" code (i. e. > drm_atomic_helper_legacy_gamma_set)? Only older intel HD graphics (<4000?) > devices, or is it actually more widely used despite the legacy naming? It's "legacy" from the the point of view of our userspace API. Which actually means it's used everywhere pretty much since we have no real non-legacy userspace. And drm_atomic_replace_property_blob_from_id() would get used for the non-legacy path as well, so it wouldn't really change anything as far as this leak is concerned. Ah, I thought I sent this to dri-devel this morning: https://lists.freedesktop.org/archives/dri-devel/2016-October/122066.html If you want to send a "I made this!" in reply (with sob), quite happy to adjust authorship. (In reply to Chris Wilson from comment #2) > Ah, I thought I sent this to dri-devel this morning: > > https://lists.freedesktop.org/archives/dri-devel/2016-October/122066.html > > If you want to send a "I made this!" in reply (with sob), quite happy to > adjust authorship. I made this! :) Thank you both Ville and Chris for your advice and the refactor. I will add the modified patch including signed-off-by to this report here, as I'm having trouble replying to the dri-devel thread as I was not subscribed to it at the time of posting. Please look over it to see if I did it correctly according to Documentation/SubmittingPatches (I left your signed-off-by in the line below mine and added a [<name>: <foo>] comment informing about your refactoring of the patch in between (I hope that's ok/allowed; if not, feel free to change it. I see the kernel is quite strict on traceability of ownership)). Two minor comments on the modifications you did: - The NULL check "if (new_blob) drm_property_unreference_blob(new_blob);" is theoretically unneccessary as it is done inside "...unreference_blob()", however I would leave it in, too, for clarity and ease of understanding what is being done. There has been a commit removing these checks (which has been generated automatically by a script) though, see [1]. - Because this check is for (new_blob != NULL) and not for (blob_id != 0), the code now assumes that a new_blob that originally is equal to NULL (in the case blob_id == 0) is never set to something non-NULL inside drm_atomic_replace_property_blob() and thus afterwards erroneously unreferenced. Maybe we should change it to "if (blob_id != 0) drm_property_unreference_blob(new_blob);", then it behaves correctly in any case, even though we the control flow with twice the same if looks a little bit silly. (I am not familiar with the specifics of the Kernel style. If by convention only the first pointer argument to a function is ever reassigned, this is no problem.) Thank you again for your help, I had a lot of fun in investigating this! Felix [1] https://github.com/torvalds/linux/commit/5f911905054a64cf8c7871fddd33f4af74f07a17 Created attachment 127543 [details] [review] Release reference from blob lookup after replacing property (signed-off-by included) Yeah, whether or not an unref() function is NULL safe depends on whereabouts in the kernel you are. In kernel/ it is considered bad practice to err on the side of safety (as it encourages sloppy use, if the caller can't decide on whether NULL is an error or not, why should the core code be nice). On the other hand, in drivers/gpu/drm we tend to favour NULL-safe teardown functions as then we avoid proliferation of if (!NULL) tests. You win some, you lose some. Reposted to dri-devel@ giving you the kudos for finding and fixing the bug. Oh I see. Interesting to learn about the different approaches to these matters. As a final note on when this bug occurs and why it hasn't been noticed yet (this might be useful for people finding this until the time at which the patched code is distributed): Programs like "xflux" or "redshift" that change screen color temperature to a more reddish type depending on daytime will trigger a call of drm_atomic_helper_legacy_gamma_set(). During transition phase (from normal->red and vice versa) they make many (>100) calls, and then in regular intervals (xflux: every 0.5s; redshift: every 5s). Thus at each of these intervals memory is leaked, with redshift at 1/10 the rate of xflux. With these tools turned off there don't seem to be many changes of the gamma value. So this bug will probably not affect those who are not using these tools. Also, this has been introduced only recently, > kernel 4.5.1 (If everything is wrapped up, this issue can be closed, but I don't know the correct procedure.) commit cac5fcedaabdadf150c8a9be9fee76defc8ba444 Author: Felix Monninger <felix.monninger@gmail.com> Date: Tue Oct 25 22:28:08 2016 +0100 drm: Release reference from blob lookup after replacing property |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.