Bug 98420

Summary: Memory leak in drm_atomic.c eventually (few days) consuming all RAM (on at least one system configuration)
Product: DRI Reporter: Felix Monninger <felix.monninger>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: medium CC: intel-gfx-bugs
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Patch to drm_atomic.c calling the required drm_property_unreference_blob's
none
Release reference from blob lookup after replacing property (signed-off-by included) none

Description Felix Monninger 2016-10-24 22:14:17 UTC
Created attachment 127524 [details] [review]
Patch to drm_atomic.c calling the required drm_property_unreference_blob's

Problem:
I noticed that RAM irretrievably gets lost over the course of a few days. "cat /proc/meminfo |grep SUnreclaim" frequently (after 2 or 3 days of uptime) grew to high amounts (>2GB) of unreclaimable SLABs until out-of-memory failure.

System:
Linux 4.7.4+; 4.8.* (problem has been in Kernel since at least Feb 16)
Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
i915

Investigation:
1. /proc/slabinfo showed that lot of 4K memory blocks "kmalloc-4096" have been allocated (visible after modifying slab.c to include those of size>4096 into the output). Example line from /proc/slabinfo:
kmalloc-4096       11574  11578   4432    7    8 : tunables    0    0    0 : slabdata   1654   1654      0
2. /proc/slab_allocators showed (again after modifying slab.c):
kmalloc-4096: 1954 drm_property_create_blob.part.19+0x27/0xe0 [drm]
(the numbers are from different times, in any case growing unboundedly until reboot)
3. using ftrace revealed the following callstack being processed precisely every .5 seconds:
drm_property_create_blob <- drm_atomic_helper_legacy_gamma_set <-drm_mode_gamma_set_ioctl <- drm_ioctl
4. Reference counts on the "blob" allocated in drm_atomic_helper.c:drm_atomic_helper_legacy_gamma_set increased by two is then passed to after the line "ret = drm_atomic_crtc_set_property(crtc, crtc_state, config->gamma_lut_property, blob->base.id);". (This should only be incremented by 1 as ownership of this blob is passed to the crtc_state which then keeps the blob as an updated property value.)
5. As the blob is passed by id ("..., blob->base.id)"), in drm_atomic.c:drm_atomic_replace_property_blob_from_id the function drm_property_lookup_blob(dev, blob_id); is called. Note that the function manual says "If successful, this takes an additional reference to the blob property. callers need to make sure to eventually unreference the returned property again, using @drm_property_unreference_blob.", which is not being done in this case.
6. This leads to the old state->degamma_lut that is replaced by the updated blob never being freed (even after its refcount being properly decremented by 1 at the remaining places), since its refcount has been incremented once too much. Thus every half second an 4K block is wasted.

Fix:
We call drm_property_unreference_blob(new_blob) at the appropriate spots in drm_atomic.c:drm_atomic_replace_property_blob_from_id . Please see the attached patch.

I wonder, which hardware is affected by this "legacy" code (i. e. drm_atomic_helper_legacy_gamma_set)? Only older intel HD graphics (<4000?) devices, or is it actually more widely used despite the legacy naming?
Comment 1 Ville Syrjala 2016-10-25 17:06:01 UTC
(In reply to Felix Monninger from comment #0)
> Created attachment 127524 [details] [review] [review]
> Patch to drm_atomic.c calling the required drm_property_unreference_blob's
> 
> Problem:
> I noticed that RAM irretrievably gets lost over the course of a few days.
> "cat /proc/meminfo |grep SUnreclaim" frequently (after 2 or 3 days of
> uptime) grew to high amounts (>2GB) of unreclaimable SLABs until
> out-of-memory failure.
> 
> System:
> Linux 4.7.4+; 4.8.* (problem has been in Kernel since at least Feb 16)
> Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
> i915
> 
> Investigation:
> 1. /proc/slabinfo showed that lot of 4K memory blocks "kmalloc-4096" have
> been allocated (visible after modifying slab.c to include those of size>4096
> into the output). Example line from /proc/slabinfo:
> kmalloc-4096       11574  11578   4432    7    8 : tunables    0    0    0 :
> slabdata   1654   1654      0
> 2. /proc/slab_allocators showed (again after modifying slab.c):
> kmalloc-4096: 1954 drm_property_create_blob.part.19+0x27/0xe0 [drm]
> (the numbers are from different times, in any case growing unboundedly until
> reboot)
> 3. using ftrace revealed the following callstack being processed precisely
> every .5 seconds:
> drm_property_create_blob <- drm_atomic_helper_legacy_gamma_set
> <-drm_mode_gamma_set_ioctl <- drm_ioctl
> 4. Reference counts on the "blob" allocated in
> drm_atomic_helper.c:drm_atomic_helper_legacy_gamma_set increased by two is
> then passed to after the line "ret = drm_atomic_crtc_set_property(crtc,
> crtc_state, config->gamma_lut_property, blob->base.id);". (This should only
> be incremented by 1 as ownership of this blob is passed to the crtc_state
> which then keeps the blob as an updated property value.)
> 5. As the blob is passed by id ("..., blob->base.id)"), in
> drm_atomic.c:drm_atomic_replace_property_blob_from_id the function
> drm_property_lookup_blob(dev, blob_id); is called. Note that the function
> manual says "If successful, this takes an additional reference to the blob
> property. callers need to make sure to eventually unreference the returned
> property again, using @drm_property_unreference_blob.", which is not being
> done in this case.
> 6. This leads to the old state->degamma_lut that is replaced by the updated
> blob never being freed (even after its refcount being properly decremented
> by 1 at the remaining places), since its refcount has been incremented once
> too much. Thus every half second an 4K block is wasted.
> 
> Fix:
> We call drm_property_unreference_blob(new_blob) at the appropriate spots in
> drm_atomic.c:drm_atomic_replace_property_blob_from_id . Please see the
> attached patch.

Looks like a solid fix. Can you redo the patch with your full name and signed-off-by line as per Documentation/SubmittingPatches, and either send directly to dri-devel@lists.freedesktop.org with git send-email or attach here and someone can forward it along.

> 
> I wonder, which hardware is affected by this "legacy" code (i. e.
> drm_atomic_helper_legacy_gamma_set)? Only older intel HD graphics (<4000?)
> devices, or is it actually more widely used despite the legacy naming?

It's "legacy" from the the point of view of our userspace API. Which actually means it's used everywhere pretty much since we have no real non-legacy userspace. And drm_atomic_replace_property_blob_from_id() would get used for the non-legacy path as well, so it wouldn't really change anything as far as this leak is concerned.
Comment 2 Chris Wilson 2016-10-25 19:47:58 UTC
Ah, I thought I sent this to dri-devel this morning:

https://lists.freedesktop.org/archives/dri-devel/2016-October/122066.html

If you want to send a "I made this!" in reply (with sob), quite happy to adjust authorship.
Comment 3 Felix Monninger 2016-10-25 21:13:21 UTC
(In reply to Chris Wilson from comment #2)
> Ah, I thought I sent this to dri-devel this morning:
> 
> https://lists.freedesktop.org/archives/dri-devel/2016-October/122066.html
> 
> If you want to send a "I made this!" in reply (with sob), quite happy to
> adjust authorship.

I made this! :)

Thank you both Ville and Chris for your advice and the refactor. I will add the modified patch including signed-off-by to this report here, as I'm having trouble replying to the dri-devel thread as I was not subscribed to it at the time of posting. Please look over it to see if I did it correctly according to Documentation/SubmittingPatches (I left your signed-off-by in the line below mine and added a [<name>: <foo>] comment informing about your refactoring of the patch in between (I hope that's ok/allowed; if not, feel free to change it. I see the kernel is quite strict on traceability of ownership)).

Two minor comments on the modifications you did:
- The NULL check "if (new_blob) drm_property_unreference_blob(new_blob);" is theoretically unneccessary as it is done inside "...unreference_blob()", however I would leave it in, too, for clarity and ease of understanding what is being done. There has been a commit removing these checks (which has been generated automatically by a script) though, see [1].
- Because this check is for (new_blob != NULL) and not for (blob_id != 0), the code now assumes that a new_blob that originally is equal to NULL (in the case blob_id == 0) is never set to something non-NULL inside drm_atomic_replace_property_blob() and thus afterwards erroneously unreferenced. Maybe we should change it to "if (blob_id != 0) drm_property_unreference_blob(new_blob);", then it behaves correctly in any case, even though we the control flow with twice the same if looks a little bit silly. (I am not familiar with the specifics of the Kernel style. If by convention only the first pointer argument to a function is ever reassigned, this is no problem.)

Thank you again for your help, I had a lot of fun in investigating this!
Felix


[1] https://github.com/torvalds/linux/commit/5f911905054a64cf8c7871fddd33f4af74f07a17
Comment 4 Felix Monninger 2016-10-25 21:15:02 UTC
Created attachment 127543 [details] [review]
Release reference from blob lookup after replacing property (signed-off-by included)
Comment 5 Chris Wilson 2016-10-25 21:32:39 UTC
Yeah, whether or not an unref() function is NULL safe depends on whereabouts in the kernel you are. In kernel/ it is considered bad practice to err on the side of safety (as it encourages sloppy use, if the caller can't decide on whether NULL is an error or not, why should the core code be nice). On the other hand, in drivers/gpu/drm we tend to favour NULL-safe teardown functions as then we avoid proliferation of if (!NULL) tests. You win some, you lose some.

Reposted to dri-devel@ giving you the kudos for finding and fixing the bug.
Comment 6 Felix Monninger 2016-10-25 23:12:07 UTC
Oh I see. Interesting to learn about the different approaches to these matters.

As a final note on when this bug occurs and why it hasn't been noticed yet (this might be useful for people finding this until the time at which the patched code is distributed):

Programs like "xflux" or "redshift" that change screen color temperature to a more reddish type depending on daytime will trigger a call of drm_atomic_helper_legacy_gamma_set(). During transition phase (from normal->red and vice versa) they make many (>100) calls, and then in regular intervals (xflux: every 0.5s; redshift: every 5s). Thus at each of these intervals memory is leaked, with redshift at 1/10 the rate of xflux.

With these tools turned off there don't seem to be many changes of the gamma value. So this bug will probably not affect those who are not using these tools. Also, this has been introduced only recently, > kernel 4.5.1

(If everything is wrapped up, this issue can be closed, but I don't know the correct procedure.)
Comment 7 Chris Wilson 2016-10-27 10:09:07 UTC
commit cac5fcedaabdadf150c8a9be9fee76defc8ba444
Author: Felix Monninger <felix.monninger@gmail.com>
Date:   Tue Oct 25 22:28:08 2016 +0100

    drm: Release reference from blob lookup after replacing property

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.