Bug 111527

Summary: obs-studio + latest mesa on amdgpu/vega64 leaks kernel memory rapidly
Product: Mesa Reporter: John Schoenick <john>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact: Default DRI bug account <dri-devel>
Severity: not set    
Priority: not set CC: alexander, tele42k3
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 111444    

Description John Schoenick 2019-08-31 20:26:10 UTC
As of at least mesa 19.3/bfac462d929 on a Vega 64:

Running obs-studio, even without starting a broadcast, will begin a seemingly exponential memory leak.  It will be fine for a few minutes, until it rapidly begins consuming what appears to be kernel memory (nothing attributed to app, but total usage skyrockets).  With 32G of ram I exhaust system memory after about three minutes, but the OOM killer doesn't know what to take down as OBS itself remains low in the list.  This can then murder the whole system.

However, killing OBS causes most of the memory to be freed.  I say most because after reproducing on a fresh boot, there were apparently a few gigabytes of unaccounted for memory that never returned.  Subsequent repros of the bug on that same boot returned to the same baseline, however.  Some caching mechanism gone wrong?

I've noticed this going back at least a few weeks, but haven't a proper bisect.  It should be very easy to reproduce, and happens on both Vega 64 systems I have available.

Steps to reproduce, may not all be necessary but I confirmed this does it from a fresh state:
- Launch obs-studio
- Enable Studio Mode by clicking the button the right
- Add two sources: "desktop capture" (select any monitor) and a single "Image" source (any image)
- Press Fade/Cut up top to make that state live.  No need to actually start recording/broadcasting.
- Wait a few minutes or until your system hangs.  Memory usage will appear stable for at least a full minute before taking off unprompted.  It will not be attributed to the app, however, being apparently kernel memory.

Reproduces with 19.3 - bfac462d929
Does not reproduce with 19.1.4

Kernel versions 5.2.8/5.2.11 same behavior
Comment 1 Pierre-Eric Pelloux-Prayer 2019-09-02 08:15:11 UTC
> Reproduces with 19.3 - bfac462d929
> Does not reproduce with 19.1.4
> 

Could you bisect to find when the issue was introduced?
Comment 2 tele42k3 2019-09-07 13:20:20 UTC
Thanks for the clear steps to reproduce this issue. I managed to reproduce this on my RX 480 and it bisected to:

commit 11a3679e3aba3524cf987f1f808d92c25f16e080
Author: Michel Dänzer <michel.daenzer@amd.com>
Date:   Fri Jun 28 18:35:56 2019 +0200

    winsys/amdgpu: Make KMS handles valid for original DRM file descriptor
    
    Getting a DMA-buf fd and converting that to a handle using our duplicate
    of that file descriptor (getting at which requires passing a
    radeon_winsys pointer to the buffer_get_handle hook) makes sure of this,
    since duplicated file descriptors reference the same file description
    and therefore the same GEM handle namespace.
    
    This is necessary because libdrm_amdgpu may use a different DRM file
    descriptor with a separate handle namespace internally, e.g. because it
    always reuses any existing amdgpu_device_handle for the same device.
    amdgpu_bo_export returns a handle which is valid for that internal
    file descriptor.
    
    Bugzilla: https://bugs.freedesktop.org/110903
    Reviewed-by: Marek Olšák <marek.olsak@amd.com>
    Tested-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>

While testing I saw a .8 to 1 MB/s slow leak which appeared immediately on opening OBS with the test scene. It felt like it consistently hit some obscured value like 64MB before the major memory leak started, which helped bisect the issue.

I reverted the commit on top of f8887909c6683986990474b61afd6d4335a69e41 with good results.
Comment 3 Michel Dänzer 2019-09-09 14:32:22 UTC
Does https://gitlab.freedesktop.org/mesa/mesa/merge_requests/1907 help by any chance?
Comment 4 tele42k3 2019-09-09 19:59:52 UTC
I reproduced the issue with 7d28e9ddd62eeccf6c528beee6b1a58fdfb7f5a0 + merge request 1907. No visible effect.
Comment 5 GitLab Migration User 2019-09-25 18:50:39 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1426.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.