Bug 106106 - GEM object leaks with fullscreen programs -> swap fills up + OOM kills within few hours
Summary: GEM object leaks with fullscreen programs -> swap fills up + OOM kills within...
Status: VERIFIED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/modesetting (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: high critical
Assignee: Louis-Francis Ratté-Boulianne
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords: bisected, patch, regression
Depends on:
Blocks:
 
Reported: 2018-04-17 15:29 UTC by Eero Tamminen
Modified: 2019-02-12 11:36 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
i915_gem_framebuffer content (535 bytes, text/plain)
2018-04-18 08:12 UTC, Eero Tamminen
no flags Details

Description Eero Tamminen 2018-04-17 15:29:21 UTC
Setup:
* Intel HW
* drm-tip kernel
* Mesa git
* Latest X server and related dependencies built from git
* DRI3 v1.2 & modifier support (needs git versions of above) enabled on X server side:
------------------
Section "ServerFlags"
    Option "Debug" "dmabuf_capable"
EndSection
------------------
* X using modesetting / glamor
* Up to date Ubuntu 16.04 for rest (I assume other distros would also be OK)
* Unity or Gnome desktop
* For example "glmark2" Open Source GL benchmark installed:
  sudo apt install glmark2

Test-case:
* Monitor GEM object usage in the system:
  sudo watch cat /sys/kernel/debug/dri/0/i915_gem_objects
* While running fullscreen GL program in a loop:
----------------
#!/bin/sh
for i in $(seq 10); do
    glmark2 --fullscreen &
    sleep 4
    killall glmark2
    sleep 1
done
----------------

Expected outcome:
* After doing above few times, GEM object count for X doesn't anymore increase

Actual outcome:
* There are two X contexts, GEM object count for one of them increases every time by tens of objects
* Global counts are increasing for unbound, bound and huge page types
* If you continue this long enough, 3D programs using e.g. large textures start failing, your swap will fill up and and kernel starts OOM killing processes.  On a machine with 2GB of RAM, I start getting OOM kills after running benchmarks for ~3 hours

Notes:
* When X is restarted, GEM objects go away and SwapFree changes from 0% to 100%, so leak is not on the kernel side
* XRestop doesn't show anything suspicious, none of the clients has huge or increasing X item counts, so the issue doesn't come from (e.g. compositor) client, it's on X server side
* Leak isn't visible with XFCE compositor, only with Unity (compiz) and Gnome -> Leak doesn't happen with XRender, only with compositors using GL/ES
* Leak happens also with other fullscreen GL programs, but not if they're run windowed -> maybe leak is related to fullscreen window redirected<->unredirected transitions?


Btw. This leak is real nasty to track down.

Kernel GEM objects don't show up in process memory usage / mappings, nor kernel slab or vmalloc infos.  Until you know what it is, you can only see it by your swap filling up and processes getting OOM killed, or crashing to 3D alloc failure.

If your device has enough RAM that the unused GEM objects attached to (context) handle don't get pushed to swap and fill it, you don't notice the leakage from anything.

After you know what's being leaked, you can track it (on Intel) with:
  sudo watch cat /sys/kernel/debug/dri/0/i915_gem_objects


I don't know when this leakage started, I noticed it only recently when it came bad enough that things started being OOM killed (on a 2MB machine), but it has been there before that.  If it's started within past few months, I can narrow down the range of X commits at least to a single day (and whether this is X or Mesa issue).
Comment 1 Chris Wilson 2018-04-17 16:31:16 UTC
Also look at i915_gem_framebuffer that will help to indicate which buffers are being leaked.
Comment 2 Eero Tamminen 2018-04-18 08:12:51 UTC
Created attachment 138901 [details]
i915_gem_framebuffer content

framebuffer data keeps stable, so that's not a problem, although huge page objects increase.

Btw. I forgot to mention that also restarting compiz doesn't help, i.e. it's on the server side.
Comment 3 Eero Tamminen 2018-04-18 13:29:06 UTC
Bisecting is progressing slowly as to be sure of the leak, I'm running the tests for few hours (naturally always same set of tests in same order), in case Mesa bug 105906 would cause fluctuations to results.

What's worse, there's no clear point when leak happened, it has grown during few weeks...

2018-03-03 git versions of Mesa, X and drm-tip kernel:
--------------------------------------------------------------------
348 objects, 197382144 bytes
109 unbound objects, 37384192 bytes
237 bound objects, 159473664 bytes
13 purgeable objects, 204800 bytes
25 mapped objects, 839680 bytes
50 huge-paged objects (2M, 64K, 4K) 184258560 bytes
29 display objects (globally pinned), 17522688 bytes
4294967296 [0x0000000010000000] gtt total
Supported page sizes: 2M, 64K, 4K

[k]contexts: 14 objects, 385024 bytes (0 active, 385024 inactive, 385024 global, 0 shared, 0 unbound)
X: 84 objects, 100442112 bytes (0 active, 133922816 inactive, 25534464 global, 42409984 shared, 618496 unbound)
X: 253 objects, 113287168 bytes (0 active, 116875264 inactive, 16834560 global, 25632768 shared, 37146624 unbound)
--------------------------------------------------------------------

-> 0.1 GB X context(s)


2018-03-07 git versions:
--------------------------------------------------------------------
458 objects, 1153622016 bytes
153 unbound objects, 440012800 bytes
303 bound objects, 713084928 bytes
12 purgeable objects, 200704 bytes
21 mapped objects, 798720 bytes
78 huge-paged objects (2M, 64K, 4K) 411471872 bytes
25 display objects (globally pinned), 17489920 bytes
4294967296 [0x0000000010000000] gtt total
Supported page sizes: 2M, 64K, 4K

[k]contexts: 12 objects, 368640 bytes (0 active, 368640 inactive, 368640 global, 0 shared, 0 unbound)
X: 202 objects, 1056759808 bytes (0 active, 713052160 inactive, 59387904 global, 998711296 shared, 403271680 unbound)
X: 251 objects, 113258496 bytes (0 active, 114884608 inactive, 16834560 global, 25632768 shared, 37134336 unbound)
--------------------------------------------------------------------

-> 1.0 GB X context (the larger one)


2018-03-12 git versions:
--------------------------------------------------------------------
612 objects, 2487406592 bytes
103 unbound objects, 37261312 bytes
295 bound objects, 838496256 bytes
13 purgeable objects, 204800 bytes
22 mapped objects, 802816 bytes
77 huge-paged objects (2M, 64K, 4K) 474124288 bytes
29 display objects (globally pinned), 17522688 bytes
4294967296 [0x0000000010000000] gtt total
Supported page sizes: 2M, 64K, 4K

[k]contexts: 14 objects, 385024 bytes (0 active, 385024 inactive, 385024 global, 0 shared, 0 unbound)
X: 358 objects, 2390556672 bytes (0 active, 845119488 inactive, 76201984 global, 2324111360 shared, 1611612160 unbound)
X: 243 objects, 113197056 bytes (0 active, 97050624 inactive, 8417280 global, 25632768 shared, 37154816 unbound)
--------------------------------------------------------------------

-> 2.2 GB X context


2018-03-15 git versions:
--------------------------------------------------------------------
894 objects, 4744323072 bytes
129 unbound objects, 272134144 bytes
465 bound objects, 2155888640 bytes
13 purgeable objects, 204800 bytes
22 mapped objects, 802816 bytes
190 huge-paged objects (2M, 64K, 4K) 1436098560 bytes
29 display objects (globally pinned), 17522688 bytes
4294967296 [0x0000000010000000] gtt total
Supported page sizes: 2M, 64K, 4K

[k]contexts: 14 objects, 385024 bytes (0 active, 385024 inactive, 385024 global, 0 shared, 0 unbound)
X: 632 objects, 4647391232 bytes (0 active, 2187685888 inactive, 93077504 global, 4589318144 shared, 2551144448 unbound)
X: 254 objects, 105172992 bytes (0 active, 88899584 inactive, 8417280 global, 17526784 shared, 37146624 unbound)
--------------------------------------------------------------------

-> 4.3 GB X context

(Note: results aren't all from same machine, I'm using couple of them to speed up bisecting.)

Last few nights git versions result in 4.4 GB X context, so the leakage hasn't  gotten worse since mid-March.

(I would have noticed the leak earlier if X modesetting format mismatch bug hadn't prevented X from starting on older GENs on which we have less RAM installed, between March 7th & April 3rd.)
Comment 4 Eero Tamminen 2018-04-19 07:43:26 UTC
Ok, the leakage started after March 3rd, that version doesn't leak at all.  After that the leakage has increased, but the indicated use-case (fullscreen GL) shows the leak right from the beginning.

As expected, there's no leakage with Intel DDX, only with modesetting/glamor. Leakage is due just to X server, other components (kernel, mesa, X libs) don't affect it.

Leak started between these X server commits:
2018-03-02 17_05:49 UTC: 43ffd572592d26bb78decfdf55e643bdfb011d3f meson: Make SHM extension optional
2018-03-06 15:53:39 UTC: 43576b901151a1f32209f476249a4de6980b654f  glamor: Restore glamor_fd_from_pixmap and glamor_pixmap_from_fd

Unfortunately that's the day when the initial atomic modesetting, modifiers and DRI3 v1.2 support went in, and the few final commits in that range (including DRI3 v1.2 support) until the last glamor fix, are broken.

(Same range of commits that broke X for a month on older GENs due to modesetting format mismatch.)

I'll try to narrow down it further.
Comment 5 Eero Tamminen 2018-04-19 12:57:58 UTC
Leakage size depends on rest of the 3D stack.  I did rest of the X bisecting using last night Git versions of everything else than X server, as leak size was largest with latest.

Last good commit:
e375f2966 modesetting: Create scanout buffers using supported modifiers

First bad commit:
----------------------------------------------------------------
commit 9d147305b4048dcec7ea4eda3eeea83f843f7788
Author:     Louis-Francis Ratté-Boulianne <lfrb@collabora.com>
AuthorDate: Wed Feb 28 01:19:42 2018 +0000
Commit:     Adam Jackson <ajax@redhat.com>
CommitDate: Mon Mar 5 13:27:47 2018 -0500

    modesetting: Check if buffer format is supported when flipping
    
    Add support for 'check_flip2' so that the present core can know
    why it is impossible to flip in that scenario. The core can then
    let know the client that the buffer format/modifier is suboptimal.
    
    v2: No longer need to implement 'check_flip'
    
    Signed-off-by: Louis-Francis Ratté-Boulianne <lfrb@collabora.com>
    Reviewed-by: Daniel Stone <daniels@collabora.com>
    Acked-by: Keith Packard <keithp@keithp.com>
    Reviewed-by: Adam Jackson <ajax@redhat.com>
----------------------------------------------------------------

Looking at that commit:
https://cgit.freedesktop.org/xorg/xserver/commit/?id=9d147305b4048dcec7ea4eda3eeea83f843f7788

It seems that on every call of ms_present_check_flip(), following gbm bo is leaked, when modifier support is enabled:
     gbm = glamor_gbm_bo_from_pixmap(screen, pixmap)


Running the example script that opens & closes fullscreen GL program 10x in a row, increases X context size by *200MB* (or more) each time:
1.  100MB
2.  250MB
3.  450MB
4.  610MB
5.  810MB
6. 1000MB
7. 1200MB
8. 1420MB
...

I.e. 20MB leak per application window open/close.
Comment 6 Eero Tamminen 2018-04-26 07:48:06 UTC
> I.e. 20MB leak per application window open/close.

IMHO this regression is blocker for the X release.  Is there some convention on how to mark bugs as release blockers?

(Should be simple to fix as I located what is leaked / where.)
Comment 7 Louis-Francis Ratté-Boulianne 2018-04-26 15:43:56 UTC
Thanks a lot Eero for hunting that leak! I've sent a patch to the mailing list that should (at least partly) fix it.
Comment 8 Eero Tamminen 2018-04-27 14:51:09 UTC
Tested the patch on KBL GT2: https://patchwork.freedesktop.org/patch/218934/

It gets rid of *all* the context leakage.

Everything else works also fine (except for Mesa bug 105906).

Tested-by: Eero Tamminen <eero.t.tamminen@intel.com>
Comment 9 Adam Jackson 2018-04-30 21:35:48 UTC
commit 6cace4990abc2386b6ea68536b321994d264c295
Author: Louis-Francis Ratté-Boulianne <lfrb@collabora.com>
Date:   Thu Apr 26 11:04:15 2018 -0400

    modesetting: Fix GBM objects leak when checking for flip
    
    GBM objects were never destroyed after looking for format and
    modifier compatibility when deciding whether flipping or copying
    a presented pixmap.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106106
    Signed-off-by: Louis-Francis Ratté-Boulianne <lfrb@collabora.com>
Comment 10 Eero Tamminen 2018-05-02 08:28:14 UTC
Verified.

(After 3 hours of testing on several devices, X again has 2x 0.1GB GEM contexts, instead of one of them being 4GB.)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.