104884 – memory leak with intel i965 mesa when running android container in Ubuntu

Bug 104884 - memory leak with intel i965 mesa when running android container in Ubuntu

Summary: memory leak with intel i965 mesa when running android container in Ubuntu

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	highest blocker
Assignee:	Tapani Pälli
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Duplicates (1):	104883 (view as bug list)
Depends on:
Blocks:

Reported:	2018-01-31 12:02 UTC by yanhua
Modified:	2018-03-12 08:05 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
fix attempt (6.10 KB, patch) 2018-02-12 11:57 UTC, Tapani Pälli	Details \| Splinter Review
fix attempt (8.37 KB, patch) 2018-02-12 12:22 UTC, Tapani Pälli	Details \| Splinter Review
Simple patch that changes glsl_type to manage its own memory (10.02 KB, patch) 2018-02-14 12:01 UTC, Simon Hausmann	Details \| Splinter Review
Patched with previous fix, and rerun the valgrind memcheck. (1.65 MB, text/plain) 2018-02-22 04:43 UTC, Jiancong	Details
The massif out by 4 fours testing case (512.83 KB, text/plain) 2018-03-01 15:01 UTC, Jiancong	Details
let's kill some programs (1.32 KB, patch) 2018-03-07 12:49 UTC, Tapani Pälli	Details \| Splinter Review
Show Obsolete (1) View All

Description yanhua 2018-01-31 12:02:30 UTC

When run a android unity3D game in a android container (fork from anbox, http://anbox.io). The memory leak very fast. It seems mesa alloc many memory, but not free.

If run tested app in android phone, there is no leak. 

I tested mesa 17.2.4, and 18.0-devel

Comment 1 yanhua 2018-01-31 12:06:48 UTC

*** Bug 104883 has been marked as a duplicate of this bug. ***

Comment 2 Tapani Pälli 2018-02-12 09:37:23 UTC

Could you please attach here some traces of what kind of leaks are you experiencing?

Comment 3 yanhua 2018-02-12 09:47:47 UTC

==3734== 173,149 bytes in 2,731 blocks are possibly lost in loss record 26,105 of 26,112
==3734==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3734==    by 0x11BF7C3C: ralloc_size (ralloc.c:121)
==3734==    by 0x11BF80B4: ralloc_array_size (ralloc.c:208)
==3734==    by 0x11BF80B4: ralloc_strdup (ralloc.c:353)
==3734==    by 0x11BDA987: glsl_type::glsl_type(glsl_struct_field const*, unsigned int, glsl_interface_packing, bool, char const*) (glsl_types.cpp:152)
==3734==    by 0x11BDC7E1: glsl_type::get_interface_instance(glsl_struct_field const*, unsigned int, glsl_interface_packing, bool, char const*) (glsl_types.cpp:1145)
==3734==    by 0x11B9C91E: construct_interface_instance (builtin_variables.cpp:352)
==3734==    by 0x11B9C91E: generate_varyings (builtin_variables.cpp:1406)
==3734==    by 0x11B9C91E: _mesa_glsl_initialize_variables(exec_list*, _mesa_glsl_parse_state*) (builtin_variables.cpp:1436)
==3734==    by 0x11B65C98: _mesa_ast_to_hir(exec_list*, _mesa_glsl_parse_state*) (ast_to_hir.cpp:127)
==3734==    by 0x11BCC9E2: _mesa_glsl_compile_shader (glsl_parser_extras.cpp:2103)
==3734==    by 0x11A35F37: _mesa_compile_shader (shaderapi.c:1128)
==3734==    by 0x99E97F: gles2_decoder_context_t::decode(void*, unsigned long, IOStream*) (gles2_dec.cpp:321)
==3734==    by 0x95C118: RenderThread::main() (RenderThread.cpp:74)
==3734==    by 0x97C8A4: emugl::Thread::thread_main(void*) (thread_pthread.cpp:124)

Comment 4 yanhua 2018-02-12 09:56:10 UTC

I found _mesa_glsl_compile_shader is called very frequently. which will eventually call into  glsl_type::get_interface_instance() and then glsl_type::glsl_type().

The glsl_type::get_interface_instance() declare a stack variable gls_type()

in glsl_type(), has a free ralloc_array, ralloc_strdup call which will call malloc.

If the program exit. this memory will released _mesa_destroy_shader_compiler we call _mesa_glsl_release_types().

Since this memory allocated is useless(just be cached). at least for the stacked object. It is better to release it very soon.

My program needs to run a long time. Those memory get no chance to be released.
And the system memory is exhausted.

Comment 5 Tapani Pälli 2018-02-12 11:57:51 UTC

Created attachment 137292 [details] [review]
fix attempt

Does this attached patch fix this for you?

Comment 6 Tapani Pälli 2018-02-12 12:22:32 UTC

Created attachment 137294 [details] [review]
fix attempt

Modified the fix slightly. I'll be trying alternative approach but this can be used for testing if there are other leaks still reported.

Comment 7 Tapani Pälli 2018-02-13 12:11:29 UTC

FYI I've tried alternative approach getting rid of stack allocated types (moving to ralloc and using glsl_type itself as context) but there are issues with builtin types declared with DECL_TYPE macro, it seems much bigger change as these types are not allocated using ralloc.

Comment 8 yanhua 2018-02-13 23:36:06 UTC

So the previous fix is not correct? Should this be fixed in later mesa release?

Comment 9 Tapani Pälli 2018-02-14 05:23:09 UTC

(In reply to yanhua from comment #8)
> So the previous fix is not correct? Should this be fixed in later mesa
> release?

It is correct but not very elegant. Ideally the class itself would manage the memory it allocates. You can use this patch for testing.

Comment 10 Simon Hausmann 2018-02-14 12:01:11 UTC

Created attachment 137353 [details] [review]
Simple patch that changes glsl_type to manage its own memory

I've made a patch that attempts to simplify the glsl_type memory management and as a bonus also gets rid of the global mutex. Is this patch worth submitting to the list?

Comment 11 Tapani Pälli 2018-02-14 13:11:39 UTC

(In reply to Simon Hausmann from comment #10)
> Created attachment 137353 [details] [review] [review]
> Simple patch that changes glsl_type to manage its own memory
> 
> I've made a patch that attempts to simplify the glsl_type memory management
> and as a bonus also gets rid of the global mutex. Is this patch worth
> submitting to the list?

Cool, yeah it looks like it would work, I guess strdup needs to be util_strdup. I thought about this earlier but I wonder if the reason for having the global context is that there are a *lot* of types so now we will have loads of ralloc contexts .. not sure. Please send to mesa-dev for discussion?

Comment 12 Tapani Pälli 2018-02-14 13:24:25 UTC

(In reply to Tapani Pälli from comment #11)
> (In reply to Simon Hausmann from comment #10)
> > Created attachment 137353 [details] [review] [review] [review]
> > Simple patch that changes glsl_type to manage its own memory
> > 
> > I've made a patch that attempts to simplify the glsl_type memory management
> > and as a bonus also gets rid of the global mutex. Is this patch worth
> > submitting to the list?
> 
> Cool, yeah it looks like it would work, I guess strdup needs to be
> util_strdup. I thought about this earlier but I wonder if the reason for
> having the global context is that there are a *lot* of types so now we will
> have loads of ralloc contexts .. not sure. Please send to mesa-dev for
> discussion?

The reason for global context is that we want these glsl_types to live during the lifetime of context until compiler gets destroyed and only destroy them at end. This way all shaders will enjoy using the types and not needing to create the same types all the time. So whatever the fix is, that functionality likely needs to stay.

Comment 13 Tapani Pälli 2018-02-15 09:13:58 UTC

Simon, I've sent your patch to mesa-dev with some small changes:
https://lists.freedesktop.org/archives/mesa-dev/2018-February/185676.html

Comment 14 Jiancong 2018-02-20 05:44:25 UTC

Testing with Simon's patch,removing ralloc_strdup, it still be found following possible leaking points.

==13823== 40,176 bytes in 1 blocks are possibly lost in loss record 907 of 915
==13823==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13823==    by 0x11BA013C: ralloc_size (ralloc.c:121)
==13823==    by 0x11BA021D: rzalloc_size (ralloc.c:153)
==13823==    by 0x11BA1331: ra_alloc_reg_set (register_allocate.c:197)
==13823==    by 0x11EF1EEC: brw_vec4_alloc_reg_set (brw_vec4_reg_allocate.cpp:116)
==13823==    by 0x11E2D128: brw_compiler_create (brw_compiler.c:113)
==13823==    by 0x11CDD337: intelInitScreen2 (intel_screen.c:2667)
==13823==    by 0x11C6D1B6: driCreateNewScreen2 (dri_util.c:151)
==13823==    by 0xFFDDED6: dri3_create_screen (dri3_glx.c:869)
==13823==    by 0xFFB1CD0: AllocAndFetchScreenConfigs (glxext.c:820)
==13823==    by 0xFFB1CD0: __glXInitialize (glxext.c:946)
==13823==    by 0xFFAD811: glXGetFBConfigs (glxcmds.c:1655)
==13823==    by 0x1CA398E4: (anonymous namespace)::GlxDisplay::queryConfigs(int, void (*)(void*, EglOS::ConfigInfo const*)

Comment 15 Kenneth Graunke 2018-02-20 17:30:54 UTC

(In reply to Jiancong from comment #14)
> Testing with Simon's patch,removing ralloc_strdup, it still be found
> following possible leaking points.
> 
> ==13823== 40,176 bytes in 1 blocks are possibly lost in loss record 907 of
> 915
> ==13823==    at 0x4C2DB8F: malloc (in
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==13823==    by 0x11BA013C: ralloc_size (ralloc.c:121)
> ==13823==    by 0x11BA021D: rzalloc_size (ralloc.c:153)
> ==13823==    by 0x11BA1331: ra_alloc_reg_set (register_allocate.c:197)
> ==13823==    by 0x11EF1EEC: brw_vec4_alloc_reg_set
> (brw_vec4_reg_allocate.cpp:116)
> ==13823==    by 0x11E2D128: brw_compiler_create (brw_compiler.c:113)
> ==13823==    by 0x11CDD337: intelInitScreen2 (intel_screen.c:2667)
> ==13823==    by 0x11C6D1B6: driCreateNewScreen2 (dri_util.c:151)
> ==13823==    by 0xFFDDED6: dri3_create_screen (dri3_glx.c:869)
> ==13823==    by 0xFFB1CD0: AllocAndFetchScreenConfigs (glxext.c:820)
> ==13823==    by 0xFFB1CD0: __glXInitialize (glxext.c:946)
> ==13823==    by 0xFFAD811: glXGetFBConfigs (glxcmds.c:1655)
> ==13823==    by 0x1CA398E4: (anonymous
> namespace)::GlxDisplay::queryConfigs(int, void (*)(void*, EglOS::ConfigInfo
> const*)

That happens when you forget to glXDestroyContext and glXMakeCurrent to NULL to unbind it so it can actually get freed.  It's not a driver bug.

Comment 16 Jiancong 2018-02-22 04:43:56 UTC

Created attachment 137522 [details]
Patched with previous fix, and rerun the valgrind memcheck.

Comment 17 Eero Tamminen 2018-02-27 09:11:22 UTC

(In reply to Kenneth Graunke from comment #15)
> (In reply to Jiancong from comment #14)
> > Testing with Simon's patch,removing ralloc_strdup, it still be found
> > following possible leaking points.
...
> That happens when you forget to glXDestroyContext and glXMakeCurrent to NULL
> to unbind it so it can actually get freed.  It's not a driver bug.


(In reply to Jiancong from comment #16)
> Created attachment 137522 [details]
> Patched with previous fix, and rerun the valgrind memcheck.

Please fix the issue pointed out by Kenneth first, and some of the other application bugs visible in the valgrind output first.  Otherwise it's hard to say which of those numerous items in your log are relevant.

Alternative, you could kill the valgrind process with a signal that the application doesn't catch, but Valgrind can catch (e.g. SIGHUP or SIGXCPU could be such).  That way application cannot mess up the leakage info with bugs it has in its exit path (which are irrelevant for tracking down run-time leakage).

Note that for tracking down larger leaks, Valgrind Massif output is often more useful than Valgrind Memcheck output.

Comment 18 Jiancong 2018-03-01 15:01:49 UTC

Created attachment 137720 [details]
The massif out by 4 fours testing case

Comment 19 Jiancong 2018-03-01 15:02:57 UTC

(In reply to Eero Tamminen from comment #17)
> (In reply to Kenneth Graunke from comment #15)
> > (In reply to Jiancong from comment #14)
> > > Testing with Simon's patch,removing ralloc_strdup, it still be found
> > > following possible leaking points.
> ...
> > That happens when you forget to glXDestroyContext and glXMakeCurrent to NULL
> > to unbind it so it can actually get freed.  It's not a driver bug.
> 
> 
> (In reply to Jiancong from comment #16)
> > Created attachment 137522 [details]
> > Patched with previous fix, and rerun the valgrind memcheck.
> 
> Please fix the issue pointed out by Kenneth first, and some of the other
> application bugs visible in the valgrind output first.  Otherwise it's hard
> to say which of those numerous items in your log are relevant.
> 
> Alternative, you could kill the valgrind process with a signal that the
> application doesn't catch, but Valgrind can catch (e.g. SIGHUP or SIGXCPU
> could be such).  That way application cannot mess up the leakage info with
> bugs it has in its exit path (which are irrelevant for tracking down
> run-time leakage).
> 
> Note that for tracking down larger leaks, Valgrind Massif output is often
> more useful than Valgrind Memcheck output.


Thanks for your advice. 

I've tested it again with massif and killed valgrind process with SIGHUP. After 4 hours testing, And I attached the massif.out file in attachment named . From the file, the #1 memory occupation is ralloc_size. As android app open and exit, the ralloc_size increase dramtically.

The #2 memory occupation is eglCreateContext. But the weird is the callstack shows mesa called _swrast_CreateContext finally. In the configuration process, I've disabled swrast compile with "./configure  --enable-glx=dri --enable-dri --with-dri-drivers=i915,i965 --with-gallium-drivers=i915 --disable-gallium-llvm". Is there anything wrong in my configuration?

Comment 20 Kenneth Graunke 2018-03-01 22:55:36 UTC

i965 uses swrast functionality internally for legacy OpenGL features - so if you're using OpenGL 1.x-3.0 or GLES 1.x...then you'll see _swrast_CreateContext.  It should be cleaned up if you destroy contexts properly.  It's weird that you would see that on Android, though, because GLES2-3 don't hit that.

Comment 21 Jiancong 2018-03-02 03:07:43 UTC

(In reply to Kenneth Graunke from comment #20)
> i965 uses swrast functionality internally for legacy OpenGL features - so if
> you're using OpenGL 1.x-3.0 or GLES 1.x...then you'll see
> _swrast_CreateContext.  It should be cleaned up if you destroy contexts
> properly.  It's weird that you would see that on Android, though, because
> GLES2-3 don't hit that.

The application use a Google translator to map GLES to GL command first, that maybe the reason.

Comment 22 Tapani Pälli 2018-03-07 12:49:35 UTC

Created attachment 137858 [details] [review]
let's kill some programs

If I read the Massif output correctly, does this patch help? This should release shader programs more aggressively and not only at context destroy.

Comment 23 Tapani Pälli 2018-03-08 07:12:03 UTC

Actually I think my patch in comment #22 is incorrect, we can't be sure there if rendering is finished and we can free the shaders. Correct way to fix these leaks is for application to:

--- 8< ---
eglMakeCurrent(info.egl_dpy, EGL_NO_SURFACE, EGL_NO_SURFACE, EGL_NO_CONTEXT);
eglTerminate(info.egl_dpy);
--- 8< ---

and similar on GLX. Simon's patch landed master, I will close this one as FIXED. Please reopen if there are still leaks that happen when cleanup of the context has been added to the application.

Comment 24 yanhua 2018-03-12 08:05:19 UTC

That's OK. With Simon's patch the memory leaks decreased dramastically.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.