Bug 109732 - DiRT Rally in a loop leaks memory until there is none left
Summary: DiRT Rally in a loop leaks memory until there is none left
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/Iris (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Kenneth Graunke
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-21 21:23 UTC by leozinho29_eu
Modified: 2019-07-02 20:07 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description leozinho29_eu 2019-02-21 21:23:01 UTC
It was observed extreme graphical corruption on DiRT Rally when using Iris. The elements on screen have trouble to disappear, creating more and more artifacts as the game is played. It was also observed a 40 FPS loss.

Below are the images showing it using Iris and using i965:

Launcher:

Iris: https://i.imgur.com/kTOfjV3.png
i965: https://i.imgur.com/2YaX2AR.png

Game: 

Iris: https://i.imgur.com/7qYPSzD.png
i965: https://i.imgur.com/3HpnUva.png

I understand Iris was just added to the main Mesa repository, thank you for the effort. I'm reporting this to make you aware of its behavior on DiRT Rally.

System specifications:

Processor: Intel Core i3-6100U;
Video: Intel HD Graphics 520;
Architecture: amd64;
RAM memory: 8 GB;
Mesa: 19.1.0-devel (git-cd0ced49e7);
Kernel version: 4.18.0-15-lowlatency;
Distribution: Xubuntu 18.04.1 amd64.
Comment 1 Kenneth Graunke 2019-02-22 07:07:18 UTC
Sorry, I let a late bug slip in just before merging.  That corruption ought to be fixed by:

commit b21de090d64167ac7436bdec1d76e76797f3b5be
Author: Kenneth Graunke <kenneth@whitecape.org>
Date:   Thu Feb 21 15:50:14 2019 -0800

    Revert "iris: Enable auxiliary buffer support"

Can you retry with the latest master?  Thanks!
Comment 2 leozinho29_eu 2019-02-22 20:54:17 UTC
Upgrading Mesa to 19.1.0-devel (git-ae2cb72804), nearly all graphical bugs disappeared. There are still graphical bugs on text, where some different background appears. This makes some texts unreadable. The following images show the difference on the main menu:

Iris: https://i.imgur.com/fJMjinD.jpg
i965: https://i.imgur.com/fZxytxX.jpg

Regarding performance, the difference is still present. DiRT Rally benchmark had the following results:

Iris: https://i.imgur.com/fx6mNNr.jpg
i965 with always_flush_batch=true to workaround bug 109212: https://i.imgur.com/hJMtpqK.jpg
i965: https://i.imgur.com/TEy8JoJ.jpg
Comment 3 Kenneth Graunke 2019-02-25 09:43:43 UTC
(In reply to leozinho29_eu from comment #2)
> Regarding performance, the difference is still present. DiRT Rally benchmark

Yeah, a performance difference is expected at this point.  We are stalling hard on trivial things like vertex buffer upload.  Plus, HiZ/CCS/fast clears are off, and so on.  All fixable, just not landed yet.

The iris-copytrans branch in my tree improves the FPS in DiRT Rally by somewhere between 20-30%.

Will have to look into the rendering issues.  On medium settings, the skybox is pretty hosed as well.  Possibly cubemap blitting bugs...
Comment 4 Kenneth Graunke 2019-03-11 23:56:18 UTC
Narrowed this down a lot, finally!  st is doing more aggressive peephole select than i965, causing the NIR to be a fmov.sat of a b32csel.  The backend compiler is creating a csel instruction for this, and dropping the saturate on the floor.  This result is used to determine whether to discard pixels, resulting in a lot of garbage pixels being drawn.  Disabling opt_peephole_csel in the backend fixes it.

The shader containing 02fd04ad_088a284a_e477969e_48f676f2 in a comment (2626 in shader-db) renders the text and exhibits the problem nicely.

Now to figure out *why* the compiler is doing that...
Comment 5 Kenneth Graunke 2019-03-12 03:20:54 UTC
Now renders correctly:
https://gitlab.freedesktop.org/mesa/mesa/merge_requests/431

Performance is still not up to snuff, though.
Comment 6 Kenneth Graunke 2019-03-12 03:21:24 UTC
Sorry.  By correctly I mean...this rendering glitch is fixed.  It still appears to have the blinking artifacts from the other bug, so it's not "correct" yet.
Comment 7 leozinho29_eu 2019-03-12 14:46:14 UTC
I applied the patch and the problems with text rendering no longer exist.

Considering that:

-i965 without workaround (always_flush_batch=true) has constant blinking elements;
-i965 with workaround still has very rare blinking elements;
-i965 with workaround takes a strong performance hit;
-i965 with workaround performs the same as iris;
-iris (with https://gitlab.freedesktop.org/mesa/mesa/merge_requests/431) has what seems to be flawless rendering (Ultralow).

It's preferable to play DiRT Rally using iris than i965.
Comment 8 leozinho29_eu 2019-03-12 15:20:39 UTC
I let DiRT Rally running the benchmark on loop and it just crashed with:

DirtRally: src/intel/genxml/gen9_pack.h:106: __gen_offset: Assertiva “(v & ~mask) == 0” falhou.
Aborted (core dumped)

I saw it crashing a few times before, but this is the first time it shows some information related to crash. Unfortunately it's just this.
Comment 9 Kenneth Graunke 2019-03-12 18:29:02 UTC
Hmm.  I haven't seen that yet, but running in a loop I did eventually get a NULL resource from BLORP's surface stream uploader, which led to a crash.  Definitely shouldn't be getting that...
Comment 10 leozinho29_eu 2019-03-23 03:58:42 UTC
Using Mesa 19.1.0-devel (git-dacb11a585), iris performance is nearly the same compared to i965: average FPS is 4 lower (80 using i965 versus 76 using iris), minimum FPS is 16 lower (60 vs. 44) and maximum FPS is 3 lower (111 vs 108). The game performance with iris increased 80% in one month. This is impressive.

Iris seems to be refusing to free memory or is freeing much less than it should and it crashed with segmentation fault twice. It seems it is a different problem from the BLORP one. 

Messages when it crashed:

Thread 56 "OGL_Dispatch_33" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f6e557fa700 (LWP 4873)]
iris_upload_dirty_render_state (ice=ice@entry=0x6812170, batch=batch@entry=0x6812668, draw=draw@entry=0x7f6e557f9230) at ../src/gallium/drivers/iris/iris_state.c:4350
4350	../src/gallium/drivers/iris/iris_state.c: Arquivo ou diretório inexistente.

The backtrace:

#0  0x00007f6ecdcd9f07 in iris_upload_dirty_render_state (ice=ice@entry=0x6812170, batch=batch@entry=0x6812668, draw=draw@entry=0x7f6e557f9230) at ../src/gallium/drivers/iris/iris_state.c:4350
#1  0x00007f6ecdcdaa44 in iris_upload_render_state (ice=0x6812170, batch=0x6812668, draw=0x7f6e557f9230) at ../src/gallium/drivers/iris/iris_state.c:5021
#2  0x00007f6ecde7dff5 in iris_draw_vbo (ctx=0x6812170, info=0x7f6e557f9230) at ../src/gallium/drivers/iris/iris_draw.c:149
#3  0x00007f6ecd732457 in cso_draw_arrays (cso=cso@entry=0x5475040, mode=mode@entry=5, start=start@entry=0, count=count@entry=4) at ../src/gallium/auxiliary/cso_cache/cso_context.c:1725
#4  0x00007f6ecd84b8c4 in st_pbo_draw (st=st@entry=0x549ed90, addr=addr@entry=0x7f6e557f94e0, surface_width=<optimized out>, surface_height=32) at ../src/mesa/state_tracker/st_pbo.c:282
#5  0x00007f6ecd838dda in try_pbo_upload_common (ctx=ctx@entry=0x6f52920, surface=surface@entry=0x7f6d7ef80710, addr=addr@entry=0x7f6e557f94e0, src_format=<optimized out>) at ../src/mesa/state_tracker/st_cb_texture.c:1283
#6  0x00007f6ecd83e0af in try_pbo_upload (unpack=0x6f5c330, pixels=0x0, depth=1, height=32, width=32, zoffset=<optimized out>, yoffset=<optimized out>, xoffset=0, dst_format=<optimized out>, type=33639, format=32993, texImage=0x7f6d7d59de60, dims=2, ctx=0x6f52920) at ../src/mesa/state_tracker/st_cb_texture.c:1401
#7  0x00007f6ecd83e0af in st_TexSubImage (ctx=0x6f52920, dims=2, texImage=0x7f6d7d59de60, xoffset=0, yoffset=0, zoffset=0, width=32, height=32, depth=1, format=32993, type=33639, pixels=0x0, unpack=0x6f5c330) at ../src/mesa/state_tracker/st_cb_texture.c:1522
#8  0x00007f6ecd7f7124 in texture_sub_image (ctx=ctx@entry=0x6f52920, dims=dims@entry=2, texObj=texObj@entry=0x7f6d7d799710, texImage=0x7f6d7d59de60, target=target@entry=3553, level=level@entry=0, xoffset=<optimized out>, yoffset=0, zoffset=0, width=32, height=32, depth=1, format=32993, type=33639, pixels=0x0) at ../src/mesa/main/teximage.c:3333
#9  0x00007f6ecd7f9e10 in texsubimage_err (ctx=0x6f52920, dims=2, target=3553, level=0, xoffset=0, yoffset=0, zoffset=0, width=32, height=32, depth=1, format=32993, type=33639, pixels=0x0, callerName=0x7f6ecdebeb1e "glTexSubImage2D") at ../src/mesa/main/teximage.c:3391
#10 0x00007f6ecd7fdd58 in _mesa_TexSubImage2D (target=<optimized out>, level=<optimized out>, xoffset=<optimized out>, yoffset=<optimized out>, width=<optimized out>, height=<optimized out>, format=32993, type=33639, pixels=0x0) at ../src/mesa/main/teximage.c:3609
#11 0x0000000001cf7c34 in  ()
#12 0x0000000001d0de30 in  ()
#13 0x00000000020f8e8e in  ()
#14 0x0000000002173bfe in  ()
#15 0x00007f6ee61d36db in start_thread (arg=0x7f6e557fa700) at pthread_create.c:463
#16 0x00007f6edf62788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Comment 11 Kenneth Graunke 2019-04-24 00:07:07 UTC
FWIW, I improved the performance by around 25% yesterday, and fixed a bug leading to the crowds being missing a lot of the time.
Comment 12 leozinho29_eu 2019-04-24 01:48:59 UTC
Just rebuilt Mesa to test. The stages have much more crowd than before. The minimum FPS on benchmarks was 61 and the average FPS was 84, both significantly higher than before. About the maximum FPS, probably the CPU is being the limiting factor as it is always around 111 FPS.

This is as good as i965 without workaround for the blinking bug and has no blinking bug.

I got a new crash when letting the benchmark running on loop (around 20 minutes):

Thread 56 "OGL_Dispatch_33" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fb627fff700 (LWP 2155)]
0x00007fb6b460a479 in alloc_surface_states (mgr=0x4898e20, ref=0x4bde2e0, aux_usages=<optimized out>) at ../src/gallium/drivers/iris/iris_resource.h:239
239	../src/gallium/drivers/iris/iris_resource.h: Arquivo ou diretório inexistente.
(gdb) bt
#0  0x00007fb6b460a479 in alloc_surface_states (mgr=0x4898e20, ref=0x4bde2e0, aux_usages=<optimized out>) at ../src/gallium/drivers/iris/iris_resource.h:239
#1  0x00007fb6b460d011 in iris_set_shader_images (ctx=0x4bd7ee0, p_stage=<optimized out>, start_slot=0, count=3, p_images=0x7fb627ffe370) at ../src/gallium/drivers/iris/iris_state.c:1954
#2  0x00007fb6b4266a92 in st_bind_images (st=0x4e21580, prog=0x7fb6918865d0, shader_type=PIPE_SHADER_COMPUTE) at ../src/mesa/state_tracker/st_atom_image.c:177
#3  0x00007fb6b431d784 in st_validate_state (st=st@entry=0x4e21580, pipeline=pipeline@entry=ST_PIPELINE_COMPUTE) at ../src/util/bitscan.h:104
#4  0x00007fb6b432301d in st_dispatch_compute_common (ctx=0x63a9940, num_groups=0x7fb627ffe7cc, group_size=0x0, indirect=0x0, indirect_offset=0) at ../src/mesa/state_tracker/st_cb_compute.c:58
#5  0x00007fb6b4308c5f in dispatch_compute (no_error=false, num_groups_z=1, num_groups_y=45, num_groups_x=80) at ../src/mesa/main/compute.c:265
#6  0x00007fb6b4308c5f in _mesa_DispatchCompute (num_groups_x=80, num_groups_y=45, num_groups_z=1) at ../src/mesa/main/compute.c:280
#7  0x0000000001d0de30 in  ()
#8  0x00000000020f8e8e in  ()
#9  0x0000000002173bfe in  ()
#10 0x00007fb6ccb1e6db in start_thread (arg=0x7fb627fff700) at pthread_create.c:463
#11 0x00007fb6c5f7288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Comment 13 Kenneth Graunke 2019-04-26 00:18:20 UTC
(In reply to leozinho29_eu from comment #0)
> Kernel version: 4.18.0-15-lowlatency;

Ah!  This is likely the source of your memory problems.  The kernel drm code had a memory leak for sync points, which both anv and iris hit, but not i965.

Please try upgrading to 4.19 (or at least 4.18.11), my guess is this will fix it.
Comment 14 leozinho29_eu 2019-04-26 00:34:22 UTC
If the bug in question is the bug 107899, it is fixed on the current Ubuntu kernel (4.18.0-18-lowlatency). I reported it as soon as it was fixed on linux-git. The report on Ubuntu's launchpad:  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165

The 5.0 kernel has a real leak, as bug 110276 is still present.

I believe it's pretty unlikely the problem is a kernel bug, but I will install 5.1-rc6 to test. I will answer here after I test, give me a moment.
Comment 15 leozinho29_eu 2019-04-26 01:19:18 UTC
I have tested with 5.1-rc6, and got the following crash. It seems to be different, however:

Thread 56 "OGL_Dispatch_33" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f4a3e69f700 (LWP 7991)]
0x00007f4a6df021cb in use_ubo_ssbo (batch=0x5acc390, ice=<optimized out>, buf=<optimized out>, 
    surf_state=0x5accff0, writable=<optimized out>) at ../src/gallium/drivers/iris/iris_resource.h:239
239	../src/gallium/drivers/iris/iris_resource.h: Arquivo ou diretório inexistente.
(gdb) bt
#0  0x00007f4a6df021cb in use_ubo_ssbo (batch=0x5acc390, ice=<optimized out>, buf=<optimized out>, surf_state=0x5accff0, writable=<optimized out>) at ../src/gallium/drivers/iris/iris_resource.h:239
#1  0x00007f4a6df041d4 in iris_populate_binding_table (ice=ice@entry=0x5acbe90, batch=batch@entry=0x5acc390, stage=stage@entry=MESA_SHADER_VERTEX, pin_only=pin_only@entry=false)
    at ../src/gallium/drivers/iris/iris_state.c:4055
#2  0x00007f4a6df0b466 in iris_upload_dirty_render_state (ice=ice@entry=0x5acbe90, batch=batch@entry=0x5acc390, draw=draw@entry=0x7f4a3e69e620) at ../src/gallium/drivers/iris/iris_state.c:4577
#3  0x00007f4a6df0d574 in iris_upload_render_state (ice=0x5acbe90, batch=0x5acc390, draw=0x7f4a3e69e620)
    at ../src/gallium/drivers/iris/iris_state.c:5135
#4  0x00007f4a6e0ba9c5 in iris_draw_vbo (ctx=0x5acbe90, info=0x7f4a3e69e620)
    at ../src/gallium/drivers/iris/iris_draw.c:149
#5  0x00007f4a6db6b5bf in st_draw_vbo (ctx=<optimized out>, prims=0x7f4a3e69e700, nr_prims=<optimized out>, ib=0x7f4a3e69e6e0, index_bounds_valid=<optimized out>, min_index=<optimized out>, max_index=<optimized out>, tfb_vertcount=0x0, stream=0, indirect=0x0) at ../src/mesa/state_tracker/st_draw.c:271
#6  0x00007f4a6dc03d44 in _mesa_validated_drawrangeelements (ctx=0x75dae60, mode=4, index_bounds_valid=<optimized out>, start=0, end=4294967295, count=6, type=5123, indices=0x0, basevertex=0, numInstances=1, baseInstance=149) at ../src/mesa/main/draw.c:816
#7  0x00007f4a6dc0436b in _mesa_exec_DrawElementsInstancedBaseVertexBaseInstance (mode=4, count=6, type=5123, indices=0x0, numInstances=1, basevertex=0, baseInstance=149) at ../src/mesa/main/draw.c:1150
#8  0x0000000001cfcfac in  ()
#9  0x0000000001d0de30 in  ()
#10 0x00000000020f8e8e in  ()
#11 0x0000000002173bfe in  ()
#12 0x00007f4a864176db in start_thread (arg=0x7f4a3e69f700) at pthread_create.c:463
#13 0x00007f4a7f86b88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Comment 16 Kenneth Graunke 2019-07-02 20:07:49 UTC
I ran the benchmark on loop for a few hours and it's working great now.  Doesn't appear to be leaking memory or buffer maps.  Performance seems roughly on par with Windows, too.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.