Bug 103304

Summary: multi-threaded usage of Gallium RadeonSI leads to NULL pointer exception in pb_cache_reclaim_buffer
Product: Mesa Reporter: Luc <lper.home>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED NOTOURBUG QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium    
Version: 17.0   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: simple sanity check patch

Description Luc 2017-10-17 06:48:27 UTC
Issue is not present in Mesa 11.X. It is however present in Mesa 13.0.X, 17.0.X and as far as I can see in the code, it is probably as well present in latest Mesa 17.2.X.
Our code is very similar as the second example in https://www.khronos.org/opengl/wiki/OpenGL_and_multithreading : we have two contexts which are shared. In one context/thread the rendering is done and in the other context/thread the texture uploading is done. It is in this case we hit the race causing a crash (on average we need about an hour to hit the issue).

The crash has following footprint:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  pb_cache_reclaim_buffer (mgr=mgr@entry=0x1e8dd30, size=size@entry=2088960, alignment=alignment@entry=4096, usage=usage@entry=20,
    bucket_index=bucket_index@entry=3) at pipebuffer/pb_cache.c:183
#1  0x00007fe2671c50e7 in amdgpu_bo_create (rws=0x1e8dbf0, size=<optimized out>, alignment=4096, domain=RADEON_DOMAIN_VRAM_GTT, flags=RADEON_FLAG_GTT_WC)
    at amdgpu_bo.c:754
#2  0x00007fe2671db666 in r600_alloc_resource (rscreen=rscreen@entry=0x1e8f0c0, res=res@entry=0x7fe24c2d3100) at r600_buffer_common.c:197
#3  0x00007fe2671e6eff in r600_texture_invalidate_storage (rctx=rctx@entry=0x1f9e900, rtex=rtex@entry=0x7fe24c2d3100) at r600_texture.c:1414
#4  0x00007fe2671eb474 in r600_texture_transfer_map (ctx=0x1f9e900, texture=0x7fe24c2d3100, level=0, usage=258, box=0x7fe265bca970,
    ptransfer=0x7fe265bca898) at r600_texture.c:1483
#5  0x00007fe267041807 in u_transfer_map_vtbl (context=<optimized out>, resource=<optimized out>, level=<optimized out>, usage=<optimized out>,
    box=<optimized out>, transfer=<optimized out>) at util/u_transfer.c:138
#6  0x00007fe267041732 in u_default_texture_subdata (pipe=0x1f9e900, resource=0x7fe24c2d3100, level=<optimized out>, usage=<optimized out>,
    box=0x7fe265bca970, data=0x7fe218ac05e0, stride=1920, layer_stride=2088960) at util/u_transfer.c:59
#7  0x00007fe266e51137 in st_TexSubImage (ctx=<optimized out>, dims=2, texImage=<optimized out>, xoffset=0, yoffset=0, zoffset=0, width=1920,
    height=1088, depth=1, format=6403, type=5121, pixels=0x7fe218ac05e0, unpack=0x2000fc0) at state_tracker/st_cb_texture.c:1412
#8  0x00007fe266dd75bf in _mesa_texture_sub_image (ctx=ctx@entry=0x1fe5d50, dims=dims@entry=2, texObj=texObj@entry=0x7fe24c2d2ca0,
    texImage=0x7fe24c2cda20, target=target@entry=3553, level=level@entry=0, xoffset=xoffset@entry=0, yoffset=yoffset@entry=0, zoffset=zoffset@entry=0,
    width=width@entry=1920, height=height@entry=1088, depth=depth@entry=1, format=format@entry=6403, type=type@entry=5121,
    pixels=pixels@entry=0x7fe218ac05e0, dsa=dsa@entry=false) at main/teximage.c:3239
#9  0x00007fe266dd7787 in texsubimage (ctx=0x1fe5d50, dims=dims@entry=2, target=3553, level=0, xoffset=0, yoffset=0, zoffset=zoffset@entry=0,
    width=1920, height=1088, depth=depth@entry=1, format=format@entry=6403, type=type@entry=5121, pixels=pixels@entry=0x7fe218ac05e0,
    callerName=callerName@entry=0x7fe26723c036 "glTexSubImage2D") at main/teximage.c:3297
#10 0x00007fe266dd7b49 in _mesa_TexSubImage2D (target=<optimized out>, level=<optimized out>, xoffset=<optimized out>, yoffset=<optimized out>,
    width=<optimized out>, height=<optimized out>, format=6403, type=5121, pixels=0x7fe218ac05e0) at main/teximage.c:3438


If we enable the assert() handling in the mesa3d library, then this crash will not occur, as an assert is triggered before:

#0  0x00007fd388fed124 in raise () from /lib64/libc.so.6
#1  0x00007fd388fee58a in abort () from /lib64/libc.so.6
#2  0x00007fd388fe5e47 in ?? () from /lib64/libc.so.6
#3  0x00007fd388fe5ef2 in __assert_fail () from /lib64/libc.so.6
#4  0x00007fd373986091 in pipe_reference_described (get_desc=<optimized out>, reference=0x7fd35801b100, ptr=0x0)
    at gallium/auxiliary/util/u_inlines.h:82
#5  pipe_reference (reference=0x7fd35801b100, ptr=0x0) at gallium/auxiliary/util/u_inlines.h:102
#6  pb_reference (src=0x7fd35801b100, dst=0x2a260d0) at gallium/auxiliary/pipebuffer/pb_buffer.h:241
#7  amdgpu_winsys_bo_reference (src=0x7fd35801b100, dst=0x2a260d0) at amdgpu_bo.h:116
#8  amdgpu_lookup_or_add_real_buffer (acs=0x3fea9d0, bo=0x7fd35801b100) at amdgpu_cs.c:358
#9  0x00007fd3739863ac in amdgpu_cs_add_buffer (rcs=<optimized out>, buf=<optimized out>, usage=10, domains=<optimized out>,
    priority=RADEON_PRIO_SAMPLER_TEXTURE) at amdgpu_cs.c:450
#10 0x00007fd3738d79fd in radeon_add_to_buffer_list (priority=RADEON_PRIO_SAMPLER_TEXTURE, usage=RADEON_USAGE_READ, rbo=0x7fd358019cd0, ring=0x1eedeb8,
    rctx=0x1eedb60) at gallium/drivers/radeon/r600_cs.h:77
#11 radeon_add_to_buffer_list_check_mem (check_mem=false, priority=RADEON_PRIO_SAMPLER_TEXTURE, usage=RADEON_USAGE_READ, rbo=0x7fd358019cd0,
    ring=0x1eedeb8, rctx=0x1eedb60) at gallium/drivers/radeon/r600_cs.h:114
#12 si_sampler_view_add_buffer (sctx=sctx@entry=0x1eedb60, resource=0x7fd358019cd0, usage=usage@entry=RADEON_USAGE_READ,
    is_stencil_sampler=<optimized out>, check_mem=check_mem@entry=false) at si_descriptors.c:316
#13 0x00007fd3738d7cb2 in si_sampler_views_begin_new_cs (sctx=sctx@entry=0x1eedb60, views=views@entry=0x1eef360) at si_descriptors.c:350
#14 0x00007fd3738dfd5a in si_all_descriptors_begin_new_cs (sctx=sctx@entry=0x1eedb60) at si_descriptors.c:2019
#15 0x00007fd3738e0983 in si_begin_new_cs (ctx=ctx@entry=0x1eedb60) at si_hw_context.c:227
#16 0x00007fd3738e14d3 in si_context_gfx_flush (context=0x1eedb60, flags=0, fence=0x0) at si_hw_context.c:162
#17 0x00007fd37399c2a7 in r600_flush_from_st (ctx=0x1eedb60, fence=0x0, flags=<optimized out>) at r600_pipe_common.c:381
#18 0x00007fd3735587ff in st_flush (st=st@entry=0x3e33870, fence=fence@entry=0x0, flags=flags@entry=0) at state_tracker/st_cb_flush.c:87
#19 0x00007fd37355881e in st_glFlush (ctx=<optimized out>) at state_tracker/st_cb_flush.c:121
#20 0x00007fd3733f7d71 in _mesa_flush (ctx=0x42cb4d0) at main/context.c:1838
#21 0x00007fd3733f8436 in _mesa_Flush () at main/context.c:1870

The thing that happens is a race between the texture uploading thread calling the r600_texture_invalidate_storage() and the glFlush call in the rendering thread calling the radeon_add_to_buffer_list() function:
In the radeon_add_to_buffer_list following code is executed:

  return rctx->ws->cs_add_buffer(
                  ring->cs, rbo->buf,
                  (enum radeon_bo_usage)(usage | RADEON_USAGE_SYNCHRONIZED),
                  rbo->domains, priority) * 4;

While in the function r600_alloc_resource the following code is executed:

	/* Replace the pointer such that if res->buf wasn't NULL, it won't be
	 * NULL. This should prevent crashes with multiple contexts using
	 * the same buffer where one of the contexts invalidates it while
	 * the others are using it. */
	old_buf = res->buf;
	res->buf = new_buf; /* should be atomic */

Where both the rbo variable in radeon_add_to_buffer_list and res variable in r600_alloc_resource are the same thing. In the further processing of cs_add_buffer, the buffer is not linked anymore with the rbo as it has been swapped in the other thread! The r600_alloc_resource will decrease the buffer use reference so it gets zero, then causing the assert in the other thread (where the assert checks the reference count).
Without the assert being enabled, the buf object will be cleaned up actually setting its prev/next pointer to NULL and causing a crash in pb_cache_reclaim_buffer when it is walking its bucket/cache list of buffers.

We performed a couple of tests:
-	By letting the texture upload perform by the render thread (done by a dirty hack in our code): stability issue is gone.
-	By letting return the r600_can_invalidate_texture() always false, so that the reallocation is not done: stability issue is gone.

These two tests proof that the race condition comes from the multi-threading aspect and the texture invalidation during texture upload.

I suppose that the check in r600_texture_transfer_map():

			if (r600_can_invalidate_texture(rctx->screen, rtex,
							usage, box))
				r600_texture_invalidate_storage(rctx, rtex);
			else
				use_staging_texture = true;

thus r600_can_invalidate_texture() returns true, while it shouldn’t as a bit later it is used in another thread by the glFlush command.
Comment 1 Nicolai Hähnle 2017-10-17 20:34:27 UTC
Do you have a simple test application you can share that reproduces this reliably?
Comment 2 Nicolai Hähnle 2017-10-17 21:18:36 UTC
Oh wow, now that I've actually looked at the issue in more detail, I'm pretty amazed that you actually managed to hit this issue! Congratulations! :)

The true analysis is a bit different, I would say. The flush ends up accessing the texture because it does an automatic re-add of all resources when starting a new CS. This should not affect the ability of the other thread to do a texture invalidation (you'd just kill performance by introducing an unnecessary stall).

The real solution is certainly different. I'm currently looking at other texture-related races as well, this is just one additional one to take care of. Thank you for the report!
Comment 3 Nicolai Hähnle 2017-10-18 10:43:37 UTC
After thinking about it some more, I think it's very likely that your application also has a bug, a write-after-read bug to be precise.

What I'm suspecting is that you're doing this:

  Thread 1              Thread 2
  --------              --------
  glBindTexture(tex);
  glDraw*(...);
  glFlush();
                        glTextureSubImage(tex, ...);

Unless you use glFinish() or glFenceSync() / glWaitSync() synchronization, there is no guarantee that thread 1's draw has completed before thread 2's texture change. In other words, the implementation is allowed to execute the texture modification *before* the draw. Especially with Gallium threading, this is quite likely to happen.

(We still also have a bug in the driver, but until I can actually double-check your code, I'd say it's quite likely that you have a write-after-read hazard like the one explained above.)
Comment 4 Luc 2017-10-18 11:29:34 UTC
Yes, we use the glFenceSync() / glWaitSync() system.
We have multiple buffers going around and after each vsync a check is done which can be recycled using the non blocking glWaitSync.
However, will check if this is done everywhere correctly in our code. 

Reason of the multi-threading was the format change done during texture upload (which took a lot of cpu power). However, now we do this in a worker thread with optimized code, before doing the texture upload (so to assure the format is compatible with the GPU before requesting the texture upload).

Therefore I adapted the code so that both (texture upload and rendering/flush) are now done in one thread as a work around.
Comment 5 Nicolai Hähnle 2017-10-18 12:20:26 UTC
Interesting. It's possible that there's a gap in the glWaitSync implementation. I'm still looking into these things.
Comment 6 Nicolai Hähnle 2017-10-18 12:23:59 UTC
Created attachment 134908 [details] [review]
simple sanity check patch

Does the attached patch help?
Comment 7 Nicolai Hähnle 2017-10-18 12:30:19 UTC
Though on second thought, that patch should have no effect, assuming that you glFlush() properly after the glFenceSync().
Comment 8 Luc 2017-10-19 16:18:43 UTC
I did some further analysis in our code, and found that some textures follow another path (not using the Fence/Sync).
In this path it can indeed be that the same texture id is re-used and uploaded in the texture upload thread, which of course can happen while the rendering thread performs the glFlush.
Comment 9 Nicolai Hähnle 2017-10-23 11:33:57 UTC
I assume it's safe to close this bug report then? Please re-open if you still run into issue.
Comment 10 Luc 2017-10-24 06:48:52 UTC
Well, we have solved it in our software now.

The question concerning if this case may be closed or not boils down to following question:
Performing a texture upload on a texture ID while it is being rendered (drawn) should this potentially lead to a crash? 
Under such case I would expect to detect tearing yes, but in my opinion it should not crash....

I let the mesa3d team decide, as we solved it by adapting our code.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.