Summary: | [EGL, i965] dEQP-EGL.functional.sharing.gles2.multithread.simple_egl_server_sync.textures.copyteximage2d_texsubimage2d_render | ||
---|---|---|---|
Product: | Mesa | Reporter: | Chad Versace <chadversary> |
Component: | Drivers/DRI/i965 | Assignee: | Topi Pohjolainen <topi.pohjolainen> |
Status: | RESOLVED FIXED | QA Contact: | Intel 3D Bugs Mailing List <intel-3d-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | ben, chadversary, harish.krupo.kps, jason, kenneth, lemody, mark.a.janes, pallavi.g, topi.pohjolainen |
Version: | git | ||
Hardware: | Other | ||
OS: | All | ||
See Also: |
https://bugs.freedesktop.org/show_bug.cgi?id=99085 https://bugs.freedesktop.org/show_bug.cgi?id=98602 |
||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Bug Depends on: | |||
Bug Blocks: | 98315 | ||
Attachments: |
hack
patch to remove meta usage |
Description
Chad Versace
2016-12-27 22:59:50 UTC
I spent 2 hours on this one today, and I'm lost. This is sometimes passing for me. It seems image comparison fails, unfortunately this test does not dump the image data within TestResults.qpa, maybe need to start with that to see what's the diff. +ken +jason Helgrind complains about potential read/write races on drm_intel_gem_bo::offset64. According to helgrind, the conflict occurs when intel_batchbuffer_flush() updates the offset in thread1 and brw_update_texture_surfaces() reads the offset during batchbuffer construction in thread2. Below is the relevant Helgrind snippet. Helgrind claims to detect this error exactly once during the testrun. Possible data race during write of size 8 at 0xCD29890 by thread #2 Locks held: 3, at addresses 0x7E49FE8 0x7E6E3B8 0xA382CB0 at 0xB0E1DB7: drm_intel_update_buffer_offsets2 (intel_bufmgr_gem.c:2292) by 0xB0E2316: do_exec2 (intel_bufmgr_gem.c:2450) by 0xB0E24C0: drm_intel_gem_bo_context_exec (intel_bufmgr_gem.c:2493) by 0xA92EE5E: do_flush_locked (intel_batchbuffer.c:359) by 0xA92F0F1: _intel_batchbuffer_flush (intel_batchbuffer.c:422) by 0xA905724: brw_fence_insert (brw_sync.c:86) by 0xA905B7D: brw_dri_create_fence (brw_sync.c:255) by 0x811F886: dri2_create_sync (egl_dri2.c:2563) by 0x8112249: _eglCreateSync (eglapi.c:1659) by 0x811237B: eglCreateSyncKHR (eglapi.c:1684) by 0x79D789: eglw::FuncPtrLibrary::createSyncKHR(void*, unsigned int, int const*) const (eglwFuncPtrLibraryImpl.inl:94) by 0x60E798: deqp::egl::GLES2ThreadTest::FenceSync::init(deqp::egl::GLES2ThreadTest::EGLThread&, bool) (teglGLES2SharingThreadedTests.cpp:301) This conflicts with a previous read of size 8 by thread #3 Locks held: 2, at addresses 0xA386C40 0xCD29130 at 0xB0E14B9: do_bo_emit_reloc (intel_bufmgr_gem.c:2042) by 0xB0E1714: drm_intel_gem_bo_emit_reloc (intel_bufmgr_gem.c:2104) by 0xB0D651D: drm_intel_bo_emit_reloc (intel_bufmgr.c:205) by 0xA90F926: brw_emit_surface_state (brw_wm_surface_state.c:173) by 0xA9108A8: brw_update_texture_surface (brw_wm_surface_state.c:646) by 0xA911C18: update_stage_texture_surfaces (brw_wm_surface_state.c:1271) by 0xA911D42: brw_update_texture_surfaces (brw_wm_surface_state.c:1302) by 0xA904480: check_and_emit_atom (brw_state_upload.c:771) Address 0xcd29890 is 48 bytes inside a block of size 360 alloc'd at 0x4C2DE60: calloc (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so) by 0xB0DACBF: drm_intel_gem_bo_alloc_internal (intel_bufmgr_gem.c:811) by 0xB0DBADB: drm_intel_gem_bo_alloc_tiled (intel_bufmgr_gem.c:945) by 0xB0D61A7: drm_intel_bo_alloc_tiled (intel_bufmgr.c:83) by 0xA93A0E4: miptree_create (intel_mipmap_tree.c:664) by 0xA93A154: intel_miptree_create (intel_mipmap_tree.c:688) by 0xA948CAB: intel_miptree_create_for_teximage (intel_tex_image.c:109) by 0xA9476F3: intel_alloc_texture_image_buffer (intel_tex.c:95) by 0xA948E01: intelTexImage (intel_tex_image.c:140) by 0xA5C8BAC: _mesa_get_fallback_texture (texobj.c:1049) by 0xA5D2385: update_single_program_texture (texstate.c:563) by 0xA5D242A: update_program_texture_state (texstate.c:588) Block was alloc'd by thread #2 (In reply to Chad Versace from comment #3) > +ken +jason > > Helgrind complains about potential read/write races on > drm_intel_gem_bo::offset64. According to helgrind, the conflict occurs when > intel_batchbuffer_flush() updates the offset in thread1 and > brw_update_texture_surfaces() reads the offset during batchbuffer > construction in thread2. I think that should be harmless. offset64 is the presumed location of the buffer - i.e. our guess where the kernel relocated it to on the last execbuf. If we guess correctly, the kernel sees that everything's where we think it is and skips relocating things. If we guess wrong, it goes ahead and does relocations anyway. At least, that's how I understand it. It's good to know about this though. I did notice that it passes with INTEL_DEBUG=sync (which I suppose is not surprising). Created attachment 128823 [details] [review] hack I spent some time with this today. I noticed that this test passes for me consistently if I disable _mesa_meta_pbo_TexSubImage usage in intelTexSubImage. In case I let it use meta and when the test fails, variable 'create_pbo' is true. I'm attaching a hack that makes this test pass in case this gives some additional hints what's happening. Created attachment 128824 [details] [review] patch to remove meta usage Here's a patch to remove meta usage, this fixes the issue for me. I don't spot any regression in CI, only bunch of tests hat start to pass when expecting a failure. Could we do such move or is meta version preferred? (In reply to Tapani Pälli from comment #6) > Created attachment 128824 [details] [review] [review] > patch to remove meta usage > > Here's a patch to remove meta usage, this fixes the issue for me. I don't > spot any regression in CI, only bunch of tests hat start to pass when > expecting a failure. Could we do such move or is meta version preferred? Removing meta shouldn't fix anything here since meta isn't really doing anything the user couldn't already do. If removing meta fixes it, then we're just papering over some other bug. (In reply to Kenneth Graunke from comment #4) > (In reply to Chad Versace from comment #3) > > +ken +jason > > > > Helgrind complains about potential read/write races on > > drm_intel_gem_bo::offset64. According to helgrind, the conflict occurs when > > intel_batchbuffer_flush() updates the offset in thread1 and > > brw_update_texture_surfaces() reads the offset during batchbuffer > > construction in thread2. This has me concerned... > I think that should be harmless. offset64 is the presumed location of the > buffer - i.e. our guess where the kernel relocated it to on the last > execbuf. If we guess correctly, the kernel sees that everything's where we > think it is and skips relocating things. If we guess wrong, it goes ahead > and does relocations anyway. Not so much. The race is actually way more subtle. When we add a relocation to the BO, we write some value into the BO based on a read of offset64. We then pass the list of BOs off to the kernel with their offsets. If the offsets in the list match the kernel's view of addresses, then it skips doing any relocations. Suppose we have the following: 1) We create a texture. By default it has offset64 == 0 2) Thread A emits a relocation for some reason 3) Thread B does stuff, flushes the batch, and writes the new value into offset64 4) Thread A finishes what it was doing and flushes the batch Given that the kernel doesn't move things around on us like crazy, there's a decent chance that offset64 hasn't changed between (3) and (4) so the offsets will match the kernel's view when A submits in (4). However, because we emitted a reloaction in (2) and the offset changed in (3) before we submit the batch in (4), the offset we provide to the kernel in the BO list is different from the offset with which the relocation was emitted. The kernel doesn't relocate and things explode. In other words, the problem isn't so much a race between threads A and B as it is that all relocations need to happen with the same offset as the one we send off to the kernel. This is fine in a single-context environment because we only have one batch in flight at any given time and so the value of offset64 is always exactly what was returned by the kernel last time and will remain that way until the current batch is submitted and we update it. In a multi-context environment we're just toast. The solution here is that each context needs to track its own set of offsets. There's no real utility in sharing them since each hardware context has it's own GTT on modern hardware. On older hardware, we'll do the relocation for the first batch we submit with the BO but then the kernel will tell us the offset and future batches won't relocate. In either case, tracking it globally in libdrm isn't gaining us anything and is, in fact, the problem. (In reply to Jason Ekstrand from comment #7) > (In reply to Tapani Pälli from comment #6) > > Created attachment 128824 [details] [review] [review] [review] > > patch to remove meta usage > > > > Here's a patch to remove meta usage, this fixes the issue for me. I don't > > spot any regression in CI, only bunch of tests hat start to pass when > > expecting a failure. Could we do such move or is meta version preferred? > > Removing meta shouldn't fix anything here since meta isn't really doing > anything the user couldn't already do. If removing meta fixes it, then > we're just papering over some other bug. > OK, I figured it might have some bug considering the busy bo (test fails only when the bo is busy), I haven't really gone through that code though. (In reply to Jason Ekstrand from comment #7) [...] > The solution here is that each context needs to track its own set of > offsets. There's no real utility in sharing them since each hardware > context has it's own GTT on modern hardware. On older hardware, we'll do > the relocation for the first batch we submit with the BO but then the kernel > will tell us the offset and future batches won't relocate. In either case, > tracking it globally in libdrm isn't gaining us anything and is, in fact, > the problem. I infer that a correct fix (not a hack) would require i965 to stop using, at a minimum, the libdrm exec functions (drm_intel_*_exec()). We could, to mitigate the pain of fixing this, and to write a backportable patch, continue using drm_intel_bo throughout i965; we just need to, at least initially, stop using drm_intel_*_exec. Do you agree? Am I misunderstanding the problem? (In reply to Chad Versace from comment #9) > (In reply to Jason Ekstrand from comment #7) [...] > I infer that a correct fix (not a hack) would require i965 to stop using, at > a minimum, the libdrm exec functions (drm_intel_*_exec()). We could, to > mitigate the pain of fixing this, and to write a backportable patch, > continue using drm_intel_bo throughout i965; we just need to, at least > initially, stop using drm_intel_*_exec. Do you agree? Am I misunderstanding > the problem? If that's true, then... Oh Cthulhu! Why did I assign this bug to myself!?!? Jason was thinking we use I915_EXEC_NO_RELOC, but we don't. I don't think that's a problem. An easy way to sanity check that would be to remove offset64 from Mesa's relocations (effectively always programming an assumed offset of 0). I think Tapani is on to something here. If the texture is busy and we can't immediately write to it, _mesa_meta_pbo_TexSubImage decides to create a PBO, copy the user supplied data into the PBO, then essentially blit (texture/render) from the PBO to the destination texture. It does this to try and avoid stalling. So...it should totally be legal to do: bool tex_busy = true; namely, always take this "create a PBO and blit" path. It should always be viable, and this should remove some unpredictability. However, doing so uncovers a bit problem: deqp-egl: main/texobj.c:2030: _mesa_unlock_context_textures: Assertion `ctx->Shared->TextureStateStamp == ctx->TextureStateTimestamp' failed. I think the Meta path is exposing Mesa's bogus texture locking. (typo..."big problem" not "bit problem".) if you want to burn some cycles looking into making Mesa's texture locking less incompetent, that would probably be worthwhile. I'm also a bit concerned that I had to make the texture lock a recursive mutex a while back, when Kristian implemented meta fast clears...it may be that meta altering texturing state in this case isn't really valid - similar to our old recursive-meta hell. It may be that switching things to BLORP is more of an actual solution than the paper-over we feared it might be. But, again, feel free to look into it - I'm mostly speculating at this point. (In reply to Kenneth Graunke from comment #11) > Jason was thinking we use I915_EXEC_NO_RELOC, but we don't. I don't think > that's a problem. An easy way to sanity check that would be to remove > offset64 from Mesa's relocations (effectively always programming an assumed > offset of 0). I think Ken is right and it's not as dire as I'd feared. Relocations should be safe so long as we continue to not use NO_RELOC. (In reply to Kenneth Graunke from comment #14) > if you want to burn some cycles looking into making Mesa's texture locking > less incompetent, that would probably be worthwhile. I'm also a bit > concerned that I had to make the texture lock a recursive mutex a while > back, when Kristian implemented meta fast clears...it may be that meta > altering texturing state in this case isn't really valid - similar to our > old recursive-meta hell. > > It may be that switching things to BLORP is more of an actual solution than > the paper-over we feared it might be. Short-term, yes, that may fix this bug. It should be easy enough to grab Topi's patches, apply them, and see if that fixes it. Long-term, we still have a lot of meta paths and getting rid of them will take time. Not sure if it's better to get rid of them or to fix them. Topi's patches are merged now, right, is this still an issue? (In reply to Ben Widawsky from comment #16) > Topi's patches are merged now, right, is this still an issue? Which patches are these? OK found those, I've tested that 'blorp_tex_upload_2' fixes the bug but those patches are not in upstream yet. We face the similar issue in Android-IA on O. Please let me know those fixes and whether it merged in upstream? Not yet, sorry. Using BLORP for texture upload uncovered a nasty bug in our state upload code that required me to do a fairly invasive rework. That's landed. We also uncovered bugs with renderbuffer compression. The fix for that landed. With those in place, I finally was able to re-send the BLORP texture upload patches for review. Ideally they will be able to land late this week or early next week. In the meantime you can grab the latest version from my branch here: https://cgit.freedesktop.org/~kwg/mesa/log/?h=tex-upload Thank you we will check (In reply to Kenneth Graunke from comment #20) > Not yet, sorry. Using BLORP for texture upload uncovered a nasty bug in our > state upload code that required me to do a fairly invasive rework. That's > landed. We also uncovered bugs with renderbuffer compression. The fix for > that landed. With those in place, I finally was able to re-send the BLORP > texture upload patches for review. Ideally they will be able to land late > this week or early next week. > > In the meantime you can grab the latest version from my branch here: > https://cgit.freedesktop.org/~kwg/mesa/log/?h=tex-upload Fixed by: commit dffda6cbbbc1a8c2f7125db1b9cf15fbcdc8eb11 Author: Jason Ekstrand <jason.ekstrand@intel.com> Date: Tue Jun 6 09:58:07 2017 -0700 i965: Use blorp instead of meta for PBO texture uploads Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.