Bug 101571

Summary: [BXT/BSW/GLK] Vulkan tests fail on 4.12 drm-tip
Product: DRI Reporter: Mark Janes <mark.a.janes>
Component: DRM/IntelAssignee: Chris Wilson <chris>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: eero.t.tamminen, intel-gfx-bugs, jason
Version: unspecified   
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=100932
Whiteboard: PatchMerged
i915 platform: BSW/CHT, BXT, BYT, GLK i915 features: GEM/Other
Bug Depends on:    
Bug Blocks: 100932    
Attachments:
Description Flags
Conversation with Chris none

Description Mark Janes 2017-06-23 17:58:12 UTC
I updated APL/BXT NUCs to drm-tip to test new GPU performance counter support:

drm-tip drm-tip/drm-tip 8f66eff8aa12ae767fd53f6b0e3badc2258d3d5c
Author:     Sean Paul <seanpaul@chromium.org>
AuthorDate: Mon Jun 19 13:58:05 2017 -0400
Commit:     Sean Paul <seanpaul@chromium.org>
CommitDate: Mon Jun 19 13:58:05 2017 -0400

Parent:     2a9433f12d1f Merge remote-tracking branch 'danvet/topic/e1000e-fix' into drm-tip
Merged:     drm-tip
Containing: drm-tip
Follows:    v4.12-rc5 (1599)

drm-tip: 2017y-06m-19d-17h-57m-19s UTC integration manifest

I found that performance counters worked, but the system failed 90% of vulkan CTS tests.

It may be easiest to observe this with the crucible test suite, which fails 90% of tests with errors like:
crucible [master]: error  : func.miptree.r8g8b8a8-unorm.aspect-color.view-3d.levels01.extent-32x32x32.upload-copy-from-buffer.download-copy-to-linear-image: cru_image_compare_rect: diff found in row 0 of rect

Reverting to the stock 4.9 kernel fixed the regressions.

I used mesa c2f82fc1d3c for testing.

I didn't check other platforms, so it may be BXT specific.

GL/GLES tests did not regress.
Comment 1 Chris Wilson 2017-06-23 22:56:10 UTC
Vulkan is accessing the incoherent buffers without alerting the kernel ever to the change of cache domain. Vulkan tries to take over cache management, but left the kernel thinking it still had to manage it as well..
Comment 2 Chris Wilson 2017-06-23 23:16:07 UTC
diff --git a/src/intel/vulkan/anv_gem.c b/src/intel/vulkan/anv_gem.c
index 401580cdf9..24355a23e4 100644
--- a/src/intel/vulkan/anv_gem.c
+++ b/src/intel/vulkan/anv_gem.c
@@ -50,6 +50,7 @@ anv_gem_create(struct anv_device *device, uint64_t size)
       return 0;
    }
 
+   anv_gem_set_domain(device, gem_create.handle, I915_GEM_DOMAIN_GTT, 0);
    return gem_create.handle;
 }
 

aligns the cache domain in the kernel with the usage bxt and makes crucible happy. It is not CPU relocs. Will take some time to figure out what disagreement is causing the confusion, e.g. one of the changes is that the kernel does its clflushing asynchronously and given that vulkan doesn't want/need the kernel to clflush at all might lead to some data conflict. But that's a long shot.
Comment 3 Chris Wilson 2017-06-28 19:57:30 UTC
It's the use of ASYNC which tells the kernel to skip its implict sync of the CPU cache, which vulkan is relying upon by accident to flush the CPU cache of render targets.
Comment 4 Jason Ekstrand 2017-07-06 00:01:31 UTC
(In reply to Chris Wilson from comment #3)
> It's the use of ASYNC which tells the kernel to skip its implict sync of the
> CPU cache, which vulkan is relying upon by accident to flush the CPU cache
> of render targets.

That makes some sense.  It could also explain the hang issue that someone reported with the Sascha multithreading demo.

That said, what do you mean by "flushing the CPU cache of render targets"?  Why does the CPU cache need flushing?
Comment 5 Chris Wilson 2017-07-06 07:18:21 UTC
Any new bo returned to userspace is zeroed. That data may entirely reside in the CPU cache, and may remain there across the GPU write into main memory. The clflush performed before the read to invalidate the cache may in fact just cause the writeback of the cache to main memory, overwriting the results.
Comment 6 Jason Ekstrand 2017-07-06 17:29:56 UTC
In that case, by telling the kernel we want it in the GTT domain, it will do the clflush and then always treat it as uncached.  (In particular, any relocations will be either done through a GTT map or clflushed.)  That seems like a perfectly reasonable thing for anv to do.  I'm not entirely sure where we want to put the SET_DOMAIN call but gem_create is probably ok but maybe anv_bo_init_new would be better.
Comment 7 Jason Ekstrand 2017-07-07 18:45:08 UTC
Created attachment 132507 [details]
Conversation with Chris

After a lengthy IRC conversation with Chris, we determined that this is actually a kernel bug with a fairly simple fix.
Comment 8 Eero Tamminen 2017-07-10 07:45:26 UTC
Jason mentioned that is relate to bug 100932, so listing all GEN8+ SOCs as affected and raising importance (issue related to GPU hangs in many tests and rendering errors in all Vulkan programs cannot be medium/normal).
Comment 9 Annie 2017-07-13 21:34:58 UTC
Can we get a kernel owner for this bug?
Comment 10 Jason Ekstrand 2017-07-13 22:08:04 UTC
Chris has a patch:

https://patchwork.freedesktop.org/patch/166536/
Comment 11 Eero Tamminen 2017-07-14 13:56:39 UTC
Tested Chris' patch with drm-tip and Mesa git. It fixes rendering on all SoC platforms: BXT, BSW, BYT.

I didn't see any GPU hangs anymore either (see bug 100932), but those may have been fixed already yesterday by something else.

(On BYT, Vulkan tests can still randomly get device lost errors, but at least they  render now correctly.)
Comment 12 Chris Wilson 2017-07-26 10:48:28 UTC
commit 686348c782d3247e3bf39fce98d4ac44e147a746
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jul 21 15:50:37 2017 +0100

    drm/i915: Force CPU synchronisation even if userspace requests ASYNC
    
    The goal here was to minimise doing any thing or any check inside the
    kernel that was not strictly required. For a userspace that assumes
    complete control over the cache domains, the kernel is usually using
    outdated information and may trigger clflushes where none were
    required.
    
    However, swapping is a situation where userspace has no knowledge of the
    domain transfer, and will leave the object in the CPU cache. The kernel
    must flush this out to the backing storage prior to use with the GPU. As
    we use an asynchronous task tracked by an implicit fence for this, we
    also need to cancel the ASYNC flag on the object so that the object will
    wait for the clflush to complete before being executed. This also absolves
    userspace of the responsibility imposed by commit 77ae9957897d ("drm/i915:
    Enable userspace to opt-out of implicit fencing") that its needed to ensure
    that the object was out of the CPU cache prior to use on the GPU.
    
    Fixes: 77ae9957897d ("drm/i915: Enable userspace to opt-out of implicit fencing")
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101571
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Jason Ekstrand <jason@jlekstrand.net>
    Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
    Link: https://patchwork.freedesktop.org/patch/msgid/20170721145037.25105-5-chris@chris-wilson.co.uk
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.