101571 – [BXT/BSW/GLK] Vulkan tests fail on 4.12 drm-tip

Bug 101571 - [BXT/BSW/GLK] Vulkan tests fail on 4.12 drm-tip

Summary: [BXT/BSW/GLK] Vulkan tests fail on 4.12 drm-tip

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	high major
Assignee:	Chris Wilson
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	PatchMerged
Keywords:

Depends on:
Blocks:	100932
	Show dependency tree / graph

Reported:	2017-06-23 17:58 UTC by Mark Janes
Modified:	2017-07-27 16:57 UTC (History)
CC List:	3 users (show)

See Also:	100932
i915 platform:	BSW/CHT, BXT, BYT, GLK
i915 features:	GEM/Other

Attachments
Conversation with Chris (11.72 KB, text/plain) 2017-07-07 18:45 UTC, Jason Ekstrand	no flags	Details
View All

Description Mark Janes 2017-06-23 17:58:12 UTC

I updated APL/BXT NUCs to drm-tip to test new GPU performance counter support:

drm-tip drm-tip/drm-tip 8f66eff8aa12ae767fd53f6b0e3badc2258d3d5c
Author:     Sean Paul <seanpaul@chromium.org>
AuthorDate: Mon Jun 19 13:58:05 2017 -0400
Commit:     Sean Paul <seanpaul@chromium.org>
CommitDate: Mon Jun 19 13:58:05 2017 -0400

Parent:     2a9433f12d1f Merge remote-tracking branch 'danvet/topic/e1000e-fix' into drm-tip
Merged:     drm-tip
Containing: drm-tip
Follows:    v4.12-rc5 (1599)

drm-tip: 2017y-06m-19d-17h-57m-19s UTC integration manifest

I found that performance counters worked, but the system failed 90% of vulkan CTS tests.

It may be easiest to observe this with the crucible test suite, which fails 90% of tests with errors like:
crucible [master]: error  : func.miptree.r8g8b8a8-unorm.aspect-color.view-3d.levels01.extent-32x32x32.upload-copy-from-buffer.download-copy-to-linear-image: cru_image_compare_rect: diff found in row 0 of rect

Reverting to the stock 4.9 kernel fixed the regressions.

I used mesa c2f82fc1d3c for testing.

I didn't check other platforms, so it may be BXT specific.

GL/GLES tests did not regress.

Comment 1 Chris Wilson 2017-06-23 22:56:10 UTC

Vulkan is accessing the incoherent buffers without alerting the kernel ever to the change of cache domain. Vulkan tries to take over cache management, but left the kernel thinking it still had to manage it as well..

Comment 2 Chris Wilson 2017-06-23 23:16:07 UTC

diff --git a/src/intel/vulkan/anv_gem.c b/src/intel/vulkan/anv_gem.c
index 401580cdf9..24355a23e4 100644
--- a/src/intel/vulkan/anv_gem.c
+++ b/src/intel/vulkan/anv_gem.c
@@ -50,6 +50,7 @@ anv_gem_create(struct anv_device *device, uint64_t size)
       return 0;
    }
 
+   anv_gem_set_domain(device, gem_create.handle, I915_GEM_DOMAIN_GTT, 0);
    return gem_create.handle;
 }
 

aligns the cache domain in the kernel with the usage bxt and makes crucible happy. It is not CPU relocs. Will take some time to figure out what disagreement is causing the confusion, e.g. one of the changes is that the kernel does its clflushing asynchronously and given that vulkan doesn't want/need the kernel to clflush at all might lead to some data conflict. But that's a long shot.

Comment 3 Chris Wilson 2017-06-28 19:57:30 UTC

It's the use of ASYNC which tells the kernel to skip its implict sync of the CPU cache, which vulkan is relying upon by accident to flush the CPU cache of render targets.

Comment 4 Jason Ekstrand 2017-07-06 00:01:31 UTC

(In reply to Chris Wilson from comment #3)
> It's the use of ASYNC which tells the kernel to skip its implict sync of the
> CPU cache, which vulkan is relying upon by accident to flush the CPU cache
> of render targets.

That makes some sense.  It could also explain the hang issue that someone reported with the Sascha multithreading demo.

That said, what do you mean by "flushing the CPU cache of render targets"?  Why does the CPU cache need flushing?

Comment 5 Chris Wilson 2017-07-06 07:18:21 UTC

Any new bo returned to userspace is zeroed. That data may entirely reside in the CPU cache, and may remain there across the GPU write into main memory. The clflush performed before the read to invalidate the cache may in fact just cause the writeback of the cache to main memory, overwriting the results.

Comment 6 Jason Ekstrand 2017-07-06 17:29:56 UTC

In that case, by telling the kernel we want it in the GTT domain, it will do the clflush and then always treat it as uncached.  (In particular, any relocations will be either done through a GTT map or clflushed.)  That seems like a perfectly reasonable thing for anv to do.  I'm not entirely sure where we want to put the SET_DOMAIN call but gem_create is probably ok but maybe anv_bo_init_new would be better.

Comment 7 Jason Ekstrand 2017-07-07 18:45:08 UTC

Created attachment 132507 [details]
Conversation with Chris

After a lengthy IRC conversation with Chris, we determined that this is actually a kernel bug with a fairly simple fix.

Comment 8 Eero Tamminen 2017-07-10 07:45:26 UTC

Jason mentioned that is relate to bug 100932, so listing all GEN8+ SOCs as affected and raising importance (issue related to GPU hangs in many tests and rendering errors in all Vulkan programs cannot be medium/normal).

Comment 9 Annie 2017-07-13 21:34:58 UTC

Can we get a kernel owner for this bug?

Comment 10 Jason Ekstrand 2017-07-13 22:08:04 UTC

Chris has a patch:

https://patchwork.freedesktop.org/patch/166536/

Comment 11 Eero Tamminen 2017-07-14 13:56:39 UTC

Tested Chris' patch with drm-tip and Mesa git. It fixes rendering on all SoC platforms: BXT, BSW, BYT.

I didn't see any GPU hangs anymore either (see bug 100932), but those may have been fixed already yesterday by something else.

(On BYT, Vulkan tests can still randomly get device lost errors, but at least they  render now correctly.)

Comment 12 Chris Wilson 2017-07-26 10:48:28 UTC

commit 686348c782d3247e3bf39fce98d4ac44e147a746
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jul 21 15:50:37 2017 +0100

    drm/i915: Force CPU synchronisation even if userspace requests ASYNC
    
    The goal here was to minimise doing any thing or any check inside the
    kernel that was not strictly required. For a userspace that assumes
    complete control over the cache domains, the kernel is usually using
    outdated information and may trigger clflushes where none were
    required.
    
    However, swapping is a situation where userspace has no knowledge of the
    domain transfer, and will leave the object in the CPU cache. The kernel
    must flush this out to the backing storage prior to use with the GPU. As
    we use an asynchronous task tracked by an implicit fence for this, we
    also need to cancel the ASYNC flag on the object so that the object will
    wait for the clflush to complete before being executed. This also absolves
    userspace of the responsibility imposed by commit 77ae9957897d ("drm/i915:
    Enable userspace to opt-out of implicit fencing") that its needed to ensure
    that the object was out of the CPU cache prior to use on the GPU.
    
    Fixes: 77ae9957897d ("drm/i915: Enable userspace to opt-out of implicit fencing")
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101571
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Jason Ekstrand <jason@jlekstrand.net>
    Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
    Link: https://patchwork.freedesktop.org/patch/msgid/20170721145037.25105-5-chris@chris-wilson.co.uk
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.