I updated APL/BXT NUCs to drm-tip to test new GPU performance counter support:
drm-tip drm-tip/drm-tip 8f66eff8aa12ae767fd53f6b0e3badc2258d3d5c
Author: Sean Paul <email@example.com>
AuthorDate: Mon Jun 19 13:58:05 2017 -0400
Commit: Sean Paul <firstname.lastname@example.org>
CommitDate: Mon Jun 19 13:58:05 2017 -0400
Parent: 2a9433f12d1f Merge remote-tracking branch 'danvet/topic/e1000e-fix' into drm-tip
Follows: v4.12-rc5 (1599)
drm-tip: 2017y-06m-19d-17h-57m-19s UTC integration manifest
I found that performance counters worked, but the system failed 90% of vulkan CTS tests.
It may be easiest to observe this with the crucible test suite, which fails 90% of tests with errors like:
crucible [master]: error : func.miptree.r8g8b8a8-unorm.aspect-color.view-3d.levels01.extent-32x32x32.upload-copy-from-buffer.download-copy-to-linear-image: cru_image_compare_rect: diff found in row 0 of rect
Reverting to the stock 4.9 kernel fixed the regressions.
I used mesa c2f82fc1d3c for testing.
I didn't check other platforms, so it may be BXT specific.
GL/GLES tests did not regress.
Vulkan is accessing the incoherent buffers without alerting the kernel ever to the change of cache domain. Vulkan tries to take over cache management, but left the kernel thinking it still had to manage it as well..
diff --git a/src/intel/vulkan/anv_gem.c b/src/intel/vulkan/anv_gem.c
index 401580cdf9..24355a23e4 100644
@@ -50,6 +50,7 @@ anv_gem_create(struct anv_device *device, uint64_t size)
+ anv_gem_set_domain(device, gem_create.handle, I915_GEM_DOMAIN_GTT, 0);
aligns the cache domain in the kernel with the usage bxt and makes crucible happy. It is not CPU relocs. Will take some time to figure out what disagreement is causing the confusion, e.g. one of the changes is that the kernel does its clflushing asynchronously and given that vulkan doesn't want/need the kernel to clflush at all might lead to some data conflict. But that's a long shot.
It's the use of ASYNC which tells the kernel to skip its implict sync of the CPU cache, which vulkan is relying upon by accident to flush the CPU cache of render targets.
(In reply to Chris Wilson from comment #3)
> It's the use of ASYNC which tells the kernel to skip its implict sync of the
> CPU cache, which vulkan is relying upon by accident to flush the CPU cache
> of render targets.
That makes some sense. It could also explain the hang issue that someone reported with the Sascha multithreading demo.
That said, what do you mean by "flushing the CPU cache of render targets"? Why does the CPU cache need flushing?
Any new bo returned to userspace is zeroed. That data may entirely reside in the CPU cache, and may remain there across the GPU write into main memory. The clflush performed before the read to invalidate the cache may in fact just cause the writeback of the cache to main memory, overwriting the results.
In that case, by telling the kernel we want it in the GTT domain, it will do the clflush and then always treat it as uncached. (In particular, any relocations will be either done through a GTT map or clflushed.) That seems like a perfectly reasonable thing for anv to do. I'm not entirely sure where we want to put the SET_DOMAIN call but gem_create is probably ok but maybe anv_bo_init_new would be better.
Created attachment 132507 [details]
Conversation with Chris
After a lengthy IRC conversation with Chris, we determined that this is actually a kernel bug with a fairly simple fix.
Jason mentioned that is relate to bug 100932, so listing all GEN8+ SOCs as affected and raising importance (issue related to GPU hangs in many tests and rendering errors in all Vulkan programs cannot be medium/normal).
Can we get a kernel owner for this bug?
Chris has a patch:
Tested Chris' patch with drm-tip and Mesa git. It fixes rendering on all SoC platforms: BXT, BSW, BYT.
I didn't see any GPU hangs anymore either (see bug 100932), but those may have been fixed already yesterday by something else.
(On BYT, Vulkan tests can still randomly get device lost errors, but at least they render now correctly.)
Author: Chris Wilson <email@example.com>
Date: Fri Jul 21 15:50:37 2017 +0100
drm/i915: Force CPU synchronisation even if userspace requests ASYNC
The goal here was to minimise doing any thing or any check inside the
kernel that was not strictly required. For a userspace that assumes
complete control over the cache domains, the kernel is usually using
outdated information and may trigger clflushes where none were
However, swapping is a situation where userspace has no knowledge of the
domain transfer, and will leave the object in the CPU cache. The kernel
must flush this out to the backing storage prior to use with the GPU. As
we use an asynchronous task tracked by an implicit fence for this, we
also need to cancel the ASYNC flag on the object so that the object will
wait for the clflush to complete before being executed. This also absolves
userspace of the responsibility imposed by commit 77ae9957897d ("drm/i915:
Enable userspace to opt-out of implicit fencing") that its needed to ensure
that the object was out of the CPU cache prior to use on the GPU.
Fixes: 77ae9957897d ("drm/i915: Enable userspace to opt-out of implicit fencing")
Signed-off-by: Chris Wilson <firstname.lastname@example.org>
Cc: Joonas Lahtinen <email@example.com>
Cc: Jason Ekstrand <firstname.lastname@example.org>
Reviewed-by: Jason Ekstrand <email@example.com>
Reviewed-by: Joonas Lahtinen <firstname.lastname@example.org>