Bug 100932 - [regression] GPU hangs and misrendering in Vulkan programs on GEN8+ SoCs
Summary: [regression] GPU hangs and misrendering in Vulkan programs on GEN8+ SoCs
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: Other All
: high major
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on: 101571
Blocks:
  Show dependency treegraph
 
Reported: 2017-05-04 15:35 UTC by Eero Tamminen
Modified: 2017-08-14 16:30 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments

Description Eero Tamminen 2017-05-04 15:35:37 UTC
Last time Sacha Willems' Multithreading Vulkan demo worked fine was with:
-----------------------------
b295a52836 at 2017-04-27 16:52:25 UTC
clover: Fix build since clang r301442
-----------------------------

Day later, at:
-----------------------------
85ca563b58 at 2017-04-28 15:54:45 UTC
anv: Drop 'x11' prefix from non-X11 WSI funcs
-----------------------------

And after that, it seems to be GPU hanging on all GEN8+ SOC platforms: BSW, BXT, GLK...

On BYT there have started to be GPU hangs in ComputeNBoby Vulkan demo (not in Multithreading), but I don't know whether it could be related, there's a large gap in our data for BYT around that time.
Comment 1 Eero Tamminen 2017-05-05 07:55:09 UTC
IndirectDraw demo is also failing on BSW/BXT/GLK.

Occasionally compute (Raytracing & N-Body ones that are actually compute bound) demos will also fail on BXT:
Fatal : VkResult is "ERROR_DEVICE_LOST" in /home/testrunner/work/SachaWillemsVulkan/base/VulkanTextOverlay.hpp at line 288

(There are too many things GPU hanging on BXT to know whether these are also GPU hangs, but at least on other platforms GPU hangs cause above error message.)
Comment 2 Jason Ekstrand 2017-05-05 22:34:02 UTC
This bisects to the following commit:

commit 35e626bd0e59e7ce9fd97ccef66b2468c09206a4
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Thu Apr 13 16:30:19 2017 -0700

    anv: Set EXEC_OBJECT_ASYNC when available
    
    Reviewed-by: Chad Versace <chadversary@chromium.org>

I have no idea why it's broken yet.  This seems like it should be safe but maybe the kernel doesn't work quite the way I think it does.
Comment 3 Jason Ekstrand 2017-05-05 23:02:35 UTC
Ok, I managed to track down what I think the problem is.  When a BO has the EXEC_OBJECT_ASYNC flag set, the kernel is not properly flushing write-combine mappings.  I'm not sure if this is a kernel bug or if that behavior is intended.
Comment 4 Chris Wilson 2017-05-08 12:54:09 UTC
Write-combined mappings? They are always flushed. Write-back mappings on byt/bsw/bxt are left to you since you opt out of the kernel synchronisation.
Comment 5 Eero Tamminen 2017-05-17 14:21:34 UTC
(In reply to Eero Tamminen from comment #0)
> Last time Sacha Willems' Multithreading Vulkan demo worked fine was with:
> -----------------------------
> b295a52836 at 2017-04-27 16:52:25 UTC
> clover: Fix build since clang r301442
> -----------------------------
> 
> Day later, at:
> -----------------------------
> 85ca563b58 at 2017-04-28 15:54:45 UTC
> anv: Drop 'x11' prefix from non-X11 WSI funcs
> -----------------------------
> 
> And after that, it seems to be GPU hanging on all GEN8+ SOC platforms: BSW,
> BXT, GLK...

Btw. the still working demos got much faster during that time frame, and that could have exposed the issue.  E.g. compute raytracing perf doubled.

(I don't see what in the Mesa commits between these commits could explain such a perf improvement and I don't yet have bisecting setup for Vulkan.)
Comment 6 Eero Tamminen 2017-07-05 10:41:44 UTC
(In reply to Jason Ekstrand from comment #3)
> Ok, I managed to track down what I think the problem is.  When a BO has the
> EXEC_OBJECT_ASYNC flag set, the kernel is not properly flushing
> write-combine mappings.  I'm not sure if this is a kernel bug or if that
> behavior is intended.

(In reply to Chris Wilson from comment #4)
> Write-combined mappings? They are always flushed. Write-back mappings on
> byt/bsw/bxt are left to you since you opt out of the kernel synchronisation.

Hangs are still happening.  And rendering in all Vulkan demos is broken on all SoCs since this too.

Is there a plan to address the issue (in some other way than reverting the breaking commit)?
Comment 7 Eero Tamminen 2017-07-10 07:42:00 UTC
Jason mentioned that there's a kernel bug related to this:
https://bugs.freedesktop.org/show_bug.cgi?id=101571

(My testing is done with kernel from around same time as Mesa.)
Comment 8 Eero Tamminen 2017-07-14 13:58:16 UTC
Did see hangs anymore last night and Chris' patch in bug 101571 fixes the rendering issues.
Comment 9 Eero Tamminen 2017-08-14 16:30:39 UTC
This was fixed with bug 101571.  Vulkan tests work now fine on BXT & BSW.  Vulkan compute tests have still issues on BYT, but that's not kernel issue.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.