Bug 108531 - [skl DXVK] GPU hang with Megadimension Neptunia VIIR
Summary: [skl DXVK] GPU hang with Megadimension Neptunia VIIR
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Vulkan/intel (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 107763
  Show dependency treegraph
 
Reported: 2018-10-23 16:12 UTC by leozinho29_eu
Modified: 2019-03-15 02:31 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg, card0/error and save (803.31 KB, application/gzip)
2018-10-23 16:12 UTC, leozinho29_eu
Details
one frame from the main menu with culprit shader (15.35 MB, application/gzip)
2018-11-02 14:53 UTC, Danylo
Details
Patch to disable the stetncil PMA optimization (744 bytes, patch)
2018-11-05 21:52 UTC, Jason Ekstrand
Details | Splinter Review
card0/error (539.65 KB, text/plain)
2018-11-05 23:02 UTC, leozinho29_eu
Details

Description leozinho29_eu 2018-10-23 16:12:25 UTC
Created attachment 142152 [details]
dmesg, card0/error and save

In the game Megadimension Neptunia VIIR, when the player tries to enter in the map Jingu Sakura Park - Inner a GPU hang happens. As the game saves in that part, loading the save again makes the GPU hang again in such way it is impossible to do anything with the save, rendering it basically useless. 

Steps to reproduce:

1) Install Steam;
2) Enabled Steam Beta;
3) Enable Proton, minimal version 3.7-8;
4) Install Megadimension Neptunia VIIR;
5) Play until reaching Jingu Sakura Park;
6) Try to access Jingu Sakura Park - Inner;
7) Notice GPU hang.

As the game play to reach would take up to 6 hours (performance is REALLY BAD, be aware), my saves are attached too. The remote directory should be copied to $WHERESTEAMISINSTALLED/Steam/userdata/$USERNUMBER/774511/. Save 1 and 3 are the relevant saves.

This happens 100% of times, everything I tried (older Mesa, stable Mesa, DXVK from git, 4.17.19, drm-tip) failed.

The attached file has dmesg, card0/error and the saves. dmesg is corrupted because of https://bugs.freedesktop.org/show_bug.cgi?id=107945 but is still readable.

System specifications:

Processor: Intel Core i3-6100U;
GPU: Intel HD Graphics 520;
Architecture: amd64;
Mesa: 18.3.0-devel (git-fdd926d5b2);
Kernel version: drm-tip (9510f8e44127260f92b5b6c3127aafa22b15f741);
Distribution: Xubuntu 18.04.1 amd64.
Comment 1 Denis 2018-10-24 12:02:14 UTC
hi, any ideas why on SKL game couln't be launched?
Game window (fast blink of white color, then only black window) appears and then closes. Launch from cmd doesn't provide any additional information

Also, on KBL, game can be launched, but - really slow and laggy, when I see pink color (actually this logo prints so slow... that I even couldn't start the first level).

SKL - HD Graphics 520
Ubuntu 16.04
kernel 4.18

Steam beta (proton 3.7-8)

KBL = HD Graphics 620
Ubuntu 16.04
kernel 4.18
Steam beta (proton 3.7-8)

On both PC's I built latest git mesa with vulkan and applied it to the game
Comment 2 leozinho29_eu 2018-10-26 13:44:13 UTC
To build Mesa I used glslang, Vulkan-Headers and Vulkan-Loader from git. Maybe the Vulkan from Ubuntu 16.04 is not new enough?

The game is crashing on Windows 10 too! It crashes with an error in the event viewer:

 Falha no bucket LKD_0x141_Tdr:6_IMAGE_igdkmd64.sys_GEN9_DX10_DISPLAY, tipo 0

I don't know how to debug there and on Windows 10 the game is running at 5 to 8 FPS. Reducing resolution has no effect, which seems a bit absurd. At 640x360 I expected it would at least be fast. I can't understand what is happening with this game, are the minimal specifications is the Steam store wrong?
Comment 3 Denis 2018-10-26 14:50:45 UTC
hm, and what about linux? It has the same fps?
He launched the game on debian (KBL), but it is also too laggy and slow.
Comment 4 leozinho29_eu 2018-10-26 15:13:21 UTC
On Linux, FPS is 1. One, really. When using intel-gpu-overlay it shows that Xorg is waiting for 4000 ms, as it can be noticed when the game is focused that everything in the screen freezes for many seconds.

intel-gpu-overlay shows GPU rcs0 at 100% nearly all the time after the auto save warning.

Specifications about GPU says: Graphics card with 1GB VRAM or more and compatibility with Direct X 11.0 or higher

My system specifications:

DxDiag:

      Display Memory: 4176 MB
    Dedicated Memory: 128 MB
       Shared Memory: 4048 MB

glxinfo -B:

    Video memory: 3072MB

This is way higher than 1 GB. It shouldn't perform so badly.
Comment 5 Denis 2018-10-26 16:49:03 UTC
Thank you for reply. My question was related to comparison of stability/productivity on windows and linux platforms.
As I understood, Linux has 1 fps, Windows 5-8?

From our side we also found out that game is running very laggy, also 1 CPU core has 100% load.

So, this may be game bug... We have a plan to check it on radeon gpu (on linux). But after weekend.
Comment 6 Danylo 2018-11-02 14:52:18 UTC
I've tried to look into the issue but performance made it extremely problematic.

So I tried to look what's wrong with the performance first and why main menu is so laggy. The main culprit was one draw call which did some kind of post-processing (I don't really know its intention).

This is its compilation result:

SIMD8 shader: 3619 instructions. 2 loops. 230252 cycles. 197:368 spills:fills. Promoted 9 constants. Compacted 57904 to 36320 bytes (37%)

That's the only such shader I found during the attempt to get past main menu.

I'll attach archive with RenderDoc trace (see call 159) of a frame and the shader in question.

My hardware is i7-7500U, HD Graphics 620 + latest Mesa git

I have failed to see any obvious (for me) issues with processing of the shader which may lead to such outcome, even the initial shader (dxbc) doesn't look good for me.

Someone more knowledgeable than me should look at it.
Comment 7 Danylo 2018-11-02 14:53:40 UTC
Created attachment 142345 [details]
one frame from the main menu with culprit shader
Comment 8 Jason Ekstrand 2018-11-05 21:52:29 UTC
Created attachment 142375 [details] [review]
Patch to disable the stetncil PMA optimization

Could you please try the attached patch?  It may or may not be the issue but it's certainly something that looks like it could be going wrong.
Comment 9 leozinho29_eu 2018-11-05 23:02:38 UTC
Created attachment 142376 [details]
card0/error

This patch did not work to me, the FPS was still 1 and the GPU hang happened in the same part.
Comment 10 Denis 2018-11-06 12:27:26 UTC
Hi. Just an additional info to the bug report:

reproduced issue on KBL (with your save, thanks). In my case, I couldn't load level, hang occurred after trying to load it.


[687462.196645] [drm] GPU HANG: ecode 9:0:0x85dfdfff, in v2r.exe [8658], reason: Hang on rcs0, action: reset
[687462.196652] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[687470.132498] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[687481.204237] asynchronous wait on fence i915:compiz[2005]/1:4e46bd timed out
[687481.204242] asynchronous wait on fence i915:compiz[2005]/1:4e46bd timed out
[687482.164264] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[687492.148114] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[687500.148024] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[687508.147874] i915 0000:00:02.0: Resetting rcs0 after gpu hang
Comment 11 Jason Ekstrand 2018-11-06 16:36:33 UTC
Please give this branch a try:

https://gitlab.freedesktop.org/jekstrand/mesa/tree/wip/bug-108531

In RenderDoc, it improves the performance of the captured frame by about 20x.
Comment 12 Denis 2018-11-06 18:22:01 UTC
hmm, it really provides huge performance improvement.
I can see this on the first splash screen (pink). It was extremely laggy before, now it looks quite fast.

Also I didn't get GPU hang, level booted successfully. The only 1 thing I mentioned - is a big delay between starting the game (first splash screen appearance). I even thought that game froze.

Tomorrow I will test more. Tested on my KBL.
Comment 13 leozinho29_eu 2018-11-06 21:29:41 UTC
Mesa from that branch has the menu with good performance (20 FPS, around the same of Windows 10) and there is no GPU hang anymore when loading the map, it takes some time but works. The pause menu has the good performance too.

In the gameplay after loading the map, FPS is still 1 and the Player Room performance is 1 FPS too. The curious thing is that the game's CPU usage is around 4% when FPS is 1, so GPU is limiting pretty hard the performance.

This Mesa branch is a big improvement.
Comment 14 Jason Ekstrand 2018-11-06 22:04:02 UTC
Am I reading correctly that there are no GPU hangs with that branch?  If so, could you try the branch without the last patch?  I'd like to know if it's moving discards or handling transitions to VK_IMAGE_LAYOUT_UNDEFINED that fixes it.
Comment 15 leozinho29_eu 2018-11-06 23:39:17 UTC
Without the last patch (so the HEAD is at 9ad747f1a17bde0ed3802e919f464c9fdaf64a1c, intel/nir: Enable nir_opt_move_discards_to_top) has no GPU hangs and good performance in the main menu to me.

I can say that when entering Jingu Sakura Park - Inner, where it was having a GPU hang, there is no longer a GPU hang.

It's not possible to say that every part of the game is 100% free of GPU hangs because with that awful performance and no sound (this game had sound working everywhere but Idea Factory pushed two bad updates and the game no longer has sound) it's not a good experience to play it, it's hard to continue.
Comment 16 Danylo 2018-11-08 11:01:06 UTC
While in the menu fps issue is solved however the same issue is indeed present in the gameplay.

Here is a trace https://drive.google.com/file/d/1j0fXWPsGWy3WM8hStV9qViEA-sWnD7rJ/view

The same shader is the cause of low fps during the gameplay. This shader decides whether to continue execution by comparing value from the depth buffer to one, if one - discard otherwise continue. 

In the main menu depth buffer is clean - it's filled with 1.0, in the gameplay rendering the scene fills the depth buffer (set range of 't7' input to 0.99 - 1.0) and now there is no early discard. Thus the main part of the shader gets invoked for the whole screen which looks like a correct behavior since it's clearly combines inputs into correctly lighting scene.
Comment 17 Denis 2018-11-08 16:31:39 UTC
>While in the menu fps issue is solved however the same issue is indeed present >in the gameplay.

>Here is a trace https://drive.google.com/file/d/1j0fXWPsGWy3WM8hStV9qViEA-sWnD7rJ/view

here is updated link which should have correct access permissions:
https://drive.google.com/open?id=14RkKwuJRZuUgDGzSUH0qsB2hzpU-i7zW
Comment 18 Jason Ekstrand 2018-11-09 17:19:11 UTC
Given the debug messages I'm seeing the game spew, I think the hang was fixed by the following commit:

commit 00fc56a68d21d7aa91b95f0eaacba59a96c466f5
Author: Danylo Piliaiev <danylo.piliaiev@gmail.com>
Date:   Fri Jul 20 12:54:42 2018 +0300

    anv: Disable dual source blending when shader doesn't support it on gen8+
    
    Dual source blending behaviour is undefined when shader doesn't
    have second color output.
    
     "If SRC1 is included in a src/dst blend factor and
      a DualSource RT Write message is not used, results
      are UNDEFINED. (This reflects the same restriction in DX APIs,
      where undefined results are produced if “o1” is not written
      by a PS – there are no default values defined)."
    
    Dismissing fragment in such situation leads to a hang on gen8+
    if depth test in enabled.
    
    Since blending cannot be gracefully fixed in such case and the result
    is undefined - blending is simply disabled.
    
    v2 (Jason Ekstrand):
     - Apply the workaround to each individual entry
     - Emit a warning through debug_report
    
    Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
    Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
    Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>


With that, I'm closing this bug because it's a bug about a GPU hang which seems to have been fixed.

The performance problem is a different issue.  I'm working with the main DXVK developer to solve the discard issue properly.  For the in-game performance, we have some compiler changes in the pipeline which will hopefully help reduce spilling.  It won't make that bump-map shader good because it's one of the worst written and most wasteful shaders I've ever seen but it may make it enough better to get the 5 FPS that windows is getting.  However, those changes are a ways out yet.  If you want to open a new bug about this game's performance, feel free to do so and we can track it there.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.