Bug 110228 - [cfl] GPU hang when running UE4Editor
Summary: [cfl] GPU hang when running UE4Editor
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Vulkan/intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
: 110377 (view as bug list)
Depends on:
Blocks:
 
Reported: 2019-03-23 19:10 UTC by Andrzej Broński
Modified: 2019-10-15 12:50 UTC (History)
7 users (show)

See Also:
i915 platform:
i915 features:


Attachments
/sys/class/drm/card0/error (29.70 KB, text/plain)
2019-03-23 19:10 UTC, Andrzej Broński
Details
dmesg output (1.40 MB, text/x-log)
2019-03-23 19:11 UTC, Andrzej Broński
Details
vulkaninfo output (16.31 KB, text/plain)
2019-03-26 16:44 UTC, Andrzej Broński
Details
UE4_debug.log (76.00 KB, text/x-log)
2019-04-02 11:41 UTC, Denis
Details
journal_log_full (3.80 MB, application/zip)
2019-04-02 11:44 UTC, Denis
Details
top_command_during_running_UE4 (44.00 KB, text/plain)
2019-04-02 11:44 UTC, Denis
Details
standalone hang reproducer (13.65 KB, application/gzip)
2019-04-25 14:44 UTC, Danylo
Details
reproducer v2 (13.79 KB, application/gzip)
2019-04-26 11:55 UTC, Danylo
Details
Patch to use SURFTYPE_1D for texel buffers (876 bytes, patch)
2019-06-14 22:54 UTC, Jason Ekstrand
Details | Splinter Review
RenderDoc's hellblade replay log (121.88 KB, text/x-log)
2019-10-06 05:55 UTC, Andrzej Broński
Details

Description Andrzej Broński 2019-03-23 19:10:23 UTC
Created attachment 143761 [details]
/sys/class/drm/card0/error

# Bug description

Kernel is freezing completely whenever I open "Unreal Engine 4" project - even SysRq keys don't work. However with netconsole I catched some error logs related to GPU:

~~~
[  381.112949] [drm:drm_ioctl [drm]] pid=775, dev=0xe200, auth=1, I915_GEM_MMAP
[  381.113119] [drm:drm_ioctl [drm]] pid=775, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2_WR
[  381.113146] [drm:drm_ioctl [drm]] pid=775, dev=0xe200, auth=1, I915_GEM_THROTTLE
...
[  387.161861] [drm] GPU HANG: ecode 9:0:0x8ed9fff2, in UE4Editor [3127], reason: hang on rcs0, action: reset
[  387.161866] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  387.161868] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  387.161870] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  387.161871] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  387.161873] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  387.162881] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
...
[  387.164405] i915 0000:00:02.0: Resetting chip for hang on rcs0
~~~

Due to freeze I was able to obtain part of /sys/class/drm/card0/error which I am attaching along with dmesg output.


# System environment

-- chipset: Z370
-- system architecture: x86_64
-- xf86-video-intel: 1:2.99.917+863+g6afed33b-1
-- xorg-server: 1.20.4-1
-- mesa: 19.0.0-1
-- libdrm: 2.4.97-1
-- kernel: 5.0.3.arch1-1
-- Linux distribution: Arch Linux
-- Machine or mobo model: Z370 SLI PLUS
-- Display connector: DisplayPort
Comment 1 Andrzej Broński 2019-03-23 19:11:15 UTC
Created attachment 143762 [details]
dmesg output
Comment 2 Andrzej Broński 2019-03-23 19:29:42 UTC
There are two problems associated with this issue:
1) GPU hang -> userspace program shouldn't be able to crash GPU
2) kernel freeze -> despite there is 'resetting rcs0/chip' OS just hangs; I think it should continue working, otherwise it should show kernel panic.
Comment 3 Chris Wilson 2019-03-23 20:33:36 UTC
(In reply to Andrzej Broński from comment #2)
> There are two problems associated with this issue:
> 1) GPU hang -> userspace program shouldn't be able to crash GPU
> 2) kernel freeze -> despite there is 'resetting rcs0/chip' OS just hangs; I
> think it should continue working, otherwise it should show kernel panic.
Comment 4 Andrzej Broński 2019-03-26 16:44:22 UTC
Created attachment 143780 [details]
vulkaninfo output

Issue can be related to Vulkan, so I provide output of 'vulkaninfo' - it segfaults btw.
My version: vulkan-intel 19.0.0-1.
Comment 5 Tapani Pälli 2019-03-27 11:07:07 UTC
(In reply to Andrzej Broński from comment #4)
> Created attachment 143780 [details]
> vulkaninfo output
> 
> Issue can be related to Vulkan, so I provide output of 'vulkaninfo' - it
> segfaults btw.
> My version: vulkan-intel 19.0.0-1.

Huh, can you provide the backtrace for vulkaninfo segfault?
Comment 6 Andrzej Broński 2019-03-27 13:10:07 UTC
(In reply to Tapani Pälli from comment #5)
> (In reply to Andrzej Broński from comment #4)
> > Created attachment 143780 [details]
> > vulkaninfo output
> > 
> > Issue can be related to Vulkan, so I provide output of 'vulkaninfo' - it
> > segfaults btw.
> > My version: vulkan-intel 19.0.0-1.
> 
> Huh, can you provide the backtrace for vulkaninfo segfault?


Sure. I built mesa with debugging symbols, then run vulkaninfo with gdb:
~~~
...
Thread 1 "vulkaninfo" received signal SIGSEGV, Segmentation fault.
0x00007ffff7f650ce in xcb_send_request_with_fds64 () from /usr/lib/libxcb.so.1
(gdb) bt
#0  0x00007ffff7f650ce in xcb_send_request_with_fds64 () from /usr/lib/libxcb.so.1
#1  0x00007ffff7f6566a in xcb_send_request () from /usr/lib/libxcb.so.1
#2  0x00007ffff7f74405 in xcb_query_extension () from /usr/lib/libxcb.so.1
#3  0x00007ffff54d2c3d in wsi_x11_connection_create (conn=0x48000401b9358b48, wsi_dev=0x5555555b8d80) at ../mesa-19.0.0/src/vulkan/wsi/wsi_common_x11.c:135
#4  wsi_x11_get_connection (wsi_dev=wsi_dev@entry=0x5555555b8d80, conn=conn@entry=0x48000401b9358b48) at ../mesa-19.0.0/src/vulkan/wsi/wsi_common_x11.c:242
#5  0x00007ffff54d341a in x11_surface_get_support (icd_surface=<optimized out>, wsi_device=0x5555555b8d80, queueFamilyIndex=<optimized out>, pSupported=0x7fffffffd824) at ../mesa-19.0.0/src/vulkan/wsi/wsi_common_x11.c:427
#6  0x0000555555562693 in ?? ()
#7  0x0000555555557f72 in ?? ()
#8  0x00007ffff7c10223 in __libc_start_main () from /usr/lib/libc.so.6
#9  0x00005555555587be in ?? ()
~~~

You can grab core file here https://drive.google.com/open?id=1c7lX82Y2xCkVFVv84SnaqD0X3lCRphAJ however it seems to be the problem with XCB library.
Comment 7 Tapani Pälli 2019-03-28 06:13:04 UTC
(In reply to Andrzej Broński from comment #6)
> (In reply to Tapani Pälli from comment #5)
> > (In reply to Andrzej Broński from comment #4)
> > > Created attachment 143780 [details]
> > > vulkaninfo output
> > > 
> > > Issue can be related to Vulkan, so I provide output of 'vulkaninfo' - it
> > > segfaults btw.
> > > My version: vulkan-intel 19.0.0-1.
> > 
> > Huh, can you provide the backtrace for vulkaninfo segfault?
> 
> 
> Sure. I built mesa with debugging symbols, then run vulkaninfo with gdb:
> ~~~
> ...
> Thread 1 "vulkaninfo" received signal SIGSEGV, Segmentation fault.
> 0x00007ffff7f650ce in xcb_send_request_with_fds64 () from
> /usr/lib/libxcb.so.1
> (gdb) bt
> #0  0x00007ffff7f650ce in xcb_send_request_with_fds64 () from
> /usr/lib/libxcb.so.1
> #1  0x00007ffff7f6566a in xcb_send_request () from /usr/lib/libxcb.so.1
> #2  0x00007ffff7f74405 in xcb_query_extension () from /usr/lib/libxcb.so.1
> #3  0x00007ffff54d2c3d in wsi_x11_connection_create
> (conn=0x48000401b9358b48, wsi_dev=0x5555555b8d80) at
> ../mesa-19.0.0/src/vulkan/wsi/wsi_common_x11.c:135
> #4  wsi_x11_get_connection (wsi_dev=wsi_dev@entry=0x5555555b8d80,
> conn=conn@entry=0x48000401b9358b48) at
> ../mesa-19.0.0/src/vulkan/wsi/wsi_common_x11.c:242
> #5  0x00007ffff54d341a in x11_surface_get_support (icd_surface=<optimized
> out>, wsi_device=0x5555555b8d80, queueFamilyIndex=<optimized out>,
> pSupported=0x7fffffffd824) at
> ../mesa-19.0.0/src/vulkan/wsi/wsi_common_x11.c:427
> #6  0x0000555555562693 in ?? ()
> #7  0x0000555555557f72 in ?? ()
> #8  0x00007ffff7c10223 in __libc_start_main () from /usr/lib/libc.so.6
> #9  0x00005555555587be in ?? ()
> ~~~
> 
> You can grab core file here
> https://drive.google.com/open?id=1c7lX82Y2xCkVFVv84SnaqD0X3lCRphAJ however
> it seems to be the problem with XCB library.

Thanks, that is weird. FYI there is another trace which refers to xcb failure in bug #110261.
Comment 8 Denis 2019-03-29 16:25:07 UTC
hi, looks like I reproduced the issue, kernel is freezing completely also. Also UE4 engine can be launched if you select "flight" project, but - simply crashes because of vulkan issue.

Continue my investigations.

Test configuration:
Manjaro
Kernel 5.0
Mesa 19.1.0 git-master
UHD 630 gpu (CFL)
Comment 9 Andrzej Broński 2019-03-29 19:40:04 UTC
(In reply to Denis from comment #8)
> hi, looks like I reproduced the issue, kernel is freezing completely also.
> Also UE4 engine can be launched if you select "flight" project, but - simply
> crashes because of vulkan issue.
> 
> Continue my investigations.
> 
> Test configuration:
> Manjaro
> Kernel 5.0
> Mesa 19.1.0 git-master
> UHD 630 gpu (CFL)

You can use "-opengl" switch as workaround.
Comment 10 Denis 2019-04-02 11:41:03 UTC
thanks for suggestion, but in my case, I think, its unnecessary.

So, what I did:
1. build and boot into drm-tip kernel (with debug flags enabled)
2. Got "top" results during running UE4 (it doesn't look like OOM...)
3. Got a UE4editor log during running it

Strange that core dump wasn't generated, when I stacked with hang. I will attach these logs to the bug report. Any ideas and suggestions, what to check or try - appreciated.

BTW - PC hangs in both cases, when I am building "empty" project but with samples, and if I build it without them (I thought that problem might be in some default pre-set, which can compile some shaders, for example)
Comment 11 Denis 2019-04-02 11:41:40 UTC
Created attachment 143842 [details]
UE4_debug.log
Comment 12 Denis 2019-04-02 11:44:27 UTC
Created attachment 143843 [details]
journal_log_full

this log contains several "hangs" during testing. Please for navigation use "Reboot" word, it will show all "places" where I had to hard reboot my PC.
Comment 13 Denis 2019-04-02 11:44:50 UTC
Created attachment 143844 [details]
top_command_during_running_UE4
Comment 14 Andrzej Broński 2019-04-03 09:57:02 UTC
(In reply to Denis from comment #10)
> thanks for suggestion, but in my case, I think, its unnecessary.
> 
> So, what I did:
> 1. build and boot into drm-tip kernel (with debug flags enabled)
> 2. Got "top" results during running UE4 (it doesn't look like OOM...)
> 3. Got a UE4editor log during running it
> 
> Strange that core dump wasn't generated, when I stacked with hang. I will
> attach these logs to the bug report. Any ideas and suggestions, what to
> check or try - appreciated.
> 
> BTW - PC hangs in both cases, when I am building "empty" project but with
> samples, and if I build it without them (I thought that problem might be in
> some default pre-set, which can compile some shaders, for example)

I doubt UE4 logs are useful, because it's kernel-level problem. I looked into your kernel logs and there is only one interesting part:
~~~
кві 02 13:12:23 manjaro-pc kernel: i915 0000:00:02.0: GPU HANG: ecode 9:1:0xfffffffe, in UE4Editor [4360], hang on rcs0
кві 02 13:12:23 manjaro-pc kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
-- Reboot --
~~~
However if you want to capture full crash log along with /sys/class/drm/card0/error it's not easy task - I used netconsole for it. You can contact me at andrzej1_1@o2.pl and I will give you details how to do it.
Comment 15 Denis 2019-04-03 11:03:42 UTC
hi Andrzej, agree that UE4 logs might be useless, but as I said - I had an idea that something what UE4 did - could cause gpu hang and that action could be printed exactly in logs. That's why I took and them.

>However if you want to capture full crash log along with /sys/class/drm/card0/error it's not easy task - I used netconsole for it. You can contact me at andrzej1_1@o2.pl and I will give you details how to do it.

that's interesting point for me, for future investigations, so I would happy to know - how to do this.

Right now I don't see any value to get error log, because you already attached it, and, as I found out, nothing extra can't increase it (to make more clear, like, using debug mesa or kernel etc...)

So as you said, there are 2 issues here - gpu hang, which leads to kernel hang.
Comment 16 Danylo 2019-04-10 11:32:07 UTC
The hang happens due to compute workload in ANV and indeed fully hangs the system. I haven't been able to identify the cause yet.
Comment 17 Denis 2019-04-10 11:58:47 UTC
I moved this ticket to the vulkan section, because when I tested this:
>You can use "-opengl" switch as workaround.
I found out that no any hangs at all. I tested on all my (about 5) projects, result - 100% workable UE4.

Also I could get an vktrace of project launching and tested it on another machine. And can say that sometimes (like, 1 from 5 tries) it leads to machine hang.
To replay it you need vktrace (part of vulkan-tools).

https://drive.google.com/file/d/1tCqzmYVx7zKTFauA3YdaO7DgPxJqYizh/view?usp=sharing
Comment 18 Andrzej Broński 2019-04-10 13:47:33 UTC
(In reply to Denis from comment #17)
> I moved this ticket to the vulkan section, because when I tested this:
> >You can use "-opengl" switch as workaround.
> I found out that no any hangs at all. I tested on all my (about 5) projects,
> result - 100% workable UE4.
> 
> Also I could get an vktrace of project launching and tested it on another
> machine. And can say that sometimes (like, 1 from 5 tries) it leads to
> machine hang.
> To replay it you need vktrace (part of vulkan-tools).
> 
> https://drive.google.com/file/d/1tCqzmYVx7zKTFauA3YdaO7DgPxJqYizh/
> view?usp=sharing

You attached trace of Quake - is it a mistake?
Comment 19 Denis 2019-04-10 13:59:41 UTC
oh... I am sorry, yes, that's by mistake.

https://drive.google.com/open?id=1_6fFT_tuIb6mdZK6cQlWZPbFSeh3Bb2d
Here it is correct link
Comment 20 Andrzej Broński 2019-04-10 19:32:51 UTC
(In reply to Denis from comment #19)
> oh... I am sorry, yes, that's by mistake.
> 
> https://drive.google.com/open?id=1_6fFT_tuIb6mdZK6cQlWZPbFSeh3Bb2d
> Here it is correct link

This trace file is very useful! I ran following command 5 times:
~~~
$ vkreplay -o vkquake-flicker-3.vktrace
~~~
and it also freezed my PC.
Comment 21 Danylo 2019-04-10 19:43:45 UTC
That's strange - it definitely shouldn't hang, it doesn't hang on our machines, maybe trace is incompatible...  Could you try actual vkQuake?
Comment 22 Andrzej Broński 2019-04-11 09:03:00 UTC
(In reply to Danylo from comment #21)
> That's strange - it definitely shouldn't hang, it doesn't hang on our
> machines, maybe trace is incompatible...  Could you try actual vkQuake?

Sorry for confusion, but I copied wrong command from history (previous test with vkquake). I meant that following command make my OS hang:
~~~
$ vkreplay -o ue4_trace3.vktrace
~~~
Comment 23 Danylo 2019-04-25 14:44:49 UTC
Created attachment 144093 [details]
standalone hang reproducer

I found a compute shader which hangs GPU, reduced it and made a standalone reproducer.

Warning: It will hang the system with hard reset being the only solution (I think it could be classified as a bug because I don't think it should hang so badly).

Shader itself is small and simple:

#version 430
layout(local_size_x = 4, local_size_y = 4, local_size_z = 4) in;
layout(binding = 0) uniform usamplerBuffer StartOffsetGrid; // length 12160
layout(binding = 1, r32ui) uniform uimageBuffer RWNextCulledLightData; 
void main()
{
    if (all(lessThan(gl_GlobalInvocationID, uvec3(19, 10, 32)))) {
        uint u0 = (((gl_GlobalInvocationID.z * uint(10)) + gl_GlobalInvocationID.y) * uint(19)) + 32;
        uint u7 = 6000 + u0; // change to 5000 and there hang will go away
        uint u11 = texelFetch(StartOffsetGrid, int(u7)).x;
        uint _225 = imageAtomicAdd(RWNextCulledLightData, 0, u11);
    }
} 

StartOffsetGrid and RWNextCulledLightData point to the same buffer with RWNextCulledLightData starting right after StartOffsetGrid.

However I'm not able to identify the cause. Change "6000 + u0" to "5000 + u8" and there will be no hang, in both cases the index will be inside the buffer.

So I need additional help here.
Comment 24 Jason Ekstrand 2019-04-25 23:02:42 UTC
I've seen mystery hangs before with UE4 and atomics but was never able to really pin it down.  Thanks for making a simple reproducer!

Mind trying a couple things?

 1. Place the buffer views in the other order in the buffer so the atomic comes first.  Does that fix it?
 2. Leaving the buffer views in the current order, try spacing them out so that there is at least 16K between the two views.  If that works, what's the minimum distance?

I've got a theory about what might be going wrong here and I don't like it...
Comment 25 Danylo 2019-04-26 11:55:56 UTC
Created attachment 144097 [details]
reproducer v2

- Changing the order of the buffers doesn't help.
- Spacing them helps, the necessary spacing depends on constant value in "uint u7 = 6080 + u0;" expression. Results:
  Constant Spacing
    6120     512
    6080     512
    6016     512
    6001     512
    6000     576
    5998     576
    5997     512
    5760     384
Comment 26 Danylo 2019-06-13 10:00:56 UTC
Hellblade: Senua’s Sacrifice most likely suffers from this issue - it fully hangs the system on a compute shader which uses buffers allocated nearby in the similar fashion as this bug.

Vulkan rdc made on HD 620: https://mega.nz/#!9ENTCIAa!yitgxJYXcxMqAkC_hnzRIadUy-9oocclGLSSurofh18

Also the reproducer and the game don't hang on HD 5500 (gen8)
Comment 27 Jason Ekstrand 2019-06-14 15:53:47 UTC
First off, thanks for all your digging and I'm sorry I've not gotten back to this yet.

I have a feeling I know what's going wrong here and I really don't like it.  When I've played with this before, I've observed that the hangs go away when I disable L3 caching on those buffers via MOCS settings or changing the L3$ programming to disable the data cache.  What I suspect is happening is that the same cache line is getting pulled in to the sampler portion of the L3$ and into the data cach portion of the L3$ at the same time and something about the atomic is causing the cache to blow up and the chip to hang.

Your first reaction to this might be to say, "aren't cache lines usually 64B?"  Yes, that's the usual size or at least in the right area.  However, our HW docs expressly say that the sampler sometimes fetches more cache lines than you'd naively think it needs.  Also (and I am speculating a bit here), the cache might be doing some speculative pre-fetching that reads outside of the normal boundaries.  I'm not sure if the two of those is enough to read 576B outside but the evidence suggests maybe it does.

So how do we work around the issue?  Good question.  Unfortunately, because of the way that buffer views work in Vulkan, we can't just force them to space their buffers out.  It's possible that we may be able to work around this in the shader by doing some sort of barrier between the texture operation and the atomic to ensure that, at the very least, the two aren't in-flight at the same time.
Comment 28 Danylo 2019-06-14 16:16:35 UTC
Wow, that makes some sense. And the workaround if it's true doesn't look great since it would problematic to apply it only in cases when such buffers are nearby...

I hope you can confirm this - it's out of my league =)
Comment 29 Jason Ekstrand 2019-06-14 21:14:44 UTC
Update:  I tried inserting a barrier() and it did nothing. :-(
Comment 30 Jason Ekstrand 2019-06-14 22:41:25 UTC
Put a bit more time into it.  Here's what I have so far:

 1. Inserting a barrier() does nothing

 2. Setting MOCS = 0 on the buffer used for the atomic makes than hang go away.

 3. If I hack up the driver to use an untyped atomic instead of a typed atomic, no change.  It still hangs so typed vs. untyped doesn't seem to matter.

 4. If I hack the driver up to use SUFTYPE_1D instead of SURFTYPE_BUFFER for the uniform texel buffer, the hang goes away.

 5. If I use SURFTYPE_1D for the VkBufferView with the atomic and SURFTYPE_BUFFER for the uniform texel buffer, it hangs.

It's very much starting to look like it's just a weird interaction between atomics, the L3$, and the sampler with SURFTYPE_BUFFER.  Unfortunately, that means our options for mitigating it are somewhat limited.  We can't just use SURFTYPE_1D for all buffer textures because that has a limited width of 16K and I really don't want to start disabling L3 caching on all atomics.
Comment 31 Jason Ekstrand 2019-06-14 22:54:51 UTC
Created attachment 144550 [details] [review]
Patch to use SURFTYPE_1D for texel buffers

One more thing before I call it a day.  I've attached a hack patch which attempts to use SURFTYPE_1D for texel buffers.  It's not a solution for the reasons I've already listed.  However, could you take a bit of time and experiment with it on some of the apps that are known to hang and see if it helps them?  If so, that at least tells us that we're headed in the right direction.
Comment 32 Andrzej Broński 2019-06-16 11:40:28 UTC
(In reply to Jason Ekstrand from comment #31)
> Created attachment 144550 [details] [review] [review]
> Patch to use SURFTYPE_1D for texel buffers
> 
> One more thing before I call it a day.  I've attached a hack patch which
> attempts to use SURFTYPE_1D for texel buffers.  It's not a solution for the
> reasons I've already listed.  However, could you take a bit of time and
> experiment with it on some of the apps that are known to hang and see if it
> helps them?  If so, that at least tells us that we're headed in the right
> direction.

I installed mesa with your patch and there is no more GPU hang after launching _Unreal Engine_. For me it's satisfactory workaround, however I hope you will figure out root cause.
Comment 33 Danylo 2019-06-20 12:25:15 UTC
> It's not a solution for the reasons I've already listed.  
> However, could you take a bit of time and experiment with it on some of the
> apps that are known to hang and see if it helps them?

Unfortunately Hellblade crashes with this patch on a buffer with:
 stride_b: 4, num_elements: 65280, pitch: 3

Which seems to be the width limitation you said about:
> We can't just use SURFTYPE_1D for all buffer textures because that has a 
> limited width of 16K

MOCS = 0 indeed helps.
Comment 34 Jason Ekstrand 2019-06-25 22:37:03 UTC
I've done a bunch of poking at this including running it on Windows (where it doesn't hang) and getting an aub capture.  There's nothing obvious that we're really doing differently. :-(  I'm guessing there's some L3$ chicken bit we're missing somewhere.

I also experimented with reworking the test to use a fragment shader instead of a compute shader and was able to reproduce the hang with FS as well.  This eliminates any it's some GPGPU workaround we're missing.
Comment 35 Denis 2019-06-27 10:15:43 UTC
*** Bug 110377 has been marked as a duplicate of this bug. ***
Comment 36 Jason Ekstrand 2019-07-20 13:00:16 UTC
From Chris Wilson:

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/g
t/intel_workarounds.c
index 704ace01e7f5..890a3bcfacea 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -667,6 +667,10 @@ gen9_gt_workarounds_init(struct drm_i915_private *i915, stru
ct i915_wa_list *wal
                            MMCD_PCLA | MMCD_HOTSPOT_EN);
        }
 
+       wa_write_masked_or(wal, _MMIO(0xb008), BIT(0), 0);
+       wa_write_masked_or(wal, _MMIO(0xb118), BIT(22), 0);
+       wa_write_masked_or(wal, _MMIO(0xb11c), BIT(8), 0);
+
        /* WaDisableHDCInvalidation:skl,bxt,kbl,cfl */
        wa_write_or(wal,
                    GAM_ECOCHK,

Assuming none of those registers are protected, we should be able to do that from userspace and disable the L3$ for just atomics.
Comment 37 Jason Ekstrand 2019-07-20 14:08:13 UTC
Leaving myself more notes:

Looking around at Windows driver code and batches, the Windows driver is definitely leaving "Non-IA coherent atomics enable" enabled in L3SQCREG4 and they don't have this hang so the workaround must be somewhere else.
Comment 38 Jason Ekstrand 2019-08-09 04:32:01 UTC
Chris posted a kernel patch on 110998 which fixes the reproducer case. Could you please try it with UE4 and any other apps we've suspected as having this same hang and see what all it fixes?
Comment 39 Denis 2019-08-09 09:54:04 UTC
Hi, unfortunately I deleted UE4 engine, but I tested on vktrace which 100% reproduced issue. On mesa 17.3.9 it doesn't hang (on later mesa versions vktrace crashed, asked Danylo to check it).
Also I tested apitrace for DX11 Hellblade game, it is UE4 engine game, and it also had a hang. Apitrace was played successfully.

So, based on this, I would say, that kernel patch helped.
Tested on drm-tip from 15 July + mentioned patch.
Comment 40 Andrzej Broński 2019-08-09 16:58:14 UTC
(In reply to Jason Ekstrand from comment #38)
> Chris posted a kernel patch on 110998 which fixes the reproducer case. Could
> you please try it with UE4 and any other apps we've suspected as having this
> same hang and see what all it fixes?

Can patch be tested without recompiling the whole kernel?
Comment 41 Sergey Alirzaev 2019-08-14 09:51:37 UTC
I have frequent crashes in mpv that look related:

#0  0x00007fbd8eaa2b45 in xcb_send_request_with_fds64 () from /usr/lib64/libxcb.so.1
#1  0x00007fbd8eaa2ce9 in xcb_send_request () from /usr/lib64/libxcb.so.1
#2  0x00007fbd8eaa9724 in xcb_intern_atom () from /usr/lib64/libxcb.so.1
#3  0x00007fbd8f153ef3 in set_adaptive_sync_property (conn=conn@entry=0x7fbd6c3c8e40, drawable=58720258, state=<optimized out>, state@entry=0) at ../mesa-19.1.4/src/loader/loader_dri3_helper.c:114
#4  0x00007fbd8f15516c in loader_dri3_drawable_init (conn=0x7fbd6c3c8e40, drawable=drawable@entry=58720258, dri_screen=0x7fbd6c841ed0, is_different_gpu=<optimized out>, multiplanes_available=<optimized out>, dri_config=0x7fbd6c2b7920, 
    ext=0x7fbd6c6727f8, vtable=0x7fbd8f165ae0 <egl_dri3_vtable>, draw=0x7fbd6c7778c0) at ../mesa-19.1.4/src/loader/loader_dri3_helper.c:382
#5  0x00007fbd8f14e258 in dri3_create_surface (drv=drv@entry=0x7fbd6c2b8290, disp=disp@entry=0x7fbd6c6c7540, type=type@entry=4, conf=0x7fbd6c2a88b0, native_surface=<optimized out>, attrib_list=0x0)
    at ../mesa-19.1.4/src/egl/drivers/dri2/platform_x11_dri3.c:179
#6  0x00007fbd8f14e377 in dri3_create_window_surface (drv=0x7fbd6c2b8290, disp=0x7fbd6c6c7540, conf=<optimized out>, native_window=<optimized out>, attrib_list=<optimized out>)
    at ../mesa-19.1.4/src/egl/drivers/dri2/platform_x11_dri3.c:232
#7  0x00007fbd8f147695 in dri2_create_window_surface (drv=<optimized out>, disp=<optimized out>, conf=<optimized out>, native_window=<optimized out>, attrib_list=<optimized out>) at ../mesa-19.1.4/src/egl/drivers/dri2/egl_dri2.c:1591
#8  0x00007fbd8f13c011 in _eglCreateWindowSurfaceCommon (disp=disp@entry=0x7fbd6c6c7540, config=config@entry=0x7fbd6c2a88b0, native_window=native_window@entry=0x3800002, attrib_list=attrib_list@entry=0x0)
    at ../mesa-19.1.4/src/egl/main/eglapi.c:929
#9  0x00007fbd8f13c205 in eglCreateWindowSurface (dpy=<optimized out>, config=0x7fbd6c2a88b0, window=58720258, attrib_list=0x0) at ../mesa-19.1.4/src/egl/main/eglapi.c:945
#10 0x000056220fda0571 in mpegl_init ()
#11 0x000056220fd882ff in ra_ctx_create ()
#12 0x000056220fda8b44 in preinit ()
#13 0x000056220fda7f01 in vo_thread ()
#14 0x00007fbd8ecb1458 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fbd8ebdf3cf in clone () from /lib64/libc.so.6

The execution path before xcb_intern_atom varies.
Comment 42 Jason Ekstrand 2019-08-14 18:17:08 UTC
(In reply to Sergey Alirzaev from comment #41)
> I have frequent crashes in mpv that look related:

That's not at all related.  It's not even using the same userspace driver.
Comment 43 Jason Ekstrand 2019-09-12 02:47:43 UTC
Fixed in the kernel:

commit 9d7b01e93526efe79dbf75b69cc5972b5a4f7b37 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Sep 4 11:07:07 2019 +0100

    drm/i915: Restore relaxed padding (OCL_OOB_SUPPRES_ENABLE) for skl+
    
    This bit was fliped on for "syncing dependencies between camera and
    graphics". BSpec has no recollection why, and it is causing
    unrecoverable GPU hangs with Vulkan compute workloads.
    
    From BSpec, setting bit5 to 0 enables relaxed padding requirements for
    buffers, 1D and 2D non-array, non-MSAA, non-mip-mapped linear surfaces;
    and *must* be set to 0h on skl+ to ensure "Out of Bounds" case is
    suppressed.
    
    Reported-by: Jason Ekstrand <jason@jlekstrand.net>
    Suggested-by: Jason Ekstrand <jason@jlekstrand.net>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110998
    Fixes: 8424171e135c ("drm/i915/gen9: h/w w/a: syncing dependencies between camera and graphics")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Tested-by: denys.kostin@globallogic.com
    Cc: Jason Ekstrand <jason@jlekstrand.net>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: <stable@vger.kernel.org> # v4.1+
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190904100707.7377-1-chris@chris-wilson.co.uk

Solves the immediate test case.
Comment 44 Andrzej Broński 2019-09-20 13:25:15 UTC
Patch is merged into Linux 5.3.0, however I still get GPU hang. Can someone confirm that issue is solved?
Comment 45 Denis 2019-09-20 14:59:07 UTC
hmmmmm, weird. I also installed 5.3.0 kernel (from manjaro) and was able to reproduce the hang

Manjaro OS
Linux den-pc 5.3.0-1-MANJARO #1
Mesa 19.1.5
KBL (HD620)
________________________________________________________
But, on ubuntu with 5.3.0 kernel - it works great

Mesa 19.3.0-devel (git-1e483a87bc)
VERSION="18.04.3 LTS (Bionic Beaver)"
5.3.0-050300-generic
KBL (HD620)

(to be sure that I am not crazy, I tested ubuntu on 4.15 kernel - and got a hang).

So looks like fix wasn't included to ALL 5.3.0 kernels... And need to wait a bit 8-/
Comment 46 Andrzej Broński 2019-09-20 15:49:51 UTC
But it's already included in Arch: https://git.archlinux.org/linux.git/commit/?h=v5.3-arch1&id=592b8d8759ceb7086e1683e1796c7110e6c2ae8f
Comment 47 Denis 2019-09-26 16:17:47 UTC
Holly ...
This bug made me crazy, even though the answer was near.

Here is patch date, on your link:

author	Linus Torvalds <torvalds@linux-foundation.org>	2019-09-14 11:54:57 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2019-09-14 11:54:57 -0700


So the ommit date into kernel - 14.09

My current 5.3.0 kernel version has => Sep 2 18:26:38

So yes, fix simply wasn't included! And the last thing - ubuntu kernel (which worked fine) has a 15.09 release date
Comment 48 Denis 2019-09-27 13:04:05 UTC
upd

Linux den-pc 5.3.1-arch1-1-ARCH #1 SMP PREEMPT Sat Sep 21 11:33:49 UTC 2019 x86_64 GNU/Linux

tested and worked fine (without hangs)
You may also install it from here https://www.archlinux.org/packages/core/x86_64/linux/
Comment 49 Andrzej Broński 2019-09-28 09:45:55 UTC
@Denis
Unfortunately patch (14 sep) was included in linux 5.3.0 (15 sep) for Arch. Today I tested 5.3.1 and cpu is still hanging.

I will provide output logs in few days.
Comment 50 Denis 2019-09-30 08:15:18 UTC
okaaaay, interesting, very very. May I ask you to make a test on renderdoc from here? => https://mega.nz/#!9ENTCIAa!yitgxJYXcxMqAkC_hnzRIadUy-9oocclGLSSurofh18

to run it you need a renderdoc => https://renderdoc.org/ (use nightly version)

>renderdoccmd replay <path_to_trace.rdc>

As we assumed that the cause of the hang was the same for all found cases - so I re-checked them on this game (I have an apitrace and renderdoc for hellblade + I tested Darksides 3 and attached reproducer from Danylo). Weird but original trace (vktrace for UE4) stopped working for me (no hangs... just didn't trace).

And the second thing - create please a new issue in gitlab, as all tickets and discussions were migrated there => https://gitlab.freedesktop.org/mesa/mesa/issues
Comment 51 Andrzej Broński 2019-10-06 05:55:57 UTC
Created attachment 145662 [details]
RenderDoc's hellblade replay log
Comment 52 Andrzej Broński 2019-10-06 06:02:56 UTC
@Denis 

1) I replayed Hellblade trace without any hang. I stopped it after few minutes, because there were only game menu and no new events happened - I attached log of it. However I does not mean bug was fixed correctly, because RenderDoc docs state that "portability of captures between hardware is not guaranteed".

2) I am also unable to replay old ue4_trace3.vktrace because of "Segmentation fault".

3) Okay, I will capture all logs (as I did when I added this bug here), then create issue on GitLab.
Comment 53 Denis 2019-10-11 10:36:18 UTC
same thing happened and for me (something broke vktrace). About renderdoc - despite that it shows only main menu, it was enough to cause gpu hang (you can run it on older kernel and see it :) ).

If you have a hang each time, you may try to make own vktrace, as I did before. Possibly it will be successful and also would lead to hang (so we can reproduce it from our side).
I also forgot to ask you to check https://bugs.freedesktop.org/show_bug.cgi?id=110228#c25 this reproducer (it was based on vktrace and 100% reproduced that hang)

So, to summarize - as renderoc doesn't hang your system (and if reproducer also doesn't hang) - we faced some new issue. Looking forward for your feedback
Comment 54 Andrzej Broński 2019-10-12 22:05:51 UTC
@Denis Recently my UE4 started crashing, so installed latest version and there is no hang... Standalone reproducer works fine as well.

I am unable to reproduce issue now, so I am happy and I acknowledge issue as fixed. If I ever encounter this problem in future, I will let you know.
Comment 55 Denis 2019-10-15 12:50:02 UTC
(In reply to Andrzej Broński from comment #54)
> @Denis Recently my UE4 started crashing, so installed latest version and
> there is no hang... Standalone reproducer works fine as well.
> 
> I am unable to reproduce issue now, so I am happy and I acknowledge issue as
> fixed. If I ever encounter this problem in future, I will let you know.

Great!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.