Created attachment 143761 [details] /sys/class/drm/card0/error # Bug description Kernel is freezing completely whenever I open "Unreal Engine 4" project - even SysRq keys don't work. However with netconsole I catched some error logs related to GPU: ~~~ [ 381.112949] [drm:drm_ioctl [drm]] pid=775, dev=0xe200, auth=1, I915_GEM_MMAP [ 381.113119] [drm:drm_ioctl [drm]] pid=775, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2_WR [ 381.113146] [drm:drm_ioctl [drm]] pid=775, dev=0xe200, auth=1, I915_GEM_THROTTLE ... [ 387.161861] [drm] GPU HANG: ecode 9:0:0x8ed9fff2, in UE4Editor [3127], reason: hang on rcs0, action: reset [ 387.161866] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 387.161868] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 387.161870] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 387.161871] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 387.161873] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 387.162881] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 ... [ 387.164405] i915 0000:00:02.0: Resetting chip for hang on rcs0 ~~~ Due to freeze I was able to obtain part of /sys/class/drm/card0/error which I am attaching along with dmesg output. # System environment -- chipset: Z370 -- system architecture: x86_64 -- xf86-video-intel: 1:2.99.917+863+g6afed33b-1 -- xorg-server: 1.20.4-1 -- mesa: 19.0.0-1 -- libdrm: 2.4.97-1 -- kernel: 5.0.3.arch1-1 -- Linux distribution: Arch Linux -- Machine or mobo model: Z370 SLI PLUS -- Display connector: DisplayPort
Created attachment 143762 [details] dmesg output
There are two problems associated with this issue: 1) GPU hang -> userspace program shouldn't be able to crash GPU 2) kernel freeze -> despite there is 'resetting rcs0/chip' OS just hangs; I think it should continue working, otherwise it should show kernel panic.
(In reply to Andrzej Broński from comment #2) > There are two problems associated with this issue: > 1) GPU hang -> userspace program shouldn't be able to crash GPU > 2) kernel freeze -> despite there is 'resetting rcs0/chip' OS just hangs; I > think it should continue working, otherwise it should show kernel panic.
Created attachment 143780 [details] vulkaninfo output Issue can be related to Vulkan, so I provide output of 'vulkaninfo' - it segfaults btw. My version: vulkan-intel 19.0.0-1.
(In reply to Andrzej Broński from comment #4) > Created attachment 143780 [details] > vulkaninfo output > > Issue can be related to Vulkan, so I provide output of 'vulkaninfo' - it > segfaults btw. > My version: vulkan-intel 19.0.0-1. Huh, can you provide the backtrace for vulkaninfo segfault?
(In reply to Tapani Pälli from comment #5) > (In reply to Andrzej Broński from comment #4) > > Created attachment 143780 [details] > > vulkaninfo output > > > > Issue can be related to Vulkan, so I provide output of 'vulkaninfo' - it > > segfaults btw. > > My version: vulkan-intel 19.0.0-1. > > Huh, can you provide the backtrace for vulkaninfo segfault? Sure. I built mesa with debugging symbols, then run vulkaninfo with gdb: ~~~ ... Thread 1 "vulkaninfo" received signal SIGSEGV, Segmentation fault. 0x00007ffff7f650ce in xcb_send_request_with_fds64 () from /usr/lib/libxcb.so.1 (gdb) bt #0 0x00007ffff7f650ce in xcb_send_request_with_fds64 () from /usr/lib/libxcb.so.1 #1 0x00007ffff7f6566a in xcb_send_request () from /usr/lib/libxcb.so.1 #2 0x00007ffff7f74405 in xcb_query_extension () from /usr/lib/libxcb.so.1 #3 0x00007ffff54d2c3d in wsi_x11_connection_create (conn=0x48000401b9358b48, wsi_dev=0x5555555b8d80) at ../mesa-19.0.0/src/vulkan/wsi/wsi_common_x11.c:135 #4 wsi_x11_get_connection (wsi_dev=wsi_dev@entry=0x5555555b8d80, conn=conn@entry=0x48000401b9358b48) at ../mesa-19.0.0/src/vulkan/wsi/wsi_common_x11.c:242 #5 0x00007ffff54d341a in x11_surface_get_support (icd_surface=<optimized out>, wsi_device=0x5555555b8d80, queueFamilyIndex=<optimized out>, pSupported=0x7fffffffd824) at ../mesa-19.0.0/src/vulkan/wsi/wsi_common_x11.c:427 #6 0x0000555555562693 in ?? () #7 0x0000555555557f72 in ?? () #8 0x00007ffff7c10223 in __libc_start_main () from /usr/lib/libc.so.6 #9 0x00005555555587be in ?? () ~~~ You can grab core file here https://drive.google.com/open?id=1c7lX82Y2xCkVFVv84SnaqD0X3lCRphAJ however it seems to be the problem with XCB library.
(In reply to Andrzej Broński from comment #6) > (In reply to Tapani Pälli from comment #5) > > (In reply to Andrzej Broński from comment #4) > > > Created attachment 143780 [details] > > > vulkaninfo output > > > > > > Issue can be related to Vulkan, so I provide output of 'vulkaninfo' - it > > > segfaults btw. > > > My version: vulkan-intel 19.0.0-1. > > > > Huh, can you provide the backtrace for vulkaninfo segfault? > > > Sure. I built mesa with debugging symbols, then run vulkaninfo with gdb: > ~~~ > ... > Thread 1 "vulkaninfo" received signal SIGSEGV, Segmentation fault. > 0x00007ffff7f650ce in xcb_send_request_with_fds64 () from > /usr/lib/libxcb.so.1 > (gdb) bt > #0 0x00007ffff7f650ce in xcb_send_request_with_fds64 () from > /usr/lib/libxcb.so.1 > #1 0x00007ffff7f6566a in xcb_send_request () from /usr/lib/libxcb.so.1 > #2 0x00007ffff7f74405 in xcb_query_extension () from /usr/lib/libxcb.so.1 > #3 0x00007ffff54d2c3d in wsi_x11_connection_create > (conn=0x48000401b9358b48, wsi_dev=0x5555555b8d80) at > ../mesa-19.0.0/src/vulkan/wsi/wsi_common_x11.c:135 > #4 wsi_x11_get_connection (wsi_dev=wsi_dev@entry=0x5555555b8d80, > conn=conn@entry=0x48000401b9358b48) at > ../mesa-19.0.0/src/vulkan/wsi/wsi_common_x11.c:242 > #5 0x00007ffff54d341a in x11_surface_get_support (icd_surface=<optimized > out>, wsi_device=0x5555555b8d80, queueFamilyIndex=<optimized out>, > pSupported=0x7fffffffd824) at > ../mesa-19.0.0/src/vulkan/wsi/wsi_common_x11.c:427 > #6 0x0000555555562693 in ?? () > #7 0x0000555555557f72 in ?? () > #8 0x00007ffff7c10223 in __libc_start_main () from /usr/lib/libc.so.6 > #9 0x00005555555587be in ?? () > ~~~ > > You can grab core file here > https://drive.google.com/open?id=1c7lX82Y2xCkVFVv84SnaqD0X3lCRphAJ however > it seems to be the problem with XCB library. Thanks, that is weird. FYI there is another trace which refers to xcb failure in bug #110261.
hi, looks like I reproduced the issue, kernel is freezing completely also. Also UE4 engine can be launched if you select "flight" project, but - simply crashes because of vulkan issue. Continue my investigations. Test configuration: Manjaro Kernel 5.0 Mesa 19.1.0 git-master UHD 630 gpu (CFL)
(In reply to Denis from comment #8) > hi, looks like I reproduced the issue, kernel is freezing completely also. > Also UE4 engine can be launched if you select "flight" project, but - simply > crashes because of vulkan issue. > > Continue my investigations. > > Test configuration: > Manjaro > Kernel 5.0 > Mesa 19.1.0 git-master > UHD 630 gpu (CFL) You can use "-opengl" switch as workaround.
thanks for suggestion, but in my case, I think, its unnecessary. So, what I did: 1. build and boot into drm-tip kernel (with debug flags enabled) 2. Got "top" results during running UE4 (it doesn't look like OOM...) 3. Got a UE4editor log during running it Strange that core dump wasn't generated, when I stacked with hang. I will attach these logs to the bug report. Any ideas and suggestions, what to check or try - appreciated. BTW - PC hangs in both cases, when I am building "empty" project but with samples, and if I build it without them (I thought that problem might be in some default pre-set, which can compile some shaders, for example)
Created attachment 143842 [details] UE4_debug.log
Created attachment 143843 [details] journal_log_full this log contains several "hangs" during testing. Please for navigation use "Reboot" word, it will show all "places" where I had to hard reboot my PC.
Created attachment 143844 [details] top_command_during_running_UE4
(In reply to Denis from comment #10) > thanks for suggestion, but in my case, I think, its unnecessary. > > So, what I did: > 1. build and boot into drm-tip kernel (with debug flags enabled) > 2. Got "top" results during running UE4 (it doesn't look like OOM...) > 3. Got a UE4editor log during running it > > Strange that core dump wasn't generated, when I stacked with hang. I will > attach these logs to the bug report. Any ideas and suggestions, what to > check or try - appreciated. > > BTW - PC hangs in both cases, when I am building "empty" project but with > samples, and if I build it without them (I thought that problem might be in > some default pre-set, which can compile some shaders, for example) I doubt UE4 logs are useful, because it's kernel-level problem. I looked into your kernel logs and there is only one interesting part: ~~~ кві 02 13:12:23 manjaro-pc kernel: i915 0000:00:02.0: GPU HANG: ecode 9:1:0xfffffffe, in UE4Editor [4360], hang on rcs0 кві 02 13:12:23 manjaro-pc kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. -- Reboot -- ~~~ However if you want to capture full crash log along with /sys/class/drm/card0/error it's not easy task - I used netconsole for it. You can contact me at andrzej1_1@o2.pl and I will give you details how to do it.
hi Andrzej, agree that UE4 logs might be useless, but as I said - I had an idea that something what UE4 did - could cause gpu hang and that action could be printed exactly in logs. That's why I took and them. >However if you want to capture full crash log along with /sys/class/drm/card0/error it's not easy task - I used netconsole for it. You can contact me at andrzej1_1@o2.pl and I will give you details how to do it. that's interesting point for me, for future investigations, so I would happy to know - how to do this. Right now I don't see any value to get error log, because you already attached it, and, as I found out, nothing extra can't increase it (to make more clear, like, using debug mesa or kernel etc...) So as you said, there are 2 issues here - gpu hang, which leads to kernel hang.
The hang happens due to compute workload in ANV and indeed fully hangs the system. I haven't been able to identify the cause yet.
I moved this ticket to the vulkan section, because when I tested this: >You can use "-opengl" switch as workaround. I found out that no any hangs at all. I tested on all my (about 5) projects, result - 100% workable UE4. Also I could get an vktrace of project launching and tested it on another machine. And can say that sometimes (like, 1 from 5 tries) it leads to machine hang. To replay it you need vktrace (part of vulkan-tools). https://drive.google.com/file/d/1tCqzmYVx7zKTFauA3YdaO7DgPxJqYizh/view?usp=sharing
(In reply to Denis from comment #17) > I moved this ticket to the vulkan section, because when I tested this: > >You can use "-opengl" switch as workaround. > I found out that no any hangs at all. I tested on all my (about 5) projects, > result - 100% workable UE4. > > Also I could get an vktrace of project launching and tested it on another > machine. And can say that sometimes (like, 1 from 5 tries) it leads to > machine hang. > To replay it you need vktrace (part of vulkan-tools). > > https://drive.google.com/file/d/1tCqzmYVx7zKTFauA3YdaO7DgPxJqYizh/ > view?usp=sharing You attached trace of Quake - is it a mistake?
oh... I am sorry, yes, that's by mistake. https://drive.google.com/open?id=1_6fFT_tuIb6mdZK6cQlWZPbFSeh3Bb2d Here it is correct link
(In reply to Denis from comment #19) > oh... I am sorry, yes, that's by mistake. > > https://drive.google.com/open?id=1_6fFT_tuIb6mdZK6cQlWZPbFSeh3Bb2d > Here it is correct link This trace file is very useful! I ran following command 5 times: ~~~ $ vkreplay -o vkquake-flicker-3.vktrace ~~~ and it also freezed my PC.
That's strange - it definitely shouldn't hang, it doesn't hang on our machines, maybe trace is incompatible... Could you try actual vkQuake?
(In reply to Danylo from comment #21) > That's strange - it definitely shouldn't hang, it doesn't hang on our > machines, maybe trace is incompatible... Could you try actual vkQuake? Sorry for confusion, but I copied wrong command from history (previous test with vkquake). I meant that following command make my OS hang: ~~~ $ vkreplay -o ue4_trace3.vktrace ~~~
Created attachment 144093 [details] standalone hang reproducer I found a compute shader which hangs GPU, reduced it and made a standalone reproducer. Warning: It will hang the system with hard reset being the only solution (I think it could be classified as a bug because I don't think it should hang so badly). Shader itself is small and simple: #version 430 layout(local_size_x = 4, local_size_y = 4, local_size_z = 4) in; layout(binding = 0) uniform usamplerBuffer StartOffsetGrid; // length 12160 layout(binding = 1, r32ui) uniform uimageBuffer RWNextCulledLightData; void main() { if (all(lessThan(gl_GlobalInvocationID, uvec3(19, 10, 32)))) { uint u0 = (((gl_GlobalInvocationID.z * uint(10)) + gl_GlobalInvocationID.y) * uint(19)) + 32; uint u7 = 6000 + u0; // change to 5000 and there hang will go away uint u11 = texelFetch(StartOffsetGrid, int(u7)).x; uint _225 = imageAtomicAdd(RWNextCulledLightData, 0, u11); } } StartOffsetGrid and RWNextCulledLightData point to the same buffer with RWNextCulledLightData starting right after StartOffsetGrid. However I'm not able to identify the cause. Change "6000 + u0" to "5000 + u8" and there will be no hang, in both cases the index will be inside the buffer. So I need additional help here.
I've seen mystery hangs before with UE4 and atomics but was never able to really pin it down. Thanks for making a simple reproducer! Mind trying a couple things? 1. Place the buffer views in the other order in the buffer so the atomic comes first. Does that fix it? 2. Leaving the buffer views in the current order, try spacing them out so that there is at least 16K between the two views. If that works, what's the minimum distance? I've got a theory about what might be going wrong here and I don't like it...
Created attachment 144097 [details] reproducer v2 - Changing the order of the buffers doesn't help. - Spacing them helps, the necessary spacing depends on constant value in "uint u7 = 6080 + u0;" expression. Results: Constant Spacing 6120 512 6080 512 6016 512 6001 512 6000 576 5998 576 5997 512 5760 384
Hellblade: Senua’s Sacrifice most likely suffers from this issue - it fully hangs the system on a compute shader which uses buffers allocated nearby in the similar fashion as this bug. Vulkan rdc made on HD 620: https://mega.nz/#!9ENTCIAa!yitgxJYXcxMqAkC_hnzRIadUy-9oocclGLSSurofh18 Also the reproducer and the game don't hang on HD 5500 (gen8)
First off, thanks for all your digging and I'm sorry I've not gotten back to this yet. I have a feeling I know what's going wrong here and I really don't like it. When I've played with this before, I've observed that the hangs go away when I disable L3 caching on those buffers via MOCS settings or changing the L3$ programming to disable the data cache. What I suspect is happening is that the same cache line is getting pulled in to the sampler portion of the L3$ and into the data cach portion of the L3$ at the same time and something about the atomic is causing the cache to blow up and the chip to hang. Your first reaction to this might be to say, "aren't cache lines usually 64B?" Yes, that's the usual size or at least in the right area. However, our HW docs expressly say that the sampler sometimes fetches more cache lines than you'd naively think it needs. Also (and I am speculating a bit here), the cache might be doing some speculative pre-fetching that reads outside of the normal boundaries. I'm not sure if the two of those is enough to read 576B outside but the evidence suggests maybe it does. So how do we work around the issue? Good question. Unfortunately, because of the way that buffer views work in Vulkan, we can't just force them to space their buffers out. It's possible that we may be able to work around this in the shader by doing some sort of barrier between the texture operation and the atomic to ensure that, at the very least, the two aren't in-flight at the same time.
Wow, that makes some sense. And the workaround if it's true doesn't look great since it would problematic to apply it only in cases when such buffers are nearby... I hope you can confirm this - it's out of my league =)
Update: I tried inserting a barrier() and it did nothing. :-(
Put a bit more time into it. Here's what I have so far: 1. Inserting a barrier() does nothing 2. Setting MOCS = 0 on the buffer used for the atomic makes than hang go away. 3. If I hack up the driver to use an untyped atomic instead of a typed atomic, no change. It still hangs so typed vs. untyped doesn't seem to matter. 4. If I hack the driver up to use SUFTYPE_1D instead of SURFTYPE_BUFFER for the uniform texel buffer, the hang goes away. 5. If I use SURFTYPE_1D for the VkBufferView with the atomic and SURFTYPE_BUFFER for the uniform texel buffer, it hangs. It's very much starting to look like it's just a weird interaction between atomics, the L3$, and the sampler with SURFTYPE_BUFFER. Unfortunately, that means our options for mitigating it are somewhat limited. We can't just use SURFTYPE_1D for all buffer textures because that has a limited width of 16K and I really don't want to start disabling L3 caching on all atomics.
Created attachment 144550 [details] [review] Patch to use SURFTYPE_1D for texel buffers One more thing before I call it a day. I've attached a hack patch which attempts to use SURFTYPE_1D for texel buffers. It's not a solution for the reasons I've already listed. However, could you take a bit of time and experiment with it on some of the apps that are known to hang and see if it helps them? If so, that at least tells us that we're headed in the right direction.
(In reply to Jason Ekstrand from comment #31) > Created attachment 144550 [details] [review] [review] > Patch to use SURFTYPE_1D for texel buffers > > One more thing before I call it a day. I've attached a hack patch which > attempts to use SURFTYPE_1D for texel buffers. It's not a solution for the > reasons I've already listed. However, could you take a bit of time and > experiment with it on some of the apps that are known to hang and see if it > helps them? If so, that at least tells us that we're headed in the right > direction. I installed mesa with your patch and there is no more GPU hang after launching _Unreal Engine_. For me it's satisfactory workaround, however I hope you will figure out root cause.
> It's not a solution for the reasons I've already listed. > However, could you take a bit of time and experiment with it on some of the > apps that are known to hang and see if it helps them? Unfortunately Hellblade crashes with this patch on a buffer with: stride_b: 4, num_elements: 65280, pitch: 3 Which seems to be the width limitation you said about: > We can't just use SURFTYPE_1D for all buffer textures because that has a > limited width of 16K MOCS = 0 indeed helps.
I've done a bunch of poking at this including running it on Windows (where it doesn't hang) and getting an aub capture. There's nothing obvious that we're really doing differently. :-( I'm guessing there's some L3$ chicken bit we're missing somewhere. I also experimented with reworking the test to use a fragment shader instead of a compute shader and was able to reproduce the hang with FS as well. This eliminates any it's some GPGPU workaround we're missing.
*** Bug 110377 has been marked as a duplicate of this bug. ***
From Chris Wilson: diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/g t/intel_workarounds.c index 704ace01e7f5..890a3bcfacea 100644 --- a/drivers/gpu/drm/i915/gt/intel_workarounds.c +++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c @@ -667,6 +667,10 @@ gen9_gt_workarounds_init(struct drm_i915_private *i915, stru ct i915_wa_list *wal MMCD_PCLA | MMCD_HOTSPOT_EN); } + wa_write_masked_or(wal, _MMIO(0xb008), BIT(0), 0); + wa_write_masked_or(wal, _MMIO(0xb118), BIT(22), 0); + wa_write_masked_or(wal, _MMIO(0xb11c), BIT(8), 0); + /* WaDisableHDCInvalidation:skl,bxt,kbl,cfl */ wa_write_or(wal, GAM_ECOCHK, Assuming none of those registers are protected, we should be able to do that from userspace and disable the L3$ for just atomics.
Leaving myself more notes: Looking around at Windows driver code and batches, the Windows driver is definitely leaving "Non-IA coherent atomics enable" enabled in L3SQCREG4 and they don't have this hang so the workaround must be somewhere else.
Chris posted a kernel patch on 110998 which fixes the reproducer case. Could you please try it with UE4 and any other apps we've suspected as having this same hang and see what all it fixes?
Hi, unfortunately I deleted UE4 engine, but I tested on vktrace which 100% reproduced issue. On mesa 17.3.9 it doesn't hang (on later mesa versions vktrace crashed, asked Danylo to check it). Also I tested apitrace for DX11 Hellblade game, it is UE4 engine game, and it also had a hang. Apitrace was played successfully. So, based on this, I would say, that kernel patch helped. Tested on drm-tip from 15 July + mentioned patch.
(In reply to Jason Ekstrand from comment #38) > Chris posted a kernel patch on 110998 which fixes the reproducer case. Could > you please try it with UE4 and any other apps we've suspected as having this > same hang and see what all it fixes? Can patch be tested without recompiling the whole kernel?
I have frequent crashes in mpv that look related: #0 0x00007fbd8eaa2b45 in xcb_send_request_with_fds64 () from /usr/lib64/libxcb.so.1 #1 0x00007fbd8eaa2ce9 in xcb_send_request () from /usr/lib64/libxcb.so.1 #2 0x00007fbd8eaa9724 in xcb_intern_atom () from /usr/lib64/libxcb.so.1 #3 0x00007fbd8f153ef3 in set_adaptive_sync_property (conn=conn@entry=0x7fbd6c3c8e40, drawable=58720258, state=<optimized out>, state@entry=0) at ../mesa-19.1.4/src/loader/loader_dri3_helper.c:114 #4 0x00007fbd8f15516c in loader_dri3_drawable_init (conn=0x7fbd6c3c8e40, drawable=drawable@entry=58720258, dri_screen=0x7fbd6c841ed0, is_different_gpu=<optimized out>, multiplanes_available=<optimized out>, dri_config=0x7fbd6c2b7920, ext=0x7fbd6c6727f8, vtable=0x7fbd8f165ae0 <egl_dri3_vtable>, draw=0x7fbd6c7778c0) at ../mesa-19.1.4/src/loader/loader_dri3_helper.c:382 #5 0x00007fbd8f14e258 in dri3_create_surface (drv=drv@entry=0x7fbd6c2b8290, disp=disp@entry=0x7fbd6c6c7540, type=type@entry=4, conf=0x7fbd6c2a88b0, native_surface=<optimized out>, attrib_list=0x0) at ../mesa-19.1.4/src/egl/drivers/dri2/platform_x11_dri3.c:179 #6 0x00007fbd8f14e377 in dri3_create_window_surface (drv=0x7fbd6c2b8290, disp=0x7fbd6c6c7540, conf=<optimized out>, native_window=<optimized out>, attrib_list=<optimized out>) at ../mesa-19.1.4/src/egl/drivers/dri2/platform_x11_dri3.c:232 #7 0x00007fbd8f147695 in dri2_create_window_surface (drv=<optimized out>, disp=<optimized out>, conf=<optimized out>, native_window=<optimized out>, attrib_list=<optimized out>) at ../mesa-19.1.4/src/egl/drivers/dri2/egl_dri2.c:1591 #8 0x00007fbd8f13c011 in _eglCreateWindowSurfaceCommon (disp=disp@entry=0x7fbd6c6c7540, config=config@entry=0x7fbd6c2a88b0, native_window=native_window@entry=0x3800002, attrib_list=attrib_list@entry=0x0) at ../mesa-19.1.4/src/egl/main/eglapi.c:929 #9 0x00007fbd8f13c205 in eglCreateWindowSurface (dpy=<optimized out>, config=0x7fbd6c2a88b0, window=58720258, attrib_list=0x0) at ../mesa-19.1.4/src/egl/main/eglapi.c:945 #10 0x000056220fda0571 in mpegl_init () #11 0x000056220fd882ff in ra_ctx_create () #12 0x000056220fda8b44 in preinit () #13 0x000056220fda7f01 in vo_thread () #14 0x00007fbd8ecb1458 in start_thread () from /lib64/libpthread.so.0 #15 0x00007fbd8ebdf3cf in clone () from /lib64/libc.so.6 The execution path before xcb_intern_atom varies.
(In reply to Sergey Alirzaev from comment #41) > I have frequent crashes in mpv that look related: That's not at all related. It's not even using the same userspace driver.
Fixed in the kernel: commit 9d7b01e93526efe79dbf75b69cc5972b5a4f7b37 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Sep 4 11:07:07 2019 +0100 drm/i915: Restore relaxed padding (OCL_OOB_SUPPRES_ENABLE) for skl+ This bit was fliped on for "syncing dependencies between camera and graphics". BSpec has no recollection why, and it is causing unrecoverable GPU hangs with Vulkan compute workloads. From BSpec, setting bit5 to 0 enables relaxed padding requirements for buffers, 1D and 2D non-array, non-MSAA, non-mip-mapped linear surfaces; and *must* be set to 0h on skl+ to ensure "Out of Bounds" case is suppressed. Reported-by: Jason Ekstrand <jason@jlekstrand.net> Suggested-by: Jason Ekstrand <jason@jlekstrand.net> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110998 Fixes: 8424171e135c ("drm/i915/gen9: h/w w/a: syncing dependencies between camera and graphics") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Tested-by: denys.kostin@globallogic.com Cc: Jason Ekstrand <jason@jlekstrand.net> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: <stable@vger.kernel.org> # v4.1+ Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190904100707.7377-1-chris@chris-wilson.co.uk Solves the immediate test case.
Patch is merged into Linux 5.3.0, however I still get GPU hang. Can someone confirm that issue is solved?
hmmmmm, weird. I also installed 5.3.0 kernel (from manjaro) and was able to reproduce the hang Manjaro OS Linux den-pc 5.3.0-1-MANJARO #1 Mesa 19.1.5 KBL (HD620) ________________________________________________________ But, on ubuntu with 5.3.0 kernel - it works great Mesa 19.3.0-devel (git-1e483a87bc) VERSION="18.04.3 LTS (Bionic Beaver)" 5.3.0-050300-generic KBL (HD620) (to be sure that I am not crazy, I tested ubuntu on 4.15 kernel - and got a hang). So looks like fix wasn't included to ALL 5.3.0 kernels... And need to wait a bit 8-/
But it's already included in Arch: https://git.archlinux.org/linux.git/commit/?h=v5.3-arch1&id=592b8d8759ceb7086e1683e1796c7110e6c2ae8f
Holly ... This bug made me crazy, even though the answer was near. Here is patch date, on your link: author Linus Torvalds <torvalds@linux-foundation.org> 2019-09-14 11:54:57 -0700 committer Linus Torvalds <torvalds@linux-foundation.org> 2019-09-14 11:54:57 -0700 So the ommit date into kernel - 14.09 My current 5.3.0 kernel version has => Sep 2 18:26:38 So yes, fix simply wasn't included! And the last thing - ubuntu kernel (which worked fine) has a 15.09 release date
upd Linux den-pc 5.3.1-arch1-1-ARCH #1 SMP PREEMPT Sat Sep 21 11:33:49 UTC 2019 x86_64 GNU/Linux tested and worked fine (without hangs) You may also install it from here https://www.archlinux.org/packages/core/x86_64/linux/
@Denis Unfortunately patch (14 sep) was included in linux 5.3.0 (15 sep) for Arch. Today I tested 5.3.1 and cpu is still hanging. I will provide output logs in few days.
okaaaay, interesting, very very. May I ask you to make a test on renderdoc from here? => https://mega.nz/#!9ENTCIAa!yitgxJYXcxMqAkC_hnzRIadUy-9oocclGLSSurofh18 to run it you need a renderdoc => https://renderdoc.org/ (use nightly version) >renderdoccmd replay <path_to_trace.rdc> As we assumed that the cause of the hang was the same for all found cases - so I re-checked them on this game (I have an apitrace and renderdoc for hellblade + I tested Darksides 3 and attached reproducer from Danylo). Weird but original trace (vktrace for UE4) stopped working for me (no hangs... just didn't trace). And the second thing - create please a new issue in gitlab, as all tickets and discussions were migrated there => https://gitlab.freedesktop.org/mesa/mesa/issues
Created attachment 145662 [details] RenderDoc's hellblade replay log
@Denis 1) I replayed Hellblade trace without any hang. I stopped it after few minutes, because there were only game menu and no new events happened - I attached log of it. However I does not mean bug was fixed correctly, because RenderDoc docs state that "portability of captures between hardware is not guaranteed". 2) I am also unable to replay old ue4_trace3.vktrace because of "Segmentation fault". 3) Okay, I will capture all logs (as I did when I added this bug here), then create issue on GitLab.
same thing happened and for me (something broke vktrace). About renderdoc - despite that it shows only main menu, it was enough to cause gpu hang (you can run it on older kernel and see it :) ). If you have a hang each time, you may try to make own vktrace, as I did before. Possibly it will be successful and also would lead to hang (so we can reproduce it from our side). I also forgot to ask you to check https://bugs.freedesktop.org/show_bug.cgi?id=110228#c25 this reproducer (it was based on vktrace and 100% reproduced that hang) So, to summarize - as renderoc doesn't hang your system (and if reproducer also doesn't hang) - we faced some new issue. Looking forward for your feedback
@Denis Recently my UE4 started crashing, so installed latest version and there is no hang... Standalone reproducer works fine as well. I am unable to reproduce issue now, so I am happy and I acknowledge issue as fixed. If I ever encounter this problem in future, I will let you know.
(In reply to Andrzej Broński from comment #54) > @Denis Recently my UE4 started crashing, so installed latest version and > there is no hang... Standalone reproducer works fine as well. > > I am unable to reproduce issue now, so I am happy and I acknowledge issue as > fixed. If I ever encounter this problem in future, I will let you know. Great!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.