Created attachment 136544 [details]
/sys/class/drm/card0/error with mesa 17.3.0
I have Intel(R) Celeron(R) CPU G3950 @ 3.00GHz CPU (Kaby Lake) with integrated graphics:
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 610 (rev 04)
When I play Portal 2 from Steam on Linux, it often freezes with GPU hang messages in syslog. I'm using 4.14.6 kernel with Calculate (basically Gentoo) patchset.
I've tried two versions of mesa: 17.3.0 and git master (commit 0158565924564ec2edca7acd0ccbc33a369ea50d) with libdrm from git as well (commit 831036a6f62005da9fb4a75fe043bd96ce672d27). With both version I got the hangs.
Additionally I tried applying the patch from https://patchwork.freedesktop.org/series/35325/ (see https://bugs.freedesktop.org/show_bug.cgi?id=102435#c27), but it did not resolve the issue.
I'm attaching two hang logs, one with 17.3.0 and the second with git master + patch.
Created attachment 136545 [details]
/sys/class/drm/card0/error with mesa from git master
Is there any known working Mesa version for this particular issue?
I can reproduce this 100% of the time on my Kabylake GT2. Start the game, type 'map sp_a2_column_blocker' in the developer console, then walk straight ahead to the door. As soon as the door starts to open, the GPU crashes.
I was able to reproduce the hang on master, 17.3-branchpoint, and 17.2-branchpoint.
I can reproduce the crash as described by Kenneth Graunke in the 'sp_a2_column_blocker' level.
Machine specifications are:
CPU: Intel(R) Core(TM) i3-6006U CPU @ 2.00GHz
GPU: Intel HD Graphics 520
OS: Archlinux x86_64
I cannot confirm is the same problem, but for me the game also crash if the 'discouragement redirection cube' interact with any open portals, for example, in the 'sp_a2_laser_stairs' level.
Jason and I discovered that making intel_miptree_choose_aux_usage() never set CCS_E or CCS_D avoids the hang. So it appears to be CCS related again, somehow...
Actually, it looks like they've bound a R8G8B8A8_UNORM texture to a sampler2DShadow, and are calling shadow2D() call on it. Pretty sure this is supposed to return undefined results. We're translating it to a sample_c message, which has a restriction that the surface format needs to support shadow sampling. My guess is that if you combine that with CCS and HiZ somehow, it dies badly.
Major kudos to Jason for noticing that. He gave me a hack that seems to fix it.
We'll probably need to be properly defensive against this. Hopefully we should have a proper fix soon...
(In reply to Kenneth Graunke from comment #6)
> Major kudos to Jason for noticing that. He gave me a hack that seems to fix
> We'll probably need to be properly defensive against this. Hopefully we
> should have a proper fix soon...
Do you have some patch against mesa master that we can try and see if problem is no longer reproducible?
Created attachment 136800 [details] [review]
hack patch from Jason that may help
Here's a hack patch from Jason that seemed to help for me. Let me know if it helps for you as well. It's probably not exactly what we want, but it's close.
Created attachment 136820 [details] [review]
Patch that should fix the hang
Try this patch instead, it's a better version. I sent it out, but the mailing lists on freedesktop.org seem to be down today. Hopefully it'll show up eventually.
(In reply to Kenneth Graunke from comment #9)
> Created attachment 136820 [details] [review] [review]
> Patch that should fix the hang
> Try this patch instead, it's a better version. I sent it out, but the
> mailing lists on freedesktop.org seem to be down today. Hopefully it'll
> show up eventually.
I tried this patch on latest mesa master (commit d67ef485804cab53499dd763db136070ef107a16) and it seems to work. I cannot reproduct the crash in sp_a2_column_blocker).
Waiting for a real fix :)
Should be fixed by:
Author: Kenneth Graunke <email@example.com>
Date: Wed Jan 17 14:16:04 2018 -0800
i965: Bind null render targets for shadow sampling + color.
Portal 2 appears to bind RGBA8888_UNORM textures to a sampler2DShadow,
and calls shadow2D() on it. This causes undefined behavior in OpenGL.
Unfortunately, our sampler appears to hang in this scenario, which is
not acceptable. Just give them a null surface instead, which returns
Fixes GPU hangs in Portal 2 on Kabylake.
Huge thanks to Jason Ekstrand for noticing this crazy behavior while
sifting through crash dumps.
Reviewed-by: Topi Pohjolainen <firstname.lastname@example.org>
Reviewed-by: Jason Ekstrand <email@example.com>
Thanks again for the bug report and testing those patches!