Bug 104487 - [KBL] portal2_linux GPU hang
Summary: [KBL] portal2_linux GPU hang
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Kenneth Graunke
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-01-04 09:11 UTC by Maxim
Modified: 2018-01-18 17:33 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
/sys/class/drm/card0/error with mesa 17.3.0 (753.55 KB, text/x-log)
2018-01-04 09:11 UTC, Maxim
Details
/sys/class/drm/card0/error with mesa from git master (755.22 KB, text/x-log)
2018-01-04 09:11 UTC, Maxim
Details
hack patch from Jason that may help (1.90 KB, patch)
2018-01-17 08:13 UTC, Kenneth Graunke
Details | Splinter Review
Patch that should fix the hang (3.27 KB, patch)
2018-01-17 23:47 UTC, Kenneth Graunke
Details | Splinter Review

Description Maxim 2018-01-04 09:11:02 UTC
Created attachment 136544 [details]
/sys/class/drm/card0/error with mesa 17.3.0

I have Intel(R) Celeron(R) CPU G3950 @ 3.00GHz CPU (Kaby Lake) with integrated graphics:

00:02.0 VGA compatible controller: Intel Corporation HD Graphics 610 (rev 04)

When I play Portal 2 from Steam on Linux, it often freezes with GPU hang messages in syslog. I'm using 4.14.6 kernel with Calculate (basically Gentoo) patchset.

I've tried two versions of mesa: 17.3.0 and git master (commit 0158565924564ec2edca7acd0ccbc33a369ea50d) with libdrm from git as well (commit 831036a6f62005da9fb4a75fe043bd96ce672d27). With both version I got the hangs.

Additionally I tried applying the patch from https://patchwork.freedesktop.org/series/35325/ (see https://bugs.freedesktop.org/show_bug.cgi?id=102435#c27), but it did not resolve the issue.

I'm attaching two hang logs, one with 17.3.0 and the second with git master + patch.
Comment 1 Maxim 2018-01-04 09:11:31 UTC
Created attachment 136545 [details]
/sys/class/drm/card0/error with mesa from git master
Comment 2 Elizabeth 2018-01-05 23:22:49 UTC
Hello Maxim, 
Is there any known working Mesa version for this particular issue?
Comment 3 Kenneth Graunke 2018-01-07 04:09:12 UTC
I can reproduce this 100% of the time on my Kabylake GT2.  Start the game, type 'map sp_a2_column_blocker' in the developer console, then walk straight ahead to the door.  As soon as the door starts to open, the GPU crashes.

I was able to reproduce the hang on master, 17.3-branchpoint, and 17.2-branchpoint.
Comment 4 Carlos 2018-01-15 01:54:38 UTC
I can reproduce the crash as described by Kenneth Graunke in the 'sp_a2_column_blocker' level.

Machine specifications are:

CPU: Intel(R) Core(TM) i3-6006U CPU @ 2.00GHz
GPU: Intel HD Graphics 520
OS: Archlinux x86_64
Mesa: 17.3.2-2

I cannot confirm is the same problem, but for me the game also crash if the 'discouragement redirection cube' interact with any open portals, for example, in the 'sp_a2_laser_stairs' level.
Comment 5 Kenneth Graunke 2018-01-17 05:38:27 UTC
Jason and I discovered that making intel_miptree_choose_aux_usage() never set CCS_E or CCS_D avoids the hang.  So it appears to be CCS related again, somehow...
Comment 6 Kenneth Graunke 2018-01-17 07:36:57 UTC
Actually, it looks like they've bound a R8G8B8A8_UNORM texture to a sampler2DShadow, and are calling shadow2D() call on it.  Pretty sure this is supposed to return undefined results.  We're translating it to a sample_c message, which has a restriction that the surface format needs to support shadow sampling.  My guess is that if you combine that with CCS and HiZ somehow, it dies badly.

Major kudos to Jason for noticing that.  He gave me a hack that seems to fix it.

We'll probably need to be properly defensive against this.  Hopefully we should have a proper fix soon...
Comment 7 Maxim 2018-01-17 07:39:46 UTC
(In reply to Kenneth Graunke from comment #6)
> 
> Major kudos to Jason for noticing that.  He gave me a hack that seems to fix
> it.
> 
> We'll probably need to be properly defensive against this.  Hopefully we
> should have a proper fix soon...


Do you have some patch against mesa master that we can try and see if problem is no longer reproducible?
Comment 8 Kenneth Graunke 2018-01-17 08:13:25 UTC
Created attachment 136800 [details] [review]
hack patch from Jason that may help

Here's a hack patch from Jason that seemed to help for me.  Let me know if it helps for you as well.  It's probably not exactly what we want, but it's close.
Comment 9 Kenneth Graunke 2018-01-17 23:47:47 UTC
Created attachment 136820 [details] [review]
Patch that should fix the hang

Try this patch instead, it's a better version.  I sent it out, but the mailing lists on freedesktop.org seem to be down today.  Hopefully it'll show up eventually.
Comment 10 Maxim 2018-01-18 09:46:03 UTC
(In reply to Kenneth Graunke from comment #9)
> Created attachment 136820 [details] [review] [review]
> Patch that should fix the hang
> 
> Try this patch instead, it's a better version.  I sent it out, but the
> mailing lists on freedesktop.org seem to be down today.  Hopefully it'll
> show up eventually.

I tried this patch on latest mesa master (commit d67ef485804cab53499dd763db136070ef107a16) and it seems to work. I cannot reproduct the crash in sp_a2_column_blocker).
Waiting for a real fix :)
Comment 11 Kenneth Graunke 2018-01-18 17:33:20 UTC
Should be fixed by:

commit 3e18c53e59457f585de217208e1745f2683be0b9
Author: Kenneth Graunke <kenneth@whitecape.org>
Date:   Wed Jan 17 14:16:04 2018 -0800

    i965: Bind null render targets for shadow sampling + color.
    
    Portal 2 appears to bind RGBA8888_UNORM textures to a sampler2DShadow,
    and calls shadow2D() on it.  This causes undefined behavior in OpenGL.
    
    Unfortunately, our sampler appears to hang in this scenario, which is
    not acceptable.  Just give them a null surface instead, which returns
    all zeroes.
    
    Fixes GPU hangs in Portal 2 on Kabylake.
    
    Huge thanks to Jason Ekstrand for noticing this crazy behavior while
    sifting through crash dumps.
    
    Cc: mesa-stable@lists.freedesktop.org
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104487
    Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
    Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

Thanks again for the bug report and testing those patches!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.