Bug 105290

Summary: [BSW/HD400] SynMark OglCSDof GPU hangs when shaders come from cache
Product: Mesa Reporter: Eero Tamminen <eero.t.tamminen>
Component: Drivers/DRI/i965Assignee: Jordan Justen <jljusten>
Status: VERIFIED DUPLICATE QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium CC: jljusten, kenneth
Version: git   
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=104636
Whiteboard:
i915 platform: i915 features:
Attachments: GPU hang error state

Description Eero Tamminen 2018-02-28 13:21:57 UTC
Setup:
- Ubuntu 16.04
- Latest drm-tip kernel or 4.15, and X server from same time frame
- Mesa git
- SynMark v7.0

Test-case:
- ./synmark2 OglCSDof

Between following commits:
2018-02-20 18:43:42 UTC 4c4e6232ee: freedreno/ir3: fix use_count refcnt'ing issue
2018-02-21 18:53:38 UTC 81dd4a7637: radeonsi: enable uvd encode for HEVC main

CSDof has started to GPU hang.  Hanging happens only with BSW.

Of the changes during this period, shader cache enabling looks most likely candidate.  I'm not going to bisect this, but I should have free BSW available tomorrow and can check whether disabling shader cache helps anything.
Comment 1 Eero Tamminen 2018-03-01 09:09:06 UTC
Verified that the hang is caused by shader cache:
* Works fine with "MESA_GLSL_CACHE_DISABLE=true" and when cache is empty
* GPU hangs when shader cache is enabled and shader is in cache
Comment 2 Jordan Justen 2018-03-03 02:24:28 UTC
After some debug, I think this might be something about growing
the instruction cache between batches. The shader cache doesn't
upload the pre-compiled default programs at link time, so some
batches may run with a smaller instruction cache size. As more
programs are used, the instruction cache will grow, but something
is not working properly with this on BSW.

I found that if I disable the shader cache with
MESA_GLSL_CACHE_DISABLE=1 but also set shader_precompile=false,
then I also get the hang.

(Thanks Ken for suggesting shader_precompile=false.)
Comment 3 Kenneth Graunke 2018-03-06 02:49:17 UTC
I tried to reproduce this today, and failed - it works fine, no matter what I try.
Comment 4 Eero Tamminen 2018-03-06 08:49:45 UTC
Created attachment 137814 [details]
GPU hang error state
Comment 5 Eero Tamminen 2018-03-06 08:50:19 UTC
(In reply to Kenneth Graunke from comment #3)
> I tried to reproduce this today, and failed - it works fine, no matter what
> I try.

Was it with HD405?  I'm still seeing hangs on HD400 with last evening commit:
  2018-03-03 04:56:35 UTC 411aa8c322: vbo: Try to reuse the same VAO more often for successive dlists.

-> Maybe this is HD400 specific like bug 104636?  Jordan's comment above sound like something that could be related to program cache corruption you see in bug 104636.

PS. bug 101406 is also BSW specific, although not a hang.
Comment 6 Jordan Justen 2018-03-06 16:03:37 UTC
In #intel-3d, Ken mentioned that he suspected that we might be
getting the scratch space wrong on HD 400.

I doubled the scratch space allocated in brw_alloc_stage_scratch,
and I was no longer seeing a hang.
Comment 7 Eero Tamminen 2018-03-07 14:18:00 UTC
(In reply to Jordan Justen from comment #6)
> In #intel-3d, Ken mentioned that he suspected that we might be
> getting the scratch space wrong on HD 400.
> 
> I doubled the scratch space allocated in brw_alloc_stage_scratch,
> and I was no longer seeing a hang.

I can verify that fix for bug 104636:
https://patchwork.freedesktop.org/patch/208502/

Fixes also this hang. -> DUPLICATE?
Comment 8 Eero Tamminen 2018-03-12 16:43:18 UTC

*** This bug has been marked as a duplicate of bug 104636 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.