Bug 105064

Summary: GPU Hang with single-channel RAM configuration
Product: DRI Reporter: Mark Janes <mark.a.janes>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: medium CC: baker.dylan.c, clayton.a.craft, intel-gfx-bugs, mark.a.janes, martin.peres, mika.kuoppala, volodymyr.los
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: ALL i915 features: GPU hang
Attachments:
Description Flags
card error state none

Description Mark Janes 2018-02-12 21:20:46 UTC
Mesa CI consistently reproduced GPU hangs on a subset of BDWGT3e machines after updating kernel from 4.9->4.15.

On further investigation, we found that all failing machines had RAM in single-channel configuration:  4GB in each of slots A1, A2.  This is a valid (though uncommon) memory configuration.

We have not yet verified if other more common memory configurations (3 slots, 4 slots filled) reproduce the hang.

So far, this is reproducible as far back as 4.13 by running piglit in a loop a few times.
Comment 1 Mark Janes 2018-02-12 21:32:50 UTC
Created attachment 137306 [details]
card error state
Comment 2 Chris Wilson 2018-02-12 21:47:44 UTC
It completed the CS interrupt following seqno 0x00bd22f7, but the context switch (following ELSP) did not restart on the new context. ~o~ I'm not sure if a bisect will reveal anything more than a timing change.

Initial guess is that this issue is related to the one fixed in

commit ba74cb10c775c839f6e1d0fabd1e772eabd9c43f
Author: Michel Thierry <michel.thierry@intel.com>
Date:   Mon Nov 20 12:34:58 2017 +0000

    drm/i915/execlists: Delay writing to ELSP until HW has processed the previou
s write
Comment 3 Dylan Baker 2018-02-12 21:56:04 UTC
Chris, do you think you can write a test case or a patch I can try?
Comment 4 Chris Wilson 2018-02-12 22:02:09 UTC
From the error-state, I would say it just takes stress, you just need to hit the right timing between CS interrupt and ELSP submission. If that is the case, igt should be able to trigger it indirectly.

My assumption may well be off, but I would start with just running texture-gather-offset (or which ever piglit got caught in that GPU hang) in a loop; probably with a background "find / -type f -exec cat {} \;"
Comment 5 Mark Janes 2018-02-12 23:32:36 UTC
In my initial investigation, I ran the texture-gather tests in loops on all cores simultaneously, and couldn't make it fail.  Running the full piglit suite triggered the hang maybe 40% of the time.
Comment 6 Dylan Baker 2018-02-13 00:01:41 UTC
I can reproduce this about %50 of the time running a full piglit suite on SKL as well in a single channel configuration, so I'm going to change this to all hardware.
Comment 7 Tomi Sarvela 2018-02-13 11:18:37 UTC
GFX CI fi-skl-6700k2 has now one DIMM at one channel, for 8GB memory.

Change happened after CI_DRM_3760.
Comment 8 Mark Janes 2018-02-14 01:12:10 UTC
Tomi: so far, we have been reproducing this with 2 sticks of RAM in the same channel.

Martin: is it possible to bisect this with EZBench?
Comment 9 Martin Peres 2018-02-14 08:18:25 UTC
(In reply to Mark Janes from comment #8)
> Martin: is it possible to bisect this with EZBench?

It is possible, but requires to write a custom bisecting job which anyway would amount to what git bisect is doing, so there is no real point in using ezbench for that :s

I am absolutely swamped with the 3x4K bug and the deployment of cibuglog-ng, so not sure I can help you. If you have a clear reproducing case (run one test even 10 times), then no development is necessary and I can reproduce that.
Comment 10 vadym 2018-02-15 18:00:15 UTC
I tested this on two available platforms: Haswell and Kabylake.
For Haswell I did 3 full piglit runs and didn't reproduce any GPU hangs (4.15.3-041503-generic was used for testing).

But for Kabylake with the same 4.15.3 kernel I got following error (100% reproducible):

fail: spec/glsl-1.20/execution/uniform-initializer/fs-mat2-array     
running: spec/glsl-1.20/execution/uniform-initializer/vs-mat3-set-by-api
running: spec/glsl-1.20/execution/uniform-initializer/fs-bool-from-const
Traceback (most recent call last):3, warn: 2, fail: 70, crash: 3 -||/   
  File "/home/vadym/mesa/piglit_original/piglit/framework/test/base.py", line 205, in execute
    self.run()
  File "/home/vadym/mesa/piglit_original/piglit/framework/test/base.py", line 279, in run
    self._run_command()
  File "/home/vadym/mesa/piglit_original/piglit/framework/test/base.py", line 357, in _run_command
Traceback (most recent call last):
  File "/home/vadym/mesa/piglit_original/piglit/framework/test/base.py", line 205, in execute
    self.run()
  File "/home/vadym/mesa/piglit_original/piglit/framework/test/base.py", line 279, in run
    raise e
    self._run_command()
  File "/home/vadym/mesa/piglit_original/piglit/framework/test/base.py", line 357, in _run_command
OSError: [Errno 12] Cannot allocate memory
    raise e
OSError: [Errno 12] Cannot allocate memory
fail: spec/glsl-1.20/execution/uniform-initializer/vs-mat3-set-by-api
fail: spec/glsl-1.20/execution/uniform-initializer/fs-bool-from-const
running: spec/glsl-1.20/execution/uniform-initializer/fs-float-set-by-other-stage
running: spec/glsl-1.20/execution/uniform-initializer/fs-mat4                    
Killed/54272] skip: 178, pass: 5853, warn: 2, fail: 72, crash: 3 -|-\|

I noticed that at this point no free memory are available on my laptop (~16 GB are allocated).I'm getting this error with the swap area disabled. But with the swap area enabled Linux is completely hangs and only reset can help. With the kernel 4.9 this issue is not reproducible. 

Kabylake configs:

Platform: Lenovo YOGA 520
CPU: Intel® Core™ i7-8550U CPU @ 1.80GHz × 8 
GPU: Intel® UHD Graphics 620 (Kabylake GT2) 
RAM: 16GB
OS: Ubuntu 16.04 LTS 64-bit
Mesa: 18.1.0-devel (git-fa901768a4)
Kernel: 4.15.3-041503-generic
Piglit: git-4210d072f
Comment 11 Clayton Craft 2018-02-22 19:16:26 UTC
Also observed intermittent test failures for KHR-GLES31.core.tessellation_shader.tessellation_control_to_tessellation_evaluation.gl_tessLevel on HSW.

When compared to the HSW systems that did do not show this intermittent failure, the main difference is that the failing systems have 1 DIMM installed (8GB in channel A bank 0) and the non-failing systems have 2 DIMMs installed (4GB each in channel A & B bank 0).

In this case, the failing systems are running kernel 4.9, and the non-failing systems are running either 4.9 or 4.15.
Comment 12 Vladimir Los 2018-03-14 09:37:53 UTC
Platform: HP Z220 SFF Workstation
SKU: ASJ45AV
CPU: i5-3470 @ 3.1 GHz x 4, stepping: 000306A9 00000019
RAM: DDR3 1600MHz (used several options in 2 channels and 4 dimms (1 or 2 memunits x 2Gb))
System BIOS: K51 v01.68
Firmware ver: 8.0.4.1441
OS:  Ubuntu 16.04(.4) LTS 64-bit 
Mesa: 17.2.8
Kernels used: 4.9.x, 4.13.x, 4.15.x
Piglit: git-b8e7cc0e59

No GPU hangs were reproduced and reported.
There were many combinations in different slots with different memory units.
Used the “all” option in the piglit runs.
Comment 13 Jani Saarinen 2018-03-29 07:10:59 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 14 Mark Janes 2018-04-06 13:53:11 UTC
yes, this bug is still valid.
Comment 15 Jani Saarinen 2018-04-25 11:20:17 UTC
Mika, Chris, any advice here?
Comment 16 Mika Kuoppala 2018-04-25 14:10:54 UTC
We could try to repro on fi-skl-6700k2
Comment 17 Jani Saarinen 2018-05-04 12:21:30 UTC
Mika, do you if tried on that ?
Comment 18 Chris Wilson 2018-05-08 10:18:35 UTC
Can you please apply https://patchwork.freedesktop.org/series/42867/ and see if that makes a difference?
Comment 19 Clayton Craft 2018-05-08 22:51:42 UTC
Chris, I had to build 4.16 in order for your patch to apply (it doesn't apply to 4.15, which was used originally to hit this issue).

But I cannot reproduce the issue on a vanilla 4.16 kernel *without* your patch. 10 full runs of piglit did *not* cause a gpu hang on a skl system running with single channel RAM config. If I roll back to 4.15, I can reproduce the gpu hang on the same system after 1-2 piglit runs.

Based on my testing, it seems like this issue is magically resolved by some change in 4.16 from 4.15 (where the issue is hit).
Comment 20 Jani Saarinen 2018-05-09 05:26:19 UTC
Based on comments, resolving. Please reopen if occurs again.
Comment 21 Chris Wilson 2018-05-09 07:15:28 UTC
Likely fix:

commit ba74cb10c775c839f6e1d0fabd1e772eabd9c43f
Author: Michel Thierry <michel.thierry@intel.com>
Date:   Mon Nov 20 12:34:58 2017 +0000

    drm/i915/execlists: Delay writing to ELSP until HW has processed the previous write
    
    The hardware needs some time to process the information received in the
    ExecList Submission Port, and expects us to not write anything more until
    it has 'acknowledged' this new submission by sending an IDLE_ACTIVE or
    PREEMPTED CSB event.
    
    If we do not follow this, the driver could write new data into the ELSP
    before HW had finishing fetching the previous one, putting us in
    'undefined behaviour' space.
    
    This seems to be the problem causing the spurious PREEMPTED & COMPLETE
    events after a COMPLETE like the one below:
    
    [] vcs0: sw rd pointer = 2, hw wr pointer = 0, current 'head' = 3.
    [] vcs0:  Execlist CSB[0]: 0x00000018 _ 0x00000007
    [] vcs0:  Execlist CSB[1]: 0x00000001 _ 0x00000000
    [] vcs0:  Execlist CSB[2]: 0x00000018 _ 0x00000007  <<< COMPLETE
    [] vcs0:  Execlist CSB[3]: 0x00000012 _ 0x00000007  <<< PREEMPTED & COMPLETE
    [] vcs0:  Execlist CSB[4]: 0x00008002 _ 0x00000006
    [] vcs0:  Execlist CSB[5]: 0x00000014 _ 0x00000006
    
    The ELSP writes that lead to this CSB sequence show that the HW hadn't
    started executing the previous execlist (the one with only ctx 0x6) by the
    time the new one was submitted; this is a bit more clear in the data
    show in the EXECLIST_STATUS register at the time of the ELSP write.
    
    [] vcs0: ELSP[0] = 0x0_0        [execlist1] - status_reg = 0x0_302
    [] vcs0: ELSP[1] = 0x6_fedb2119 [execlist0] - status_reg = 0x0_8302
    
    [] vcs0: ELSP[2] = 0x7_fedaf119 [execlist1] - status_reg = 0x0_8308
    [] vcs0: ELSP[3] = 0x6_fedb2119 [execlist0] - status_reg = 0x7_8308
    
    Note that having to wait for this ack does not disable lite-restores,
    although it may reduce their numbers.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102035
    Signed-off-by: Michel Thierry <michel.thierry@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/<20171118003038.7935-1-michel.thierry@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171120123458.23242-4-chris@chris-wilson.co.uk
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

which was asked to be tested in c3.
Comment 22 Mark Janes 2018-05-09 13:13:22 UTC
Well, that explains why it wouldn't reproduce yesterday on 4.16, when we went to test Chris's patch.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.