Bug 111453

Summary: System hang in GfxBench CarChase
Product: DRI Reporter: Eero Tamminen <eero.t.tamminen>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED MOVED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: medium CC: arkadiusz.hiler, intel-gfx-bugs, tomi.p.sarvela
Version: DRI gitKeywords: regression
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=111936
Whiteboard: Triaged, ReadyForDev
i915 platform: BDW, BXT, SKL i915 features: GPU hang

Description Eero Tamminen 2019-08-21 10:31:07 UTC
Setup:
- SKL i5 6600K
- Ubuntu 18.04 + updates
- Unity desktop
- drm-tip git kernel (889140938c)
- Rest of the gfxstack build from July 8th git commits:
  - libdrm: 331e51e32f
  - Mesa: 9c7adaeb5f
  - X server: fabc421962
- GfxBench v5 (I assume v4 would work as well)
- FullHD monitor

Test-case:
- Run GfxBench CarChase onscreen test in fullscreen:
  testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_4
- Repeat 10x times

Expected outcome:
- No problems, like is the case with SKL GT4e, KBL GT3e and BXT

Actual outcome:
- Whole system hangs with CarChase rendering frozen on screen

This started to happen between following drm-tip commits:
* 2019-08-18 12:57:21 e64af2bda2: drm-tip: 2019y-08m-18d-12h-56m-37s UTC integration manifest
* 2019-08-19 16:56:48 c3ddf4084e: drm-tip: 2019y-08m-19d-16h-56m-05s UTC integration manifest

For some reason it doesn't happen (at least within 20x repeats) when using this week's libdrm / Mesa / X server, only with the listed earlier versions of them.

Regardless of user space versions, kernel shouldn't allow whole system to hang (network, console etc).
Comment 1 Eero Tamminen 2019-08-21 10:36:36 UTC
With slightly newer kernel:
* 2019-08-20 16:56:31 889140938c: drm-tip: 2019y-08m-20d-16h-55m-47s UTC integration manifest

There was a system hang also on BDW GT2 in SynMark TerrainFlyInst with same libdrm/Mesa/Xorg stack, which didn't happen with newer libdrm/Mesa/Xorg.  If it repeats like that tomorrow along with SKL GT2 hang, it could be related.
Comment 2 Chris Wilson 2019-08-21 10:43:45 UTC
c3ddf4084e is not known amongst the gfx-ci tags.
Comment 3 Eero Tamminen 2019-08-21 11:18:06 UTC
I guess commit 889140938c from yesterday (i.e. day later) can be found?
Comment 4 Chris Wilson 2019-08-21 11:27:48 UTC
(In reply to Eero Tamminen from comment #3)
> I guess commit 889140938c from yesterday (i.e. day later) can be found?

Yes, just trying to piece together when the rc5 merge was.

e64af2bda2: intel/CI_DRM_6732
889140938c: intel/CI_DRM_6749

 233 files changed, 2848 insertions(+), 1225 deletions(-)

includes some mm/ changes, dma-mapping changes, as well as

Chris Wilson (15):
      drm/i915: Always wrap the ring offset before resetting
      drm/i915/gt: Mark up the nested engine-pm timeline lock as irqsafe
      drm/i915: Only emit the 'send bug report' once for a GPU hang
      drm/i915: Serialize against vma moves
      drm/i915: i915_active.retire() is optional
      dma-buf: Introduce selftesting framework
      dma-buf: Add selftests for dma-fence
      drm/i915: Select DMABUF_SELFTESTS for the default i915.ko debug build
      drm/i915: Use 0 for the unordered context
      drm/i915: Assume exclusive access to objects inside resume
      dma-buf: Use %zu for printing sizeof
      dmabuf: Mark up onstack timer for selftests
      drm/i915: Serialize insertion into the file->mm.request_list
      drm/i915: Be defensive when starting vma activity
      drm/i915/gtt: Relax pd_used assertion

which don't on the surface seem alarming.
Comment 5 Eero Tamminen 2019-08-21 12:00:42 UTC
"dmesg -w" over ssh doesn't show anything before/when the system hangs.

Triggering the issue requires multiple runs of the test-case, so it's possible that it's happened also earlier than e64af2bda2 drm-tip.  However, I checked now also build of the first commit (e64af2bda2) with 20x runs, and wasn't able to trigger the hang, so that seems safe bisect starting point.

CPU info:
model name	: Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
stepping	: 3
microcode	: 0xc2
...
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs

BIOS is AMI version 3805.
Comment 6 Eero Tamminen 2019-08-21 12:04:12 UTC
(In reply to Eero Tamminen from comment #0)
> This started to happen between following drm-tip commits:
> * 2019-08-18 12:57:21 e64af2bda2: drm-tip: 2019y-08m-18d-12h-56m-37s UTC
> integration manifest

-> rc4

> * 2019-08-19 16:56:48 c3ddf4084e: drm-tip: 2019y-08m-19d-16h-56m-05s UTC
> integration manifest

-> rc5

As its over upstream merge, hang could be also due to non-i915 changes (or bad rebase).
Comment 7 Eero Tamminen 2019-08-21 12:18:56 UTC
(In reply to Eero Tamminen from comment #5)
> Triggering the issue requires multiple runs of the test-case, so it's
> possible that it's happened also earlier than e64af2bda2 drm-tip.  However,
> I checked now also build of the first commit (e64af2bda2) with 20x runs, and
> wasn't able to trigger the hang, so that seems safe bisect starting point.

Although those 20x runs didn't hang or show anything in dmesg, I got a docker error after them:
docker: Error response from daemon: failed to create endpoint clever_curie on network bridge: failed to add the host (vethea361d4) <=> sandbox (vethe91071f) pair interfaces: operation not supported.

Although same command works fine after boot with same kernel, before the CarChase runs.  No idea how running GfxBench CarChase 3D benchmark could trigger something like that...
Comment 8 Chris Wilson 2019-08-21 17:59:16 UTC
Not in the 18-19th range, but a tiny bit earlier,

commit df403069029dc61e0fc09cbeb0b5900705edec5b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Aug 16 18:16:08 2019 +0100

    drm/i915/execlists: Lift process_csb() out of the irq-off spinlock

might have a triggered a GPF inside an irq-off spinlock (so dead machine more or less), now hopefully fixed in

commit a20ab592d1a87218229109d109b8e2feae6f598d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Aug 21 15:23:36 2019 +0100

    drm/i915/execlists: Set priority hint prior to submission


With any luck tomorrow, it'll be stable again...
Comment 9 Eero Tamminen 2019-08-22 09:37:52 UTC
Got the SKL CarChase system hang again, this time with whole stack being from yesterday:
* kernel:
  2019-08-21 16-44-48 3790794007: drm-tip: 2019y-08m-21d-16h-43m-51s UTC integration manifest
* Mesa:
  2019-08-21 15-40-56  fc69a5cf7: pan/decode: Cleanup mali_attr printing
* X server:
  2019-08-20 18-06-52 95dcc81cb1: glx: Fix previous context validation in xorgGlxMakeCurrent

Is your commit before or after commit 3790794007?

(BDW had been offline, but I'll see later on whether that still hangs on something too.)
Comment 10 Chris Wilson 2019-08-22 09:44:05 UTC
Huh, 3790794007 isn't found after pulling from gfx-ci. I guess we need to ping Tomi and the gang on whether the infra is losing track of commits.

commit a20ab592d1a87218229109d109b8e2feae6f598d
Author:     Chris Wilson <chris@chris-wilson.co.uk>
AuthorDate: Wed Aug 21 15:23:36 2019 +0100
Commit:     Chris Wilson <chris@chris-wilson.co.uk>
CommitDate: Wed Aug 21 17:32:27 2019 +0100

    drm/i915/execlists: Set priority hint prior to submission

vs

* kernel:
  2019-08-21 16-44-48 3790794007: drm-tip: 2019y-08m-21d-16h-43m-51s UTC integration manifest

So that should be just after my push, but it's close.

You don't like easy answers!
Comment 11 Lakshmi 2019-08-22 10:25:47 UTC
(In reply to Chris Wilson from comment #10)
> Huh, 3790794007 isn't found after pulling from gfx-ci. I guess we need to
> ping Tomi and the gang on whether the infra is losing track of commits.
> 
> commit a20ab592d1a87218229109d109b8e2feae6f598d
> Author:     Chris Wilson <chris@chris-wilson.co.uk>
> AuthorDate: Wed Aug 21 15:23:36 2019 +0100
> Commit:     Chris Wilson <chris@chris-wilson.co.uk>
> CommitDate: Wed Aug 21 17:32:27 2019 +0100
> 
>     drm/i915/execlists: Set priority hint prior to submission
> 
> vs
> 
> * kernel:
>   2019-08-21 16-44-48 3790794007: drm-tip: 2019y-08m-21d-16h-43m-51s UTC
> integration manifest
> 
> So that should be just after my push, but it's close.
> 
> You don't like easy answers!

Adding Tomi and Arek here.
Comment 12 Arek Hiler 2019-08-22 11:07:22 UTC
(In reply to Lakshmi from comment #11)
> (In reply to Chris Wilson from comment #10)
> > Huh, 3790794007 isn't found after pulling from gfx-ci. I guess we need to
> > ping Tomi and the gang on whether the infra is losing track of commits.
> 
> Adding Tomi and Arek here.

gfx-ci is keeping track of all the tags that were *used by CI*, not of everything that was ever pushed to drm-tip.

If you have multiple force pushes when CI is busy with something, the intermediate commits won't get noticed.

Seems like 3790794007 was one of those revisions that came in-between and did not get a tag:

$ git ls-remote ci-tags | grep 3790794007
<NULL>
Comment 13 Eero Tamminen 2019-08-28 08:21:07 UTC
(In reply to Eero Tamminen from comment #1)
> With slightly newer kernel:
> * 2019-08-20 16:56:31 889140938c: drm-tip: 2019y-08m-20d-16h-55m-47s UTC
> integration manifest
> 
> There was a system hang also on BDW GT2 in SynMark TerrainFlyInst with same
> libdrm/Mesa/Xorg stack, which didn't happen with newer libdrm/Mesa/Xorg.  If
> it repeats like that tomorrow along with SKL GT2 hang, it could be related.

Saw a potential system hang in same test on SKL GT4e few days ago, but with older drm-tip v5.2 kernel, so those are unlikely to be related to this bug.


(In reply to Chris Wilson from comment #8)
> now hopefully fixed in
> 
> commit a20ab592d1a87218229109d109b8e2feae6f598d
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Wed Aug 21 15:23:36 2019 +0100
> 
>     drm/i915/execlists: Set priority hint prior to submission

I haven't seen these system hangs since 21st, but it was hard to reproduce.  I'll wait until next week, and if there still aren't any hangs, run large number of rounds to verify.
Comment 14 Eero Tamminen 2019-08-30 09:51:19 UTC
BXT J4205 seems to have hanged in this test too (device disconnected during testing before being automatically rebooted), with latest drm-tip kernel and (nearly) 2 month old Mesa & Xserver git versions.
Comment 15 Eero Tamminen 2019-08-30 10:50:37 UTC
(In reply to Eero Tamminen from comment #14)
> BXT J4205 seems to have hanged in this test too (device disconnected during
> testing before being automatically rebooted), with latest drm-tip kernel and
> (nearly) 2 month old Mesa & Xserver git versions.

Note: I wasn't able to reproduce this manually within 20 onscreen & 20 offscreen test repeats.
Comment 16 Eero Tamminen 2019-09-10 11:27:38 UTC
Got the hang again during nightly, on SKL i5-6600K, with latest drm-tip kernel and (nearly) 2 month old Mesa & Xserver git versions.
Comment 17 Eero Tamminen 2019-09-16 08:05:10 UTC
(In reply to Eero Tamminen from comment #14)
> BXT J4205 seems to have hanged in this test too (device disconnected during
> testing before being automatically rebooted), with latest drm-tip kernel and
> (nearly) 2 month old Mesa & Xserver git versions.

During weekend also with drm-tip v5.2 kernel and latest Mesa.
Comment 18 Eero Tamminen 2019-09-20 09:26:06 UTC
(In reply to Eero Tamminen from comment #16)
> Got the hang again during nightly, on SKL i5-6600K, with latest drm-tip
> kernel and (nearly) 2 month old Mesa & Xserver git versions.

Ditto. Got this now also on BDW GT2 with latest gfx stack / Iris.


(In reply to Eero Tamminen from comment #15)
> (In reply to Eero Tamminen from comment #14)
> > BXT J4205 seems to have hanged in this test too (device disconnected during
> > testing before being automatically rebooted), with latest drm-tip kernel and
> > (nearly) 2 month old Mesa & Xserver git versions.
> 
> Note: I wasn't able to reproduce this manually within 20 onscreen & 20
> offscreen test repeats.

I'm still able to reproduce the system hang on SKL GT2 (i5-6600K), just by rebooting with latest git versions of gfx stack, and running the hanging test few times:
bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_4
Comment 19 Eero Tamminen 2019-09-25 10:29:27 UTC
In reply to Eero Tamminen from comment #18)
> I'm still able to reproduce the system hang on SKL GT2 (i5-6600K), just by
> rebooting with latest git versions of gfx stack, and running the hanging
> test few times:
> bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080
> --fullscreen 1 --test_id gl_4

while this system hang is rare in automated testing runs, currently it seems to happen more often with BXT (J4205 with latest drm-tip kernel + 3D stack from git), that with SKL.
Comment 20 Eero Tamminen 2019-10-10 17:06:15 UTC
CarChase test system disconnect / system hang on BXT with last evening drm-tip, on SKL GT2 with previous evening drm-tip.
Comment 21 Martin Peres 2019-11-29 19:24:26 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/379.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.