Summary: | System hang in GfxBench CarChase | ||
---|---|---|---|
Product: | DRI | Reporter: | Eero Tamminen <eero.t.tamminen> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | RESOLVED MOVED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | critical | ||
Priority: | medium | CC: | arkadiusz.hiler, intel-gfx-bugs, tomi.p.sarvela |
Version: | DRI git | Keywords: | regression |
Hardware: | Other | ||
OS: | All | ||
See Also: | https://bugs.freedesktop.org/show_bug.cgi?id=111936 | ||
Whiteboard: | Triaged, ReadyForDev | ||
i915 platform: | BDW, BXT, SKL | i915 features: | GPU hang |
Description
Eero Tamminen
2019-08-21 10:31:07 UTC
With slightly newer kernel: * 2019-08-20 16:56:31 889140938c: drm-tip: 2019y-08m-20d-16h-55m-47s UTC integration manifest There was a system hang also on BDW GT2 in SynMark TerrainFlyInst with same libdrm/Mesa/Xorg stack, which didn't happen with newer libdrm/Mesa/Xorg. If it repeats like that tomorrow along with SKL GT2 hang, it could be related. c3ddf4084e is not known amongst the gfx-ci tags. I guess commit 889140938c from yesterday (i.e. day later) can be found? (In reply to Eero Tamminen from comment #3) > I guess commit 889140938c from yesterday (i.e. day later) can be found? Yes, just trying to piece together when the rc5 merge was. e64af2bda2: intel/CI_DRM_6732 889140938c: intel/CI_DRM_6749 233 files changed, 2848 insertions(+), 1225 deletions(-) includes some mm/ changes, dma-mapping changes, as well as Chris Wilson (15): drm/i915: Always wrap the ring offset before resetting drm/i915/gt: Mark up the nested engine-pm timeline lock as irqsafe drm/i915: Only emit the 'send bug report' once for a GPU hang drm/i915: Serialize against vma moves drm/i915: i915_active.retire() is optional dma-buf: Introduce selftesting framework dma-buf: Add selftests for dma-fence drm/i915: Select DMABUF_SELFTESTS for the default i915.ko debug build drm/i915: Use 0 for the unordered context drm/i915: Assume exclusive access to objects inside resume dma-buf: Use %zu for printing sizeof dmabuf: Mark up onstack timer for selftests drm/i915: Serialize insertion into the file->mm.request_list drm/i915: Be defensive when starting vma activity drm/i915/gtt: Relax pd_used assertion which don't on the surface seem alarming. "dmesg -w" over ssh doesn't show anything before/when the system hangs. Triggering the issue requires multiple runs of the test-case, so it's possible that it's happened also earlier than e64af2bda2 drm-tip. However, I checked now also build of the first commit (e64af2bda2) with 20x runs, and wasn't able to trigger the hang, so that seems safe bisect starting point. CPU info: model name : Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz stepping : 3 microcode : 0xc2 ... bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs BIOS is AMI version 3805. (In reply to Eero Tamminen from comment #0) > This started to happen between following drm-tip commits: > * 2019-08-18 12:57:21 e64af2bda2: drm-tip: 2019y-08m-18d-12h-56m-37s UTC > integration manifest -> rc4 > * 2019-08-19 16:56:48 c3ddf4084e: drm-tip: 2019y-08m-19d-16h-56m-05s UTC > integration manifest -> rc5 As its over upstream merge, hang could be also due to non-i915 changes (or bad rebase). (In reply to Eero Tamminen from comment #5) > Triggering the issue requires multiple runs of the test-case, so it's > possible that it's happened also earlier than e64af2bda2 drm-tip. However, > I checked now also build of the first commit (e64af2bda2) with 20x runs, and > wasn't able to trigger the hang, so that seems safe bisect starting point. Although those 20x runs didn't hang or show anything in dmesg, I got a docker error after them: docker: Error response from daemon: failed to create endpoint clever_curie on network bridge: failed to add the host (vethea361d4) <=> sandbox (vethe91071f) pair interfaces: operation not supported. Although same command works fine after boot with same kernel, before the CarChase runs. No idea how running GfxBench CarChase 3D benchmark could trigger something like that... Not in the 18-19th range, but a tiny bit earlier, commit df403069029dc61e0fc09cbeb0b5900705edec5b Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Aug 16 18:16:08 2019 +0100 drm/i915/execlists: Lift process_csb() out of the irq-off spinlock might have a triggered a GPF inside an irq-off spinlock (so dead machine more or less), now hopefully fixed in commit a20ab592d1a87218229109d109b8e2feae6f598d Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Aug 21 15:23:36 2019 +0100 drm/i915/execlists: Set priority hint prior to submission With any luck tomorrow, it'll be stable again... Got the SKL CarChase system hang again, this time with whole stack being from yesterday: * kernel: 2019-08-21 16-44-48 3790794007: drm-tip: 2019y-08m-21d-16h-43m-51s UTC integration manifest * Mesa: 2019-08-21 15-40-56 fc69a5cf7: pan/decode: Cleanup mali_attr printing * X server: 2019-08-20 18-06-52 95dcc81cb1: glx: Fix previous context validation in xorgGlxMakeCurrent Is your commit before or after commit 3790794007? (BDW had been offline, but I'll see later on whether that still hangs on something too.) Huh, 3790794007 isn't found after pulling from gfx-ci. I guess we need to ping Tomi and the gang on whether the infra is losing track of commits. commit a20ab592d1a87218229109d109b8e2feae6f598d Author: Chris Wilson <chris@chris-wilson.co.uk> AuthorDate: Wed Aug 21 15:23:36 2019 +0100 Commit: Chris Wilson <chris@chris-wilson.co.uk> CommitDate: Wed Aug 21 17:32:27 2019 +0100 drm/i915/execlists: Set priority hint prior to submission vs * kernel: 2019-08-21 16-44-48 3790794007: drm-tip: 2019y-08m-21d-16h-43m-51s UTC integration manifest So that should be just after my push, but it's close. You don't like easy answers! (In reply to Chris Wilson from comment #10) > Huh, 3790794007 isn't found after pulling from gfx-ci. I guess we need to > ping Tomi and the gang on whether the infra is losing track of commits. > > commit a20ab592d1a87218229109d109b8e2feae6f598d > Author: Chris Wilson <chris@chris-wilson.co.uk> > AuthorDate: Wed Aug 21 15:23:36 2019 +0100 > Commit: Chris Wilson <chris@chris-wilson.co.uk> > CommitDate: Wed Aug 21 17:32:27 2019 +0100 > > drm/i915/execlists: Set priority hint prior to submission > > vs > > * kernel: > 2019-08-21 16-44-48 3790794007: drm-tip: 2019y-08m-21d-16h-43m-51s UTC > integration manifest > > So that should be just after my push, but it's close. > > You don't like easy answers! Adding Tomi and Arek here. (In reply to Lakshmi from comment #11) > (In reply to Chris Wilson from comment #10) > > Huh, 3790794007 isn't found after pulling from gfx-ci. I guess we need to > > ping Tomi and the gang on whether the infra is losing track of commits. > > Adding Tomi and Arek here. gfx-ci is keeping track of all the tags that were *used by CI*, not of everything that was ever pushed to drm-tip. If you have multiple force pushes when CI is busy with something, the intermediate commits won't get noticed. Seems like 3790794007 was one of those revisions that came in-between and did not get a tag: $ git ls-remote ci-tags | grep 3790794007 <NULL> (In reply to Eero Tamminen from comment #1) > With slightly newer kernel: > * 2019-08-20 16:56:31 889140938c: drm-tip: 2019y-08m-20d-16h-55m-47s UTC > integration manifest > > There was a system hang also on BDW GT2 in SynMark TerrainFlyInst with same > libdrm/Mesa/Xorg stack, which didn't happen with newer libdrm/Mesa/Xorg. If > it repeats like that tomorrow along with SKL GT2 hang, it could be related. Saw a potential system hang in same test on SKL GT4e few days ago, but with older drm-tip v5.2 kernel, so those are unlikely to be related to this bug. (In reply to Chris Wilson from comment #8) > now hopefully fixed in > > commit a20ab592d1a87218229109d109b8e2feae6f598d > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Wed Aug 21 15:23:36 2019 +0100 > > drm/i915/execlists: Set priority hint prior to submission I haven't seen these system hangs since 21st, but it was hard to reproduce. I'll wait until next week, and if there still aren't any hangs, run large number of rounds to verify. BXT J4205 seems to have hanged in this test too (device disconnected during testing before being automatically rebooted), with latest drm-tip kernel and (nearly) 2 month old Mesa & Xserver git versions. (In reply to Eero Tamminen from comment #14) > BXT J4205 seems to have hanged in this test too (device disconnected during > testing before being automatically rebooted), with latest drm-tip kernel and > (nearly) 2 month old Mesa & Xserver git versions. Note: I wasn't able to reproduce this manually within 20 onscreen & 20 offscreen test repeats. Got the hang again during nightly, on SKL i5-6600K, with latest drm-tip kernel and (nearly) 2 month old Mesa & Xserver git versions. (In reply to Eero Tamminen from comment #14) > BXT J4205 seems to have hanged in this test too (device disconnected during > testing before being automatically rebooted), with latest drm-tip kernel and > (nearly) 2 month old Mesa & Xserver git versions. During weekend also with drm-tip v5.2 kernel and latest Mesa. (In reply to Eero Tamminen from comment #16) > Got the hang again during nightly, on SKL i5-6600K, with latest drm-tip > kernel and (nearly) 2 month old Mesa & Xserver git versions. Ditto. Got this now also on BDW GT2 with latest gfx stack / Iris. (In reply to Eero Tamminen from comment #15) > (In reply to Eero Tamminen from comment #14) > > BXT J4205 seems to have hanged in this test too (device disconnected during > > testing before being automatically rebooted), with latest drm-tip kernel and > > (nearly) 2 month old Mesa & Xserver git versions. > > Note: I wasn't able to reproduce this manually within 20 onscreen & 20 > offscreen test repeats. I'm still able to reproduce the system hang on SKL GT2 (i5-6600K), just by rebooting with latest git versions of gfx stack, and running the hanging test few times: bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_4 In reply to Eero Tamminen from comment #18) > I'm still able to reproduce the system hang on SKL GT2 (i5-6600K), just by > rebooting with latest git versions of gfx stack, and running the hanging > test few times: > bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 > --fullscreen 1 --test_id gl_4 while this system hang is rare in automated testing runs, currently it seems to happen more often with BXT (J4205 with latest drm-tip kernel + 3D stack from git), that with SKL. CarChase test system disconnect / system hang on BXT with last evening drm-tip, on SKL GT2 with previous evening drm-tip. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/379. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.