Bug 93693

Summary: [BAT SKL BDW] missed interrupt in gem_storedw_loop/basic-render with *ERROR* Hangcheck timer elapsed...
Product: DRI Reporter: Daniel Vetter <daniel>
Component: DRM/IntelAssignee: Mika Kuoppala <mika.kuoppala>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: highest CC: intel-gfx-bugs, knikkane, mika.kuoppala
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
drm/i915: Force ordering on request submission and hangcheck none

Description Daniel Vetter 2016-01-13 10:07:42 UTC
All bdw/skl machines have random gpu hangs when running gem_storedw_loop/basic-render.

Strangely other engines are all fine, and this testcasee only uses CS instructions (so doesn't even load a full render workload).

This is kinda PO-exit criteria of fail, while bdw/skl are PV ready :(

Long-term history to make this clear can be found on the CI server under /archive/results/CI_IGT_test/igt@gem_storedw_loop@basic-render.html
Comment 1 Chris Wilson 2016-01-13 15:25:15 UTC
Wow, it's telling that the render ring is so slow! :)

I can run this in a loop until I get bored (>10minutes) on -nightly and haven't encountered an issue yet. I'd like to see the error state to see if there are any clues there.
Comment 2 Mika Kuoppala 2016-01-13 16:56:32 UTC
I suspect Daniel got confused by the error message. For what I can see, the
gem_store_dwloop triggers the hangcheck timer elapsed, rander ring idle errors.
Comment 3 Mika Kuoppala 2016-01-13 16:58:36 UTC
Created attachment 121002 [details] [review]
drm/i915: Force ordering on request submission and hangcheck
Comment 4 Chris Wilson 2016-01-13 17:34:03 UTC
(In reply to Mika Kuoppala from comment #3)
> Created attachment 121002 [details] [review] [review]
> drm/i915: Force ordering on request submission and hangcheck

You can't move the list manipulation just like that! It's time we eliminated that list_empty() check, but this does nothing to paper over the race.
Comment 5 Daniel Vetter 2016-01-15 18:40:31 UTC
(In reply to Mika Kuoppala from comment #2)
> I suspect Daniel got confused by the error message. For what I can see, the
> gem_store_dwloop triggers the hangcheck timer elapsed, rander ring idle
> errors.

Yeah I screwed up the title, it's "just" that the sw tracking got out of whack with reality, the gpu is actually perfectly fine. After all the testcase does succeed (and it checks that all the CS dw stores did land).
Comment 6 Daniel Vetter 2016-01-20 13:44:40 UTC
Same bug most likely in gem_sync/basic-render.
Comment 7 Chris Wilson 2016-01-24 11:59:08 UTC
commit 7c17d377374ddbcfb7873366559fc4ed8b296e11
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jan 20 15:43:35 2016 +0200

    drm/i915: Use ordered seqno write interrupt generation on gen8+ execlists
Comment 8 Chris Wilson 2016-01-24 12:01:05 UTC
For the record, this only happens for me when I have an output connected - suggests some interesting hilarity with memory bw/latency.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.