98748 – [KBL] reset request times out (GPU reset failure)

Bug 98748 - [KBL] reset request times out (GPU reset failure)

Summary: [KBL] reset request times out (GPU reset failure)

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2016-11-16 16:23 UTC by Eero Tamminen
Modified:	2017-07-11 16:02 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	KBL
i915 features:	GPU hang

Attachments
Dmesg (97.08 KB, text/plain) 2016-11-16 16:23 UTC, Eero Tamminen	no flags	Details
Last error state from build where there were no repeated resets (24.63 KB, text/plain) 2016-11-18 10:43 UTC, Eero Tamminen	no flags	Details
Last error state from build with the repeated resets (23.50 KB, text/plain) 2016-11-18 10:45 UTC, Eero Tamminen	no flags	Details
Last error state from build with the repeated resets, using newer mesa git (26.72 KB, text/plain) 2016-11-18 10:48 UTC, Eero Tamminen	no flags	Details
View All

Description Eero Tamminen 2016-11-16 16:23:47 UTC

Created attachment 128014 [details]
Dmesg

Test setup:
- KBL-U QL9J (haven't seen this yet on other platforms)
- Fairly up to date Ubuntu 16.04 with DRI3 & Unity desktop
- Latest kernel and rest of 3D stack within a few weeks
kernel git://anongit.freedesktop.org/drm-intel at 04145fe15cf8c81c221e62fc9d65d93053f9bd1a 2016-11-15_14-49-57

Test-case:
- Boot
- Run Unigine, GLBenchmark 2.7, GfxBench 4.0, SynMark 7.0 benchmarks several times

Expected outcome:
- Everything works fine

Actual outcome:
- After SynMark CSDof (spilling compute shader test), rest of tests fail to:
    intel_do_flush_locked failed: Input/output error
- After 3D tests have been stopped and few minutes have been waited, device idle power usage is still very high (3x normal)

Logs show that when device is idling afterwards:
- Package & cores are in lower power states as expected
- GPU frequency is still at max (allowed by TDP), with 0% in RC6*
- compiz is 100% in (GPU?) IOWAIT
- Unlike in normal situations, powertop shows:
-------------------
Usage;Wakeups/s;GPU ops/s;Disk IO/s;GFX Wakeups/s;Category;Description
 77,9 ms/s;;;;;kWork;i915_hangcheck_elapsed
-------------------

Same issue happens with yesterday night version of X server, Intel DDX, Mesa (which should fix one issue with spilling) and few week older versions of them.

Dmesg attached. Earlier GEN bug 92774 seems to have had similar issue.

Comment 1 Chris Wilson 2016-11-16 16:35:17 UTC

They are all secondary effects to the GPU not resetting.

Comment 2 Eero Tamminen 2016-11-16 16:57:36 UTC

That device has succeeded in running the full test set until end only twice before this, in early September and 27th of October (latter had same X, Intel DDX and Mesa as the version which has this extra symptoms).

On both of these cases, there's been hang with CSDof and GPU reset fail.

However, there were no repeated hang resets and GPU was completely idle after the tests had finished.

Comment 3 yann 2016-11-16 17:37:40 UTC

Eero, can you also attached the error dump?

Comment 4 yann 2016-11-17 08:35:05 UTC

Eero, can you have a try with Chris' patch: https://patchwork.freedesktop.org/series/15471/ ?

Comment 5 Eero Tamminen 2016-11-18 10:43:15 UTC

Created attachment 128054 [details]
Last error state from build where there were no repeated resets

Comment 6 Eero Tamminen 2016-11-18 10:45:40 UTC

Created attachment 128055 [details]
Last error state from build with the repeated resets

Comment 7 Eero Tamminen 2016-11-18 10:48:58 UTC

Created attachment 128056 [details]
Last error state from build with the repeated resets, using newer mesa git

This one uses:
- kernel: 04145fe15cf8c81c221e62fc9d65d93053f9bd1a
- mesa: 341fc0073a3c05fd43e9c7a33613bcb881f25f33

Comment 8 Eero Tamminen 2016-11-18 12:10:16 UTC

(In reply to yann from comment #4)
> Eero, can you have a try with Chris' patch:
> https://patchwork.freedesktop.org/series/15471/ ?

Didn't help.  Still does recurring hangs after test-case stops.

Valtteri came up with test-case that triggers the issue within few minutes:
------ hang.sh --------
#!/bin/sh
for i in $(seq $1); do
    ./synmark2 OglBatch0 &
    sleep 2
    killall synmark2
done
-----------------------
$ ./hang.sh 100
-----------------------

(Mika's now looking into issue.)

Comment 9 Mika Kuoppala 2016-12-14 13:43:56 UTC

This does not appear on other gt3 boxes? I suggest we close this and reopen if it does. Eero?

Comment 10 Eero Tamminen 2016-12-27 11:12:37 UTC

(In reply to Mika Kuoppala from comment #9)
> This does not appear on other gt3 boxes? I suggest we close this and reopen
> if it does. Eero?

If you refer to reset request timeouts or higher power usage due to GPU reset failing completely, I haven't seen those on any HW in last couple of weeks.  But we don't anymore have the KBL-U QL9J machine in regular testing.

(There have been system hangs on the same CarChase offscreen tests on SKL GT2 & BXT, but I guess that's a different issue.)

Comment 11 Eero Tamminen 2017-01-05 16:04:50 UTC

Something similar may now be happening on SKL GT2, since yesterday.  After GFXBench CarChase tests (which often GPU hangs) all tests fail.

However, I don't have logs as Jenkins timeouts the test-run, and reboots to another test-run.

Comment 12 Eero Tamminen 2017-05-29 11:23:25 UTC

I haven't seen reset request timeout errors this year, so I think this can be closed.


BXT J4205 had higher power consumption after all tests had been run (and CarChase offscreen had hanged as earlier) on 3 days around May 7th, but no reset timeouts, so it's different issue.  Didn't see anything similar on other devices on last 2 months (or when using newer Mesa that doesn't anymore trigger the hangs so frequently).

Comment 13 Eero Tamminen 2017-07-07 11:26:09 UTC

Haven't seem this in a long time, so marking it as fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.