97020 – [hsw] GPU hang on first context use (after long uptime)

Bug 97020 - [hsw] GPU hang on first context use (after long uptime)

Summary: [hsw] GPU hang on first context use (after long uptime)

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-07-21 11:52 UTC by Nicolás Lichtmaier
Modified:	2017-07-24 23:15 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	HSW
i915 features:	GPU hang

Attachments
screenshot (20.45 KB, image/png) 2016-07-21 11:52 UTC, Nicolás Lichtmaier	no flags	Details
dmesg output (27.03 KB, application/gzip) 2016-07-21 11:53 UTC, Nicolás Lichtmaier	no flags	Details
/sys/class/drm/card0/error (399.16 KB, application/gzip) 2016-07-21 11:54 UTC, Nicolás Lichtmaier	no flags	Details
Another dump (3.08 MB, application/gzip) 2016-08-25 12:57 UTC, Nicolás Lichtmaier	no flags	Details
View All

Description Nicolás Lichtmaier 2016-07-21 11:52:27 UTC

Created attachment 125222 [details]
screenshot

I'm attaching a screenshot and the relevant files.

Comment 1 Nicolás Lichtmaier 2016-07-21 11:53:48 UTC

Created attachment 125223 [details]
dmesg output

Comment 2 Nicolás Lichtmaier 2016-07-21 11:54:40 UTC

Created attachment 125224 [details]
/sys/class/drm/card0/error

Comment 3 Chris Wilson 2016-07-21 12:30:16 UTC

A GPU hang will result in rendering errors, so they may well just be a victim.

Comment 4 yann 2016-08-04 12:23:37 UTC

Could it be linked to gem/gtt ?

Seeing in kernel log several message on alignment:

[143743.427883] i915 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment
[143743.427932] i915 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment
[143743.428074] i915 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment

and in dump (if I am correct about current render ring HEAD) we have IPEHR 0x0c000000

0x0080dc34:      0x11000005: MI_LOAD_REGISTER_IMM
0x0080dc38:      0x00012050:    dword 1
0x0080dc3c:      0x00010001:    dword 2
0x0080dc40:      0x00022050:    dword 3
0x0080dc44:      0x00010001:    dword 4
0x0080dc48:      0x0001a050:    dword 5
0x0080dc4c:      0x00010001:    dword 6
0x0080dc50:      0x00000000: MI_NOOP
0x0080dc54:      0x0c000000: MI_SET_CONTEXT
0x0080dc58:      0x798dd10c:    gtt offset = 0x798dd000
0x0080dc5c: HEAD     0x00000000: MI_NOOP
Bad length (7) in MI_LOAD_REGISTER_IMM, [3, 3]
0x0080dc60:      0x11000005: MI_LOAD_REGISTER_IMM
0x0080dc64:      0x00012050:    dword 1
0x0080dc68:      0x00010000:    dword 2
0x0080dc6c:      0x00022050:    dword 3
0x0080dc70:      0x00010000:    dword 4
0x0080dc74:      0x0001a050:    dword 5
0x0080dc78:      0x00010000:    dword 6
0x0080dc7c:      0x04000001: MI_ARB_ON_OFF

Comment 5 Chris Wilson 2016-08-04 12:42:54 UTC

The "bogus alignment" errors are themselves bogus (self-inflicted by the kernel and don't affect anything). The hang is on processing the MI_SET_CONTEXT. This kernel has the workaround for the PSMI issue, so hopefully it is not a repeat of the last known context hangs, but I did find that this was a fresh context of interest. Could be some state that hasn't been cleared etc.

Comment 6 Nicolás Lichtmaier 2016-08-25 12:57:09 UTC

Created attachment 126031 [details]
Another dump

This keeps happening, here's another /sys/class/drm/card0/error dump. This happened when the computer was resuming after suspension.

In dmesg:

[215029.932383] [drm] stuck on render ring
[215029.933531] [drm] GPU HANG: ecode 7:0:0x84dfbffe, in chrome [7937], reason: Ring hung, action: reset
[215029.933533] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[215029.933534] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[215029.933536] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[215029.933537] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[215029.933538] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[215029.935896] drm/i915: Resetting chip after gpu hang

Comment 7 Jari Tahvanainen 2017-03-28 16:11:35 UTC

Nicolas - We seem to have neglected the bug quite a bit, apologies. Do you see this problem with the latest kernel (preferable drm-tip branch from
git://anongit.freedesktop.org/drm-tip) ?
Mark this as 
REOPENED if you can reproduce (and attach kernel log and card0/error) and 
RESOLVED if you cannot reproduce.

Comment 8 Chris Wilson 2017-03-28 16:20:15 UTC

commit 5d4bac5503fcc67dd7999571e243cee49371aef7
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Mar 22 20:59:30 2017 +0000

    drm/i915: Restore marking context objects as dirty on pinning

Comment 9 Chris Wilson 2017-03-28 16:21:34 UTC

Weird. error state matches the expected pattern for the fix, just the date is much much older than the regression that that patch fixes. Could be just an older version of the same bug...

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.