93579 – [HSW] GPU HANG: ecode 7:0:0x85ddfffc, in john [2277], reason: Ring hung, action: reset

Bug 93579 - [HSW] GPU HANG: ecode 7:0:0x85ddfffc, in john [2277], reason: Ring hung, action: reset

Summary: [HSW] GPU HANG: ecode 7:0:0x85ddfffc, in john [2277], reason: Ring hung, acti...

Status:	RESOLVED INVALID

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Ian Romanick
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-01-04 17:41 UTC by Frank Dittrich
Modified:	2017-02-10 22:39 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	HSW
i915 features:	GPU hang

Attachments
Contents of /sys/class/drm/card0/error (3.01 MB, text/plain) 2016-01-04 17:41 UTC, Frank Dittrich	Details
View All

Description Frank Dittrich 2016-01-04 17:41:30 UTC

Created attachment 120800 [details]
Contents of /sys/class/drm/card0/error

This is from dmesg:


[ 1791.185004] [drm] stuck on render ring
[ 1791.186261] [drm] GPU HANG: ecode 7:0:0x85ddfffc, in john [2277], reason: Ring hung, action: reset
[ 1791.186265] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1791.186268] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1791.186270] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1791.186273] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 1791.186275] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 1791.188446] drm/i915: Resetting chip after gpu hang
[ 1797.181911] [drm] stuck on render ring
[ 1797.183147] [drm] GPU HANG: ecode 7:0:0x85ddfffc, in john [2277], reason: Ring hung, action: reset
[ 1797.185338] drm/i915: Resetting chip after gpu hang


I am attaching the contents of /sys/class/drm/card0/error.

I got the error on a Fedora 22 system with kernel 4.4.0-0.rc6.git1.1.vanilla.knurd.1.fc22.x86_64.

The CPU is Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz.

I built and installed beignet from latest master commit.
(master)beignet $ git describe --tags 
Release_v1.0.0-654-gf749808

Then I built John the Ripper using a recent bleeding-jumbo commit
https://github.com/magnumripper/JohnTheRipper/commit/ca11872eaf094b0dbe90ba3f74fae5366d2b3125
(bleeding-jumbo)run $ git describe --tags 
1.8.0.6-jumbo-1-1814-gca11872

(bleeding-jumbo)run $ ./john --list=build-info 
Version: 1.8.0.6-jumbo-1-1814-gca11872
Build: linux-gnu 64-bit AVX2-ac OMP
SIMD: AVX2, interleaving: MD4:3 MD5:3 SHA1:1 SHA256:1 SHA512:1
$JOHN is ./
Format interface version: 13
Max. number of reported tunable costs: 3
Rec file version: REC4
Charset file version: CHR3
CHARSET_MIN: 1 (0x01)
CHARSET_MAX: 255 (0xff)
CHARSET_LENGTH: 24
SALT_HASH_SIZE: 1048576
Max. Markov mode level: 400
Max. Markov mode password length: 30
gcc version: 5.3.1
GNU libc version: 2.21 (loaded: 2.21)
OpenCL headers version: 1.2
Crypto library: OpenSSL
OpenSSL library version: 0100010bf
OpenSSL 1.0.1k-fips 8 Jan 2015
GMP library version: 6.0.0
Regex library version: 1.3	(loaded: 1.3.1)
File locking: fcntl()
fseek(): fseek
ftell(): ftell
fopen(): fopen
memmem(): System's

I got the GPU hang when running the self test for office2013-opencl format.

(bleeding-jumbo)run $ ./john --test=0 --format=office2013-opencl
Will run 4 OpenMP threads
Device 0: Intel(R) HD Graphics Haswell GT2 Desktop
Testing: office2013-opencl, MS Office 2013 (100,000 iterations) [SHA512 OpenCL 2x AES]... (4xOMP) Options used: -I ./kernels -cl-mad-enable -D__GPU__ -DDEVICE_INFO=34 -DSIZEOF_SIZE_T=8 -DDEV_VER_MAJOR=1 -DDEV_VER_MINOR=2 -D_OPENCL_COMPILER -DHASH_LOOPS=100 -DUNICODE_LENGTH=96 -DV_WIDTH=2 $JOHN/kernels/office2013_kernel.cl
Local worksize (LWS) 7, global worksize (GWS) 49
drm_intel_gem_bo_context_exec() failed: Input/output error
OpenCL CL_OUT_OF_RESOURCES error in opencl_office2013_fmt_plug.c:323 - failed in clEnqueueNDRangeKernel
(bleeding-jumbo)run $ echo $?
1

I got similar problems with some older Linux kernels, beignet versions or John the Ripper versions.

I have no idea what causes the GPU hang, but I doubt John the Ripper is to blame.

John the Ripper's OpenCL formats work on various other GPUs.
Once I even tried that test after

# echo -n 0 > /sys/module/i915/parameters/enable_hangcheck

I had to switch off power after about half an hour.

Please let me know what else to test.

I would like to know what causes the GPU hang and how to fix or work around it.

Comment 1 yann 2016-09-15 16:10:08 UTC

There are improvements pushed in kernel and Mesa that will benefit to your system, so please re-test with latest kernel & Mesa to see if this issue is still occurring (you may update / upgrade to a more recent version of Fedora).

In the meantime, assigning to Mesa product (please let me know if I am mistaken with this GPU Hang).

From this error dump, hung is happening in render ring batch with active head at 0x7c79f140, with 0x7a000003 (PIPE_CONTROL) as IPEHR.

Batch extract (around 0x7c79f140):

0x7c79f11c:      0x0000007f: MI_NOOP
0x7c79f120:      0xffffffff: UNKNOWN
0x7c79f124:      0x70040000: 3D UNKNOWN: 3d_965 opcode = 0x7004
0x7c79f128:      0x00000000: MI_NOOP
0x7c79f12c:      0x7a000003: PIPE_CONTROL
0x7c79f130:      0x00100020:    no write, cs stall, DC flush,
0x7c79f134:      0x00000000:    destination address
0x7c79f138:      0x00000000:    immediate dword low
0x7c79f13c:      0x00000000:    immediate dword high
0x7c79f140:      0x7a000003: PIPE_CONTROL
0x7c79f144:      0x00101400:    no write, cs stall, render target cache flush, texture cache invalidate,
0x7c79f148:      0x00000000:    destination address
0x7c79f14c:      0x00000000:    immediate dword low
0x7c79f150:      0x00000000:    immediate dword high

Comment 2 yann 2016-11-04 15:38:01 UTC

Please test a new version of Mesa (12 or 13) and mark as REOPENED
if you can reproduce and RESOLVED/* if you cannot reproduce.

If you can reproduce, please capture and upload an apitrace (https://github.com/apitrace/apitrace) so that we can easily 
reproduce as well.

Comment 3 Annie 2017-02-10 22:39:13 UTC

Dear Reporter,

This Mesa bug has been in the "NEEDINFO" status for over 60 days. I am closing this bug based on lack of response but feel free to reopen if resolution is still needed. Please ensure you're supplying the correct information as requested.

Thank you.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.