111396 – GPU HANG: ecode 9:0:0xedddfeff, reason: Hang on render ring, action: reset

Bug 111396 - GPU HANG: ecode 9:0:0xedddfeff, reason: Hang on render ring, action: reset

Summary: GPU HANG: ecode 9:0:0xedddfeff, reason: Hang on render ring, action: reset

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	18.1
Hardware:	Other All

Importance:	high critical
Assignee:	Intel 3D Bugs Mailing List
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2019-08-14 02:52 UTC by yugang
Modified:	2019-09-25 20:34 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
GPU HANG: ecode 9:0:0xedddfeff (1.60 MB, text/plain) 2019-08-14 02:52 UTC, yugang	Details
raw error state (33.32 KB, text/plain) 2019-08-16 01:24 UTC, yugang	Details
new decode related to the hang (2.33 MB, text/plain) 2019-08-22 03:14 UTC, yugang	Details
new raw error state (28.19 KB, text/plain) 2019-08-22 04:41 UTC, yugang	Details
View All

Description yugang 2019-08-14 02:52:39 UTC

Created attachment 145052 [details]
GPU HANG: ecode 9:0:0xedddfeff

Comment 1 yugang 2019-08-14 02:53:08 UTC

we found it in Android with 18.1 mesa, could you help check if there were other duplicate bug/patch already fixed the issue? thanks

Comment 2 Chris Wilson 2019-08-14 08:35:51 UTC

Someone thought it would be funny to put a picture in the middle of your batch buffer. If the entire driver stack is as old as the kernel, confirming on upstream drivers is a priority.

Comment 3 yugang 2019-08-16 01:24:30 UTC

Created attachment 145074 [details]
raw error state

Comment 4 yugang 2019-08-16 02:07:17 UTC

for some init analysis from our side, current project is hard to switch to latest upstream kernel or mesa due to production requirement, and we also met another HANG and tracked in https://bugs.freedesktop.org/show_bug.cgi?id=111395.

'''
GPU HANG: ecode 9:0:0xedddfeff, reason: Hang on render ring, action: reset

_ERROR: 0x00000000
FAULT_TLB_DATA: 0x00000002 0x072a312a_

ACTHD: 0x00000000 ff4b1770      <-- Instruction at this address is being parsed by CS
IPEHR: 0x12020101                      <-  (head of the instruction which was parsed previously)    <- MI_STORE_REGISTER_MEM

Also, there no bit set in error reg. This means there is no pagefault problem (memory issue) at least.

ACTHD is what we need to take a look at here because it helps to point to the actual 3D instructions being parsed by command streamer when GPU hang is detected.
ACTHD value of “ACTHD: 0x00000000 ff4b1770” points to the GPU instruction at the GTT space address, “0xff4b1770”

“0xff4b1770" points to "0x12020101", which is same value with IPEHR.

The MI_STORE_REGISTER_MEM command requests a register read from a specified memory mapped register location in the device and store of that DWord to memory. 
The register address is specified along with the command to perform the read.

_0xff4b1764: 0x12020101: MI_STORE_REGISTER_MEM
0xff4b1768: 0x00000000: dword 1
0xff4b176c: 0x00000000: dword 2
0xff4b1770: 0xd1160d0e: UNKNOWN_

Comment 5 Kenneth Graunke 2019-08-21 07:29:22 UTC

Now that I've looked more closely, this one looks especially strange.

1. The batchbuffer is supposedly 0xff4b0000, but the "Active (render ring)" list includes this:

    00000000_ff4b0000    32768 3e 00 [ 7a0a 00 00 00 00 ] 00 Y dirty uncached

The batch buffer is...Y-tiled?  That's not right.  Plus, there seems to be very little batch in the batch - it's *entirely* image garbage.  If anything, it looks like we actually execbuf2'd an image which may have at one point contained some batch commands due to BO cache reuse or something.

2. The hardware context is entirely garbage as well.  Even the headers and basic structure are missing.  It's almost all zeroes.

I'm not sure that Mesa is likely to cause this much damage with a simple buffer underallocation.  It may be possible, but a *lot* of data is clobbered, and in multiple places which are not necessarily contiguous.

Chris, any more bright ideas?

Comment 6 Kenneth Graunke 2019-08-21 18:04:42 UTC

Chris and I are wondering if this batch might have actually originated from the media driver, rather than the GL driver.  There was a lot of media ring usage going on before the GPU crashed, and I'm seeing almost nothing that looks Mesa related in here.  (It might be, but any indicators have been utterly destroyed...)

Comment 7 yugang 2019-08-22 03:14:30 UTC

Created attachment 145123 [details]
new decode related to the hang

attached the new log from customer, seems also has the same situation: batch buffer begin with empty content, and may has some uncertain contents.

Comment 8 yugang 2019-08-22 04:41:11 UTC

Created attachment 145126 [details]
new raw error state

Comment 9 yugang 2019-08-22 07:54:40 UTC

and do you know why this hang error state has no process name/info(i mean the log has no info just likes "in Map-GL [6016]" in issue 111395)? if there are some debug option need be enabled?

Comment 10 yugang 2019-08-22 08:44:27 UTC

@Chris and Kenneth, can you point out details about batchbuffer was doing ecoding or decoding? e.g. which instructions are more likely doing encoding/ or decoding, this may help on customer isolate the components of issue reproducing (so far, we know the map application had both 3D and media encoding functions running), thank you

Comment 11 GitLab Migration User 2019-09-25 20:34:55 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1828.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.