Created attachment 145052 [details]
GPU HANG: ecode 9:0:0xedddfeff
we found it in Android with 18.1 mesa, could you help check if there were other duplicate bug/patch already fixed the issue? thanks
Someone thought it would be funny to put a picture in the middle of your batch buffer. If the entire driver stack is as old as the kernel, confirming on upstream drivers is a priority.
Created attachment 145074 [details]
raw error state
for some init analysis from our side, current project is hard to switch to latest upstream kernel or mesa due to production requirement, and we also met another HANG and tracked in https://bugs.freedesktop.org/show_bug.cgi?id=111395.
GPU HANG: ecode 9:0:0xedddfeff, reason: Hang on render ring, action: reset
FAULT_TLB_DATA: 0x00000002 0x072a312a_
ACTHD: 0x00000000 ff4b1770 <-- Instruction at this address is being parsed by CS
IPEHR: 0x12020101 <- (head of the instruction which was parsed previously) <- MI_STORE_REGISTER_MEM
Also, there no bit set in error reg. This means there is no pagefault problem (memory issue) at least.
ACTHD is what we need to take a look at here because it helps to point to the actual 3D instructions being parsed by command streamer when GPU hang is detected.
ACTHD value of “ACTHD: 0x00000000 ff4b1770” points to the GPU instruction at the GTT space address, “0xff4b1770”
“0xff4b1770" points to "0x12020101", which is same value with IPEHR.
The MI_STORE_REGISTER_MEM command requests a register read from a specified memory mapped register location in the device and store of that DWord to memory.
The register address is specified along with the command to perform the read.
_0xff4b1764: 0x12020101: MI_STORE_REGISTER_MEM
0xff4b1768: 0x00000000: dword 1
0xff4b176c: 0x00000000: dword 2
0xff4b1770: 0xd1160d0e: UNKNOWN_
Now that I've looked more closely, this one looks especially strange.
1. The batchbuffer is supposedly 0xff4b0000, but the "Active (render ring)" list includes this:
00000000_ff4b0000 32768 3e 00 [ 7a0a 00 00 00 00 ] 00 Y dirty uncached
The batch buffer is...Y-tiled? That's not right. Plus, there seems to be very little batch in the batch - it's *entirely* image garbage. If anything, it looks like we actually execbuf2'd an image which may have at one point contained some batch commands due to BO cache reuse or something.
2. The hardware context is entirely garbage as well. Even the headers and basic structure are missing. It's almost all zeroes.
I'm not sure that Mesa is likely to cause this much damage with a simple buffer underallocation. It may be possible, but a *lot* of data is clobbered, and in multiple places which are not necessarily contiguous.
Chris, any more bright ideas?
Chris and I are wondering if this batch might have actually originated from the media driver, rather than the GL driver. There was a lot of media ring usage going on before the GPU crashed, and I'm seeing almost nothing that looks Mesa related in here. (It might be, but any indicators have been utterly destroyed...)
Created attachment 145123 [details]
new decode related to the hang
attached the new log from customer, seems also has the same situation: batch buffer begin with empty content, and may has some uncertain contents.
Created attachment 145126 [details]
new raw error state
and do you know why this hang error state has no process name/info(i mean the log has no info just likes "in Map-GL " in issue 111395)? if there are some debug option need be enabled?
@Chris and Kenneth, can you point out details about batchbuffer was doing ecoding or decoding? e.g. which instructions are more likely doing encoding/ or decoding, this may help on customer isolate the components of issue reproducing (so far, we know the map application had both 3D and media encoding functions running), thank you