Created attachment 145051 [details]
GPU HANG: ecode 9:0:0x86dfbff9
we found it in Android with 18.1 mesa, could you help check if there were other duplicate bug/patch already fixed the issue? thanks
Do you have a raw error state that wasn't run through intel_error_decode? That tool is old/broken and doesn't decode things correctly.
Created attachment 145073 [details]
raw of error state
(In reply to Kenneth Graunke from comment #2)
> Do you have a raw error state that wasn't run through intel_error_decode?
> That tool is old/broken and doesn't decode things correctly.
attached the "raw of error state", sorry for my negligence. by the way, do you know any other tools or just use latest intel_error_decode for raw decode? thank you
also attached the init analysis from our side, thank you
GPU HANG: ecode 9:0:0x86dfbff9, in GNaviMap-GL , reason: Hang on render ring, action: reset
Active process (on ring render): GNaviMap-GL , score 0
ERROR: 0x00000000 FAULT_TLB_DATA: 0x0000000c 0xbd1ddf68
ACTHD: 0x00000000 ffffaa9c <-- Instruction at this address is being parsed now by CS
IPEHR: 0x79000002 <-- (head of the instruction which was parsed previously) -> "3DSTATE_DRAWING_RECTANGLE"
Also, there no bit set in error reg. This means there is no pagefault problem (memory issue) at least.
ACTHD is what we need to take a look at here because it helps to point to the actual 3D instructions being parsed by command streamer when GPU hang is detected.
ACTHD value of “ACTHD: 0x00000000 ffffaa9c” points to the GPU instruction at the GTT space address, “0xffffaa9c”
“0xffffaa9c" points to "0x79000002", which is same value with IPEHR.
0xffffaa98: 0x79000002: 3DSTATE_DRAWING_RECTANGLE
0xffffaa9c: 0x00000000: top left: 0,0
The 3DSTATE_DRAWING_RECTANGLE command is used to set the 3D drawing rectangle and related state.
Possibility is GPU got hung while executing 3DSTATE_DRAWING_RECTANGLE or other instruction previously. We do not know it for sure, but we found 33 entries of 0x7b000005: 3DPRIMITIVE: fail sequential.
One of this "fail" is right before 3DSTATE_DRAWING_RECTANGLE.
This log tells what was on top of the RCS queue is 3DPRIMITIVE that is used to submit vertices to 3D pipeline. Assume this was invoked by glDrawArrays (or equivalent draw function).
Below is one of them that
Bad length 7 in (null), expected 6-6
0xffffca9c: 0x7b000005: 3DPRIMITIVE: fail sequential -> previously parsed
0xffffcaa0: 0x00000006: vertex count
0xffffcaa4: 0x00000004: start vertex
0xffffcaa8: 0x00000000: instance count
0xffffcaac: 0x00000001: start instance
0xffffcab0: 0x00000000: index bias
0xffffcab4: 0x00000000: MI_NOOP
0xffffcab8: 0x78260000: 3DSTATE_BINDING_TABLE_POINTERS_VS -> currently being parsed
0xffffcabc: 0x00000000: dword 1
0xffffcac0: 0x782a0000: 3DSTATE_BINDING_TABLE_POINTERS_PS
0xffffcac4: 0x00000840: dword 1
0xffffcac8: 0x782f0000: 3DSTATE_SAMPLER_STATE_POINTERS_PS
0xffffcacc: 0x00000860: dword 1
and we could't do further analysis per materials in our hands, didn't see any abnormal in ring/batch buffer, e.g. empty buffer, invalid buffer length..., the hang may be caused by near by "ffffaa9c", but the instructions in buffer has no abnormal finding
Created attachment 145075 [details]
decoded error state with newer tools
Here's a copy of the error state decoded with aubinator_error_decode, from Mesa.
That's the best tool for decoding error states these days. Build Mesa with -Dtools=intel to get aubinator_error_decode. Unfortunately, the version in master doesn't work with error states from kernels this old anymore, but I hacked the tool to make it work again. Patch for that is in the 'old-error-decode' branch of https://gitlab.freedesktop.org/kwg/mesa/
A couple quick observations...
1. 3DSTATE_DRAWING_RECTANGLE is usually not the culprit, it's just a non-pipelined command, which causes the command streamer to stop there until previous draws complete. Those previous draws are likely what's actually hanging.
2. With the updated decode, you can see that the draw immediately before ACTHD is an indexed trilist draw with 66 vertices. Prior to that appears to be a BLORP stencil clear...(?)
3. The beginning of the batch looks somewhat...overwritten with zeroes.
It begins with:
- 6 MI_NOOPs aka DWords of 0
- PIPE_CONTROL <Constant, Texture>
- PIPE_CONTROL <Depth, Render, CS stall>
We never begin a batch with zeroes, so seeing MI_NOOP (0) is suspect. In many cases, that has meant something has written past the end of a buffer and happens to have clobbered adjacent memory, in this case the batch buffer.
The first draw in the batch is a RECTLIST, so we can assume the first operation came from BLORP.
The first thing it does is flush caches, which would perform two PIPE_CONTROLs - the first with <Depth, Render, and CS stall>, then another with <Constant, Texture>. PIPE_CONTROL is 6 DWords long, so it's as if the first one got zeroed out somehow, then the second one is there. (Then again, at the start of a new batch, we shouldn't need to do these cache flushes either...)
After that, it emits state base address. That does a PIPE_CONTROL with <Depth, DC, Render, CS Stall, Write Immediate>, then SBA, then a PIPE_CONTROL with <State, Texture, Instruction>. Those are all present. After that things look pretty normal again.
I would suspect a buffer underallocation, but that's just an initial guess...
https://bugs.freedesktop.org/show_bug.cgi?id=111396 also happened on same platform, for your reference
(In reply to Kenneth Graunke from comment #6)
> Created attachment 145075 [details]
> decoded error state with newer tools
> Here's a copy of the error state decoded with aubinator_error_decode, from
> That's the best tool for decoding error states these days. Build Mesa with
> -Dtools=intel to get aubinator_error_decode. Unfortunately, the version in
> master doesn't work with error states from kernels this old anymore, but I
> hacked the tool to make it work again. Patch for that is in the
> 'old-error-decode' branch of https://gitlab.freedesktop.org/kwg/mesa/
thank you Kenneth, i built the aubinator_error_decode in my side successfully.
and besides the error decode, pls let know if you need enable mesa addtional log(as the reproduce rate was very low, and i just enabled -DDEBUG and MESA_VERBOSE = 0xffff in mesa, but not sure if could get useful log there)
update the importance due to this block our production release, pls comments if you have concerns about this, thank you
(In reply to yugang from comment #10)
> update the importance due to this block our production release, pls comments
> if you have concerns about this, thank you
Did you try with more up-to-date Mesa to see if the app works there? This would enable us to bisect how it got fixed. I know celadon has a 19.1.4 branch so maybe would be worth a try? Also, can you pinpoint to the Map-GL app, is it a web app?
as it been only reproduced in customer side without fixed reproducing steps, but low reproduce rate which blocked the release, and customer could't use new mesa/kernel now due to the changes/impacts were too big, may impacted whole system quality.
application is not web application, it is about map navigation.
It is unrealistic to expect that the problem can be pinpointed solely by looking at the error state. The entire graphics stack is a big question mark for this product.
If you want progress to be made on this bug, you need to follow up on the requests that have been made:
- test with an upstream recent kernel
- test with the latest mesa release
- Provide reproduction steps
If your customer thinks updating to a newer graphics stack is riskier than shipping the bugs that have been already fixed, you still need to *test* with the new stack to understand which bug fixes need backports to your system.
(In reply to yugang from comment #8)
> https://bugs.freedesktop.org/show_bug.cgi?id=111396 also happened on same
> platform, for your reference
Based on Chris's comment there, it sounds even more like a BO underallocation somewhere. This could very well be fixed in a newer Mesa - we've fixed several of those in the last couple years, and your Mesa is a year out of date. I understand that upgrading the kernel is difficult, but upgrading only Mesa (at least for testing purposes) should be a lot more doable. It's worth trying that first. Even if your customer can't actually upgrade, it may help you figure out what to backport.
as two issues are only reproduced in customer side(in the special app and uncertain special steps, maybe has some other special different things), we can't do more testing in our side so far.
(In reply to Kenneth Graunke from comment #14)
> (In reply to yugang from comment #8)
> > FYI.
> > https://bugs.freedesktop.org/show_bug.cgi?id=111396 also happened on same
> > platform, for your reference
> Based on Chris's comment there, it sounds even more like a BO
> underallocation somewhere. This could very well be fixed in a newer Mesa -
> we've fixed several of those in the last couple years, and your Mesa is a
> year out of date. I understand that upgrading the kernel is difficult, but
> upgrading only Mesa (at least for testing purposes) should be a lot more
> doable. It's worth trying that first. Even if your customer can't actually
> upgrade, it may help you figure out what to backport.
@Kenneth, do you know if there have been some related patches/debug methods which have fixed some bugs related to the BO underallocation? can you have a check and provide some example of them so that we can evaluate current issue and see if there are some log/method for further debug?
those two issues seriously blocked the customer's milestone, mesa/kernel upgrade may only met requirements of future release.
The HW context in this error state also looks entirely garbage. Unlike 11396, the batch here is an ordinary linear buffer, however.
Created attachment 145122 [details]
new log related to the hang
attached the new log from customer, seems also has the same situation: batch buffer begin with empty content, and may has some uncertain contents.
(In reply to Kenneth Graunke from comment #17)
> The HW context in this error state also looks entirely garbage. Unlike
> 11396, the batch here is an ordinary linear buffer, however.
hi Kenneth, do you think if application level or userspace system(e.g. application of other non-GL/GLES compoments) level can impact this(i mean bufffer overwrite)?
Created attachment 145125 [details]
new raw error state
Created attachment 145314 [details]
Created attachment 145315 [details]
Created attachment 145316 [details]
Created attachment 145317 [details]
could you help check latest two hang error code(also attached two decode files) if they also have the similar issue in batch buffer/ring buffer as before(e.g. underallocation with random content)?
this serious impacts the customer's productions, and so urgent to feedback to customer. thank you
we met "ecode 9:0:0x84df7cfc" three times in one board, this is also different with previous HANGs which has different ecode in each HANG. there are some existing bugs, but no detailed fix patch found:
1. https://bugs.freedesktop.org/show_bug.cgi?id=108557 fixed in 18.3.2(not sure if need kernel upgrade), customer is using 18.1.0-devel (git-22b909edd7)), the reproduce rate also impact patch finding.
(In reply to yugang from comment #25)
> hi Kenneth，
> could you help check latest two hang error code(also attached two decode
> files) if they also have the similar issue in batch buffer/ring buffer as
> before(e.g. underallocation with random content)?
> this serious impacts the customer's productions, and so urgent to feedback
> to customer. thank you
0xefdfffff.i915_error_state.txt is total garbage once again - the batch is just obliterated by something scribbling all over memory. Random ecodes makes sense, as that value is produced based on the INSTDONE bits and IPEHR (hanging instruction) - which, when your hanging instruction is some random garbage - tends to produce random ecodes.
0x84df7cfc.i915_error_state.02.txt looks more promising, it appears that it has an actual batch. However, I'm not seeing anything amiss right away. The hardware context is again all zeroes, but that may just be an error capture bug in that old kernel (perhaps https://bugs.freedesktop.org/show_bug.cgi?id=107691), and not part of the actual problem. It's really difficult to work with these old logs, there is just a ton of information missing. We started capturing a lot more information with Linux v4.14 and newer Mesa, but that isn't really an option here...
One random idea. It looks like in that log, 3DSTATE_CONSTANT_VS has a pointer of 0xfdd5b900, which looks like a real memory address and not an offset from Dynamic State Base Address. Which means that Mesa must be setting CS_DEBUG_MODE2 to make that an absolute address instead an offset. At one point, vaapi-intel-driver didn't program CS_DEBUG_MODE2 and expected it to be an offset. The kernel also didn't isolate contexts from each other until d2b4b97933f5adacfba42dc3b9200d0e21fbe2c4, so sometimes a media process would inherit state from another context, where that mode was flipped, and repeatedly hang. The kernel getparam I915_PARAM_HAS_CONTEXT_ISOLATION is supposed to control that. I guess you must have that in your 4.9.x backport, though?
just quick checked kernel in customer side, has no d2b4b97933f5adacfba42dc3b9200d0e21fbe2c4.
do you know if only backporting this commit is useful for 0x84df7cfc? or still need mesa and media side changes? thank you
(In reply to yugang from comment #28)
> just quick checked kernel in customer side, has no
> do you know if only backporting this commit is useful for 0x84df7cfc? or
> still need mesa and media side changes? thank you
It certainly couldn't hurt. Better protecting contexts from one another is definitely helpful for system stability, otherwise all components on the systems need to agree on common values for everything.
If you don't actually have that commit in the kernel, then everybody using the GPU (mesa, media, opencl, etc) must agree on the value of CS_DEBUG_MODE2::Constant Buffer Address Offset Disable. If a driver starts up and inherits the wrong value from another context, it will absolutely crash the GPU. You can get lucky if you start your processes in the right order, I suppose...
I am really curious why your Mesa thinks it can use absolute pointers for constant buffers. It's supposed to query the kernel and only enable that functionality (!compiler->constant_buffer_0_is_relative) if the I915_PARAM_HAS_CONTEXT_ISOLATION getparam is present. Perhaps you don't have that kernel sha, but you do have the getparam? For a while, Mesa was broken and didn't check, but that was in the 17.x era.
For reference, the broken Mesa commit that introduced this without checking was
8ec5a4e4a4a32f4de351c5fc2bf0eb615b6eef1b. But it was reverted in 013d33122028f2492da90a03ae4bc1dab84c3ee9. Proper support was re-added in fa8a764b62588420ac789df79ec0ab858b38639f.
If it is easier, rather than porting the kernel change, you can disable absolute constant addressing in Mesa. It's not really critical to have it.
thank you, mesa in customer side has those 3 commit:
8ec5a4e4a4a32f4de351c5fc2bf0eb615b6eef1b. 013d33122028f2492da90a03ae4bc1dab84c3ee9. fa8a764b62588420ac789df79ec0ab858b38639f.
we will try to revert fa8a764b62588420ac789df79ec0ab858b38639f to try.
Created attachment 145411 [details]
Created attachment 145412 [details]
Created attachment 145413 [details]
Created attachment 145414 [details]
Created attachment 145415 [details]
Created attachment 145416 [details]
@Kenneth, i uploaded 3 new hang error state, seems they have same error state comparing with existing 0x84df7cfc.i915_error_state.02.txt, could you help to have a look too so that we can make sure there are no new issue found. thank you
(In reply to yugang from comment #37)
> @Kenneth, i uploaded 3 new hang error state, seems they have same error
> state comparing with existing 0x84df7cfc.i915_error_state.02.txt, could you
> help to have a look too so that we can make sure there are no new issue
> found. thank you
those 3 HANGs have no reverted fa8a764b62588420ac789df79ec0ab858b38639f, still in previous ENV. we are testing the "reverted fa8a764b62588420ac789df79ec0ab858b38639f" now, will share status soon.
Created attachment 145452 [details]
Created attachment 145453 [details]
For attachment: 0x84d77cfc.crashlog3.20190920.i915_error_state.*
@Kenneth, we still could reproduce the GPU HANG with the reverted patch fa8a764b62588420ac789df79ec0ab858b38639f.
and we found there were media usages when playing nav application, and some devices couldn't reproduce the GPU HANG when we disabled the media usages(repdoduc rate was very high before, the media usages is about camera image format converting by media driver).
Created attachment 145454 [details]
Created attachment 145455 [details]
Created attachment 145456 [details]
For attachment: 0x84df7cfc.crashlog0.20190920.i915_error_state.*.txt
we found there were some big size veterx buffer(e.g. Buffer Size: 43524 Vertex buffer 0, size 43524) before "0x7b000005: 3DPRIMITIVE". not sure it was abnormal? thank you:
0xfdc00f80: 0x78080003: 3DSTATE_VERTEX_BUFFERS
0xfdc00f80: 0x78080003 : Dword 0
DWord Length: 3
0xfdc00f84: 0x0004400c : Dword 1
0xfdc00f88: 0xfeb6a090 : Dword 2
0xfdc00f8c: 0x00000000 : Dword 3
0xfdc00f90: 0x0000aa04 : Dword 4
Vertex Buffer State: <struct VERTEX_BUFFER_STATE>
0xfdc00f84: 0x0004400c : Dword 0
Buffer Pitch: 12
Null Vertex Buffer: false
Address Modify Enable: true
Vertex Buffer MOCS: 4
Memory Object Control State: <struct MEMORY_OBJECT_CONTROL_STATE>
0xfdc00f84: 0x0004400c : Dword 0
Index to MOCS Tables: 2
Vertex Buffer Index: 0
0xfdc00f88: 0xfeb6a090 : Dword 1
0xfdc00f8c: 0x00000000 : Dword 2
Buffer Starting Address: 0xfeb6a090
0xfdc00f90: 0x0000aa04 : Dword 3
Buffer Size: 43524
vertex buffer 0, size 43524
buffer contents unavailable
I am wondering this is related to the previous bug:
@dongwonk, i can't find the related functions in my mesa version, maybe due to it is too old. thank you for your info.
@Kenneth, so far, we only reproduced when we using MSAA=2 (tried and not reproduced it when MSAA = 0 or 4. as the reproduce rate reason, we sill need more testings to verify this).
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1827.