Bug 111395 - GPU HANG: ecode 9:0:0x86dfbff9, in Map-GL [6016], reason: Hang on render ring, action: reset
Summary: GPU HANG: ecode 9:0:0x86dfbff9, in Map-GL [6016], reason: Hang on render ring...
Status: NEW
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: 18.1
Hardware: x86-64 (AMD64) other
: high critical
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-14 02:45 UTC by yugang
Modified: 2019-08-22 04:40 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
GPU HANG: ecode 9:0:0x86dfbff9 (2.14 MB, text/plain)
2019-08-14 02:45 UTC, yugang
Details
raw of error state (35.32 KB, text/plain)
2019-08-16 01:21 UTC, yugang
Details
decoded error state with newer tools (5.65 MB, text/plain)
2019-08-16 07:55 UTC, Kenneth Graunke
Details
new log related to the hang (2.28 MB, text/plain)
2019-08-22 03:08 UTC, yugang
Details
new raw error state (27.82 KB, text/plain)
2019-08-22 04:40 UTC, yugang
Details

Note You need to log in before you can comment on or make changes to this bug.
Description yugang 2019-08-14 02:45:19 UTC
Created attachment 145051 [details]
GPU HANG: ecode 9:0:0x86dfbff9
Comment 1 yugang 2019-08-14 02:47:24 UTC
we found it in Android with 18.1 mesa, could you help check if there were other duplicate bug/patch already fixed the issue? thanks
Comment 2 Kenneth Graunke 2019-08-15 21:35:51 UTC
Do you have a raw error state that wasn't run through intel_error_decode?  That tool is old/broken and doesn't decode things correctly.
Comment 3 yugang 2019-08-16 01:21:39 UTC
Created attachment 145073 [details]
raw of error state
Comment 4 yugang 2019-08-16 01:23:50 UTC
(In reply to Kenneth Graunke from comment #2)
> Do you have a raw error state that wasn't run through intel_error_decode? 
> That tool is old/broken and doesn't decode things correctly.

attached the "raw of error state", sorry for my negligence. by the way, do you know any other tools or just use latest intel_error_decode for raw decode? thank you
Comment 5 yugang 2019-08-16 02:04:04 UTC
also attached the init analysis from our side, thank you

'''
GPU HANG: ecode 9:0:0x86dfbff9, in GNaviMap-GL [6016], reason: Hang on render ring, action: reset
Active process (on ring render): GNaviMap-GL [6016], score 0
ERROR: 0x00000000 FAULT_TLB_DATA: 0x0000000c 0xbd1ddf68

ACTHD: 0x00000000 ffffaa9c   <-- Instruction at this address is being parsed now by CS
IPEHR: 0x79000002                 <--  (head of the instruction which was parsed previously) -> "3DSTATE_DRAWING_RECTANGLE" 

 

 Also, there no bit set in error reg. This means there is no pagefault problem (memory issue) at least.

ACTHD is what we need to take a look at here because it helps to point to the actual 3D instructions being parsed by command streamer when GPU hang is detected.
ACTHD value of “ACTHD: 0x00000000 ffffaa9c” points to the GPU instruction at the GTT space address, “0xffffaa9c”

“0xffffaa9c" points to "0x79000002", which is same value with IPEHR.

0xffffaa98: 0x79000002: 3DSTATE_DRAWING_RECTANGLE
0xffffaa9c: 0x00000000: top left: 0,0

 

The 3DSTATE_DRAWING_RECTANGLE command is used to set the 3D drawing rectangle and related state.
Possibility is GPU got hung while executing 3DSTATE_DRAWING_RECTANGLE or other instruction previously. We do not know it for sure, but we found 33 entries of 0x7b000005: 3DPRIMITIVE: fail sequential. 
One of this "fail" is right before 3DSTATE_DRAWING_RECTANGLE.

This log tells what was on top of the RCS queue is 3DPRIMITIVE that is used to submit vertices to 3D pipeline. Assume this was invoked by glDrawArrays (or equivalent draw function).
Below is one of them that 

Bad length 7 in (null), expected 6-6
0xffffca9c: 0x7b000005: 3DPRIMITIVE: fail sequential    -> previously parsed
0xffffcaa0: 0x00000006: vertex count
0xffffcaa4: 0x00000004: start vertex
0xffffcaa8: 0x00000000: instance count
0xffffcaac: 0x00000001: start instance
0xffffcab0: 0x00000000: index bias
0xffffcab4: 0x00000000: MI_NOOP
0xffffcab8: 0x78260000: 3DSTATE_BINDING_TABLE_POINTERS_VS    -> currently being parsed
0xffffcabc: 0x00000000: dword 1
0xffffcac0: 0x782a0000: 3DSTATE_BINDING_TABLE_POINTERS_PS
0xffffcac4: 0x00000840: dword 1
0xffffcac8: 0x782f0000: 3DSTATE_SAMPLER_STATE_POINTERS_PS
0xffffcacc: 0x00000860: dword 1
'''
and we could't do further analysis per materials in our hands, didn't see any abnormal in ring/batch buffer, e.g. empty buffer, invalid buffer length..., the hang may be caused by near by "ffffaa9c", but the instructions in buffer has no abnormal finding
Comment 6 Kenneth Graunke 2019-08-16 07:55:32 UTC
Created attachment 145075 [details]
decoded error state with newer tools

Here's a copy of the error state decoded with aubinator_error_decode, from Mesa.  

That's the best tool for decoding error states these days.  Build Mesa with -Dtools=intel to get aubinator_error_decode.  Unfortunately, the version in master doesn't work with error states from kernels this old anymore, but I hacked the tool to make it work again.  Patch for that is in the 'old-error-decode' branch of https://gitlab.freedesktop.org/kwg/mesa/
Comment 7 Kenneth Graunke 2019-08-16 08:47:44 UTC
A couple quick observations...

1. 3DSTATE_DRAWING_RECTANGLE is usually not the culprit, it's just a non-pipelined command, which causes the command streamer to stop there until previous draws complete.  Those previous draws are likely what's actually hanging.

2. With the updated decode, you can see that the draw immediately before ACTHD is an indexed trilist draw with 66 vertices.  Prior to that appears to be a BLORP stencil clear...(?)

3. The beginning of the batch looks somewhat...overwritten with zeroes.

It begins with:

- 6 MI_NOOPs aka DWords of 0
- PIPE_CONTROL <Constant, Texture>
- PIPE_CONTROL <Depth, Render, CS stall>
- STATE_BASE_ADDRESS

We never begin a batch with zeroes, so seeing MI_NOOP (0) is suspect.  In many cases, that has meant something has written past the end of a buffer and happens to have clobbered adjacent memory, in this case the batch buffer.

The first draw in the batch is a RECTLIST, so we can assume the first operation came from BLORP.

The first thing it does is flush caches, which would perform two PIPE_CONTROLs - the first with <Depth, Render, and CS stall>, then another with <Constant, Texture>.  PIPE_CONTROL is 6 DWords long, so it's as if the first one got zeroed out somehow, then the second one is there.  (Then again, at the start of a new batch, we shouldn't need to do these cache flushes either...)

After that, it emits state base address.  That does a PIPE_CONTROL with <Depth, DC, Render, CS Stall, Write Immediate>, then SBA, then a PIPE_CONTROL with <State, Texture, Instruction>.  Those are all present.  After that things look pretty normal again.

I would suspect a buffer underallocation, but that's just an initial guess...
Comment 8 yugang 2019-08-16 09:04:08 UTC
FYI.

https://bugs.freedesktop.org/show_bug.cgi?id=111396 also happened on same platform, for your reference
Comment 9 yugang 2019-08-19 02:50:56 UTC
(In reply to Kenneth Graunke from comment #6)
> Created attachment 145075 [details]
> decoded error state with newer tools
> 
> Here's a copy of the error state decoded with aubinator_error_decode, from
> Mesa.  
> 
> That's the best tool for decoding error states these days.  Build Mesa with
> -Dtools=intel to get aubinator_error_decode.  Unfortunately, the version in
> master doesn't work with error states from kernels this old anymore, but I
> hacked the tool to make it work again.  Patch for that is in the
> 'old-error-decode' branch of https://gitlab.freedesktop.org/kwg/mesa/

thank you Kenneth, i built the aubinator_error_decode in my side successfully.

and besides the error decode, pls let know if you need enable mesa addtional log(as the reproduce rate was very low, and i just enabled -DDEBUG and MESA_VERBOSE = 0xffff in mesa, but not sure if could get useful log there)
Comment 10 yugang 2019-08-19 03:52:27 UTC
update the importance due to this block our production release, pls comments if you have concerns about this, thank you
Comment 11 Tapani Pälli 2019-08-19 04:58:11 UTC
(In reply to yugang from comment #10)
> update the importance due to this block our production release, pls comments
> if you have concerns about this, thank you

Did you try with more up-to-date Mesa to see if the app works there? This would enable us to bisect how it got fixed. I know celadon has a 19.1.4 branch so maybe would be worth a try? Also, can you pinpoint to the Map-GL app, is it a web app?
Comment 12 yugang 2019-08-19 05:35:45 UTC
as it been only reproduced in customer side without fixed reproducing steps, but low reproduce rate which blocked the release, and customer could't use new mesa/kernel now due to the changes/impacts were too big, may impacted whole system quality.

application is not web application, it is about map navigation.
Comment 13 Mark Janes 2019-08-19 06:13:02 UTC
It is unrealistic to expect that the problem can be pinpointed solely by looking at the error state.  The entire graphics stack is a big question mark for this product.

If you want progress to be made on this bug, you need to follow up on the requests that have been made:

 - test with an upstream recent kernel
 - test with the latest mesa release
 - Provide reproduction steps

If your customer thinks updating to a newer graphics stack is riskier than shipping the bugs that have been already fixed, you still need to *test* with the new stack to understand which bug fixes need backports to your system.
Comment 14 Kenneth Graunke 2019-08-19 15:50:53 UTC
(In reply to yugang from comment #8)
> FYI.
> 
> https://bugs.freedesktop.org/show_bug.cgi?id=111396 also happened on same
> platform, for your reference

Based on Chris's comment there, it sounds even more like a BO underallocation somewhere.  This could very well be fixed in a newer Mesa - we've fixed several of those in the last couple years, and your Mesa is a year out of date.  I understand that upgrading the kernel is difficult, but upgrading only Mesa (at least for testing purposes) should be a lot more doable.  It's worth trying that first.  Even if your customer can't actually upgrade, it may help you figure out what to backport.
Comment 15 yugang 2019-08-20 01:46:05 UTC
as two issues are only reproduced in customer side(in the special app and uncertain special steps, maybe has some other special different things), we can't do more testing in our side so far.
Comment 16 yugang 2019-08-20 05:08:05 UTC
(In reply to Kenneth Graunke from comment #14)
> (In reply to yugang from comment #8)
> > FYI.
> > 
> > https://bugs.freedesktop.org/show_bug.cgi?id=111396 also happened on same
> > platform, for your reference
> 
> Based on Chris's comment there, it sounds even more like a BO
> underallocation somewhere.  This could very well be fixed in a newer Mesa -
> we've fixed several of those in the last couple years, and your Mesa is a
> year out of date.  I understand that upgrading the kernel is difficult, but
> upgrading only Mesa (at least for testing purposes) should be a lot more
> doable.  It's worth trying that first.  Even if your customer can't actually
> upgrade, it may help you figure out what to backport.

@Kenneth, do you know if there have been some related patches/debug methods which have fixed some bugs related to the BO underallocation? can you have a check and provide some example of them so that we can evaluate current issue and see if there are some log/method for further debug?

those two issues seriously blocked the customer's milestone, mesa/kernel upgrade may only met requirements of future release.
Comment 17 Kenneth Graunke 2019-08-21 07:30:42 UTC
The HW context in this error state also looks entirely garbage.  Unlike 11396, the batch here is an ordinary linear buffer, however.
Comment 18 yugang 2019-08-22 03:08:27 UTC
Created attachment 145122 [details]
new log related to the hang

attached the new log from customer, seems also has the same situation: batch buffer begin with empty content, and may has some uncertain contents.
Comment 19 yugang 2019-08-22 03:15:20 UTC
(In reply to Kenneth Graunke from comment #17)
> The HW context in this error state also looks entirely garbage.  Unlike
> 11396, the batch here is an ordinary linear buffer, however.
hi Kenneth, do you think if application level or userspace system(e.g. application of other non-GL/GLES compoments) level can impact this(i mean bufffer overwrite)?
Comment 20 yugang 2019-08-22 04:40:32 UTC
Created attachment 145125 [details]
new raw error state


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.