Summary: | Stability issue in i915 during continuous 15h transcode workload with MFE | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Oleg Makarov <oleg.makarov> | ||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||
Status: | CLOSED NOTOURBUG | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||
Severity: | normal | ||||||||
Priority: | high | CC: | chris, dmitry.ermilov, dmitry.v.rogozhkin, intel-gfx-bugs | ||||||
Version: | unspecified | ||||||||
Hardware: | x86-64 (AMD64) | ||||||||
OS: | Linux (All) | ||||||||
Whiteboard: | Triaged, ReadyForDev | ||||||||
i915 platform: | SKL | i915 features: | GPU hang | ||||||
Attachments: |
|
Description
Oleg Makarov
2019-03-29 16:51:08 UTC
Of particular note, batch: [0x00000000_05489000, 0x00000000_05491000] BBADDR: 0x00000000_029b5269 (GPU is reading far out of bounds) And HEAD: 0x040019e8 [0x00000ac8] head = 0x000019e8, wraps = 32 TAIL: 0x000014b8 [0x00000b28, 0x00000b48] ELSP[0]: pid 3041, ban score 0, seqno 26:0c994e24, prio 2, emitted -184 ms, start ffd4b000, head 00001430, tail 000014b8 ELSP[1]: pid 3041, ban score 0, seqno 2e:0c994e25, prio 2, emitted -160 ms, start ffc3b000, head 00001158, tail 000011d8 START: 0xffd4b000 says the CS overran the RING_TAIL. But that doesn't explain seqno: 0x0c994354 last_seqno: 0x0c994e25 as the older seqno is not in the ELSP0 ring (so would not have been rewritten by the overrun). I suspect the pipecontrol writes stopped first. BTW, if needed, we're ready to collect more details, bisect i915 changes, etc. Please, just give us a particular steps. This issue is pretty important for us because it affects MFE (multi-frame encode) - the main performance feature of the media stack on Linux. file.par contains mixed capitalization, dos line endings, and references to files not in fate-suite.ffmpeg.org Even reproduce.sh is in dos format, confusing the decoder with 'file.par\r'. media-driver b893f25d35a18a24208f7f858341347d491b460b doesn't build due to an unreported dependency on ?? (gmmlib). Assuming I do get a working stack; where should I find the sample files? Created attachment 143854 [details]
i915_error_state_on_drm_tip_1c163f4c7b
We suspected that the GPU hang causes only on official Kernel repository, I build Kernel from https://github.com/freedesktop/drm-tip.git with commit: 1c163f4c7b3f621efff9b28a47abb36f7378d783 and did a test, after 40h work, got the GPU hang too. i915_error_state in the attachment. (In reply to Oleg Makarov from comment #4) > Created attachment 143854 [details] > i915_error_state_on_drm_tip_1c163f4c7b Sorry for the delay. head !=acthd, I believe this is a Mesa bug. rcs0 command stream: IDLE?: no START: 0xffb39000 HEAD: 0x000020d0 [0x00001f58] head = 0x000020d0, wraps = 0 TAIL: 0x00002060 [0x00001fb8, 0x00001fd8] CTL: 0x00003001 len=16384, enabled MODE: 0x00000000 HWS: 0xfffce000 ACTHD: 0x00000000 14b98144 at ring: 0x00000000 IPEIR: 0x00000000 IPEHR: 0x7a000004 INSTDONE: 0xffdcffff busy: CS busy: TSG busy: VFE SC_INSTDONE: 0xffffffff SAMPLER_INSTDONE[0][0]: 0xffffffff SAMPLER_INSTDONE[0][1]: 0xffffffff SAMPLER_INSTDONE[0][2]: 0xffffffff SAMPLER_INSTDONE[1][0]: 0xffffffff SAMPLER_INSTDONE[1][1]: 0xffffffff SAMPLER_INSTDONE[1][2]: 0xffffffff SAMPLER_INSTDONE[2][0]: 0xffffffff SAMPLER_INSTDONE[2][1]: 0xffffffff SAMPLER_INSTDONE[2][2]: 0xffffffff ROW_INSTDONE[0][0]: 0xffffffff ROW_INSTDONE[0][1]: 0xffffffff ROW_INSTDONE[0][2]: 0xffffffff ROW_INSTDONE[1][0]: 0xffffffff ROW_INSTDONE[1][1]: 0xffffffff ROW_INSTDONE[1][2]: 0xffffffff ROW_INSTDONE[2][0]: 0xffffffff ROW_INSTDONE[2][1]: 0xffffffff ROW_INSTDONE[2][2]: 0xffffffff batch: [0x00000000_15bb0000, 0x00000000_15bb8000] BBADDR: 0x00000000_14b98145 BB_STATE: 0x00000020 INSTPS: 0x00009080 INSTPM: 0x00000000 FADDR: 0x00000000 14b98300 RC PSMI: 0x00000010 FAULT_REG: 0x00000000 SYNC_0: 0x00000000 SYNC_1: 0x00000000 SYNC_2: 0x00000000 GFX_MODE: 0x00008000 PDP0: 0x0000000832b1f000 PDP1: 0x0000000000000000 PDP2: 0x0000000000000000 PDP3: 0x0000000000000000 seqno: 0x1245c424 last_seqno: 0x1245cee4 waiting: yes ring->head: 0x00001d10 ring->tail: 0x00002060 hangcheck stall: yes hangcheck action: dead hangcheck action timestamp: 0ms (4332067992; epoch) engine reset count: 0 ELSP[0]: pid 5869, ban score 0, seqno 36:1245cee3, prio 3, emitted -516ms, start ffb39000, head 00001fe0, tail 00002060 ELSP[1]: pid 5869, ban score 0, seqno 23:1245cee4, prio 3, emitted -508ms, start ffda3000, head 00002908, tail 00002988 Active context: sample_multi_tr[5869] user_handle 16 hw_id 54, prio 0, ban score 0 guilty 0 active 0 Bad count in PIPE_CONTROL 0x15bb0000: 0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush 0x15bb0004: 0x001018bc: destination address 0x15bb0008: 0x00000000: immediate dword low 0x15bb000c: 0x00000000: immediate dword high Bad count in PIPE_CONTROL 0x15bb0018: 0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush 0x15bb001c: 0x0000089c: destination address 0x15bb0020: 0x00000000: immediate dword low 0x15bb0024: 0x00000000: immediate dword high Bad count in PIPE_CONTROL 0x15bb0030: 0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush 0x15bb0034: 0x00104080: destination address 0x15bb0038: 0x00017560: immediate dword low 0x15bb003c: 0x00000000: immediate dword high 0x15bb0048: 0x11000001: MI_LOAD_REGISTER_IMM 0x15bb004c: 0x00007034: dword 1 0x15bb0050: 0x80000040: dword 2 0x15bb0054: 0x69041301: 3DSTATE_PIPELINE_SELECT Bad length 19 in STATE_BASE_ADDRESS, expected 6-10 0x15bb0058: 0x61010011: STATE_BASE_ADDRESS Please revert back the product to DRM if you disagree. Definitely not Mesa : 0x05489060: 0x69041301: PIPELINE_SELECT 0x05489060: 0x69041301 : Dword 0 Pipeline Selection: 1 (Media) Media Sampler DOP Clock Gate Enable: false Force Media Awake: false Mask Bits: 19 Must be a media driver. As Chris indicates above to me this seems to be a driver/HW bug, but perhaps a tracking bug should also be filed against the media driver at https://github.com/intel/media-driver/ for any input from the media driver team? (In reply to ashutosh.dixit from comment #9) > As Chris indicates above to me this seems to be a driver/HW bug, but perhaps > a tracking bug should also be filed against the media driver at > https://github.com/intel/media-driver/ for any input from the media driver > team? It sounds reasonable. Oleg, would you mind submitting an issue against https://github.com/intel/media-driver/? Recently we rerun stress tests on Ubuntu18.04.1LTS, KERNEL: 5.1.9., and bug not reproduced. The bug not actual from this kernel version, and we can close it. (In reply to Oleg Makarov from comment #11) > Recently we rerun stress tests on Ubuntu18.04.1LTS, KERNEL: 5.1.9., and bug > not reproduced. The bug not actual from this kernel version, and we can > close it. Thanks for the feedback. Closing this bug as NOTOURBUG. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.