Bug 110285 - Stability issue in i915 during continuous 15h transcode workload with MFE
Summary: Stability issue in i915 during continuous 15h transcode workload with MFE
Status: CLOSED NOTOURBUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: high normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-29 16:51 UTC by Oleg Makarov
Modified: 2019-08-06 08:46 UTC (History)
4 users (show)

See Also:
i915 platform: SKL
i915 features: GPU hang


Attachments
i915_error_state_and_reproducer (14.08 KB, application/x-zip-compressed)
2019-03-29 16:51 UTC, Oleg Makarov
no flags Details
i915_error_state_on_drm_tip_1c163f4c7b (13.21 KB, application/x-gzip)
2019-04-03 15:21 UTC, Oleg Makarov
no flags Details

Description Oleg Makarov 2019-03-29 16:51:08 UTC
Created attachment 143808 [details]
i915_error_state_and_reproducer

GPU hang happened after ~15h.

Environment:
============
OS: Ubuntu18.04.1LTS
KERNEL: 5.0.0
MICROCODE: 0xc6
CPU MODEL: Intel(R) Xeon(R) CPU E3-1578L v5 @ 2.00GHz
ASLR IS ENABLED
HT IS ENABLED
CPU's: 8, THREAD's TO CORE: 2

MediaDriver: https://github.com/intel/media-driver, commit: 2241415
mediasdk: https://github.com/Intel-Media-SDK/MediaSDK, commit dad7abb
gmmlib: https://github.com/intel/gmmlib, commit 8bee050
libva-utils: https://github.com/intel/libva-utils, commit 72dab0c
libva: https://github.com/intel/libva, commit a99bab1
Kernel commit: 	1c163f4c7b3f621efff9b28a47abb36f7378d783

Suspected commit: f36c071f6344e0a335ed4b4e0b3a38c0dd54648b:

Commit description:
drm/i915/ringbuffer: Clear semaphore sync registers on ring init
Ensure that the sync registers are cleared every time we restart the
ring to avoid stale values from creeping in from random neutrinos.

cat /proc/cmdline:
\boot\vmlinuz-5.0.0 root=LABEL=TARGET_OS ro vconsole.font=latarcyrheb-sun16 crashkernel=128M vconsole.keymap=us biosdevname=0 LANG=en_US.UTF-8 systemd.debug modprobe.blacklist=ast,mgag200 intel_pstate=disable intel_idle.max_cstate=1 enable_rc6=0 initrd=boot\initrd.img-5.0.0 

GPU hang doesn't reproduce on Kernel 4.19.5.

Command line for reproducing and "i915 error state" are in the attachment.

For reproducing you can get the streams from http://fate-suite.ffmpeg.org/
Comment 1 Chris Wilson 2019-03-29 17:24:50 UTC
Of particular note,

  batch: [0x00000000_05489000, 0x00000000_05491000]
  BBADDR: 0x00000000_029b5269

(GPU is reading far out of bounds)

And
  HEAD:  0x040019e8 [0x00000ac8]
    head = 0x000019e8, wraps = 32
  TAIL:  0x000014b8 [0x00000b28, 0x00000b48]

  ELSP[0]:  pid 3041, ban score 0, seqno       26:0c994e24, prio 2, emitted -184
ms, start ffd4b000, head 00001430, tail 000014b8
  ELSP[1]:  pid 3041, ban score 0, seqno       2e:0c994e25, prio 2, emitted -160
ms, start ffc3b000, head 00001158, tail 000011d8 

 START: 0xffd4b000

says the CS overran the RING_TAIL.

But that doesn't explain

  seqno: 0x0c994354
  last_seqno: 0x0c994e25

as the older seqno is not in the ELSP0 ring (so would not have been rewritten by the overrun). I suspect the pipecontrol writes stopped first.
Comment 2 Dmitry Ermilov 2019-04-01 09:39:42 UTC
BTW, if needed, we're ready to collect more details, bisect i915 changes, etc. Please, just give us a particular steps.

This issue is pretty important for us because it affects MFE (multi-frame encode) - the main performance feature of the media stack on Linux.
Comment 3 Chris Wilson 2019-04-03 09:59:30 UTC
file.par contains mixed capitalization, dos line endings, and references to files not in fate-suite.ffmpeg.org

Even reproduce.sh is in dos format, confusing the decoder with 'file.par\r'.

media-driver b893f25d35a18a24208f7f858341347d491b460b doesn't build due to an unreported dependency on ?? (gmmlib).

Assuming I do get a working stack; where should I find the sample files?
Comment 4 Oleg Makarov 2019-04-03 15:21:42 UTC
Created attachment 143854 [details]
i915_error_state_on_drm_tip_1c163f4c7b
Comment 5 Oleg Makarov 2019-04-03 15:24:43 UTC
We suspected that the GPU hang causes only on official Kernel repository, I build Kernel from https://github.com/freedesktop/drm-tip.git with commit: 1c163f4c7b3f621efff9b28a47abb36f7378d783 and did a test, after 40h work,  got the GPU hang too. i915_error_state in the attachment.
Comment 6 Lakshmi 2019-06-25 13:44:44 UTC
(In reply to Oleg Makarov from comment #4)
> Created attachment 143854 [details]
> i915_error_state_on_drm_tip_1c163f4c7b

Sorry for the delay.

head !=acthd, I believe this is a Mesa bug. 

rcs0 command stream:
  IDLE?: no
  START: 0xffb39000
  HEAD:  0x000020d0 [0x00001f58]
    head = 0x000020d0, wraps = 0
  TAIL:  0x00002060 [0x00001fb8, 0x00001fd8]
  CTL:   0x00003001
    len=16384, enabled
  MODE:  0x00000000
  HWS:   0xfffce000
  ACTHD: 0x00000000 14b98144
    at ring: 0x00000000
  IPEIR: 0x00000000
  IPEHR: 0x7a000004
  INSTDONE: 0xffdcffff
    busy: CS
    busy: TSG
    busy: VFE
  SC_INSTDONE: 0xffffffff
  SAMPLER_INSTDONE[0][0]: 0xffffffff
  SAMPLER_INSTDONE[0][1]: 0xffffffff
  SAMPLER_INSTDONE[0][2]: 0xffffffff
  SAMPLER_INSTDONE[1][0]: 0xffffffff
  SAMPLER_INSTDONE[1][1]: 0xffffffff
  SAMPLER_INSTDONE[1][2]: 0xffffffff
  SAMPLER_INSTDONE[2][0]: 0xffffffff
  SAMPLER_INSTDONE[2][1]: 0xffffffff
  SAMPLER_INSTDONE[2][2]: 0xffffffff
  ROW_INSTDONE[0][0]: 0xffffffff
  ROW_INSTDONE[0][1]: 0xffffffff
  ROW_INSTDONE[0][2]: 0xffffffff
  ROW_INSTDONE[1][0]: 0xffffffff
  ROW_INSTDONE[1][1]: 0xffffffff
  ROW_INSTDONE[1][2]: 0xffffffff
  ROW_INSTDONE[2][0]: 0xffffffff
  ROW_INSTDONE[2][1]: 0xffffffff
  ROW_INSTDONE[2][2]: 0xffffffff
  batch: [0x00000000_15bb0000, 0x00000000_15bb8000]
  BBADDR: 0x00000000_14b98145
  BB_STATE: 0x00000020
  INSTPS: 0x00009080
  INSTPM: 0x00000000
  FADDR: 0x00000000 14b98300
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  SYNC_0: 0x00000000
  SYNC_1: 0x00000000
  SYNC_2: 0x00000000
  GFX_MODE: 0x00008000
  PDP0: 0x0000000832b1f000
  PDP1: 0x0000000000000000
  PDP2: 0x0000000000000000
  PDP3: 0x0000000000000000
  seqno: 0x1245c424
  last_seqno: 0x1245cee4
  waiting: yes
  ring->head: 0x00001d10
  ring->tail: 0x00002060
  hangcheck stall: yes
  hangcheck action: dead
  hangcheck action timestamp: 0ms (4332067992; epoch)
  engine reset count: 0
  ELSP[0]:  pid 5869, ban score 0, seqno       36:1245cee3, prio 3, emitted -516ms, start ffb39000, head 00001fe0, tail 00002060
  ELSP[1]:  pid 5869, ban score 0, seqno       23:1245cee4, prio 3, emitted -508ms, start ffda3000, head 00002908, tail 00002988
  Active context: sample_multi_tr[5869] user_handle 16 hw_id 54, prio 0, ban score 0 guilty 0 active 0

Bad count in PIPE_CONTROL
0x15bb0000:      0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush
0x15bb0004:      0x001018bc:    destination address
0x15bb0008:      0x00000000:    immediate dword low
0x15bb000c:      0x00000000:    immediate dword high
Bad count in PIPE_CONTROL
0x15bb0018:      0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush
0x15bb001c:      0x0000089c:    destination address
0x15bb0020:      0x00000000:    immediate dword low
0x15bb0024:      0x00000000:    immediate dword high
Bad count in PIPE_CONTROL
0x15bb0030:      0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush
0x15bb0034:      0x00104080:    destination address
0x15bb0038:      0x00017560:    immediate dword low
0x15bb003c:      0x00000000:    immediate dword high
0x15bb0048:      0x11000001: MI_LOAD_REGISTER_IMM
0x15bb004c:      0x00007034:    dword 1
0x15bb0050:      0x80000040:    dword 2
0x15bb0054:      0x69041301: 3DSTATE_PIPELINE_SELECT
Bad length 19 in STATE_BASE_ADDRESS, expected 6-10
0x15bb0058:      0x61010011: STATE_BASE_ADDRESS
Comment 7 Lakshmi 2019-06-26 10:31:27 UTC
Please revert back the product to DRM if you disagree.
Comment 8 Lionel Landwerlin 2019-06-26 10:58:54 UTC
Definitely not Mesa :

0x05489060:  0x69041301:  PIPELINE_SELECT                                                         
0x05489060:  0x69041301 : Dword 0
    Pipeline Selection: 1 (Media)
    Media Sampler DOP Clock Gate Enable: false
    Force Media Awake: false
    Mask Bits: 19

Must be a media driver.
Comment 9 ashutosh.dixit 2019-07-21 19:41:10 UTC
As Chris indicates above to me this seems to be a driver/HW bug, but perhaps a tracking bug should also be filed against the media driver at https://github.com/intel/media-driver/ for any input from the media driver team?
Comment 10 Dmitry Ermilov 2019-07-22 09:22:34 UTC
(In reply to ashutosh.dixit from comment #9)
> As Chris indicates above to me this seems to be a driver/HW bug, but perhaps
> a tracking bug should also be filed against the media driver at
> https://github.com/intel/media-driver/ for any input from the media driver
> team?

It sounds reasonable. Oleg, would you mind submitting an issue against https://github.com/intel/media-driver/?
Comment 11 Oleg Makarov 2019-08-05 15:14:15 UTC
Recently we rerun stress tests on Ubuntu18.04.1LTS, KERNEL: 5.1.9., and bug not reproduced. The bug not actual from this kernel version, and we can close it.
Comment 12 Lakshmi 2019-08-06 08:46:50 UTC
(In reply to Oleg Makarov from comment #11)
> Recently we rerun stress tests on Ubuntu18.04.1LTS, KERNEL: 5.1.9., and bug
> not reproduced. The bug not actual from this kernel version, and we can
> close it.

Thanks for the feedback. Closing this bug as NOTOURBUG.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.