Created attachment 143808 [details] i915_error_state_and_reproducer GPU hang happened after ~15h. Environment: ============ OS: Ubuntu18.04.1LTS KERNEL: 5.0.0 MICROCODE: 0xc6 CPU MODEL: Intel(R) Xeon(R) CPU E3-1578L v5 @ 2.00GHz ASLR IS ENABLED HT IS ENABLED CPU's: 8, THREAD's TO CORE: 2 MediaDriver: https://github.com/intel/media-driver, commit: 2241415 mediasdk: https://github.com/Intel-Media-SDK/MediaSDK, commit dad7abb gmmlib: https://github.com/intel/gmmlib, commit 8bee050 libva-utils: https://github.com/intel/libva-utils, commit 72dab0c libva: https://github.com/intel/libva, commit a99bab1 Kernel commit: 1c163f4c7b3f621efff9b28a47abb36f7378d783 Suspected commit: f36c071f6344e0a335ed4b4e0b3a38c0dd54648b: Commit description: drm/i915/ringbuffer: Clear semaphore sync registers on ring init Ensure that the sync registers are cleared every time we restart the ring to avoid stale values from creeping in from random neutrinos. cat /proc/cmdline: \boot\vmlinuz-5.0.0 root=LABEL=TARGET_OS ro vconsole.font=latarcyrheb-sun16 crashkernel=128M vconsole.keymap=us biosdevname=0 LANG=en_US.UTF-8 systemd.debug modprobe.blacklist=ast,mgag200 intel_pstate=disable intel_idle.max_cstate=1 enable_rc6=0 initrd=boot\initrd.img-5.0.0 GPU hang doesn't reproduce on Kernel 4.19.5. Command line for reproducing and "i915 error state" are in the attachment. For reproducing you can get the streams from http://fate-suite.ffmpeg.org/
Of particular note, batch: [0x00000000_05489000, 0x00000000_05491000] BBADDR: 0x00000000_029b5269 (GPU is reading far out of bounds) And HEAD: 0x040019e8 [0x00000ac8] head = 0x000019e8, wraps = 32 TAIL: 0x000014b8 [0x00000b28, 0x00000b48] ELSP[0]: pid 3041, ban score 0, seqno 26:0c994e24, prio 2, emitted -184 ms, start ffd4b000, head 00001430, tail 000014b8 ELSP[1]: pid 3041, ban score 0, seqno 2e:0c994e25, prio 2, emitted -160 ms, start ffc3b000, head 00001158, tail 000011d8 START: 0xffd4b000 says the CS overran the RING_TAIL. But that doesn't explain seqno: 0x0c994354 last_seqno: 0x0c994e25 as the older seqno is not in the ELSP0 ring (so would not have been rewritten by the overrun). I suspect the pipecontrol writes stopped first.
BTW, if needed, we're ready to collect more details, bisect i915 changes, etc. Please, just give us a particular steps. This issue is pretty important for us because it affects MFE (multi-frame encode) - the main performance feature of the media stack on Linux.
file.par contains mixed capitalization, dos line endings, and references to files not in fate-suite.ffmpeg.org Even reproduce.sh is in dos format, confusing the decoder with 'file.par\r'. media-driver b893f25d35a18a24208f7f858341347d491b460b doesn't build due to an unreported dependency on ?? (gmmlib). Assuming I do get a working stack; where should I find the sample files?
Created attachment 143854 [details] i915_error_state_on_drm_tip_1c163f4c7b
We suspected that the GPU hang causes only on official Kernel repository, I build Kernel from https://github.com/freedesktop/drm-tip.git with commit: 1c163f4c7b3f621efff9b28a47abb36f7378d783 and did a test, after 40h work, got the GPU hang too. i915_error_state in the attachment.
(In reply to Oleg Makarov from comment #4) > Created attachment 143854 [details] > i915_error_state_on_drm_tip_1c163f4c7b Sorry for the delay. head !=acthd, I believe this is a Mesa bug. rcs0 command stream: IDLE?: no START: 0xffb39000 HEAD: 0x000020d0 [0x00001f58] head = 0x000020d0, wraps = 0 TAIL: 0x00002060 [0x00001fb8, 0x00001fd8] CTL: 0x00003001 len=16384, enabled MODE: 0x00000000 HWS: 0xfffce000 ACTHD: 0x00000000 14b98144 at ring: 0x00000000 IPEIR: 0x00000000 IPEHR: 0x7a000004 INSTDONE: 0xffdcffff busy: CS busy: TSG busy: VFE SC_INSTDONE: 0xffffffff SAMPLER_INSTDONE[0][0]: 0xffffffff SAMPLER_INSTDONE[0][1]: 0xffffffff SAMPLER_INSTDONE[0][2]: 0xffffffff SAMPLER_INSTDONE[1][0]: 0xffffffff SAMPLER_INSTDONE[1][1]: 0xffffffff SAMPLER_INSTDONE[1][2]: 0xffffffff SAMPLER_INSTDONE[2][0]: 0xffffffff SAMPLER_INSTDONE[2][1]: 0xffffffff SAMPLER_INSTDONE[2][2]: 0xffffffff ROW_INSTDONE[0][0]: 0xffffffff ROW_INSTDONE[0][1]: 0xffffffff ROW_INSTDONE[0][2]: 0xffffffff ROW_INSTDONE[1][0]: 0xffffffff ROW_INSTDONE[1][1]: 0xffffffff ROW_INSTDONE[1][2]: 0xffffffff ROW_INSTDONE[2][0]: 0xffffffff ROW_INSTDONE[2][1]: 0xffffffff ROW_INSTDONE[2][2]: 0xffffffff batch: [0x00000000_15bb0000, 0x00000000_15bb8000] BBADDR: 0x00000000_14b98145 BB_STATE: 0x00000020 INSTPS: 0x00009080 INSTPM: 0x00000000 FADDR: 0x00000000 14b98300 RC PSMI: 0x00000010 FAULT_REG: 0x00000000 SYNC_0: 0x00000000 SYNC_1: 0x00000000 SYNC_2: 0x00000000 GFX_MODE: 0x00008000 PDP0: 0x0000000832b1f000 PDP1: 0x0000000000000000 PDP2: 0x0000000000000000 PDP3: 0x0000000000000000 seqno: 0x1245c424 last_seqno: 0x1245cee4 waiting: yes ring->head: 0x00001d10 ring->tail: 0x00002060 hangcheck stall: yes hangcheck action: dead hangcheck action timestamp: 0ms (4332067992; epoch) engine reset count: 0 ELSP[0]: pid 5869, ban score 0, seqno 36:1245cee3, prio 3, emitted -516ms, start ffb39000, head 00001fe0, tail 00002060 ELSP[1]: pid 5869, ban score 0, seqno 23:1245cee4, prio 3, emitted -508ms, start ffda3000, head 00002908, tail 00002988 Active context: sample_multi_tr[5869] user_handle 16 hw_id 54, prio 0, ban score 0 guilty 0 active 0 Bad count in PIPE_CONTROL 0x15bb0000: 0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush 0x15bb0004: 0x001018bc: destination address 0x15bb0008: 0x00000000: immediate dword low 0x15bb000c: 0x00000000: immediate dword high Bad count in PIPE_CONTROL 0x15bb0018: 0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush 0x15bb001c: 0x0000089c: destination address 0x15bb0020: 0x00000000: immediate dword low 0x15bb0024: 0x00000000: immediate dword high Bad count in PIPE_CONTROL 0x15bb0030: 0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush 0x15bb0034: 0x00104080: destination address 0x15bb0038: 0x00017560: immediate dword low 0x15bb003c: 0x00000000: immediate dword high 0x15bb0048: 0x11000001: MI_LOAD_REGISTER_IMM 0x15bb004c: 0x00007034: dword 1 0x15bb0050: 0x80000040: dword 2 0x15bb0054: 0x69041301: 3DSTATE_PIPELINE_SELECT Bad length 19 in STATE_BASE_ADDRESS, expected 6-10 0x15bb0058: 0x61010011: STATE_BASE_ADDRESS
Please revert back the product to DRM if you disagree.
Definitely not Mesa : 0x05489060: 0x69041301: PIPELINE_SELECT 0x05489060: 0x69041301 : Dword 0 Pipeline Selection: 1 (Media) Media Sampler DOP Clock Gate Enable: false Force Media Awake: false Mask Bits: 19 Must be a media driver.
As Chris indicates above to me this seems to be a driver/HW bug, but perhaps a tracking bug should also be filed against the media driver at https://github.com/intel/media-driver/ for any input from the media driver team?
(In reply to ashutosh.dixit from comment #9) > As Chris indicates above to me this seems to be a driver/HW bug, but perhaps > a tracking bug should also be filed against the media driver at > https://github.com/intel/media-driver/ for any input from the media driver > team? It sounds reasonable. Oleg, would you mind submitting an issue against https://github.com/intel/media-driver/?
Recently we rerun stress tests on Ubuntu18.04.1LTS, KERNEL: 5.1.9., and bug not reproduced. The bug not actual from this kernel version, and we can close it.
(In reply to Oleg Makarov from comment #11) > Recently we rerun stress tests on Ubuntu18.04.1LTS, KERNEL: 5.1.9., and bug > not reproduced. The bug not actual from this kernel version, and we can > close it. Thanks for the feedback. Closing this bug as NOTOURBUG.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.