Bug 97104

Summary: [KBL IVB] kicked stuck waiters missed irq when running drv_missed_irq
Product: DRI Reporter: cprigent <christophe.prigent>
Component: DRM/IntelAssignee: Chris Wilson <chris>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: IVB, KBL i915 features: GEM/Other
Attachments:
Description Flags
IVB-drv_missed_irq-kern.log
none
IVB-kms_setmode__clone_exclusive_crtc-kern.log
none
KBLU-drv_missed_irq-kern.log
none
KBLU-drv_missed_irq-output none

Description cprigent 2016-07-28 09:44:02 UTC
Created attachment 125363 [details]
IVB-drv_missed_irq-kern.log

Platform: IVB
CPU: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (family 6, model 58, stepping 9)
Motherboard version: DH77EB
GPU: IntelĀ® HD Graphics 4000 - Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller
Software
Bios: EBH7710H.86A.0096.2012.1012.1645
Linux distribution: Ubuntu 16.04 64 bits
Kernel: 4.7.0-rc7 7eeb04a from http://cgit.freedesktop.org/drm-intel/
  commit 7eeb04a101316645916d4d9df058a9341797f1af
  Author: Chris Wilson <chris@chris-wilson.co.uk>
  Date:   Sun Jul 24 11:00:31 2016 +0100
  drm-intel-nightly: 2016y-07m-24d-09h-59m-54s UTC integration manifest
drm: libdrm-2.4.70 0caa84c from git://anongit.freedesktop.org/mesa/drm
mesa: mesa-11.2.2 3a9f628from git://anongit.freedesktop.org/mesa/mesa
cairo: 1.15.2 db8a7f1 from git://anongit.freedesktop.org/cairo
xserver: xorg-server-1.18.0-497 0b2f308 from git://git.freedesktop.org/git/xorg/xserver
xf86-video-intel: 2.99.917-687 6988b87 from git://git.freedesktop.org/git/xorg/driver/xf86-video-intel
libva: libva-1.7.0-26 c36971c from git://git.freedesktop.org/git/vaapi/libva
vaapi-intel-driver: 1.7.0-58 e554446 from git://git.freedesktop.org/git/vaapi/intel-driver
Intel-Gpu-Tools 1.15-140 e3abb20 from http://anongit.freedesktop.org/git/xorg/app/intel-gpu-tools.git

Steps:
------
1. Execute IGT test: drv_missed_irq

Actual result
--------------
1. Test is fail:
root@IVB102:/opt/X11R7/src/intel-gpu-tools/tests# ./drv_missed_irq
IGT-Version: 1.15-ge3abb20 (x86_64) (Linux: 4.7.0-rc7-nightly+ x86_64)
(drv_missed_irq:4592) CRITICAL: Test assertion failure function __real_main115, file drv_missed_irq.c:165:
(drv_missed_irq:4592) CRITICAL: Failed assertion: missed_rings == expect_rings
(drv_missed_irq:4592) CRITICAL: error: 0 != 0x7
Stack trace:
  #0 [__igt_fail_assert+0xf1]
  #1 [_start+0x0]
  #2 [__libc_start_main+0xf0]
  #3 [_start+0x29]
  #4 [<unknown>+0x29]
Test drv_missed_irq failed.
**** DEBUG ****
(drv_missed_irq:4592) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation()
(drv_missed_irq:4592) drmtest-DEBUG: Test requirement passed: !(fd<0)
(drv_missed_irq:4592) DEBUG: Test requirement passed: gem_mmap__has_wc(fd)
(drv_missed_irq:4592) DEBUG: Testing rings 7
(drv_missed_irq:4592) DEBUG: Executing on ring render [1]
(drv_missed_irq:4592) DEBUG: Executing on ring bsd [2]
(drv_missed_irq:4592) DEBUG: Executing on ring bsd1 [2002]
(drv_missed_irq:4592) DEBUG: Executing on ring bsd2 [4002]
(drv_missed_irq:4592) DEBUG: Executing on ring blt [3]
(drv_missed_irq:4592) DEBUG: Executing on ring vebox [4]
(drv_missed_irq:4592) CRITICAL: Test assertion failure function __real_main115, file drv_missed_irq.c:165:
(drv_missed_irq:4592) CRITICAL: Failed assertion: missed_rings == expect_rings
(drv_missed_irq:4592) CRITICAL: error: 0 != 0x7
****  END  ****
FAIL (7.458s)

Expected result:
----------------
1. Test is Pass
Comment 1 Chris Wilson 2016-07-28 09:54:19 UTC
Hmm, yes, this was why that hangcheck -> stuck rings markup was in the idle worker. That code had an inherent race condition, just need to find a better means of flagging this failure -- less of a concern as the test is explicitly trying to break the driver.
Comment 2 cprigent 2016-08-04 12:13:01 UTC
Created attachment 125520 [details]
IVB-kms_setmode__clone_exclusive_crtc-kern.log
Comment 3 cprigent 2016-08-04 12:13:32 UTC
Created attachment 125521 [details]
KBLU-drv_missed_irq-kern.log
Comment 4 cprigent 2016-08-04 12:14:37 UTC
Created attachment 125522 [details]
KBLU-drv_missed_irq-output

Reproduced on KBL-U.

KBL-U
Hardware
Platform: KABY LAKE-U
CPU : Intel(R) Core(TM) @ 2.60GHz
MCP : KBL-U G0 2+2
QDF : QYQ8
Chipset PCH: SPT-LP C1
CRB : KABY LAKE U DDR3L RVP7 CRB FAB1

Software
BIOS: 38_07 KBLSE2R1.R00.X038.P07.1606200632 from https://ubit-artifactory-ba.intel.com/artifactory/simple/one-windows-local/Submissions/ifwi/KBL_PURPLE_IFWI_2016_WW26_1_00_HR'16/IFWI-KBL_PURPLE_IFWI_2016_WW26_1_00_HR'16-R.7z
ME FW: 11.5.0.1058
EC FW: 1.19
Ksc (EC FW): 1.20
Linux distribution: Ubuntu 16.04 64 bits
Kernel: 4.7.0 6f87e85 from http://cgit.freedesktop.org/drm-intel/
   commit 6f87e85fa302ffdb4cb9f4cd712691165923c7a2
 Author: Chris Wilson <chris@chris-wilson.co.uk>
  Date:   Mon Aug 1 15:53:41 2016 +0100
  drm-intel-nightly: 2016y-08m-01d-14h-53m-17s UTC integration manifest
drm: libdrm-2.4.70 f19cd3a from git://anongit.freedesktop.org/mesa/drm
mesa: mesa-11.2.2 3a9f628from git://anongit.freedesktop.org/mesa/mesa
cairo: 1.15.2 db8a7f1 from git://anongit.freedesktop.org/cairo
xserver: xorg-server-1.18.0-502 c833c08 from git://git.freedesktop.org/git/xorg/xserver
xf86-video-intel: 2.99.917-688 49daf5d from git://git.freedesktop.org/git/xorg/driver/xf86-video-intel
libva: libva-1.7.0-40 f7e2263 from git://git.freedesktop.org/git/vaapi/libva
vaapi-intel-driver: 1.7.0-64 1cd6795 from git://git.freedesktop.org/git/vaapi/intel-driver
GuC 9.14 from http://rdvivi-hillsboro.jf.intel.com/firmware/kbl_guc_ver9_14.tar.bz2 
DMC 1.01 from: https://01.org/linuxgraphics/downloads/kabylake-dmc-1.01 
Intel-Gpu-Tools 1.15-188 53b4dfdfrom http://anongit.freedesktop.org/git/xorg/app/intel-gpu-tools.git
Comment 5 Chris Wilson 2016-08-05 09:45:04 UTC
In my tree, I have moved the kicking from hangcheck to intel_breadcrumbs.c which prevents this problem. I'll try and get that polished up...
Comment 6 yann 2016-08-05 11:08:37 UTC
Christophe, please try with Chris' patch: https://patchwork.freedesktop.org/series/10711/
Comment 7 Chris Wilson 2016-08-05 12:44:27 UTC
Had some fun with the test case as well. It appears that on execlists (at least) we could get spurious interrupts due to the IIR bit being set for the user interrupt and a second context-switch interrupt causing processing of the wakeup.
Comment 8 Chris Wilson 2016-08-10 09:40:45 UTC
commit 83348ba84ee0d5d4d982e5382bfbc8b2a2d05e75
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Aug 9 17:47:51 2016 +0100

    drm/i915: Move missed interrupt detection from hangcheck to breadcrumbs
    
    In commit 2529d57050af ("drm/i915: Drop racy markup of missed-irqs from
    idle-worker") the racy detection of missed interrupts was removed when
    we went idle. This however opened up the issue that the stuck waiters
    were not being reported, causing a test case failure. If we move the
    stuck waiter detection out of hangcheck and into the breadcrumb
    mechanims (i.e. the waiter) itself, we can avoid this issue entirely.
    This leaves hangcheck looking for a stuck GPU (inspecting for request
    advancement and HEAD motion), and breadcrumbs looking for a stuck
    waiter - hopefully make both easier to understand by their segregation.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.