Bug 105363

Summary: [CI][SHARDS] igt@kms_flip@flip-vs-expired-vblank - Failed assertion: gettime_us() - start < 500
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: NEW --- QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: high CC: intel-gfx-bugs
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: ALL i915 features: display/Other

Description Martin Peres 2018-03-06 10:45:02 UTC
(kms_flip:7832) CRITICAL: Test assertion failure function run_test_step, file kms_flip.c:709:
(kms_flip:7832) CRITICAL: Failed assertion: gettime_us() - start < 500
(kms_flip:7832) CRITICAL: Last errno: 25, Inappropriate ioctl for device
Subtest flip-vs-expired-vblank failed.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3874/shard-snb3/igt@kms_flip@flip-vs-expired-vblank.html
Comment 1 Lakshmi 2019-02-22 12:08:32 UTC
Platform is set as ALL as this failure occurs in ALL machines.
Comment 2 Martin Peres 2019-04-23 12:34:20 UTC
Also happening in shards, bumping the priority!
Comment 3 Mika Kahola 2019-06-03 12:51:34 UTC
It seems that we exceed the timeout when switching to test another pipe i.e. from pipe A to pipe B

flip-vs-expired-vblank-interruptible on pipe A, connector HDMI-A-1: PASSED

(kms_flip:7192) INFO: Beginning flip-vs-expired-vblank-interruptible on pipe B, connector HDMI-A-1
(kms_flip:7192) igt_fb-DEBUG: igt_create_fb_with_bo_size(width=1920, height=1080, format=XR24(0x34325258), modifier=0x0, size=0)
(kms_flip:7192) igt_fb-DEBUG: igt_create_fb_with_bo_size(handle=2, pitch=7680)
(kms_flip:7192) ioctl_wrappers-DEBUG: Test requirement passed: igt_has_fb_modifiers(fd)
(kms_flip:7192) igt_fb-DEBUG: igt_create_fb_with_bo_size(width=1920, height=1080, format=XR24(0x34325258), modifier=0x0, size=0)
(kms_flip:7192) igt_fb-DEBUG: igt_create_fb_with_bo_size(handle=3, pitch=7680)
(kms_flip:7192) ioctl_wrappers-DEBUG: Test requirement passed: igt_has_fb_modifiers(fd)
(kms_flip:7192) igt_fb-DEBUG: Test requirement passed: cairo_surface_status(fb->cairo_surface) == CAIRO_STATUS_SUCCESS
(kms_flip:7192) igt_fb-DEBUG: Test requirement passed: cairo_surface_status(fb->cairo_surface) == CAIRO_STATUS_SUCCESS
(kms_flip:7192) igt_kms-INFO:   1920x1080 60 1920 2008 2052 2200 1080 1083 1088 1125 0x9 0x48 148500
(kms_flip:7192) DEBUG: name = flip
last_ts = 0.000000
last_received_ts = 0.000000
last_seq = 0
current_ts = 1611.736816
current_received_ts = 1611.736206
current_seq = 65053
count = 0
seq_step = 1
(kms_flip:7192) DEBUG: name = flip

The last_ts is initialized as 0. We grab the current_ts which is within normal range. However, the difference between these two values exceed the test limit of 500.
Comment 4 Arek Hiler 2019-06-19 12:51:10 UTC
The test is making sure that blocking wait on a expired vblank (that has already
happened, so the current_seq == requested_seq) should be "immediate", i.e. taking
no longer than 500us.

This tests fails in all its variants, 2x, interruptible, non-interruptible.
500us seems like a plenty of time, but I have especially bad intuition when
it comes to time scales.

I would advise extending logging to that we igt_debug() the time it takes for a
successful and failed wait, so we can inspect the time scale, and confirm that
500us is good enough. Another culprit may be the method of getting the current
time and type casting done along the way.

Reading both the kernel handler for the IOCTL and libdrm wrapper I have no idea
where we could waste time other than some wonky OS scheduling / context
switching.

User impact: hard to tell, without more information on the failed waits, but
should be no more than "medium", assuming they are not taking forever.
Comment 5 emersion 2019-06-24 08:30:10 UTC
Posted a patch to improve logging: https://patchwork.freedesktop.org/patch/313695/
Comment 6 CI Bug Log 2019-06-25 11:26:44 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* all machines: igt@kms_flip@(2x-)flip-vs-expired-vblank(-interruptible) - Failed assertion: end - start &lt; 500
  (No new failures associated)
Comment 7 emersion 2019-06-25 11:30:44 UTC
commit 15ad664534413628f06c0f172aac11598bfdb895
Author: Simon Ser <simon.ser@intel.com>
Date:   Fri Jun 14 13:52:23 2019 +0300

    tests/kms_flip: improve logging when timing out
    
    Signed-off-by: Simon Ser <simon.ser@intel.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105363
    Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Comment 8 emersion 2019-06-26 10:07:56 UTC
(kms_flip:1523) DEBUG: Vblank took 78us
(kms_flip:1523) DEBUG: Vblank took 225us
(kms_flip:1523) DEBUG: Vblank took 95us
(kms_flip:1523) DEBUG: Vblank took 202us
(kms_flip:1523) DEBUG: Vblank took 78us
(kms_flip:1523) DEBUG: Vblank took 77us
(kms_flip:1523) DEBUG: Vblank took 47us
(kms_flip:1523) DEBUG: Vblank took 78us
(kms_flip:1523) DEBUG: Vblank took 77us
(kms_flip:1523) DEBUG: Vblank took 80us
(kms_flip:1523) DEBUG: Vblank took 79us
(kms_flip:1523) DEBUG: Vblank took 78us
(kms_flip:1523) DEBUG: Vblank took 61us
(kms_flip:1523) DEBUG: Vblank took 78us
(kms_flip:1523) DEBUG: Vblank took 80us
(kms_flip:1523) DEBUG: Vblank took 525us
(kms_flip:1523) CRITICAL: Test assertion failure function run_test_step, file ../tests/kms_flip.c:698:
(kms_flip:1523) CRITICAL: Failed assertion: end - start < 500
Comment 9 Arek Hiler 2019-07-19 12:37:12 UTC
Few notes:

1. The message could be improved. It seems to suggest that this is interval between vblanks instead of the time it takes doing a wait on expired one

2. Going through recent logs the time is usually ~25us with occasional spikes to ~150us. And there are the outliers which we fail on:

(kms_flip:1816) DEBUG: Vblank took 1456us
(kms_flip:5124) DEBUG: Vblank took 1088us
(kms_flip:1373) DEBUG: Vblank took 501us
(kms_flip:2150) DEBUG: Vblank took 842us
(kms_flip:5198) DEBUG: Vblank took 3301us

Still no idea what may cause this. It seems to be too much of a lag even for very unfavourable scheduling. The only locking that happens on the call path is in drm_vblank_get() which takes on a few spinlocks. I believe it's worth investigating.

3ms of delay for 16ms frames (60fps) is a lot.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.