Bug 105106

Summary: [CI] igt@perf_pmu@busy*-vcs0 - Failed assertion: (double)(val) <= (1.0 + (tolerance)) * (double)(ts[1] - ts[0]) && (double)(val) >= (1.0 - (tolerance)) * (double)(ts[1] - ts[0]) (perf_pmu:1787) CRITICAL: 'val' != 'ts[1] - ts[0]' (455000000.000000 not
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Francesco Balestrieri <francesco.balestrieri>
Status: RESOLVED MOVED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: BYT, G45, HSW, I915G, IVB, SNB i915 features: Perf/PMU

Description Marta Löfstedt 2018-02-15 06:43:05 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3773/shard-snb4/igt@perf_pmu@busy-start-vcs0.html

(perf_pmu:1787) CRITICAL: Test assertion failure function busy_start, file perf_pmu.c:246:
(perf_pmu:1787) CRITICAL: Failed assertion: (double)(val) <= (1.0 + (tolerance)) * (double)(ts[1] - ts[0]) && (double)(val) >= (1.0 - (tolerance)) * (double)(ts[1] - ts[0])
(perf_pmu:1787) CRITICAL: 'val' != 'ts[1] - ts[0]' (455000000.000000 not within +5.000000%/-5.000000% tolerance of 506593737.000000)
Subtest busy-start-vcs0 failed.
Comment 1 Chris Wilson 2018-02-15 07:27:22 UTC
What's telling about this is the

(perf_pmu:1787) DEBUG: slept=506588707 perf=506593737
(perf_pmu:1787) CRITICAL: 'val' != 'ts[1] - ts[0]' (455000000.000000 not within +5.000000%/-5.000000% tolerance of 506593737.000000)

so it's not just a timing issue of how long we slept as we had hoped.

Plan B was to wait for the spin batch to start.
Comment 2 Marta Löfstedt 2018-02-28 06:48:03 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4307/shard-snb5/igt@perf_pmu@busy-vcs0.html

(perf_pmu:3148) CRITICAL: Test assertion failure function single, file perf_pmu.c:211:
(perf_pmu:3148) CRITICAL: Failed assertion: (double)(val) <= (1.0 + (tolerance)) * (double)(flags & (1) ? slept : 0.f) && (double)(val) >= (1.0 - (tolerance)) * (double)(flags & (1) ? slept : 0.f)
(perf_pmu:3148) CRITICAL: 'val' != 'flags & (1) ? slept : 0.f' (410000000.000000 not within +5.000000%/-5.000000% tolerance of 500072256.000000)
Subtest busy-vcs0 failed.
Comment 3 Marta Löfstedt 2018-02-28 06:49:19 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3841/shard-snb1/igt@perf_pmu@busy-check-all-vcs0.html

(perf_pmu:1619) CRITICAL: Test assertion failure function busy_check_all, file perf_pmu.c:380:
(perf_pmu:1619) CRITICAL: Failed assertion: (double)(val[busy_idx]) <= (1.0 + (tolerance)) * (double)(slept) && (double)(val[busy_idx]) >= (1.0 - (tolerance)) * (double)(slept)
(perf_pmu:1619) CRITICAL: 'val[busy_idx]' != 'slept' (475000000.000000 not within +5.000000%/-5.000000% tolerance of 510344059.000000)
Subtest busy-check-all-vcs0 failed.

Not sure this is considered the same issue.
Comment 4 Marta Löfstedt 2018-03-16 12:55:20 UTC
this was heavily hit on IVB on the drmtip_1 and drmtip_2 runs with shards testlist on BAT machines. 

I am not going to spam the bug with hits but here are some examples:

check cibuglog history and or respecive 
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip.html

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_1/fi-ivb-3520m/igt@perf_pmu@render-node-busy-vcs0.html
(perf_pmu:2153) CRITICAL: Test assertion failure function single, file ../tests/perf_pmu.c:213:
(perf_pmu:2153) CRITICAL: Failed assertion: (double)(val) <= (1.0 + (tolerance)) * (double)(flags & (1) ? slept : 0.f) && (double)(val) >= (1.0 - (tolerance)) * (double)(flags & (1) ? slept : 0.f)
(perf_pmu:2153) CRITICAL: 'val' != 'flags & (1) ? slept : 0.f' (240000000.000000 not within +5.000000%/-5.000000% tolerance of 504372512.000000)
Subtest render-node-busy-vcs0 failed.

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_2/fi-ivb-3770/igt@perf_pmu@render-node-busy-vcs0.html
(perf_pmu:1993) CRITICAL: Test assertion failure function single, file ../tests/perf_pmu.c:213:
(perf_pmu:1993) CRITICAL: Failed assertion: (double)(val) <= (1.0 + (tolerance)) * (double)(flags & (1) ? slept : 0.f) && (double)(val) >= (1.0 - (tolerance)) * (double)(flags & (1) ? slept : 0.f)
(perf_pmu:1993) CRITICAL: 'val' != 'flags & (1) ? slept : 0.f' (145000000.000000 not within +5.000000%/-5.000000% tolerance of 509359744.000000)
Subtest render-node-busy-vcs0 failed.

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_1/fi-ivb-3770/igt@perf_pmu@busy-vcs0.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_2/fi-ivb-3520m/igt@perf_pmu@busy-vcs0.html
Comment 5 Tvrtko Ursulin 2018-04-03 10:39:18 UTC
Should be improved with:

commit a9741da52ad1963f7632ef1e852cbe1c3bcc601e (u-180326)
Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Date:   Thu Mar 22 17:24:17 2018 +0000

    tests/perf_pmu: Improve accuracy by waiting on spinner to start
Comment 6 Martin Peres 2018-04-16 14:01:36 UTC
Was seen on drmtip_9 on blb hardware: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_9/fi-blb-e6850/igt@perf_pmu@cpu-hotplug.html

(perf_pmu:1378) CRITICAL: Test assertion failure function cpu_hotplug, file ../tests/perf_pmu.c:1166:
(perf_pmu:1378) CRITICAL: Failed assertion: (double)(val) <= (1.0 + (tolerance)) * (double)(ts[1] - ts[0]) && (double)(val) >= (1.0 - (tolerance)) * (double)(ts[1] - ts[0])
(perf_pmu:1378) CRITICAL: 'val' != 'ts[1] - ts[0]' (45000000.000000 not within +5.000000%/-5.000000% tolerance of 48568116.000000)
Subtest cpu-hotplug failed.
Comment 7 Chris Wilson 2018-05-11 18:39:47 UTC
Wait, there's more!

commit d502f055ac4500cada758876a512ac4f14b34851
Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Date:   Wed Apr 4 10:51:52 2018 +0100

    tests/perf_pmu: Avoid RT thread for accuracy test
    
    Realtime scheduling interferes with execlists submission (tasklet) so try
    to simplify the PWM loop in a few ways:
    
     * Drop RT.
     * Longer batches for smaller systematic error.
     * More truthful test duration calculation.
     * Less clock queries.
     * No self-adjust - instead just report the achieved cycle and let the
       parent check against it.
     * Report absolute cycle error.
    
    v2:
     * Bring back self-adjust. (Chris Wilson)
       (But slightly fixed version with no overflow.)
    
    v3:
     * Log average and mean calibration for each pass.
    
    v4:
     * Eliminate development leftovers.
     * Fix variance logging.
    
    Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

Presumptively closing.
Comment 8 Chris Wilson 2018-05-11 18:40:54 UTC
(In reply to Martin Peres from comment #6)
> Was seen on drmtip_9 on blb hardware:
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_9/fi-blb-e6850/
> igt@perf_pmu@cpu-hotplug.html
> 
> (perf_pmu:1378) CRITICAL: Test assertion failure function cpu_hotplug, file
> ../tests/perf_pmu.c:1166:
> (perf_pmu:1378) CRITICAL: Failed assertion: (double)(val) <= (1.0 +
> (tolerance)) * (double)(ts[1] - ts[0]) && (double)(val) >= (1.0 -
> (tolerance)) * (double)(ts[1] - ts[0])
> (perf_pmu:1378) CRITICAL: 'val' != 'ts[1] - ts[0]' (45000000.000000 not
> within +5.000000%/-5.000000% tolerance of 48568116.000000)
> Subtest cpu-hotplug failed.

Please be really careful not to mix up subtests inside perf_pmu, there are quite a few distinct issues to worry about.
Comment 10 Martin Peres 2018-05-28 14:10:25 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_39/fi-ivb-3520m/igt@perf_pmu@multi-client-vcs0.html

(In reply to Chris Wilson from comment #8)
> (In reply to Martin Peres from comment #6)
> > Was seen on drmtip_9 on blb hardware:
> > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_9/fi-blb-e6850/
> > igt@perf_pmu@cpu-hotplug.html
> > 
> > (perf_pmu:1378) CRITICAL: Test assertion failure function cpu_hotplug, file
> > ../tests/perf_pmu.c:1166:
> > (perf_pmu:1378) CRITICAL: Failed assertion: (double)(val) <= (1.0 +
> > (tolerance)) * (double)(ts[1] - ts[0]) && (double)(val) >= (1.0 -
> > (tolerance)) * (double)(ts[1] - ts[0])
> > (perf_pmu:1378) CRITICAL: 'val' != 'ts[1] - ts[0]' (45000000.000000 not
> > within +5.000000%/-5.000000% tolerance of 48568116.000000)
> > Subtest cpu-hotplug failed.
> 
> Please be really careful not to mix up subtests inside perf_pmu, there are
> quite a few distinct issues to worry about.

We still see a lot of issues. Should I create a new bug?
Comment 11 Martin Peres 2018-06-14 14:46:17 UTC
Re-opening since it is still a problem in almost all drmtip runs...
Comment 12 Chris Wilson 2018-09-06 08:30:41 UTC
Another try. The accuracy ones should be about done; pm-runtime measurements seem to be the hold out.
Comment 13 Martin Peres 2018-09-07 17:40:19 UTC
(In reply to Chris Wilson from comment #12)
> Another try. The accuracy ones should be about done; pm-runtime measurements
> seem to be the hold out.

The only test failing is https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_105/fi-elk-e7500/igt@perf_pmu@cpu-hotplug.html, on a wide range of machines:

(perf_pmu:1285) CRITICAL: Test assertion failure function cpu_hotplug, file ../tests/perf_pmu.c:1157:
(perf_pmu:1285) CRITICAL: Failed assertion: (double)(val) <= (1.0 + (tolerance)) * (double)(ts[1] - ts[0]) && (double)(val) >= (1.0 - (tolerance)) * (double)(ts[1] - ts[0])
(perf_pmu:1285) CRITICAL: 'val' != 'ts[1] - ts[0]' (35280953.000000 not within +5.000000%/-5.000000% tolerance of 37769634.000000)
Subtest cpu-hotplug failed.
Comment 14 Chris Wilson 2018-09-07 17:50:12 UTC
(In reply to Martin Peres from comment #13)
> (In reply to Chris Wilson from comment #12)
> > Another try. The accuracy ones should be about done; pm-runtime measurements
> > seem to be the hold out.
> 
> The only test failing is
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_105/fi-elk-e7500/
> igt@perf_pmu@cpu-hotplug.html, on a wide range of machines:
> 
> (perf_pmu:1285) CRITICAL: Test assertion failure function cpu_hotplug, file
> ../tests/perf_pmu.c:1157:
> (perf_pmu:1285) CRITICAL: Failed assertion: (double)(val) <= (1.0 +
> (tolerance)) * (double)(ts[1] - ts[0]) && (double)(val) >= (1.0 -
> (tolerance)) * (double)(ts[1] - ts[0])
> (perf_pmu:1285) CRITICAL: 'val' != 'ts[1] - ts[0]' (35280953.000000 not
> within +5.000000%/-5.000000% tolerance of 37769634.000000)
> Subtest cpu-hotplug failed.

3.5 vs 3.77. Close enough that we should just call that victory and be done with it. Let's hope there aren't any worse fails.
Comment 15 Martin Peres 2018-10-16 07:58:58 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_128/fi-apl-guc/igt@perf_pmu@cpu-hotplug.html

Starting subtest: cpu-hotplug
(perf_pmu:918) CRITICAL: Test assertion failure function cpu_hotplug, file ../tests/perf_pmu.c:1157:
(perf_pmu:918) CRITICAL: Failed assertion: (double)(val) <= (1.0 + (tolerance)) * (double)(ts[1] - ts[0]) && (double)(val) >= (1.0 - (tolerance)) * (double)(ts[1] - ts[0])
(perf_pmu:918) CRITICAL: 'val' != 'ts[1] - ts[0]' (70394268.000000 not within +5.000000%/-5.000000% tolerance of 66944504.000000)
Subtest cpu-hotplug failed.
Comment 16 Francesco Balestrieri 2018-10-23 07:06:49 UTC
> 3.5 vs 3.77. Close enough that we should just call that victory and be done with it. > Let's hope there aren't any worse fails.

So can we call this a victory? Martin reported another failure above, but the values aren't that far off either (I think).
Comment 17 Francesco Balestrieri 2018-11-27 13:07:07 UTC
Ping. I'm inclined to close this if there are no objections.
Comment 18 Francesco Balestrieri 2018-11-28 07:01:06 UTC
Or if this is really OK, maybe change the tolerance in the testcase so that it's not reported as an error?
Comment 19 CI Bug Log 2019-09-23 12:26:27 UTC
A CI Bug Log filter associated to this bug has been updated:

{- CTG BWR IVB BLB ELK ILK BYT SNB HSW APL: igt@perf_pmu@.* - fail - Failed assertion: (double)(val) &lt;= (1.0 + (tolerance)) * (double)(ts[1] - ts[0]) &amp;&amp; ... -}
{+ CTG ELK BWR IVB BLB ELK ILK BYT SNB HSW APL: igt@perf_pmu@.* - fail - Failed assertion: (double)(val) &lt;= (1.0 + (tolerance)) * (double)(ts[1] - ts[0]) &amp;&amp; ... +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_374/fi-elk-e7500/igt@perf_pmu@cpu-hotplug.html
Comment 20 CI Bug Log 2019-11-19 14:50:47 UTC
A CI Bug Log filter associated to this bug has been updated:

{- snb ivb: igt@perf_pmu@busy(-idle)-check-all-vcs[01] - Failed assertion: (double)(val[busy_idx]) &lt;= (1.0 + (tolerance)) * (double)(slept) &amp;&amp;... -}
{+ snb ivb: igt@perf_pmu@busy(-idle)-check-all-vcs[01] - Failed assertion: (double)(val[busy_idx]) &lt;= (1.0 + (tolerance)) * (double)(slept) &amp;&amp;... +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_404/fi-bsw-kefka/igt@perf_pmu@busy-idle-check-all-vcs0.html
Comment 21 Martin Peres 2019-11-29 17:40:26 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/75.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.