Bug 111245 - [CI][SHARDS] igt@perf_pmu@busy-hang-bcs0 - fail - Failed assertion: (double)(val) <= (1.0 + (tolerance)) * (double)(0) && (double)(val) >= (1.0 - (tolerance)) * (double)(0)
Summary: [CI][SHARDS] igt@perf_pmu@busy-hang-bcs0 - fail - Failed assertion: (double)(...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: low normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-29 07:49 UTC by Lakshmi
Modified: 2019-08-08 21:07 UTC (History)
2 users (show)

See Also:
i915 platform: ICL
i915 features:


Attachments

Description Lakshmi 2019-07-29 07:49:29 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6557/shard-iclb1/igt@perf_pmu@busy-hang-bcs0.html

Starting subtest: busy-hang-bcs0
(perf_pmu:1387) CRITICAL: Test assertion failure function single, file ../tests/perf_pmu.c:306:
(perf_pmu:1387) CRITICAL: Failed assertion: (double)(val) <= (1.0 + (tolerance)) * (double)(0) && (double)(val) >= (1.0 - (tolerance)) * (double)(0)
(perf_pmu:1387) CRITICAL: 'val' != '0' (8780.000000 not within +5.000000%/-5.000000% tolerance of 0.000000)
Subtest busy-hang-bcs0 failed.
Comment 1 CI Bug Log 2019-07-29 07:50:06 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@perf_pmu@busy-hang-bcs0 - fail - Failed assertion: (double)(val) &lt;= (1.0 + (tolerance)) * (double)(0) &amp;&amp; (double)(val) &gt;= (1.0 - (tolerance)) * (double)(0)
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6557/shard-iclb1/igt@perf_pmu@busy-hang-bcs0.html
Comment 2 Lakshmi 2019-07-29 10:21:09 UTC
@Don, Can you assess this bug and set appropriate Priority/severity?
Comment 3 Don Hiatt 2019-07-31 20:50:51 UTC
The IGT 'tests/perf_pmu.c' busy-hang subtest attempts to hang the gpu and then issues a igt_force_gpu_reset(). In this case, the assert is triggered as the gpu is not coming out of reset within the expected tolerance.

I'll try and figure out the priority/severity next.
Comment 4 Don Hiatt 2019-07-31 20:56:37 UTC

This seems to have failed only one time for CI_DRM_6557_full (4 days, 21 hours old) (http://gfx-ci.fi.intel.com/cibuglog-ng/runcfg/23470) and started passing with CI_DRM_6563_full up to CI_DRM_6586_full(6 hours, 45 minutes old).
  
As this bug appears to be a one of I think we can set this as a low priority and monitor.

http://gfx-ci.fi.intel.com/cibuglog-ng/results/all?query=test_name+%3D+%27igt%40perf_pmu%40busy-hang-bcs0%27+AND+machine_name+ICONTAINS+%27shard-iclb1%27
Comment 5 Chris Wilson 2019-08-08 20:59:59 UTC
If my guess is correct this is due to not idling correctly after hang and so sampling the idle-barrier. So

commit c7302f204490f3eb4ef839bec228315bcd3ba43f (drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 8 21:27:58 2019 +0100

    drm/i915: Defer final intel_wakeref_put to process context
    
    As we need to acquire a mutex to serialise the final
    intel_wakeref_put, we need to ensure that we are in process context at
    that time. However, we want to allow operation on the intel_wakeref from
    inside timer and other hardirq context, which means that need to defer
    that final put to a workqueue.
    
    Inside the final wakeref puts, we are safe to operate in any context, as
    we are simply marking up the HW and state tracking for the potential
    sleep. It's only the serialisation with the potential sleeping getting
    that requires careful wait avoidance. This allows us to retain the
    immediate processing as before (we only need to sleep over the same
    races as the current mutex_lock).
    
    v2: Add a selftest to ensure we exercise the code while lockdep watches.
    v3: That test was extremely loud and complained about many things!
    v4: Not a whale!
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111295
    References: https://bugs.freedesktop.org/show_bug.cgi?id=111245
    References: https://bugs.freedesktop.org/show_bug.cgi?id=111256
    Fixes: 18398904ca9e ("drm/i915: Only recover active engines")
    Fixes: 51fbd8de87dc ("drm/i915/pmu: Atomically acquire the gt_pm wakeref")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190808202758.10453-1-chris@chris-wilson.co.uk

should help.
Comment 6 Don Hiatt 2019-08-08 21:07:53 UTC
Hey Chris,

I think that commit as well as your 'v3-drm-i915-pmu-Use-GT-parked-for-estimating-RC6-while-asleep.patch' will also fix https://bugs.freedesktop.org/show_bug.cgi?id=110877 as you indicated in our email. I finally started understanding the 'gt-parked' code and the 'perf_pmu --run-subtest rc6' are passing just fine with those changes. I'm still trying to understand this 'defer-final' change set.

Thanks!

don


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.