Bug 102254

Summary: [CI] igt@perf@oa-exponents fails
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Marta Löfstedt <marta.lofstedt>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: hector.franciscox.velazquez.suriano, intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: HSW i915 features: Perf/OA

Description Martin Peres 2017-08-16 12:35:55 UTC
The test igt@perf@oa-exponents hits the following assert:

(perf:1635) CRITICAL: Test assertion failure function read_2_oa_reports, file perf.c:1201:
(perf:1635) CRITICAL: Failed assertion: !"reached"

Full logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2968/shard-hsw5/igt@perf@oa-exponents.html
Comment 1 Lionel Landwerlin 2017-08-16 17:40:50 UTC
I have this series to help out with the flakyness of this test :
https://patchwork.freedesktop.org/series/28373/

Not landed yet, hopefully soon!
Comment 2 Elizabeth 2017-08-28 19:04:09 UTC
*** Bug 102421 has been marked as a duplicate of this bug. ***
Comment 3 Jari Tahvanainen 2017-09-04 08:08:40 UTC
Tested-by - the series 28373 seem to improve the situation on my dev-skl-i5-6600k having 

Without series (IGT-Version: 1.19-g5ce65a9a):
Subtest oa-exponents: FAIL (0,214s)
Subtest per-context-mode-unprivileged: FAIL (0,004s)
Subtest polling: FAIL (10,032s)
Subtest short-reads: FAIL (0,001s)
Subtest mi-rpc: FAIL (0,001s)
Subtest rc6-disable: FAIL (0,001s)
Subtest create-destroy-userspace-config: FAIL (0,003s)

With series 28373 applied on above:
Subtest i915-ref-count: SUCCESS (0,043s)
Subtest sysctl-defaults: SUCCESS (0,000s)
Subtest non-system-wide-paranoid: SUCCESS (0,015s)
Subtest invalid-open-flags: SUCCESS (0,000s)
Subtest invalid-oa-metric-set-id: SUCCESS (0,007s)
Subtest invalid-oa-format-id: SUCCESS (0,008s)
Subtest missing-sample-flags: SUCCESS (0,000s)
Subtest oa-formats: SUCCESS (0,073s)
Subtest invalid-oa-exponent: SUCCESS (0,007s)
Subtest low-oa-exponent-permissions: SUCCESS (0,015s)
Subtest oa-exponents: SUCCESS (15,035s)
Test requirement not met in function __real_main4515, file perf.c:4580:
Test requirement: IS_HASWELL(devid)
Subtest per-context-mode-unprivileged: SKIP (0,000s)
Subtest buffer-fill: SUCCESS (1,734s)
Subtest disabled-read-error: SUCCESS (0,037s)
Subtest non-sampling-read-error: SUCCESS (0,007s)
Subtest enable-disable: SUCCESS (1,730s)
Subtest blocking: SUCCESS (10,022s)
Subtest polling: SUCCESS (10,010s)
Subtest short-reads: SUCCESS (0,020s)
Subtest mi-rpc: SUCCESS (0,009s)
Test requirement not met in function __real_main4515, file perf.c:4608:
Test requirement: IS_HASWELL(devid)
Subtest unprivileged-single-ctx-counters: SKIP (0,000s)
Subtest gen8-unprivileged-single-ctx-counters: SUCCESS (0,027s)
Subtest rc6-disable: SUCCESS (1,510s)
Subtest invalid-create-userspace-config: SUCCESS (0,000s)
Subtest invalid-remove-userspace-config: SUCCESS (0,007s)
Subtest create-destroy-userspace-config: SUCCESS (0,022s)
Subtest whitelisted-registers-userspace-config: SUCCESS (0,000s)

For HSW, APL, KBL see shards results https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_132/shards.html
E.g. HSW has Test perf:
        Subgroup polling:
                fail       -> PASS       (shard-hsw) fdo#102252
        Subgroup oa-exponents:
                fail       -> PASS       (shard-hsw) fdo#102254
Comment 4 Jari Tahvanainen 2017-09-04 08:13:47 UTC
Changing component to IGT since fix identified in intel-gpu-tools git.
Comment 6 Marta Löfstedt 2017-10-05 11:57:40 UTC
The issue is not reproduced since:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3172 on HSW-shards
Comment 7 Marta Löfstedt 2017-10-09 11:08:33 UTC
Issue is reproduced on APL-shards:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3190/shard-apl3/igt@perf@oa-exponents.html
Comment 10 Lionel Landwerlin 2017-11-10 15:04:55 UTC
I've been thinking about this a bit.
Since https://patchwork.freedesktop.org/patch/180544/ seemed to have fix the problem on big cores, we probably have a power management issue on the atoms...
Comment 12 Marta Löfstedt 2018-01-05 11:18:16 UTC
I noted that this test quite frequently also fail like this:

(perf:1560) CRITICAL: Test assertion failure function test_oa_exponents, file perf.c:1922:
(perf:1560) CRITICAL: Failed assertion: n_reports == (sizeof(reports)/sizeof(reports[0]))
(perf:1560) CRITICAL: error: 10 != 30
Subtest oa-exponents failed.

for example:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3602/shard-apl5/igt@perf@oa-exponents.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3600/shard-glkb4/igt@perf@oa-exponents.html
Comment 13 Lionel Landwerlin 2018-02-15 18:50:11 UTC
I've got a rewrite for that test that seems a lot better than the current test :

https://patchwork.freedesktop.org/series/38372/

It completes within a fraction of the time (~1s vs ~10s) and seems a lot more reliable.
Comment 15 Lionel Landwerlin 2018-02-23 17:03:57 UTC
Are those tests manually stopped or is the machine hanging? : https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3827/shard-kbl2/pstore6-1519347265_Oops_1.log
Comment 16 Marta Löfstedt 2018-02-26 08:03:44 UTC
(In reply to Lionel Landwerlin from comment #15)
> Are those tests manually stopped or is the machine hanging? :
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3827/shard-kbl2/pstore6-
> 1519347265_Oops_1.log

Lionel, the owatch will trigger kernel softdog after 370 seconds of "inactivity"
Comment 17 Marta Löfstedt 2018-02-26 08:05:06 UTC
Lionel owatch is part of ezbenche: https://cgit.freedesktop.org/ezbench
Comment 18 Lionel Landwerlin 2018-02-26 11:17:32 UTC
Thanks.

I'm also trying to understand why the enable-disable subtest fails with 0 reports from time to time.
I think this might be the same issue.
The current theory is, after a context switch to the preempt context, the value stored in the OACONTROL register is messed up and so the OA unit doesn't output reports anymore.
Comment 19 Lionel Landwerlin 2018-03-01 16:01:31 UTC
With 

commit 41d3fdcd15d5ecf29cc73e8b79c2327ebb54b960
Author: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Date:   Thu Mar 1 11:06:13 2018 +0000

    drm/i915/perf: fix perf stream opening lock

landed, I really hope this is finally fixed for good.
Comment 20 Marta Löfstedt 2018-03-02 07:22:33 UTC
Patch integrated to CI_DRM_3860 so far last softdog was at https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3859/shard-apl8/igt@perf@oa-exponents.html

I monitor this over the weekend

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.