Bug 100367 - [BAT][SKL] igt@kms_pipe_crc_basic@suspend-read-crc-pipe-[abc] fails on CI
Summary: [BAT][SKL] igt@kms_pipe_crc_basic@suspend-read-crc-pipe-[abc] fails on CI
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: high critical
Assignee: Mika Kahola
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-24 09:18 UTC by Martin Peres
Modified: 2017-10-23 10:43 UTC (History)
2 users (show)

See Also:
i915 platform: SKL
i915 features: power/suspend-resume


Attachments
spinlock shuffle (739 bytes, patch)
2017-04-20 13:18 UTC, Mika Kahola
no flags Details | Splinter Review
Reset GPU before running test (735 bytes, patch)
2017-05-18 13:17 UTC, Mika Kahola
no flags Details | Splinter Review

Description Martin Peres 2017-03-24 09:18:08 UTC
The machine fi-skl-6700k produced a fail for the test igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a on CI_DRM_2387 and has not been reproduced yet, 7 runs later.

The main relevant part:
(kms_pipe_crc_basic:10156) igt-debugfs-CRITICAL: Test assertion failure function igt_assert_crc_equal, file igt_debugfs.c:312:
(kms_pipe_crc_basic:10156) igt-debugfs-CRITICAL: Failed assertion: a->crc[i] == b->crc[i]
(kms_pipe_crc_basic:10156) igt-debugfs-CRITICAL: Last errno: 9, Bad file descriptor
(kms_pipe_crc_basic:10156) igt-debugfs-CRITICAL: error: 0xbed119d0 != 0x521eeb85

Here are all the logs: https://intel-gfx-ci.01.org/CI/CI_DRM_2387/fi-skl-6700k/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html
Comment 1 Martin Peres 2017-03-24 09:18:54 UTC
Setting the platform and elevating the priority because it involves our CI.
Comment 2 Jani Saarinen 2017-04-06 06:49:45 UTC
Statistics: Failure rate 3/94 run(s) (3%)
Comment 3 Jani Saarinen 2017-04-20 06:29:49 UTC
Seen also on PW run on test-c
https://intel-gfx-ci.01.org/CI/Patchwork_4521/fi-skl-6700k/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c.html

Marked this test also for CI
Comment 4 Jani Saarinen 2017-04-20 06:53:31 UTC
For a: Statistics: Failure rate 3/153 run(s) (1%)
Comment 5 Mika Kahola 2017-04-20 13:18:08 UTC
Created attachment 130942 [details] [review]
spinlock shuffle

This maybe a bit long shot but maybe give it go with this patch where we shuffle around the spinlock acquiring/releasing. Failure rate is quite low so we need to give this a long run.
Comment 6 krisman 2017-04-24 14:20:12 UTC
(In reply to Mika Kahola from comment #5)
> Created attachment 130942 [details] [review] [review]
> spinlock shuffle
> 
> This maybe a bit long shot but maybe give it go with this patch where we
> shuffle around the spinlock acquiring/releasing. Failure rate is quite low
> so we need to give this a long run.

copy_to_user may sleep, so you can't hold the spinlock.. from a quick look, not sure it will help for testing either, I think this patch will deadlock the interruption handler when adding a new crc entry..
Comment 7 Ricardo 2017-05-09 16:50:52 UTC
Adding tag into "Whiteboard" field - ReadyForDev
The bug still active
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included
Comment 8 Mika Kahola 2017-05-18 13:17:02 UTC
Created attachment 131406 [details] [review]
Reset GPU before running test

This patch is related to another bug but could be tested with this bug as well. Because the occurrence of this bug is relatively rare (<3%) it could be assumed that GPU may be left in some weird state. Therefore, the patch proposes to reset GPU before entering to the subtests. Let's see what happens in CI when this patch is applied.
Comment 9 Mika Kahola 2017-05-18 13:21:11 UTC
I forgot to mention that running this test alone for couple of hundred times I wasn't able to trigger this reported behavior.
Comment 10 Martin Peres 2017-05-19 10:01:39 UTC
(In reply to Mika Kahola from comment #8)
> Created attachment 131406 [details] [review] [review]
> Reset GPU before running test
> 
> This patch is related to another bug but could be tested with this bug as
> well. Because the occurrence of this bug is relatively rare (<3%) it could
> be assumed that GPU may be left in some weird state. Therefore, the patch
> proposes to reset GPU before entering to the subtests. Let's see what
> happens in CI when this patch is applied.

Isn't that cheating? Why not reset the gpu before suspend then?

In other news, today, we hit the same bug on pipe B too: https://intel-gfx-ci.01.org/CI/CI_DRM_2634/fi-skl-6700k/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b.html
Comment 11 Mika Kahola 2017-06-06 10:52:51 UTC
Well, I think we could start the test "fresh" and hence the reset before running the tests.
Comment 12 Jani Saarinen 2017-06-07 10:33:38 UTC
These are really hard to reproduce. Might be that we just need to wait few times still and close if not reproduced.
Comment 13 Mika Kahola 2017-06-14 11:46:46 UTC
Unable to replicate the issue and the issue hasn't surfaced on CI runs either.
Comment 14 Martin Peres 2017-07-04 08:19:22 UTC
(In reply to Mika Kahola from comment #13)
> Unable to replicate the issue and the issue hasn't surfaced on CI runs
> either.

Guess who's back? https://intel-gfx-ci.01.org/CI/CI_DRM_2788/fi-skl-6700k/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html
Comment 15 Mika Kahola 2017-07-04 08:30:44 UTC
oh, that surfaced again. Not so cool.
Comment 16 Jani Saarinen 2017-09-14 07:26:36 UTC
Test	Affected machines (Last seen on)
igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c	
fi-skl-6700k: CI_DRM_2998: 2017-08-24 / 87 runs ago, with result 'fail' ( raw data, history ), failure rate of 8 / 795 runs (1 %)
igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b	
fi-skl-6700k: CI_DRM_2987: 2017-08-22 / 98 runs ago, with result 'fail' ( raw data, history ), failure rate of 2 / 795 runs (0 %)
igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a	
fi-skl-6700k: CI_DRM_2788: 2017-06-30 / 273 runs ago, with result 'fail' ( raw data, history ), failure rate of 5 / 795 runs (1 %)

https://intel-gfx-ci.01.org/cibuglog/index.html%3Faction_failures_history=92.html

Really sporadic. Dropping priority.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.