Bug 100989 - [BAT][BSW] igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b hung in CI
Summary: [BAT][BSW] igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b hung in CI
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: highest critical
Assignee: krisman
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2017-05-10 16:07 UTC by Martin Peres
Modified: 2017-07-17 08:45 UTC (History)
2 users (show)

See Also:
i915 platform: BSW/CHT
i915 features: power/suspend-resume


Attachments

Description Martin Peres 2017-05-10 16:07:23 UTC
On CI_DRM_2600, the machine fi-bsw-n3050 hung on igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b.

This may be related to the thousands of "[drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x71450064" found in the logs.

Full logs: https://intel-gfx-ci.01.org/CI/CI_DRM_2600/fi-bsw-n3050/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b.html
Comment 1 krisman 2017-05-10 18:56:14 UTC
(In reply to Martin Peres from comment #0)
> On CI_DRM_2600, the machine fi-bsw-n3050 hung on
> igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b.
> 
> This may be related to the thousands of "[drm:intel_dp_aux_ch [i915]]
> dp_aux_ch timeout status 0x71450064" found in the logs.
> 
> Full logs:
> https://intel-gfx-ci.01.org/CI/CI_DRM_2600/fi-bsw-n3050/
> igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b.html

The hpd task is completing ok as far as I understand.  we can try the patch that I mentioned in Bug 100215 to see if it helps[1] but I don't think it will fix it.  Did the system hard hanged or just the IGT task? Also did it get back from the suspend?  I think I saw some hangs while reading CRC that I couldn't reproduce anymore on SNB or SKLs (not sure), I'll give it another try.

Thanks, 

[1] https://patchwork.freedesktop.org/patch/151486/
Comment 2 Martin Peres 2017-05-11 12:36:44 UTC
(In reply to krisman from comment #1)
> (In reply to Martin Peres from comment #0)
> > On CI_DRM_2600, the machine fi-bsw-n3050 hung on
> > igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b.
> > 
> > This may be related to the thousands of "[drm:intel_dp_aux_ch [i915]]
> > dp_aux_ch timeout status 0x71450064" found in the logs.
> > 
> > Full logs:
> > https://intel-gfx-ci.01.org/CI/CI_DRM_2600/fi-bsw-n3050/
> > igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b.html
> 
> The hpd task is completing ok as far as I understand.  we can try the patch
> that I mentioned in Bug 100215 to see if it helps[1] but I don't think it
> will fix it. 

Our system already tested it and found it was fixing the issue: https://patchwork.freedesktop.org/series/23299/

But I guess you meant that we should apply it permanently. The only way to do this is to get it upstream :)

> Did the system hard hanged or just the IGT task? Also did it
> get back from the suspend?  

The system hard-hanged or at least could not be reached from the network after resume, and the controller could not read the results anymore (https://intel-gfx-ci.01.org/CI/CI_DRM_2600/fi-bsw-n3050/igt.log).

> I think I saw some hangs while reading CRC that
> I couldn't reproduce anymore on SNB or SKLs (not sure), I'll give it another
> try.


Thanks for looking into it!
Comment 3 krisman 2017-05-11 12:48:19 UTC
(In reply to Martin Peres from comment #2)
> (In reply to krisman from comment #1)
> > (In reply to Martin Peres from comment #0)
> > > On CI_DRM_2600, the machine fi-bsw-n3050 hung on
> > > igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b.
> > > 
> > > This may be related to the thousands of "[drm:intel_dp_aux_ch [i915]]
> > > dp_aux_ch timeout status 0x71450064" found in the logs.
> > > 
> > > Full logs:
> > > https://intel-gfx-ci.01.org/CI/CI_DRM_2600/fi-bsw-n3050/
> > > igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b.html
> > 
> > The hpd task is completing ok as far as I understand.  we can try the patch
> > that I mentioned in Bug 100215 to see if it helps[1] but I don't think it
> > will fix it. 
> 
> Our system already tested it and found it was fixing the issue:
> https://patchwork.freedesktop.org/series/23299/
> 
> But I guess you meant that we should apply it permanently. The only way to
> do this is to get it upstream :)

Thanks Martin!  I am working on a new version of that patch to go upstream.  I'll make myself the assignee for this one.
Comment 4 Jani Saarinen 2017-06-02 08:02:07 UTC
Last seen: 2017-05-10
Statistics: Failure rate 1/88 run(s) (1%).
Comment 5 Jani Saarinen 2017-06-07 10:42:07 UTC
krisman, any ETA for patch to try?
Comment 6 krisman 2017-06-07 12:53:40 UTC
(In reply to Jani Saarinen from comment #5)
> krisman, any ETA for patch to try?

I submitted a new version under the name:  drm: i915: Don't try detecting sinks on ports already in use last week which got feedback from Ville and will need more rework.

I wonder if this got resolved by Maarten work for 100215.
Comment 7 Jani Saarinen 2017-06-08 07:19:54 UTC
I guess that IGT change was reverted I think?
Comment 8 Marta Löfstedt 2017-06-08 11:31:32 UTC
I am confused about this bug according to:

https://intel-gfx-ci.01.org/CI/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b.html

the igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b has been skipped for a very long time and then all of a sudden the result is incomplete on CI_DRM_2699
Comment 9 Petri Latvala 2017-06-08 11:45:24 UTC
The test is doing some suboptimal things. It first does the suspend/resume cycle, and then checks if the tested pipe has valid connectors, so it will suspend anyway even when it will skip.

In the case of CI_DRM_2699, there were about 3 successful suspend/resumes before suspend-read-crc-pipe-b came along where the DUT never recovered from the suspend. The point being here that it's the suspend/resume that jams, not this particular subtest.
Comment 10 krisman 2017-06-20 19:24:24 UTC
(In reply to Petri Latvala from comment #9)
> The test is doing some suboptimal things. It first does the suspend/resume
> cycle, and then checks if the tested pipe has valid connectors, so it will
> suspend anyway even when it will skip.

For the record, I submitted a patch to address the issue with the test suspending before skipping.  Petri already reviewed and pushed to igt.

31f71d62d5ff ("igt/kms_pipe_crc_basic: Skip test before system suspend")

> In the case of CI_DRM_2699, there were about 3 successful suspend/resumes
> before suspend-read-crc-pipe-b came along where the DUT never recovered from
> the suspend. The point being here that it's the suspend/resume that jams,
> not this particular subtest.

Agreed.  The issue is more related to the recovery of the suspend than the read_crc itself.  Just need to mention, though, that bsw-n3050 has connectors attached, as well as VGA, meaning that it's not a case where my patch will make igt skip.

We are also discussing on the list a way to reduce the overhead provoked by the thousands of dp_aux_ch timeout messages below:

[drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x71450064
Comment 11 Martin Peres 2017-07-17 08:45:18 UTC
(In reply to krisman from comment #10)
> (In reply to Petri Latvala from comment #9)
> > The test is doing some suboptimal things. It first does the suspend/resume
> > cycle, and then checks if the tested pipe has valid connectors, so it will
> > suspend anyway even when it will skip.
> 
> For the record, I submitted a patch to address the issue with the test
> suspending before skipping.  Petri already reviewed and pushed to igt.
> 
> 31f71d62d5ff ("igt/kms_pipe_crc_basic: Skip test before system suspend")

Thanks for this! Reducing the noise and the execution time is a Yay from me :)

> 
> > In the case of CI_DRM_2699, there were about 3 successful suspend/resumes
> > before suspend-read-crc-pipe-b came along where the DUT never recovered from
> > the suspend. The point being here that it's the suspend/resume that jams,
> > not this particular subtest.
> 
> Agreed.  The issue is more related to the recovery of the suspend than the
> read_crc itself.  Just need to mention, though, that bsw-n3050 has
> connectors attached, as well as VGA, meaning that it's not a case where my
> patch will make igt skip.

Right, I will still close the bug as the issue is somewhere else than our driver...


> 
> We are also discussing on the list a way to reduce the overhead provoked by
> the thousands of dp_aux_ch timeout messages below:
> 
> [drm:intel_dp_aux_ch [i915]] dp_aux_ch timeout status 0x71450064

Right, any success on that?
Comment 12 Martin Peres 2017-07-17 08:45:53 UTC
Since the test is now skipping immediately, this is not a problem anymore for us.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.