108743 – [CI][DRMTIP] igt@*suspend* - incomplete

Bug 108743 - [CI][DRMTIP] igt@*suspend* - incomplete

Summary: [CI][DRMTIP] igt@*suspend* - incomplete

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	medium normal
Assignee:	sujaritha.sundaresan
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-11-14 14:48 UTC by Martin Peres
Modified:	2019-11-29 17:59 UTC (History)
CC List:	4 users (show)

See Also:
i915 platform:	ICL, KBL, SKL
i915 features:

Attachments
dmesg from serial console (103.50 KB, text/plain) 2019-09-27 23:28 UTC, Don Hiatt	no flags	Details
View All

Description Martin Peres 2018-11-14 14:48:05 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_140/fi-kbl-guc/igt@drv_suspend@forcewake.html

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_138/fi-skl-guc/igt@drv_suspend@shrink.html

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_138/fi-skl-guc/igt@drv_suspend@debugfs-reader.html

Comment 1 CI Bug Log 2019-01-31 11:04:11 UTC

A CI Bug Log filter associated to this bug has been updated:

{- GUC: igt@*(suspend|s3)* - incomplete -}
{+ GUC: igt@*(suspend|s3)* - incomplete +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_191/fi-kbl-guc/igt@i915_suspend@forcewake.html

Comment 2 Jon Ewins 2019-02-15 23:59:15 UTC

Further investigation will be deferred until after upcoming update to guc version.

Comment 3 Daniele Ceraolo Spurio 2019-07-26 22:43:55 UTC

This is not happening with the new FW, so closing.

Comment 4 Martin Peres 2019-08-21 13:29:20 UTC

(In reply to Daniele Ceraolo Spurio from comment #3)
> This is not happening with the new FW, so closing.

It has happened 9 times in the last 14 drmtip runs, and has never been not seen for more than 4 runs at a time. This means you did not follow the 10x shown in the bug assessment process. Please follow all the steps carefuly and not skip directly to closing the issue.

Comment 5 sujaritha.sundaresan 2019-09-11 15:49:38 UTC

I'm currently following this bug. In the last look through the CI results I can see that this is still occurring but I haven't been able to identify the exact issue yet.

Comment 6 sujaritha.sundaresan 2019-09-11 20:01:13 UTC

I have been able to reproduce this bug on an ICL with the gem_ctx_isolation@vcs0-s3 and i915_suspend@forcewake tests. On these runs I completely lose the DUT after the failed test run. The next will be to get some serial logs for this.

Comment 7 sujaritha.sundaresan 2019-09-16 22:30:10 UTC

This issue is recurrently seen on the following five tests: kms_vblank@pipe-c/b-continuation-suspend, gem_workarounds@suspend-resume context, gem_ctx_isolation@vcs0-s3, i915_suspend@forcewake. For all these tests, locally I can see this issue happening without guc as well.

Comment 8 Jon Ewins 2019-09-16 22:48:18 UTC

As issue reproduced on SKL and ICL with and without GuC, changing i915 feature selection from firmware/guc to power/suspend/resume.

Comment 9 Jon Ewins 2019-09-16 23:19:36 UTC

Local test result confirmed, but the CI evidence of being seen only on our -guc machines is compelling.  Issue on kms tests might be a new regression.  Re-adding firmware/guc to i915/feature.

Comment 10 Don Hiatt 2019-09-27 23:17:55 UTC

Suja and I have been working on trying to duplicate this.

On ICL, the i915_suspend test just appears to hang (see below)

gta@ubt-18:~/ril-src/igt-gpu-tools$ sudo ./build/tests/i915_suspend
IGT-Version: 1.24-g5a6c6856 (x86_64) (Linux: 5.3.0+ x86_64)
Starting subtest: fence-restore-tiled2untiled
[cmd] rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Fri Sep 27 22:42:17 2019
checking the first canary object
checking the second canary object
Subtest fence-restore-tiled2untiled: SUCCESS (7.957s)
Starting subtest: fence-restore-untiled
[cmd] rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Fri Sep 27 22:42:39 2019
checking the first canary object
checking the second canary object

Subtest fence-restore-untiled: SUCCESS (6.978s)
Starting subtest: debugfs-reader
[cmd] rtcwake: assuming RTC uses UTC ... <
rtcwake: wakeup from "mem" using /dev/rtc0 at Fri Sep 27 22:43:02 2019 <------- seems to hang here?

However, with a serial port connected it turns out that the dut does not die after all, as we still have an interactive console and can see kernel messages.
It seems that the netdev isn't waking up and that is why the test appears to hand and you can't ssh into it again.

Also, looking the running processes the test appears to be running.

Lastly, we're seeing a "PM: Cannot get swap device, try swapon -a" and
"PM: Cannot get swap writer" on the console. I wondering if the test is trying to hibernate and is expecting swap space?

I have the console going and it looks like the machine is not really dead.
The serial port is still interactive but the network appears dead, that is why you don’t see any output on your terminal, nor
can you ssh into the dut.

From the serial console, the test is still running.

The error on the serial console seems to imply it is expecting the machine to have a swap space enabled. Perhaps that is
the reason the test just appears to hang. We now know the device does come out of suspend, only that the network isn’t
restarted.

Comment 11 Don Hiatt 2019-09-27 23:20:43 UTC

(In reply to Don Hiatt from comment #10)


> 
> 
> I have the console going and it looks like the machine is not really dead.
> The serial port is still interactive but the network appears dead, that is
> why you don’t see any output on your terminal, nor
> can you ssh into the dut.
> 
> From the serial console, the test is still running.
> 
> The error on the serial console seems to imply it is expecting the machine
> to have a swap space enabled. Perhaps that is
> the reason the test just appears to hang. We now know the device does come
> out of suspend, only that the network isn’t
> restarted.

Sorry, this was a cut and paste repeat of what I was saying.

Comment 12 Don Hiatt 2019-09-27 23:28:26 UTC

Created attachment 145560 [details]
dmesg from serial console

Comment 13 Don Hiatt 2019-09-27 23:40:58 UTC

After enabling swap on the dut, the tests are passing.

Comment 14 sujaritha.sundaresan 2019-10-03 22:08:37 UTC

This bug has not been seen for about a week now on any of the platforms it was previously seen on. I will continue to track this bug and update if there are any changes.

Comment 15 sujaritha.sundaresan 2019-10-30 23:02:56 UTC

This issue was recently seen again on the gem_eio@in-flight-suspend and kms_pipe_crc_basic@suspend-read-crc-pipe-b tests. Initially the incomplete tests were successful after enabling swap on guc devices. After assessing the new logs, it looks like neither of these issues are guc specific. The same issues are seen across non-guc systems as well. This particular bug log appears to be capturing general issues seen on guc systems. I do not think they are specific to guc.

Comment 16 sujaritha.sundaresan 2019-11-08 22:33:00 UTC

This issue is possibly being seen again primarily on TGL systems.

Comment 17 Martin Peres 2019-11-29 17:59:24 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/184.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.