https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_140/fi-kbl-guc/igt@drv_suspend@forcewake.html https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_138/fi-skl-guc/igt@drv_suspend@shrink.html https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_138/fi-skl-guc/igt@drv_suspend@debugfs-reader.html
A CI Bug Log filter associated to this bug has been updated: {- GUC: igt@*(suspend|s3)* - incomplete -} {+ GUC: igt@*(suspend|s3)* - incomplete +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_191/fi-kbl-guc/igt@i915_suspend@forcewake.html
Further investigation will be deferred until after upcoming update to guc version.
This is not happening with the new FW, so closing.
(In reply to Daniele Ceraolo Spurio from comment #3) > This is not happening with the new FW, so closing. It has happened 9 times in the last 14 drmtip runs, and has never been not seen for more than 4 runs at a time. This means you did not follow the 10x shown in the bug assessment process. Please follow all the steps carefuly and not skip directly to closing the issue.
I'm currently following this bug. In the last look through the CI results I can see that this is still occurring but I haven't been able to identify the exact issue yet.
I have been able to reproduce this bug on an ICL with the gem_ctx_isolation@vcs0-s3 and i915_suspend@forcewake tests. On these runs I completely lose the DUT after the failed test run. The next will be to get some serial logs for this.
This issue is recurrently seen on the following five tests: kms_vblank@pipe-c/b-continuation-suspend, gem_workarounds@suspend-resume context, gem_ctx_isolation@vcs0-s3, i915_suspend@forcewake. For all these tests, locally I can see this issue happening without guc as well.
As issue reproduced on SKL and ICL with and without GuC, changing i915 feature selection from firmware/guc to power/suspend/resume.
Local test result confirmed, but the CI evidence of being seen only on our -guc machines is compelling. Issue on kms tests might be a new regression. Re-adding firmware/guc to i915/feature.
Suja and I have been working on trying to duplicate this. On ICL, the i915_suspend test just appears to hang (see below) gta@ubt-18:~/ril-src/igt-gpu-tools$ sudo ./build/tests/i915_suspend IGT-Version: 1.24-g5a6c6856 (x86_64) (Linux: 5.3.0+ x86_64) Starting subtest: fence-restore-tiled2untiled [cmd] rtcwake: assuming RTC uses UTC ... rtcwake: wakeup from "mem" using /dev/rtc0 at Fri Sep 27 22:42:17 2019 checking the first canary object checking the second canary object Subtest fence-restore-tiled2untiled: SUCCESS (7.957s) Starting subtest: fence-restore-untiled [cmd] rtcwake: assuming RTC uses UTC ... rtcwake: wakeup from "mem" using /dev/rtc0 at Fri Sep 27 22:42:39 2019 checking the first canary object checking the second canary object Subtest fence-restore-untiled: SUCCESS (6.978s) Starting subtest: debugfs-reader [cmd] rtcwake: assuming RTC uses UTC ... < rtcwake: wakeup from "mem" using /dev/rtc0 at Fri Sep 27 22:43:02 2019 <------- seems to hang here? However, with a serial port connected it turns out that the dut does not die after all, as we still have an interactive console and can see kernel messages. It seems that the netdev isn't waking up and that is why the test appears to hand and you can't ssh into it again. Also, looking the running processes the test appears to be running. Lastly, we're seeing a "PM: Cannot get swap device, try swapon -a" and "PM: Cannot get swap writer" on the console. I wondering if the test is trying to hibernate and is expecting swap space? I have the console going and it looks like the machine is not really dead. The serial port is still interactive but the network appears dead, that is why you don’t see any output on your terminal, nor can you ssh into the dut. From the serial console, the test is still running. The error on the serial console seems to imply it is expecting the machine to have a swap space enabled. Perhaps that is the reason the test just appears to hang. We now know the device does come out of suspend, only that the network isn’t restarted.
(In reply to Don Hiatt from comment #10) > > > I have the console going and it looks like the machine is not really dead. > The serial port is still interactive but the network appears dead, that is > why you don’t see any output on your terminal, nor > can you ssh into the dut. > > From the serial console, the test is still running. > > The error on the serial console seems to imply it is expecting the machine > to have a swap space enabled. Perhaps that is > the reason the test just appears to hang. We now know the device does come > out of suspend, only that the network isn’t > restarted. Sorry, this was a cut and paste repeat of what I was saying.
Created attachment 145560 [details] dmesg from serial console
After enabling swap on the dut, the tests are passing.
This bug has not been seen for about a week now on any of the platforms it was previously seen on. I will continue to track this bug and update if there are any changes.
This issue was recently seen again on the gem_eio@in-flight-suspend and kms_pipe_crc_basic@suspend-read-crc-pipe-b tests. Initially the incomplete tests were successful after enabling swap on guc devices. After assessing the new logs, it looks like neither of these issues are guc specific. The same issues are seen across non-guc systems as well. This particular bug log appears to be capturing general issues seen on guc systems. I do not think they are specific to guc.
This issue is possibly being seen again primarily on TGL systems.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/184.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.