Bug 94390

Summary: [SKL-Y] GuC does not load when resuming from suspend to DISK
Product: DRI Reporter: cprigent <christophe.prigent>
Component: DRM/IntelAssignee: Dave Gordon <dg11491352>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: chris.harris, harmanpreet.s.bambah, intel-gfx-bugs, yu.dai
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: SKL i915 features: firmware/guc, power/suspend-resume
Attachments:
Description Flags
dmesg
none
i915_guc_load_status none

Description cprigent 2016-03-03 18:33:46 UTC
Created attachment 122103 [details]
dmesg

Hardware
Platform: SKY LAKE Y A0 QUAL
CPU : Intel(R) Core(TM) m5-6Y57 CPU @ 1.10GHz (family: 6, model: 78  stepping: 3)
MCP : SKL-Y  D1 2+2 (ou ULX-D1)
QDF : QJK9 
CPU : SKL D0
Chipset PCH: Sunrise Point LP C0       
CRB : SKY LAKE Y LPDDR3 RVP3 CRB FAB2
Reworks : All Mandatories + FBS02 & FBS03, O-06
Software 
Linux OS : Ubuntu 15.04 64 bits
BIOS : SKLSE2R1.R00.B104.B01.1511110114
ME FW : 11.0.0.1191
Ksc (EC FW): 1.20
Kernel: drm-intel-nightly 4.5.0-rc6 from http://cgit.freedesktop.org/drm-intel
With patch applied from:
https://patchwork.freedesktop.org/patch/74962/
https://lists.freedesktop.org/archives/intel-gfx/2016-February/088390.html
Guc 6.1


Steps:
-------
1. Execute command:
sudo -s
echo disk > /sys/power/state
2. Wait a moment
3. resume with keyboard
4. Check GuC load status from kernel debugfs with:
cat /sys/kernel/debug/dri/0/i915_guc_load_status

Actual result:
--------------
4. GuC is not loaded, confirmed from kernel log:
[   91.298855] [drm:intel_guc_ucode_load] GuC fw status: path i915/skl_guc_ver6.bin, fetch SUCCESS, load SUCCESS
[   91.298858] [drm:intel_guc_ucode_load] GuC fw status: fetch SUCCESS, load PENDING
[   91.403585] [drm:guc_ucode_xfer_dma] DMA status 0x10, GuC status 0x800570ec
[   91.403585] [drm:guc_ucode_xfer_dma] returning -110
[   91.403586] [drm:intel_guc_ucode_load] GuC firmware load fail, err -110
[   91.423518] [drm:host2guc_action [i915]] *ERROR* GUC: host2guc action 0x20 failed. ret=-110 status=0x00000020 response=0x40000000
[   91.423585] [drm:intel_guc_ucode_load] falling back to execlist mode, err 0

Expected result:
----------------
4. GuC is successfully loaded

Info
----
Not reproduced with suspend to Freeze and RAM
Comment 1 cprigent 2016-03-03 18:35:17 UTC
Created attachment 122104 [details]
i915_guc_load_status
Comment 2 Dave Gordon 2016-03-04 14:52:58 UTC
Hi,
you seem to have two entirely different scenarios here.

The second attachment labelled "i915_guc_load_status" unfortunately isn't the output from the debugfs file of the same name, it's a dmesg log showing a situation where the GuC firmware isn't loaded at all; the request_firmware() call fails to find the firmware, triggers the user-helper, which waits a minute (waiting for rootfs?) before failing. In the absence of any valid firmware, the driver correctly falls back to execlist mode as expected.

The first attachment (correctly labelled "dmesg") is rather more interesting. At startup, the GuC firmware has been fetched and successfully loaded into the GuC. After the suspend-resume cycle, the firmware is reloaded into the GuC -- but the GuC doesn't complete the startup handshake!

My first suspicion would have been that the GuC firmware image in host memory might have been lost or corrupted during the suspend-resume cycle, but the GuC status code tells us that the GuC's BootRom has validated the newly-loaded image and the RSA signature is (still) correct. This essentially proves that the host-side image as copied by the DMA engine was still intact after it was re-copied into the GuC's memory.

So what went wrong? The GuC status decodes as "uKernel got unexpected exception". If you did manage to capture the contents of the debugfs i915_guc_load_status, that might contain some more details, as the exception handler may (depending on GuC build configuration options) put additional data about the exception in the shared registers for the host to capture. Otherwise there's not much more we can deduce from this.

Does it happen every time, or at least frequently? Or just once or very rarely? We do know of a h/w bug that can occasionally result in the GuC reload not working; Arun Siluvery recently posted a patch to work around this. If this is also just occasional, the same workaround will fix it too. If it's every time, it probably means the GuC h/w is not being reinitialised properly on resume, and we will need to change something, maybe power sequencing?

.Dave.
Comment 3 Harmanpreet Singh 2016-03-16 16:18:32 UTC
Do we have any latest update on this issue status ?
Comment 4 Chris Harris 2016-03-16 18:02:23 UTC
A fix for this issue is in progress. It's based on the change posted at https://patchwork.freedesktop.org/series/3985/ and an update should be posted in the next few days.
Comment 5 Dave Gordon 2016-03-23 18:47:59 UTC
It tuns out that the Linux hibernate/resume cycle is weirdly asymmetric :(

In a suspend (to RAM) cycle, devices are frozen, then the CPU power is turned "off", leaving the system in a low-power state. On resume, the CPU power comes up, then devices are thawed and normal operation continues. This is therefore essentially symmetrical.

In hibernation, OTOH, things are much stranger. First, devices are frozen, including the display/GPU and presumably the HD/SSD; then, the kernel builds a compressed hibernation image; then, devices are *thawed* again, and the image written out. Finally, the power is turned off; there doesn't appear to be any re-freeze or controlled/phased shutdown at this stage. Therefore, the final state of devices just before poweroff is "active" -- not a problem for i915 or the GuC, but potentially an issue for self-powered devices or those with nonvolatile state.

It's resume-from-hibernation where things really get strange. First, the system boots as though from cold: BIOS, GRUB, kernel. Drivers are initialised and the GuC firmware may be loaded. THEN the system discovers the hibernation image. It doesn't (AFAICT) freeze devices at this point; it simply overwrites the running kernel with the hibernated image.

This means that the hardware (e.g. GuC) is left active, having been initialised by the booted kernel. But the reloaded driver state says the device is frozen, and must be reinitialised, including (re)loading the GuC firmware.

For most devices, reinitialising an already-active device wouldn't be a problem; but the GuC's program memory is write-once. So when the restored driver tries to reload the GuC, the write fails.

There are several ways to resolve this; for example, we could ignore the failure and discover afterwards that the GuC is in fact running anyway. Or we could check in advance, before even trying to load the GuC, and skip the entire sequence if it's already loaded. However the preferred solution will be to reset the GuC just before loading it, this guaranteeing that it starts out in a known state. (Otherwise, we would at least have to reprogram various options, and it might still be possible that the state of the running firmware is not identical with that which the driver would have loaded).

This solution has been coded, tested locally and by the reporter, and sent to the upstream mailing list.
Comment 7 cprigent 2016-07-07 15:36:00 UTC
For info, I don't reproduce it on KBL-U with GuC 9.14.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.