Summary: | After S3 resume, kernel: [drm] psp command failed and response status is (-65529) at 27th time of S3. 28th time of S3 freeze the system. | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Kai-Heng Feng <kai.heng.feng> | ||||||||||||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||||||||||||
Status: | RESOLVED MOVED | QA Contact: | |||||||||||||||||||
Severity: | normal | ||||||||||||||||||||
Priority: | medium | CC: | alexdeucher, christian.koenig, samantham | ||||||||||||||||||
Version: | unspecified | ||||||||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||
Attachments: |
|
Description
Kai-Heng Feng
2019-06-11 06:17:40 UTC
Created attachment 144498 [details]
Full kernel log
Created attachment 144502 [details]
Another kind of fail
Jun 11 03:02:41 u-HP-ProBook-645-G4 kernel: [drm] psp command failed and response status is (-65529)
Is this a regression? If so, can you bisect? (In reply to Alex Deucher from comment #3) > Is this a regression? If so, can you bisect? No this is not a regression. This issue (S3 resume fail) also happens on previous kernel versions, but without any stack trace logged. On amd-staging-drm-next we can observe the same issue and a stacktrace. Does disabling the IOMMU help? Created attachment 145044 [details]
failed log when iommu is disabled.
I also tried disabling GFXOFF but the same issue still happens: diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c b/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c index a24beaa4fb01..62a8394b1f5f 100644 --- a/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c @@ -173,6 +173,7 @@ int hwmgr_early_init(struct pp_hwmgr *hwmgr) case AMDGPU_FAMILY_RV: switch (hwmgr->chip_id) { case CHIP_RAVEN: + hwmgr->feature_mask &= ~PP_GFXOFF_MASK; hwmgr->od_enabled = false; hwmgr->smumgr_funcs = &smu10_smu_funcs; smu10_init_function_pointers(hwmgr); (In reply to Kai-Heng Feng from comment #6) > Created attachment 145044 [details] > failed log when iommu is disabled. What was the failur ewith IOMMU disabled ? Is it the same as with IOMMU enabled ? In the log I only see PSP errors on resume. Can you confirm that the only failure/error you observed in the log in that use case ? Can you please provide your FW versions by cat /sys/kernel/debug/dri/0/amdgpu_firmware_info (In reply to Andrey Grodzovsky from comment #8) > (In reply to Kai-Heng Feng from comment #6) > > Created attachment 145044 [details] > > failed log when iommu is disabled. > > What was the failur ewith IOMMU disabled ? Blanked screen. Graphics no longer works. >Is it the same as with IOMMU enabled ? Yes. > In the log I only see PSP errors on resume. Can you confirm that the only > failure/error you observed in the log in that use case ? Yes. I haven't seen "[drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:57:crtc-0] flip_done timed out" for a while. Now it always shows PSP fail. > > Can you please provide your FW versions by > cat /sys/kernel/debug/dri/0/amdgpu_firmware_info VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 40, firmware version: 0x00000099 PFP feature version: 40, firmware version: 0x000000ae CE feature version: 40, firmware version: 0x0000004d RLC feature version: 1, firmware version: 0x00000213 RLC SRLC feature version: 1, firmware version: 0x00000001 RLC SRLG feature version: 1, firmware version: 0x00000001 RLC SRLS feature version: 1, firmware version: 0x00000001 MEC feature version: 40, firmware version: 0x0000018b MEC2 feature version: 40, firmware version: 0x0000018b SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x001ad4d4 TA XGMI feature version: 0, firmware version: 0x00000000 TA RAS feature version: 0, firmware version: 0x00000000 SMC feature version: 0, firmware version: 0x00001e4f SDMA0 feature version: 41, firmware version: 0x000000a9 VCN feature version: 0, firmware version: 0x0110901c DMCU feature version: 0, firmware version: 0x00000000 VBIOS version: SWBRT32481.001 I am getting this same issue (at least I believe the same). It is in the 5.2 series but not in the 5.1 series of the kernel. If needed I can post my logs. I have Lenovo A485 w/ 2700U (In reply to Samantha McVey from comment #10) > I am getting this same issue (at least I believe the same). It is in the 5.2 > series but not in the 5.1 series of the kernel. If needed I can post my > logs. I have Lenovo A485 w/ 2700U Can you please build a kernel from branch [1], reproduce the issue, and attach `journalctl -b -1 -k` so we can check if is really a same issue. [1] https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next > Now it always shows PSP fail.
I've dug up more info about this issue. It always times out in psp_cmd_submit_buf(). Particularly, this code section:
while (*((unsigned int *)psp->fence_buf) != index) {
if (--timeout == 0)
break;
msleep(1);
}
psp->fence_buf stuck at 406 and index stuck at 407 and it eventually times out.
This _always_ happens at 27th time of S3, and freeze the whole system at 28th S3 attempt.
Created attachment 145085 [details]
amd-staging-drm-net dmesg log
Created attachment 145086 [details]
amd-staging-drm-next xorg log
I have uploaded my dmesg log and xorg log from amd-staging-drm-next (In reply to Samantha McVey from comment #13) > Created attachment 145085 [details] > amd-staging-drm-net dmesg log Doesn't look like the same one. Does this system support conventional S3 or is it a reduced ACPI platform that only supports suspend to idle? (In reply to Alex Deucher from comment #17) > Does this system support conventional S3 or is it a reduced ACPI platform > that only supports suspend to idle? This system defaults to S3, and the issue happens under S3. Is there any first gen Raven Ridge supports s2idle? (In reply to Kai-Heng Feng from comment #9) > (In reply to Andrey Grodzovsky from comment #8) > > (In reply to Kai-Heng Feng from comment #6) > > > Created attachment 145044 [details] > > > failed log when iommu is disabled. > > > > What was the failur ewith IOMMU disabled ? > Blanked screen. Graphics no longer works. > > >Is it the same as with IOMMU enabled ? > Yes. > > > In the log I only see PSP errors on resume. Can you confirm that the only > > failure/error you observed in the log in that use case ? > Yes. I haven't seen > "[drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* > [CRTC:57:crtc-0] flip_done timed out" > for a while. > > Now it always shows PSP fail. > > > > > Can you please provide your FW versions by > > cat /sys/kernel/debug/dri/0/amdgpu_firmware_info > VCE feature version: 0, firmware version: 0x00000000 > UVD feature version: 0, firmware version: 0x00000000 > MC feature version: 0, firmware version: 0x00000000 > ME feature version: 40, firmware version: 0x00000099 > PFP feature version: 40, firmware version: 0x000000ae > CE feature version: 40, firmware version: 0x0000004d > RLC feature version: 1, firmware version: 0x00000213 > RLC SRLC feature version: 1, firmware version: 0x00000001 > RLC SRLG feature version: 1, firmware version: 0x00000001 > RLC SRLS feature version: 1, firmware version: 0x00000001 > MEC feature version: 40, firmware version: 0x0000018b > MEC2 feature version: 40, firmware version: 0x0000018b > SOS feature version: 0, firmware version: 0x00000000 > ASD feature version: 0, firmware version: 0x001ad4d4 > TA XGMI feature version: 0, firmware version: 0x00000000 > TA RAS feature version: 0, firmware version: 0x00000000 > SMC feature version: 0, firmware version: 0x00001e4f > SDMA0 feature version: 41, firmware version: 0x000000a9 > VCN feature version: 0, firmware version: 0x0110901c > DMCU feature version: 0, firmware version: 0x00000000 > VBIOS version: SWBRT32481.001 Can you please confirm the issue happens regardless of graphic enabled, load system in console mode and verify you still observe the problem.(In reply to Kai-Heng Feng from comment #12) > > Now it always shows PSP fail. > I've dug up more info about this issue. It always times out in > psp_cmd_submit_buf(). Particularly, this code section: > > while (*((unsigned int *)psp->fence_buf) != index) { > if (--timeout == 0) > break; > msleep(1); > } > > psp->fence_buf stuck at 406 and index stuck at 407 and it eventually times > out. > This _always_ happens at 27th time of S3, and freeze the whole system at > 28th S3 attempt. Does it happen also when no acceleration in system - i mean if you do S3 cycles from console mode ? (In reply to Andrey Grodzovsky from comment #19) > Can you please confirm the issue happens regardless of graphic enabled, load > system in console mode and verify you still observe the problem. I guess you mean without graphical session? Yes I already tested that. 1. If amdgpu.ko is loaded, the issue happens under both console or graphical session. 2. If amdgpu.ko is not loaded, the issue doesn't happen regardless of console or graphical session. > Does it happen also when no acceleration in system - i mean if you do S3 > cycles from console mode ? Please refer to the point 2 above. In fact please rebase latest drm-next from here - https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next, there are 2 changes by Alex in communication with PSP with might help drm/amdgpu/psp: invalidate the hdp read cache before reading the psp response drm/amdgpu/psp: flush HDP write fifo after submitting cmds to the psp See if the PSP errors go away with that. (In reply to Andrey Grodzovsky from comment #21) > In fact please rebase latest drm-next from here - > https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next, there > are 2 changes by Alex in communication with PSP with might help > > drm/amdgpu/psp: invalidate the hdp read cache before reading the psp > response > drm/amdgpu/psp: flush HDP write fifo after submitting cmds to the psp > > See if the PSP errors go away with that. The slightly different error message still popped out after 27th S3, and 28th S3 attempt froze the system: Sep 28 05:38:44 u-HP-ProBook-645-G4 kernel: [drm:psp_hw_start.cold [amdgpu]] *ERROR* PSP load asd failed! Sep 28 05:38:44 u-HP-ProBook-645-G4 kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed Sep 28 05:38:44 u-HP-ProBook-645-G4 kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -22 Sep 28 05:38:44 u-HP-ProBook-645-G4 kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-22). Sep 28 05:38:44 u-HP-ProBook-645-G4 kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -22 Sep 28 05:38:44 u-HP-ProBook-645-G4 kernel: PM: Device 0000:04:00.0 failed to resume async: error -22 $ journalctl -b -1 -k | grep "suspend entry (deep)" | wc -l 28 Created attachment 145576 [details]
journalctl last boot kernel message
(In reply to Kai-Heng Feng from comment #23) > Created attachment 145576 [details] > journalctl last boot kernel message Can u retry with latest FW from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git and also load kernel with drm.debug=1 as there seems a failure in PSP command submission during FW loading but the actual code of failure is now under debug log level. (In reply to Andrey Grodzovsky from comment #24) > (In reply to Kai-Heng Feng from comment #23) > > Created attachment 145576 [details] > > journalctl last boot kernel message > > Can u retry with latest FW from > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git Still same issue. > > and also load kernel with drm.debug=1 as there seems a failure in PSP > command submission during FW loading but the actual code of failure is now > under debug log level. I can reproduce the issue on latest firmware ("amdgpu: update vega20 ucode for 19.30") and latest amd-staging-drm-next ("drm/amdgpu: remove redundant variable r and redundant return statement"). I don't see keep trying latest kernel/firmware makes us going anywhere. If you need a physical hardware to debug, please just let me know. Created attachment 145666 [details]
PSP failed with drm.debug=1
Created attachment 145667 [details]
ring test failed with drm.debug=1
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/822. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.