Bug 110360 - AMD system hits AMD-Vi: Completion-Wait loop timed out on Acer Squirtle_SR laptop
Summary: AMD system hits AMD-Vi: Completion-Wait loop timed out on Acer Squirtle_SR la...
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: high normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-09 05:37 UTC by jian-hong
Modified: 2019-04-16 02:29 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
The error log (209.44 KB, text/plain)
2019-04-09 05:37 UTC, jian-hong
no flags Details
The dmesg of disabled amdgpu's runpm (65.65 KB, text/plain)
2019-04-10 09:19 UTC, jian-hong
no flags Details
The dmesg of disabled pci ats (66.50 KB, text/x-log)
2019-04-11 05:04 UTC, jian-hong
no flags Details
lspci -nnv (12.98 KB, text/x-log)
2019-04-12 02:36 UTC, jian-hong
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description jian-hong 2019-04-09 05:37:41 UTC
Created attachment 143905 [details]
The error log

We have an Acer Squirtle_SR laptop equipped with AMD A9-9420e RADEON R5, 5 COMPUTE CORES 2C+3G and [AMD/ATI] Topaz XT [Radeon R7 M260/M265 / M340/M360 / M440/M445] [1002:6900].  We test it with Linux kernel 5.1.0-rc4.  The system hits the following error and makes system hang up:

Apr 09 11:28:57 endless kernel: AMD-Vi: Completion-Wait loop timed out
Apr 09 11:28:57 endless kernel: iwlwifi 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0xff814000 flags=0x0050]

The worst case is the disk's block may be disrupted, then we have to re-install the system if it cannot be recovered by fsck.

If we blacklist the amdgpu module, then system will not hit the error.  But system has no GUI, and only shows console.

If iommu=soft is appended to the boot command, system works fine.
Comment 1 jian-hong 2019-04-09 06:00:24 UTC
The [AMD/ATI] Topaz XT [Radeon R7 M260/M265 / M340/M360 / M440/M445] [1002:6900]

01:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Topaz XT [Radeon R7 M260/M265 / M340/M360 / M440/M445] [1002:6900] (rev c3)
	Subsystem: Acer Incorporated [ALI] Topaz XT [Radeon R7 M260/M265 / M340/M360 / M440/M445] [1025:1217]
	Physical Slot: 0
	Flags: bus master, fast devsel, latency 0, IRQ 44
	Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Memory at d0000000 (64-bit, prefetchable) [size=2M]
	I/O ports at 3000 [size=256]
	Memory at d1400000 (32-bit, non-prefetchable) [size=256K]
	Expansion ROM at d1440000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [270] #19
	Capabilities: [2b0] Address Translation Service (ATS)
	Capabilities: [2c0] Page Request Interface (PRI)
	Capabilities: [2d0] Process Address Space ID (PASID)
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu
Comment 2 Alex Deucher 2019-04-09 14:22:56 UTC
Does booting with amdgpu.runpm=0 on the kernel command line in grub help?
Comment 3 jian-hong 2019-04-10 09:19:55 UTC
Created attachment 143916 [details]
The dmesg of disabled amdgpu's runpm

(In reply to Alex Deucher from comment #2)
> Does booting with amdgpu.runpm=0 on the kernel command line in grub help?

System boots correctly with amdgpu.runpm=0 on the kernel command line.
Comment 4 jian-hong 2019-04-11 05:04:51 UTC
Created attachment 143930 [details]
The dmesg of disabled pci ats

Also tested with 'pci=noats' on boot command which is mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=194521#c24
System also boots fine.
Comment 5 Alex Deucher 2019-04-11 15:21:54 UTC
Please attach the output of lspci -vnn
Comment 6 jian-hong 2019-04-12 02:36:58 UTC
Created attachment 143946 [details]
lspci -nnv
Comment 7 jian-hong 2019-04-15 02:37:02 UTC
Any thing else I can help more?  Test or need more information, log? :)
Comment 8 Alex Deucher 2019-04-15 04:31:21 UTC
https://patchwork.kernel.org/patch/10889269/
Comment 9 Alex Deucher 2019-04-15 04:32:16 UTC
(In reply to Alex Deucher from comment #8)
> https://patchwork.kernel.org/patch/10889269/

I think it's actually a problem with runtime pm and some pci state.  I may ask you to help debug that when I get a chance.
Comment 10 Daniel Drake 2019-04-16 02:29:19 UTC
Thanks Alex. We will have to return this unit to the vendor at some point, but we will try to hold onto it for another month so that we can run any tests you request.

Alternatively, we may be able to get an affected unit shipped to you on a 1-month loan. Would that be useful?


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.