Bug 107898 - "kfd: Failed to resume IOMMU for device 1002:15dd" on Raven Ridge
Summary: "kfd: Failed to resume IOMMU for device 1002:15dd" on Raven Ridge
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/amdkfd (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-11 07:33 UTC by Marvin Damschen
Modified: 2019-11-19 07:53 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg 4.19-rc3 (129.72 KB, text/plain)
2018-09-11 07:33 UTC, Marvin Damschen
no flags Details
Add iommu init instrumentation (2.72 KB, patch)
2018-09-12 05:21 UTC, Felix Kühling
no flags Details | Splinter Review
dmesg 4.19-rc3 with iommu init instrumentation (87.60 KB, text/plain)
2018-09-12 10:17 UTC, Marvin Damschen
no flags Details
ROCm 1.9 info on 4.19-rc4 (5.78 KB, text/plain)
2018-09-20 09:01 UTC, Marvin Damschen
no flags Details

Description Marvin Damschen 2018-09-11 07:33:16 UTC
Created attachment 141520 [details]
dmesg 4.19-rc3

Hey,

I wanted to try the newly-added support for Raven Ridge in amdkfd, but initialization fails at:
"kfd: Failed to resume IOMMU for device 1002:15dd" on AMD Ryzen 5 2500U (Lenovo E485) with 4.19-rc3. IOMMU itself seems to initialize fine (As I understand, I can ignore the "AMD-Vi: Unable to write to IOMMU perf counter." msg). Full log is attached.

Best regards
Marvin
Comment 1 Oded Gabbay 2018-09-11 08:53:43 UTC
Added Felix to CC
Comment 2 Felix Kuehling 2018-09-11 20:48:41 UTC
The AMD-Vi messages in the log look OK. I'm seeing the same on my Raven system (Ryzen 5 2400G desktop).

I'm currently running a 4.19-rc1+ kernel from Alex Deucher's drm-next-4.20-wip branch. I haven't tried rc3 from the master branch yet. I'll try it tonight and see if I can reproduce the issue.
Comment 3 Felix Kühling 2018-09-12 05:19:59 UTC
I'm not seeing this problem on my Raven system with 4.19-rc3+ ($ git describe
v4.19-rc3-21-g5e335542de83).

The most likely explanation is that on your system IOMMUv2 is not enabled. That may be a BIOS setting. If your system BIOS setup doesn't allow you to enable the IOMMUv2, then you may be out of luck. I'll attach a patch that adds some extra error messages that should confirm that or point to a different source of the problem.
Comment 4 Felix Kühling 2018-09-12 05:21:21 UTC
Created attachment 141532 [details] [review]
Add iommu init instrumentation
Comment 5 Marvin Damschen 2018-09-12 10:16:09 UTC
Output with patch applied:

Sep 12 12:08:20 zen kernel: kfd kfd: Allocated 3969056 bytes on gart
Sep 12 12:08:20 zen kernel: Topology: Add APU node [0x15dd:0x1002]
Sep 12 12:08:20 zen kernel: Failed to attache to group
Sep 12 12:08:20 zen kernel: amd_iommu_init_device failed: -22
Sep 12 12:08:20 zen kernel: kfd kfd: Failed to resume IOMMU for device 1002:15dd
Sep 12 12:08:20 zen kernel: Creating topology SYSFS entries
Sep 12 12:08:20 zen kernel: kfd kfd: device 1002:15dd NOT added due to errors


Full log attached.

Thank you
Marvin
Comment 6 Marvin Damschen 2018-09-12 10:17:03 UTC
Created attachment 141533 [details]
dmesg 4.19-rc3 with iommu init instrumentation
Comment 7 Felix Kuehling 2018-09-13 20:05:18 UTC
Good timing. We were just given a laptop that has similar problems and found a partial workaround: Try adding "iommu=pt" to your kernel command line. This may at least get you through the KFD initialization, but there are likely more problems down the line.

The problems are due to BIOS bugs. We're looking into more workarounds to ignore or patch incorrect information in the CRAT ACPI table that describes the compute devices for KFD.
Comment 8 Marvin Damschen 2018-09-14 07:40:03 UTC
KFD initializes without errors using "iommu=pt". I will see whether I can get ROCm running on top of that.

Unfortunately, the BIOS has been terrible so far on the raven-based Lenovo  laptops. I am happy to try any patches or workarounds you have, just let me know.
Comment 9 Marvin Damschen 2018-09-20 09:00:17 UTC
ROCm 1.9 runs OpenCL on GPU on top of mainline kfd and seems stable. However:
- CPU is not detected as a compute device (rocminfo attached)
- Performance, at least in darktable, is quite low (the "bench.SRW" benchmark in OpenCL on GPU takes more than 3 times longer than on CPU without OpenCL). The problem could be that memory buffers are too small, clinfo reports:
"Max memory allocation                           268435456 (256MiB)"
which seems quite small to me (?).

Are these problems a result of incorrect information in CRAT?

Best regards
Marvin
Comment 10 Marvin Damschen 2018-09-20 09:01:16 UTC
Created attachment 141657 [details]
ROCm 1.9 info on 4.19-rc4
Comment 11 Felix Kuehling 2018-10-01 18:40:10 UTC
rocminfo reports both the CPU and the GPU.

If OpenCL can't use the CPU as a compute device, that's probably a limitation of the OpenCL implementation.

The max memory allocation size is strange. rocminfo reports a single 16GB memory pool attached to the CPU. That's system memory from the CRAT table and looks reasonable. It should be possible to use at least 3/8 of that with the upstream KFD. If CLinfo is reporting something different I'm wondering if it's an OpenCL limitation rather than a ROCm limitation.

If you're interested in the raw information reported by KFD to user mode, checkout /sys/class/kfd/kfd/topology/nodes. On an APU there should be only one node (0). Underneath that you'll find node properties as well as memory properties that may be interesting.
Comment 12 Marvin Damschen 2018-10-10 12:07:43 UTC
Thanks a lot for the info. /sys/class/kfd/kfd/topology/nodes/0/mem_banks/0/properties correctly reports 16GB of RAM.
As the issues seem to come from BIOS/OpenCL (not from kfd) and kfd successfully initializes with "iommu=pt", I will close this bug report as resolved.

Best regards
Marvin
Comment 13 Chí-Thanh Christopher Nguyễn 2018-11-14 10:44:37 UTC
I have the same issue on Dell Latitude 5495 with Linux kernel 4.19.1 and iommu=pt is a workaround here too.

But as AMD is working around other BIOS bugs[1] (rather than getting them fixed quickly with their business partners), I think this bug report should be left open for now.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=44d8cc6f1a905e4bb1d4221a898abb0d7e9d100a
Comment 14 Martin Peres 2019-11-19 07:53:50 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/4.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.