With Pierre-Loup's desktop we have an issue where the reported VRAM and CPU visible VRAM are totally off. e.g. the last report we had kernel reported total VRAM size of 0xfffffffb3070f000 (16.0 EiB) and a visible VRAM size of 0x86135000 (2.09 GiB). The system is a large aperture system, so expected is ~8 GiB for both, and it is most of the time. The system typically boots in a good state and eventually after a while transitions to a bad state and sometimes it transitions back. We have not been able to find a pattern and on one of the earlier cases the debugfs vram_mm file reported reasonable size & utilization. The kernels that have been tried include amd-staging-drm-next and a released kernel (need to look up whether it was 4.15 or 4.16 if that information is needed)
Please attach the dmesg output from the affected system, preferably captured while or after the problem occurs. (In reply to Bas Nieuwenhuizen from comment #0) > e.g. the last report we had kernel reported total VRAM size of > 0xfffffffb3070f000 (16.0 EiB) and a visible VRAM size of 0x86135000 (2.09 > GiB). Not the other way around? One potential issue I can see is that only BOs with AMDGPU_GEM_CREATE_NO_CPU_ACCESS are accounted for invisible_pin_size. But such BOs can still end up pinned at least partially in CPU visible VRAM, which would mess up the calculation of how much visible VRAM currently isn't pinned. One possible solution for this would be for amdgpu_bo_pin_restricted and amdgpu_bo_unpin to walk the list of memory nodes and calculate exactly how much of each of them lies in visible or invisible VRAM.
(In reply to Michel Dänzer from comment #1) > One possible solution for this would be for amdgpu_bo_pin_restricted and > amdgpu_bo_unpin to walk the list of memory nodes and calculate exactly how > much of each of them lies in visible or invisible VRAM. We actually have a helper for that in amdgpu_vram_mgr_vis_size(). Apart from that if the problem only occurse after a certain time it looks like we have a mismatch between adding the pinned size and subtracting it again. Or alternatively some sort of memory corruption. Have you tried running it with KASAN enabled for a while?
Created attachment 140160 [details] [review] drm/amdgpu: Refactor amdgpu_vram_mgr_bo_sizes helper
Created attachment 140161 [details] [review] drm/amdgpu: Make amdgpu_vram_mgr_bo_sizes accurate in all cases
Created attachment 140162 [details] [review] drm/amdgpu: Add debugging output related to bogus pin_size values Do these patches help? If not, this patch will hopefully show where we're going wrong.
There is also a bug in amdgpu_bo_unpin(): r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx); ... if (bo->tbo.mem.mem_type == TTM_PL_VRAM) { There is no guarantee that bo->placement still points to the placement we used for pinning. So we need to do the housekeeping before trying to re-validate the BO.
(In reply to Christian König from comment #6) > There is no guarantee that bo->placement still points to the placement we > used for pinning. I thought about that as well, but I couldn't think of a reason why ttm_bo_validate would move the BO somewhere else. Anyway, better safe than sorry I guess; I'll take care of this as well, thanks.
BTW, Christian, do you think the pin_size values should be atomic, like the usage ones?
Created attachment 140175 [details] [review] drm/amdgpu: Add debugging output related to bogus pin_size values Debugging patch rebased on top of https://patchwork.freedesktop.org/series/44837/ .
This series should fix it: https://patchwork.freedesktop.org/series/46325/
Thanks for the report. Should be fixed with the following commits in 4.19, which are being backported to stable branches: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ddc21af4d0f37f42b33c54cb69b215997fe5b082 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a5ccfe5c20740f2fbf00291490cdf8d2373ec255 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=15e6b76880e65be24250e30986084b5569b7a06f https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=456607d816d89a442a3d5ec98b02c8bc950b5228
Sounds like we can close this one now.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.