Bug 106872

Summary: vram sizes reported by the kernel totally off
Product: DRI Reporter: Bas Nieuwenhuizen <bas>
Component: DRM/AMDgpuAssignee: Michel Dänzer <michel>
Status: CLOSED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: ckoenig.leichtzumerken
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
drm/amdgpu: Refactor amdgpu_vram_mgr_bo_sizes helper
none
drm/amdgpu: Make amdgpu_vram_mgr_bo_sizes accurate in all cases
none
drm/amdgpu: Add debugging output related to bogus pin_size values
none
drm/amdgpu: Add debugging output related to bogus pin_size values none

Description Bas Nieuwenhuizen 2018-06-09 22:47:28 UTC
With Pierre-Loup's desktop we have an issue where the reported VRAM and CPU visible VRAM are totally off.

e.g. the last report we had kernel reported total VRAM size of 0xfffffffb3070f000 (16.0 EiB) and a visible VRAM size of 0x86135000 (2.09 GiB).

The system is a large aperture system, so expected is ~8 GiB for both, and it is most of the time. The system typically boots in a good state and eventually after a while transitions to a bad state and sometimes it transitions back.

We have not been able to find a pattern and on one of the earlier cases the debugfs vram_mm file reported reasonable size & utilization.


The kernels that have been tried include amd-staging-drm-next and a released kernel (need to look up whether it was 4.15 or 4.16 if that information is needed)
Comment 1 Michel Dänzer 2018-06-11 17:15:34 UTC
Please attach the dmesg output from the affected system, preferably captured while or after the problem occurs.

(In reply to Bas Nieuwenhuizen from comment #0)
> e.g. the last report we had kernel reported total VRAM size of
> 0xfffffffb3070f000 (16.0 EiB) and a visible VRAM size of 0x86135000 (2.09
> GiB).

Not the other way around?

One potential issue I can see is that only BOs with AMDGPU_GEM_CREATE_NO_CPU_ACCESS are accounted for invisible_pin_size. But such BOs can still end up pinned at least partially in CPU visible VRAM, which would mess up the calculation of how much visible VRAM currently isn't pinned.

One possible solution for this would be for amdgpu_bo_pin_restricted and amdgpu_bo_unpin to walk the list of memory nodes and calculate exactly how much of each of them lies in visible or invisible VRAM.
Comment 2 Christian König 2018-06-12 06:59:07 UTC
(In reply to Michel Dänzer from comment #1)
> One possible solution for this would be for amdgpu_bo_pin_restricted and
> amdgpu_bo_unpin to walk the list of memory nodes and calculate exactly how
> much of each of them lies in visible or invisible VRAM.

We actually have a helper for that in amdgpu_vram_mgr_vis_size().

Apart from that if the problem only occurse after a certain time it looks like we have a mismatch between adding the pinned size and subtracting it again. Or alternatively some sort of memory corruption.

Have you tried running it with KASAN enabled for a while?
Comment 3 Michel Dänzer 2018-06-14 15:47:51 UTC
Created attachment 140160 [details] [review]
drm/amdgpu: Refactor amdgpu_vram_mgr_bo_sizes helper
Comment 4 Michel Dänzer 2018-06-14 15:48:24 UTC
Created attachment 140161 [details] [review]
drm/amdgpu: Make amdgpu_vram_mgr_bo_sizes accurate in all cases
Comment 5 Michel Dänzer 2018-06-14 15:50:25 UTC
Created attachment 140162 [details] [review]
drm/amdgpu: Add debugging output related to bogus pin_size values

Do these patches help? If not, this patch will hopefully show where we're going wrong.
Comment 6 Christian König 2018-06-14 19:44:29 UTC
There is also a bug in amdgpu_bo_unpin():

r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
...
if (bo->tbo.mem.mem_type == TTM_PL_VRAM) {

There is no guarantee that bo->placement still points to the placement we used for pinning.

So we need to do the housekeeping before trying to re-validate the BO.
Comment 7 Michel Dänzer 2018-06-15 08:20:33 UTC
(In reply to Christian König from comment #6)
> There is no guarantee that bo->placement still points to the placement we
> used for pinning.

I thought about that as well, but I couldn't think of a reason why ttm_bo_validate would move the BO somewhere else. Anyway, better safe than sorry I guess; I'll take care of this as well, thanks.
Comment 8 Michel Dänzer 2018-06-15 09:06:09 UTC
BTW, Christian, do you think the pin_size values should be atomic, like the usage ones?
Comment 9 Michel Dänzer 2018-06-15 14:49:35 UTC
Created attachment 140175 [details] [review]
drm/amdgpu: Add debugging output related to bogus pin_size values

Debugging patch rebased on top of https://patchwork.freedesktop.org/series/44837/ .
Comment 10 Michel Dänzer 2018-07-11 16:24:59 UTC
This series should fix it: https://patchwork.freedesktop.org/series/46325/
Comment 12 Christian König 2018-09-13 13:51:53 UTC
Sounds like we can close this one now.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.