Bug 106872

Summary:

vram sizes reported by the kernel totally off

Product:

DRI

Reporter:

Bas Nieuwenhuizen <bas>

Component:

DRM/AMDgpu

Assignee:

Michel Dänzer <michel>

Status:

CLOSED FIXED

QA Contact:

Default DRI bug account <dri-devel>

Severity:

normal

Priority:

medium

CC:

ckoenig.leichtzumerken

Version:

unspecified

Hardware:

Other

OS:

All

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
drm/amdgpu: Refactor amdgpu_vram_mgr_bo_sizes helper	none
drm/amdgpu: Make amdgpu_vram_mgr_bo_sizes accurate in all cases	none
drm/amdgpu: Add debugging output related to bogus pin_size values	none
drm/amdgpu: Add debugging output related to bogus pin_size values	none

Description Bas Nieuwenhuizen 2018-06-09 22:47:28 UTC

With Pierre-Loup's desktop we have an issue where the reported VRAM and CPU visible VRAM are totally off.

e.g. the last report we had kernel reported total VRAM size of 0xfffffffb3070f000 (16.0 EiB) and a visible VRAM size of 0x86135000 (2.09 GiB).

The system is a large aperture system, so expected is ~8 GiB for both, and it is most of the time. The system typically boots in a good state and eventually after a while transitions to a bad state and sometimes it transitions back.

We have not been able to find a pattern and on one of the earlier cases the debugfs vram_mm file reported reasonable size & utilization.


The kernels that have been tried include amd-staging-drm-next and a released kernel (need to look up whether it was 4.15 or 4.16 if that information is needed)

Comment 1 Michel Dänzer 2018-06-11 17:15:34 UTC

Please attach the dmesg output from the affected system, preferably captured while or after the problem occurs.

(In reply to Bas Nieuwenhuizen from comment #0)
> e.g. the last report we had kernel reported total VRAM size of
> 0xfffffffb3070f000 (16.0 EiB) and a visible VRAM size of 0x86135000 (2.09
> GiB).

Not the other way around?

One potential issue I can see is that only BOs with AMDGPU_GEM_CREATE_NO_CPU_ACCESS are accounted for invisible_pin_size. But such BOs can still end up pinned at least partially in CPU visible VRAM, which would mess up the calculation of how much visible VRAM currently isn't pinned.

One possible solution for this would be for amdgpu_bo_pin_restricted and amdgpu_bo_unpin to walk the list of memory nodes and calculate exactly how much of each of them lies in visible or invisible VRAM.

Comment 2 Christian König 2018-06-12 06:59:07 UTC

(In reply to Michel Dänzer from comment #1)
> One possible solution for this would be for amdgpu_bo_pin_restricted and
> amdgpu_bo_unpin to walk the list of memory nodes and calculate exactly how
> much of each of them lies in visible or invisible VRAM.

We actually have a helper for that in amdgpu_vram_mgr_vis_size().

Apart from that if the problem only occurse after a certain time it looks like we have a mismatch between adding the pinned size and subtracting it again. Or alternatively some sort of memory corruption.

Have you tried running it with KASAN enabled for a while?

Comment 3 Michel Dänzer 2018-06-14 15:47:51 UTC

Created attachment 140160 [details] [review]
drm/amdgpu: Refactor amdgpu_vram_mgr_bo_sizes helper

Comment 4 Michel Dänzer 2018-06-14 15:48:24 UTC

Created attachment 140161 [details] [review]
drm/amdgpu: Make amdgpu_vram_mgr_bo_sizes accurate in all cases

Comment 5 Michel Dänzer 2018-06-14 15:50:25 UTC

Created attachment 140162 [details] [review]
drm/amdgpu: Add debugging output related to bogus pin_size values

Do these patches help? If not, this patch will hopefully show where we're going wrong.

Comment 6 Christian König 2018-06-14 19:44:29 UTC

There is also a bug in amdgpu_bo_unpin():

r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
...
if (bo->tbo.mem.mem_type == TTM_PL_VRAM) {

There is no guarantee that bo->placement still points to the placement we used for pinning.

So we need to do the housekeeping before trying to re-validate the BO.

Comment 7 Michel Dänzer 2018-06-15 08:20:33 UTC

(In reply to Christian König from comment #6)
> There is no guarantee that bo->placement still points to the placement we
> used for pinning.

I thought about that as well, but I couldn't think of a reason why ttm_bo_validate would move the BO somewhere else. Anyway, better safe than sorry I guess; I'll take care of this as well, thanks.

Comment 8 Michel Dänzer 2018-06-15 09:06:09 UTC

BTW, Christian, do you think the pin_size values should be atomic, like the usage ones?

Comment 9 Michel Dänzer 2018-06-15 14:49:35 UTC

Created attachment 140175 [details] [review]
drm/amdgpu: Add debugging output related to bogus pin_size values

Debugging patch rebased on top of https://patchwork.freedesktop.org/series/44837/ .

Comment 10 Michel Dänzer 2018-07-11 16:24:59 UTC

This series should fix it: https://patchwork.freedesktop.org/series/46325/

Comment 11 Michel Dänzer 2018-09-13 13:47:32 UTC

Thanks for the report. Should be fixed with the following commits in 4.19, which are being backported to stable branches:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ddc21af4d0f37f42b33c54cb69b215997fe5b082
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a5ccfe5c20740f2fbf00291490cdf8d2373ec255
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=15e6b76880e65be24250e30986084b5569b7a06f
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=456607d816d89a442a3d5ec98b02c8bc950b5228

Comment 12 Christian König 2018-09-13 13:51:53 UTC

Sounds like we can close this one now.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.