Created attachment 144518 [details]
dmesg (kernel options used: drm.debug=0x7)
Since d6edccee8da38d4802020d5aa4d9e11bb7aae801 GDM + wayland won't start on my machines.
The cursor appears but I can't move it and the background stays black.
* mutter --wayland -r --no-x11 fails in the same way
* GDM with x11 still works correctly.
* using master and reverting a7ecf78b900c28aafdc1cd1e1a4117feb30a66c9, 4e3297f7d4d87618bf896ac503e1f036a7b6befb and d6edccee8da38d4802020d5aa4d9e11bb7aae801 fixes the problem (the 2 first commits doesn't cause the problem but depends on the problematic one).
* mutter: 3.30.2-7 (from Debian)
* mesa / drm : master
* linux: 40cc64619a2580b (from amd-staging-drm-next)
Created attachment 144519 [details]
The content of /sys/kernel/debug/dri/0/amdgpu_gem_info while mutter is running
I just ran into this as well yesterday.
The problem seems to be that mutter ends up using a DRM render node FD for GBM/drawing, so the GEM BOs created with that do not exist in the DRM card node FD used for KMS, so mutter ends up unable to display anything.
mutter does pass in the KMS FD via the EGL_DRM_MASTER_FD_EXT attribute.
Seems like mutter checks the EGLDevice codepath - eglGetPlatformDisplay + eglInitialize and then checks the required extensions.
On case they're not found, eglTerminate is not called.
Thus as the GBM codepath is hit, due to the libdrm_amdgpu amdgpu_device cache, we end up using the EGLDevice (EGL_EXT_platform_device) amdgpu_device. Which seemingly does not work.
I'm working on a mutter MR and will add the links shortly. Including backports all the way back to 3.30.
Created attachment 144522 [details] [review]
amdgpu: Compare DRM node type as well in fd_compare
It breaks because the cached amdgpu_device unavoidably uses a different DRM file descriptor from the one mutter uses for KMS (which it gets independently from libdrm(_amdgpu)/EGL/GBM), so GEM objects created in the former aren't visible in the latter.
The libdrm_amdgpu patch I attached makes this particular case work, because the two file descriptors are for different DRM node types (render vs. card). But it could still break if the cached amdgpu_device is for the same node type as the new file descriptor passed in.
Christian, can you remind us in what cases re-using a cached admgpu_device is required?
AFAICT the original idea behind the amdgpu_device cache is to allow sharing resources between primary/render node of the same device.
As-is the patch will break that, while effectively making the handle to/from fd handling (et al) dead code.
Then again I don't know the AMDGPU stack that well to see where this would be a problem.
Submitted fix for Mutter:
As of the above, I'm closing this as NOTOURBUG.
Fwiw I don't have a strong preference for/against the libdrm_amdgpu patch, although it "feels" wrong.
(In reply to Michel Dänzer from comment #4)
> Created attachment 144522 [details] [review] [review]
> amdgpu: Compare DRM node type as well in fd_compare
Using this patch I can start GDM + wayland properly but all glx stuff seems broken (despite having Xwayland running). For instance:
name of display: :0
Error: couldn't find RGB GLX visual or fbconfig
Does it work for you?
(In reply to Pierre-Eric Pelloux-Prayer from comment #8)
> (In reply to Michel Dänzer from comment #4)
> > Created attachment 144522 [details] [review] [review] [review]
> > amdgpu: Compare DRM node type as well in fd_compare
> Using this patch I can start GDM + wayland properly but all glx stuff seems
> broken (despite having Xwayland running). For instance:
> $ glxinfo
> name of display: :0
> Error: couldn't find RGB GLX visual or fbconfig
> Does it work for you?
Sorry, this was an incorrect setup on my machine.
You can add "Tested-by: Pierre-Eric Pelloux-Prayer <email@example.com>" on your patch if you decide to merge it.
Thanks for the mutter fix Emil, but breakage like this could also be triggered under different circumstances, without the higher layers doing anything wrong.
I'm exploring an idea for a solution between radeonsi and libdrm_amdgpu.
Michel - fully agreed. Hope you find a way to preserve the original cool idea and while preventing this bug.
Out of curiosity - please CC me on the fix.
(In reply to Michel Dänzer from comment #5)
> Christian, can you remind us in what cases re-using a cached admgpu_device
> is required?
This is to allow for correct DRI2 synchronization as well as sharing the same VM along different client libraries.
Using the same VM had the background that our closed source guys wanted to use the same virtual address for a buffer in different clients. That never materialized so you can pretty much ignore this.
But the DRI2 synchronization issue is absolutely real and we intentionally always use the same device for all different kind of file descriptors passed in because of this.
I think the underlying problem is that amdgpu_bo_export() doesn't knows the file descriptor where a handle of type amdgpu_bo_handle_type_kms should be valid on.
https://gitlab.freedesktop.org/mesa/mesa/merge_requests/1226 fixes this.
The fix is merged to Mesa master, thanks for the report and testing.