Bug 110903

Summary: Can't start mutter or GDM with wayland
Product: Mesa Reporter: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer>
Component: Drivers/Gallium/radeonsiAssignee: Michel Dänzer <michel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: ckoenig.leichtzumerken, emil.l.velikov
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: dmesg (kernel options used: drm.debug=0x7)
The content of /sys/kernel/debug/dri/0/amdgpu_gem_info while mutter is running
amdgpu: Compare DRM node type as well in fd_compare

Description Pierre-Eric Pelloux-Prayer 2019-06-12 09:57:10 UTC
Created attachment 144518 [details]
dmesg (kernel options used: drm.debug=0x7)

Since d6edccee8da38d4802020d5aa4d9e11bb7aae801 GDM + wayland won't start on my machines.
The cursor appears but I can't move it and the background stays black.

Observations:
* mutter --wayland -r --no-x11 fails in the same way
* GDM with x11 still works correctly.
* using master and reverting a7ecf78b900c28aafdc1cd1e1a4117feb30a66c9, 4e3297f7d4d87618bf896ac503e1f036a7b6befb and d6edccee8da38d4802020d5aa4d9e11bb7aae801 fixes the problem (the 2 first commits doesn't cause the problem but depends on the problematic one).

Versions:
* mutter: 3.30.2-7 (from Debian)
* mesa / drm : master
* linux: 40cc64619a2580b (from amd-staging-drm-next)
Comment 1 Pierre-Eric Pelloux-Prayer 2019-06-12 09:57:47 UTC
Created attachment 144519 [details]
The content of /sys/kernel/debug/dri/0/amdgpu_gem_info while mutter is running
Comment 2 Michel Dänzer 2019-06-12 10:58:51 UTC
I just ran into this as well yesterday.

The problem seems to be that mutter ends up using a DRM render node FD for GBM/drawing, so the GEM BOs created with that do not exist in the DRM card node FD used for KMS, so mutter ends up unable to display anything.

mutter does pass in the KMS FD via the EGL_DRM_MASTER_FD_EXT attribute.
Comment 3 Emil Velikov 2019-06-12 15:30:48 UTC
Seems like mutter checks the EGLDevice codepath - eglGetPlatformDisplay + eglInitialize and then checks the required extensions.

On case they're not found, eglTerminate is not called.

Thus as the GBM codepath is hit, due to the libdrm_amdgpu amdgpu_device cache, we end up using the EGLDevice (EGL_EXT_platform_device) amdgpu_device. Which seemingly does not work.

I'm working on a mutter MR and will add the links shortly. Including backports all the way back to 3.30.
Comment 4 Michel Dänzer 2019-06-12 16:01:10 UTC
Created attachment 144522 [details] [review]
amdgpu: Compare DRM node type as well in fd_compare
Comment 5 Michel Dänzer 2019-06-12 16:15:24 UTC
It breaks because the cached amdgpu_device unavoidably uses a different DRM file descriptor from the one mutter uses for KMS (which it gets independently from libdrm(_amdgpu)/EGL/GBM), so GEM objects created in the former aren't visible in the latter.

The libdrm_amdgpu patch I attached makes this particular case work, because the two file descriptors are for different DRM node types (render vs. card). But it could still break if the cached amdgpu_device is for the same node type as the new file descriptor passed in.

Christian, can you remind us in what cases re-using a cached admgpu_device is required?
Comment 6 Emil Velikov 2019-06-12 16:47:45 UTC
AFAICT the original idea behind the amdgpu_device cache is to allow sharing resources between primary/render node of the same device.

As-is the patch will break that, while effectively making the handle to/from fd handling (et al) dead code.

Then again I don't know the AMDGPU stack that well to see where this would be a problem.
Comment 7 Emil Velikov 2019-06-12 17:40:29 UTC
Submitted fix for Mutter:
https://gitlab.gnome.org/GNOME/mutter/merge_requests/619

As of the above, I'm closing this as NOTOURBUG.

Fwiw I don't have a strong preference for/against the libdrm_amdgpu patch, although it "feels" wrong.
Comment 8 Pierre-Eric Pelloux-Prayer 2019-06-12 17:48:09 UTC
(In reply to Michel Dänzer from comment #4)
> Created attachment 144522 [details] [review] [review]
> amdgpu: Compare DRM node type as well in fd_compare

Using this patch I can start GDM + wayland properly but all glx stuff seems broken (despite having Xwayland running). For instance:

$ glxinfo
name of display: :0
Error: couldn't find RGB GLX visual or fbconfig

Does it work for you?
Comment 9 Pierre-Eric Pelloux-Prayer 2019-06-13 12:28:10 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #8)
> (In reply to Michel Dänzer from comment #4)
> > Created attachment 144522 [details] [review] [review] [review]
> > amdgpu: Compare DRM node type as well in fd_compare
> 
> Using this patch I can start GDM + wayland properly but all glx stuff seems
> broken (despite having Xwayland running). For instance:
> 
> $ glxinfo
> name of display: :0
> Error: couldn't find RGB GLX visual or fbconfig
> 
> Does it work for you?

Sorry, this was an incorrect setup on my machine.

You can add "Tested-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>" on your patch if you decide to merge it.
Comment 10 Michel Dänzer 2019-06-13 14:59:09 UTC
Thanks for the mutter fix Emil, but breakage like this could also be triggered under different circumstances, without the higher layers doing anything wrong.

I'm exploring an idea for a solution between radeonsi and libdrm_amdgpu.
Comment 11 Emil Velikov 2019-06-13 16:10:40 UTC
Michel - fully agreed. Hope you find a way to preserve the original cool idea and while preventing this bug.

Out of curiosity - please CC me on the fix.
Comment 12 Christian König 2019-06-14 16:07:06 UTC
(In reply to Michel Dänzer from comment #5)
> Christian, can you remind us in what cases re-using a cached admgpu_device
> is required?

This is to allow for correct DRI2 synchronization as well as sharing the same VM along different client libraries.

Using the same VM had the background that our closed source guys wanted to use the same virtual address for a buffer in different clients. That never materialized so you can pretty much ignore this.

But the DRI2 synchronization issue is absolutely real and we intentionally always use the same device for all different kind of file descriptors passed in because of this.

I think the underlying problem is that amdgpu_bo_export() doesn't knows the file descriptor where a handle of type amdgpu_bo_handle_type_kms should be valid on.
Comment 13 Michel Dänzer 2019-07-02 10:41:34 UTC
https://gitlab.freedesktop.org/mesa/mesa/merge_requests/1226 fixes this.
Comment 14 Michel Dänzer 2019-07-03 09:52:18 UTC
The fix is merged to Mesa master, thanks for the report and testing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.