Bug 111588

Summary: Framebuffer corruption when a fb which is not being scanned out gets removed
Product: DRI Reporter: Hans de Goede <jwrdegoede>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: not set    
Priority: not set    
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description Hans de Goede 2019-09-08 10:00:19 UTC
This is a weird issue, which I noticed when working on this plymouth fix:
https://gitlab.freedesktop.org/plymouth/plymouth/merge_requests/59

On boot to ensure a smooth handover of the framebuffer from plymouth to gdm the following happens:

1) plymouth starts, does an addfb, becomes master, set the fb as the fb the crtc should scanout
2) gdm starts tells plymouth to "deactivate", plymouth drops master (but does not exit)
3) gdm becomes master, installs its own fb to scanout, the fb being scanned out is now owned by gdm
4) gdm tells plymouth it may quit now
5) plymouth exits, without calling rmfb or closing the /dev/dri/card0 fd, relying on the kernel to cleanup
6) all is well

The bug fix from the above merge requests make plymouth actually cleanup behind itself, this is necessary to avoid issues with hotunplug (see the plymouth MR for details). My first attempt at this simply made plymouth always do the cleanup, both on hotunplug and exit as that was the most straight forward to do.
This changes the sequence to:

1-4) Idem as above
5) Plymouth internally calls src/plugins/renderers/drm/plugin.c:
   ply_renderer_buffer_free() this does:
    drmModeRmFB(...);
    munmap (buffer->map_address, buffer->map_size);
    destroy_dumb_buffer_request.handle = buffer->handle;
    drmIoctl (fd, DRM_IOCTL_MODE_DESTROY_DUMB, &destroy_dumb_buffer_request);
   Followed by calling close() on the fd.
6) Plymouth exits
7) 5 and/or 6 cause the gdm framebuffer being all messed up, it looks like a
   wrong pitch or tiling setting

Note that when 5 is executed plymouth no longer is master and the fb being removed is no longer being scanned out, so this really should not be able to influence the current kms state, yet it does.

This is 100% reproducable for me with Fedora 30 + master plymouth + 5.3.0-rc7 on a R7 250E (SAPPHIRE Ultimate Radeon R7 250) using the amdgpu driver. I know that the default is to use the radeon driver with the R7 250E, but I was using the amdgpu driver deliberately to reproduce: https://bugzilla.redhat.com/show_bug.cgi?id=1490490

For now I've modified the plymouth fix to only call ply_renderer_buffer_free() + close() on hot-unplug and to still leave cleanup to the kernel on exit, but it would be nice to get to the bottom of this.
Comment 1 Hans de Goede 2019-09-08 12:37:39 UTC
I just realized I left out one bit of info which might be useful, to debug this I added the following change to the kernel:

diff --git a/drivers/gpu/drm/drm_framebuffer.c b/drivers/gpu/drm/drm_framebuffer.c
index 57564318ceea..4712bfb9ae05 100644
--- a/drivers/gpu/drm/drm_framebuffer.c
+++ b/drivers/gpu/drm/drm_framebuffer.c
@@ -464,6 +464,7 @@ int drm_mode_rmfb(struct drm_device *dev, u32 fb_id,
 	if (drm_framebuffer_read_refcount(fb) > 1) {
 		struct drm_mode_rmfb_work arg;
 
+		pr_err("drm_modr_rmfb calling drm_framebuffer_remove\n");
 		INIT_WORK_ONSTACK(&arg.work, drm_mode_rmfb_work_fn);
 		INIT_LIST_HEAD(&arg.fbs);
 		list_add_tail(&fb->filp_head, &arg.fbs);
@@ -471,8 +472,10 @@ int drm_mode_rmfb(struct drm_device *dev, u32 fb_id,
 		schedule_work(&arg.work);
 		flush_work(&arg.work);
 		destroy_work_on_stack(&arg.work);
-	} else
+	} else {
+		pr_err("drm_modr_rmfb calling drm_framebuffer_put\n");
 		drm_framebuffer_put(fb);
+	}
 
 	return 0;
 
@@ -669,11 +672,13 @@ void drm_fb_release(struct drm_file *priv)
 	 */
 	list_for_each_entry_safe(fb, tfb, &priv->fbs, filp_head) {
 		if (drm_framebuffer_read_refcount(fb) > 1) {
+			pr_err("drm_fb_release calling drm_framebuffer_remove\n");
 			list_move_tail(&fb->filp_head, &arg.fbs);
 		} else {
 			list_del_init(&fb->filp_head);
 
 			/* This drops the fpriv->fbs reference. */
+			pr_err("drm_fb_release calling drm_framebuffer_put\n");
 			drm_framebuffer_put(fb);
 		}
 	}
@@ -863,6 +868,8 @@ static int atomic_remove_fb(struct drm_framebuffer *fb)
 		if (plane->state->fb != fb)
 			continue;
 
+		pr_err("atomic_remove_fb found plane still using fb\n");
+
 		plane_state = drm_atomic_get_plane_state(state, plane);
 		if (IS_ERR(plane_state)) {
 			ret = PTR_ERR(plane_state);


In the working case, so where we let the kernel do the fb cleanup itself, I see:

Plymouth removes fb it creates to test for 32bpp support:
kernel: drm_modr_rmfb calling drm_framebuffer_put

gdm starts, does page-flipping, resulting in a number of:

kernel: drm_modr_rmfb calling drm_framebuffer_put
kernel: drm_modr_rmfb calling drm_framebuffer_put
...

lines

And then plymouth exits without any cleanup, so we get:
kernel: drm_fb_release calling drm_framebuffer_put

Followed by more:

kernel: drm_modr_rmfb calling drm_framebuffer_put
kernel: drm_modr_rmfb calling drm_framebuffer_put
...

From gdm.

In the broken case, where ply_renderer_buffer_free() gets called on plymouth-quit, I only see:

kernel: drm_modr_rmfb calling drm_framebuffer_put
kernel: drm_modr_rmfb calling drm_framebuffer_put
...

lines, wihch is expected as the fb is rmfb-ed before the fd is closed.

Note that we never hit:

@@ -863,6 +868,8 @@ static int atomic_remove_fb(struct drm_framebuffer *fb)
 		if (plane->state->fb != fb)
 			continue;
 
+		pr_err("atomic_remove_fb found plane still using fb\n");
+
 		plane_state = drm_atomic_get_plane_state(state, plane);
 		if (IS_ERR(plane_state)) {
 			ret = PTR_ERR(plane_state);


So AFAICT userspace is doing everything correctly even in the broken case.
Comment 2 Michel Dänzer 2019-09-09 16:14:58 UTC
(In reply to Hans de Goede from comment #0)
> 5) Plymouth internally calls src/plugins/renderers/drm/plugin.c:
>    ply_renderer_buffer_free() this does:
>     drmModeRmFB(...);
>     munmap (buffer->map_address, buffer->map_size);
>     destroy_dumb_buffer_request.handle = buffer->handle;
>     drmIoctl (fd, DRM_IOCTL_MODE_DESTROY_DUMB, &destroy_dumb_buffer_request);
>    Followed by calling close() on the fd.
> 6) Plymouth exits
> 7) 5 and/or 6 cause the gdm framebuffer being all messed up, it looks like a
>    wrong pitch or tiling setting

Would be interesting if you could further narrow down which step (or even sub-step of 5) exactly triggers the problem.
Comment 3 Martin Peres 2019-11-19 09:51:07 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/902.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.