Bug 106960

Summary: Leases RROutput invalid access / segfault with xorg-server-1.20.0
Product: xorg Reporter: john.frankish
Component: Server/Ext/RandRAssignee: Keith Packard <keithp>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: highest CC: ajax, keithp, thellstrom
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
gdb and xorg output
none
gdb output
none
Proposed fix
none
During reset/shutdown, clean up leases in DIX instead of each driver
none
Patch that potentially fixes the problem none

Description john.frankish 2018-06-19 08:29:06 UTC
Created attachment 140221 [details]
gdb and xorg output

Using xf86-video-vmware 13.2.1 with mesa-18.0.0 (and earlier) and xorg-server-1.19.5 (and earlier) in vmware workstation 14 player works fine.

Using both xf86-video-vmware 13.2.1 and 13.3.0 with mesa-18.0.0 and xorg-server-1.20.0 results in a segfault. Note that with the vesa or modesetting drivers and xf86-input-vmmouse it works fine.

gdb output and Xorg.0.log attached
Comment 1 john.frankish 2018-06-19 08:31:18 UTC
Created attachment 140222 [details]
gdb output

gdb output missing from above
Comment 2 Michel Dänzer 2018-06-19 08:43:38 UTC
Looks like https://patchwork.freedesktop.org/patch/226987/ should fix this.
Comment 3 john.frankish 2018-06-19 08:47:10 UTC
Thanks - I should have mentioned that I tried that patch with xorg-server-1.20.0 and it didn't help.

I also tried mesa-18.1.2 with no success either.
Comment 4 Thomas Hellström 2018-06-19 08:51:12 UTC
Thanks for reporting.

A quick look at xf86-video-vmware seems like we're on screen close are calling into the crtc subsystem when it has already been taken down. I'll prepare a patch as soon as time allows.
Comment 5 Thomas Hellström 2018-06-19 08:55:11 UTC
Created attachment 140224 [details] [review]
Proposed fix

@john,

Could you try this patch to xf86-video-vmware?

Thanks,
Thomas
Comment 6 john.frankish 2018-06-19 09:28:22 UTC
Thanks for the rapid response.

Unfortunately, it doesn't fix things - I tried with both the xf86-video-vmware patch alone and together with the xorg-server patch referenced above.
Comment 7 Thomas Hellström 2018-06-19 10:47:42 UTC
Indeed. Seems to be some kind of memory corruption in the closescreen path. I'll probably need some more time to debug this.
Comment 8 john.frankish 2018-06-19 10:50:39 UTC
OK - let me know if you'd like me to do anything at this end.
Comment 9 Thomas Hellström 2018-06-20 13:56:40 UTC
So this is actually caused by the new leases code, accessing randr_outputs from CloseScreen where they have already been freed. I'll forward to KeithP to come up with a suitable solution.

This happens also with the modesetting driver, although it typically doesn't generate a visible error, although valgrind catches it:



==8844== Invalid read of size 8
==8844==    at 0xCEB7D79: drmmode_terminate_leases (drmmode_display.c:3330)
==8844==    by 0xCEAFB9C: CloseScreen (driver.c:1816)
==8844==    by 0x53AAC3: RRCloseScreen (randr.c:108)
==8844==    by 0x4D4BC8: xf86CrtcCloseScreen (xf86Crtc.c:743)
==8844==    by 0x48DE01: DGACloseScreen (xf86DGA.c:288)
==8844==    by 0x49C3CD: CMapCloseScreen (xf86cmap.c:250)
==8844==    by 0x511FB7: XvCloseScreen (xvmain.c:309)
==8844==    by 0x4AE21E: xf86XVCloseScreen (xf86xv.c:1168)
==8844==    by 0x569F42: present_close_screen (present_screen.c:70)
==8844==    by 0x4FD12A: CursorCloseScreen (cursor.c:205)
==8844==    by 0x56728A: AnimCurCloseScreen (animcur.c:100)
==8844==    by 0x4F7E91: compCloseScreen (compinit.c:86)
==8844==  Address 0x15939540 is 352 bytes inside a block of size 368 free'd
==8844==    at 0x4C2EDAC: free (vg_replace_malloc.c:530)
==8844==    by 0x53AAA0: RRCloseScreen (randr.c:106)
==8844==    by 0x4D4BC8: xf86CrtcCloseScreen (xf86Crtc.c:743)
==8844==    by 0x48DE01: DGACloseScreen (xf86DGA.c:288)
==8844==    by 0x49C3CD: CMapCloseScreen (xf86cmap.c:250)
==8844==    by 0x511FB7: XvCloseScreen (xvmain.c:309)
==8844==    by 0x4AE21E: xf86XVCloseScreen (xf86xv.c:1168)
==8844==    by 0x569F42: present_close_screen (present_screen.c:70)
==8844==    by 0x4FD12A: CursorCloseScreen (cursor.c:205)
==8844==    by 0x56728A: AnimCurCloseScreen (animcur.c:100)
==8844==    by 0x4F7E91: compCloseScreen (compinit.c:86)
==8844==    by 0xA4DA544: glxCloseScreen (glxscreens.c:171)
==8844==  Block was alloc'd at
==8844==    at 0x4C2FB06: calloc (vg_replace_malloc.c:711)
==8844==    by 0x53B31C: RRScreenInit (randr.c:308)
==8844==    by 0x4E373C: xf86RandR12Init (xf86RandR12.c:870)
==8844==    by 0x4D4CD4: xf86CrtcScreenInit (xf86Crtc.c:778)
==8844==    by 0xCEAF69E: ScreenInit (driver.c:1660)
==8844==    by 0x495A55: xf86ScreenInit (xf86Init.c:317)
==8844==    by 0x43998B: AddScreen (dispatch.c:3915)
==8844==    by 0x496D39: InitOutput (xf86Init.c:744)
==8844==    by 0x43EDA2: dix_main (main.c:193)
==8844==    by 0x42136D: main (stubmain.c:34)
==8844== 
==8844== Invalid read of size 8
==8844==    at 0xCEB7D88: drmmode_terminate_leases (drmmode_display.c:3330)
==8844==    by 0xCEAFB9C: CloseScreen (driver.c:1816)
==8844==    by 0x53AAC3: RRCloseScreen (randr.c:108)
==8844==    by 0x4D4BC8: xf86CrtcCloseScreen (xf86Crtc.c:743)
==8844==    by 0x48DE01: DGACloseScreen (xf86DGA.c:288)
==8844==    by 0x49C3CD: CMapCloseScreen (xf86cmap.c:250)
==8844==    by 0x511FB7: XvCloseScreen (xvmain.c:309)
==8844==    by 0x4AE21E: xf86XVCloseScreen (xf86xv.c:1168)
==8844==    by 0x569F42: present_close_screen (present_screen.c:70)
==8844==    by 0x4FD12A: CursorCloseScreen (cursor.c:205)
==8844==    by 0x56728A: AnimCurCloseScreen (animcur.c:100)
==8844==    by 0x4F7E91: compCloseScreen (compinit.c:86)
==8844==  Address 0x15939540 is 352 bytes inside a block of size 368 free'd
==8844==    at 0x4C2EDAC: free (vg_replace_malloc.c:530)
==8844==    by 0x53AAA0: RRCloseScreen (randr.c:106)
==8844==    by 0x4D4BC8: xf86CrtcCloseScreen (xf86Crtc.c:743)
==8844==    by 0x48DE01: DGACloseScreen (xf86DGA.c:288)
==8844==    by 0x49C3CD: CMapCloseScreen (xf86cmap.c:250)
==8844==    by 0x511FB7: XvCloseScreen (xvmain.c:309)
==8844==    by 0x4AE21E: xf86XVCloseScreen (xf86xv.c:1168)
==8844==    by 0x569F42: present_close_screen (present_screen.c:70)
==8844==    by 0x4FD12A: CursorCloseScreen (cursor.c:205)
==8844==    by 0x56728A: AnimCurCloseScreen (animcur.c:100)
==8844==    by 0x4F7E91: compCloseScreen (compinit.c:86)
==8844==    by 0xA4DA544: glxCloseScreen (glxscreens.c:171)
==8844==  Block was alloc'd at
==8844==    at 0x4C2FB06: calloc (vg_replace_malloc.c:711)
==8844==    by 0x53B31C: RRScreenInit (randr.c:308)
==8844==    by 0x4E373C: xf86RandR12Init (xf86RandR12.c:870)
==8844==    by 0x4D4CD4: xf86CrtcScreenInit (xf86Crtc.c:778)
==8844==    by 0xCEAF69E: ScreenInit (driver.c:1660)
==8844==    by 0x495A55: xf86ScreenInit (xf86Init.c:317)
==8844==    by 0x43998B: AddScreen (dispatch.c:3915)
==8844==    by 0x496D39: InitOutput (xf86Init.c:744)
==8844==    by 0x43EDA2: dix_main (main.c:193)
==8844==    by 0x42136D: main (stubmain.c:34)
==8844== 


In the xf86-video-vmware case, it looks different, and the server
segfaults.

==9797== Invalid read of size 8
==9797==    at 0x5430AE: RROutputIsLeased (rrlease.c:123)
==9797==    by 0x4D9F82: xf86DPMSSet (xf86Crtc.c:2967)
==9797==    by 0xCEB3613: vmwgfx_disable_scanout (vmwgfx_crtc.c:117)
==9797==    by 0xCEB1022: drv_leave_vt (vmwgfx_driver.c:1271)
==9797==    by 0xCEB1288: drv_close_screen (vmwgfx_driver.c:1327)
==9797==    by 0x511FB7: XvCloseScreen (xvmain.c:309)
==9797==    by 0x4AE21E: xf86XVCloseScreen (xf86xv.c:1168)
==9797==    by 0x5EAA60: SyncCloseScreen (misync.c:156)
==9797==    by 0x4FD12A: CursorCloseScreen (cursor.c:205)
==9797==    by 0x56728A: AnimCurCloseScreen (animcur.c:100)
==9797==    by 0x4F7E91: compCloseScreen (compinit.c:86)
==9797==    by 0x569F42: present_close_screen (present_screen.c:70)
==9797==  Address 0x1377ba68 is 8 bytes inside a block of size 153 free'd
==9797==    at 0x4C2EDAC: free (vg_replace_malloc.c:530)
==9797==    by 0x547727: RROutputDestroyResource (rroutput.c:410)
==9797==    by 0x46BE1F: doFreeResource (resource.c:880)
==9797==    by 0x46C830: FreeClientResources (resource.c:1146)
==9797==    by 0x46C91F: FreeAllResources (resource.c:1161)
==9797==    by 0x43F170: dix_main (main.c:292)
==9797==    by 0x42136D: main (stubmain.c:34)
==9797==  Block was alloc'd at
==9797==    at 0x4C2DBAB: malloc (vg_replace_malloc.c:299)
==9797==    by 0x546967: RROutputCreate (rroutput.c:83)
==9797==    by 0x4E5B53: xf86RandR12CreateObjects12 (xf86RandR12.c:1734)
==9797==    by 0x4E747A: xf86RandR12Init12 (xf86RandR12.c:2375)
==9797==    by 0x4E383E: xf86RandR12Init (xf86RandR12.c:895)
==9797==    by 0x4D4CD4: xf86CrtcScreenInit (xf86Crtc.c:778)
==9797==    by 0xCEB1D06: drv_screen_init (vmwgfx_driver.c:1200)
==9797==    by 0x495A55: xf86ScreenInit (xf86Init.c:317)
==9797==    by 0x43998B: AddScreen (dispatch.c:3915)
==9797==    by 0x496D39: InitOutput (xf86Init.c:744)
==9797==    by 0x43EDA2: dix_main (main.c:193)
==9797==    by 0x42136D: main (stubmain.c:34)
Comment 10 Keith Packard 2018-06-26 23:51:27 UTC
Created attachment 140355 [details] [review]
During reset/shutdown, clean up leases in DIX instead of each driver

Thanks for the careful analysis. I've created a proposed fix which moves lease termination out of the driver and does it during destruction of the CRTC and Output resources instead. This patch has also been posted to the xorg-devel mailing list.
Comment 11 Thomas Hellström 2018-06-27 06:56:40 UTC
(In reply to Keith Packard from comment #10)
> Created attachment 140355 [details] [review] [review]
> During reset/shutdown, clean up leases in DIX instead of each driver
> 
> Thanks for the careful analysis. I've created a proposed fix which moves
> lease termination out of the driver and does it during destruction of the
> CRTC and Output resources instead. This patch has also been posted to the
> xorg-devel mailing list.

But I don't think the patch catches the case where RROutputIsLeased() dereferences the RROutput pointer which has been freed. (Even if no leases ever existed). I think RROutputIsLeased() needs a screen-wide guard for whether resources have been freed or not....
Comment 12 Keith Packard 2018-06-27 08:34:21 UTC
The fix here is to remove all of the lease-related code from the video driver CloseScreen function. That's where you were seeing calls to RROutputIsLeased being made from.

I can't see any other paths which reach RROutputIsLeased or RRCrtcIsLeased during server shutdown -- they're all in mode configuration routines.

I have posted a patch which catches the other kind of errors -- calls made before the DIX resources are created during server startup. That's referenced in https://bugs.freedesktop.org/show_bug.cgi?id=106772 but only affects Nouveau in zaphod mode as far as I know.

If we do find a path where these pointers are getting used, we can use the above patch along with an additional callback that clears the pointers when the resources are freed. I think that's effectively the same as your suggestion for a global check to see if the resources have been freed?
Comment 13 Thomas Hellström 2018-06-27 10:47:57 UTC
So the xf86-video-vmware driver hits

CloseScreen()
LeaveVT()
xf86DPMSSet()
RROutputIsLeased() -> Access that becomes invalid after resource takedown.

Even if it's completely unaware of leases. Will your patch fix that?

Thanks,
Thomas
Comment 14 Keith Packard 2018-06-28 18:45:37 UTC
I can't tell if it will fix this one -- the randr_crtc and randr_output fields are cleared in xf86CrtcCloseScreen, so if that happens before the driver CloseScreen is called, then all will be well. Hrm. I think we should be clearing the randr_crtc and randr_output fields *before* we do anything else in xf86CrtcCloseScreen -- we know those pointers are invalid due to the resources being freed already. With that and the null checking wrappers around IsLeased, we should be in business.

The whole management of the DIX vs DDX resources is a bit of a mess though; the DDX resources are more-or-less permanent while the DIX structures come and go with server generations.

In any case, I've posted a two patch series which should fix this problem and cc'd you on it. Let me know what you think, and thanks for pointing out this additional issue.
Comment 15 john.frankish 2018-06-30 10:10:36 UTC
I'd like to check if the two patch series fixes the problem - where can I find the patches?
Comment 16 john.frankish 2018-07-09 08:44:47 UTC
The two patches in xorg-server git seem to fix the xf86-video-vmware problem, but there still seems to be a timing related issue with modesetting?
Comment 17 Olivier Fourdan 2018-11-21 13:30:52 UTC
(In reply to Keith Packard from comment #14)
> I can't tell if it will fix this one -- the randr_crtc and randr_output
> fields are cleared in xf86CrtcCloseScreen, so if that happens before the
> driver CloseScreen is called, then all will be well. Hrm. I think we should
> be clearing the randr_crtc and randr_output fields *before* we do anything
> else in xf86CrtcCloseScreen -- we know those pointers are invalid due to the
> resources being freed already. With that and the null checking wrappers
> around IsLeased, we should be in business.
> 
> The whole management of the DIX vs DDX resources is a bit of a mess though;
> the DDX resources are more-or-less permanent while the DIX structures come
> and go with server generations.
> 
> In any case, I've posted a two patch series which should fix this problem
> and cc'd you on it. Let me know what you think, and thanks for pointing out
> this additional issue.

FWIW, the crash still occurs with xf86-video-vmware and xserver-1.20.1 which has https://gitlab.freedesktop.org/xorg/xserver/commit/101d15c
Comment 18 Michel Dänzer 2018-11-21 14:51:57 UTC
(In reply to Olivier Fourdan from comment #17)
> FWIW, the crash still occurs with xf86-video-vmware and xserver-1.20.1 which
> has https://gitlab.freedesktop.org/xorg/xserver/commit/101d15c

(The equivalent of) drmmode_terminate_leases also needs to be removed from the driver.
Comment 19 Olivier Fourdan 2018-11-22 15:27:44 UTC
(In reply to Michel Dänzer from comment #18)
> (The equivalent of) drmmode_terminate_leases also needs to be removed from
> the driver.

I don't think the vmware driver uses that (can't find any reference to that).

The crash occurs in `dixGetPrivate()` called from `RROutputIsLeased()` because output->pScreen is invalid.

the vmware driver does:

`drv_close_screen()` -> `LeaveVT()` → `drv_leave_vt()` → `vmwgfx_disable_scanout()` → `xf86DPMSSet()` → `xf86OutputIsLeased()` → `RROutputIsLeased()`

The RROutput passed to `RROutputIsLeased()` from `xf86OutputIsLeased()` is the xf86OutputPtr's `output->randr_output` but I suspect this is already freed when we get to that point on teardown.
Comment 20 Olivier Fourdan 2018-11-22 15:50:18 UTC
Actually, `drv_close_screen()` should be called after `xf86CrtcCloseScreen()` which does:

```
 763     /* The randr_output and randr_crtc pointers are already invalid as
 764      * the DIX resources were freed when the associated resources were
 765      * freed. Clear them now; referencing through them during the rest
 766      * of the CloseScreen sequence will not end well.
 767      */
 768     for (o = 0; o < config->num_output; o++) {
 769         xf86OutputPtr output = config->output[o];
 770 
 771         output->randr_output = NULL;
 772     }
```

So the output given to `xf86OutputIsLeased()` should be `NULL` and `RROutputIsLeased()` should be called (as `xf86OutputIsLeased()` check for a `NULL`  output->randr_output)...

The thing is vmware's `drv_screen_init()` does the `vmwgfx_wrap(ms, pScreen, CloseScreen, drv_close_screen);` before calling `xf86CrtcScreenInit(pScreen)` so I think we are missing attachment 140224 [details] [review] (to move it *after* `xf86CrtcScreenInit(pScreen)` which installs `xf86CrtcCloseScreen()`)
Comment 21 Thomas Hellström 2018-11-23 06:59:57 UTC
(In reply to Olivier Fourdan from comment #20)
> Actually, `drv_close_screen()` should be called after
> `xf86CrtcCloseScreen()` which does:
> 
> ```
>  763     /* The randr_output and randr_crtc pointers are already invalid as
>  764      * the DIX resources were freed when the associated resources were
>  765      * freed. Clear them now; referencing through them during the rest
>  766      * of the CloseScreen sequence will not end well.
>  767      */
>  768     for (o = 0; o < config->num_output; o++) {
>  769         xf86OutputPtr output = config->output[o];
>  770 
>  771         output->randr_output = NULL;
>  772     }
> ```
> 

Actually, I don't think that will help since then LeaveVT, which turns off the screens will be called AFTER xf86CrtcCloseScreen() and more misery will occur.

The root problem is still that the above code setting randr_output to NULL should be called immediately after the screen resources were taken down. Not in xf86CrtcCloseScreen.

As an ugly workaround we could perhaps duplicate that code in the vmware driver just before calling LeaveVT.
Comment 22 Thomas Hellström 2018-11-23 07:15:09 UTC
Created attachment 142585 [details] [review]
Patch that potentially fixes the problem

I've attached a workaround to xf86-video-vmware that might fix the problem. Could anybody that is experiencing this try it out?

Thanks,
Thomas
Comment 23 Olivier Fourdan 2018-11-23 07:40:11 UTC
(In reply to Thomas Hellström from comment #22)
> Created attachment 142585 [details] [review] [review]
> Patch that potentially fixes the problem
> 
> I've attached a workaround to xf86-video-vmware that might fix the problem.
> Could anybody that is experiencing this try it out?

Changing xf86OutputPtr output's behind xf86 code (I mean in the driver) seems odd.

I don't use vmware so I can't help with testing, but changing xf86OutputPtr output's behind xf86 code (I mean in the driver) seems odd.
Comment 24 Thomas Hellström 2018-11-23 08:43:57 UTC
(In reply to Olivier Fourdan from comment #23)

> 
> Changing xf86OutputPtr output's behind xf86 code (I mean in the driver)
> seems odd.
> 
> I don't use vmware so I can't help with testing, but changing xf86OutputPtr
> output's behind xf86 code (I mean in the driver) seems odd.

I agree. that's why I called it an ugly workaround. But until Keith comes up with a better solution, I guess we have to live with this (provided it works).

The problem is that when these resources are destroyed, there is no way to notify the xf86Crtc layer so that it can clean the pointers. A reasonable solution would be either to add such callbacks or refcount the objects so that they are not destroyed until xf86CrtcCloseScreen. Both non-trivial.
Comment 25 Thomas Hellström 2018-11-23 10:00:20 UTC
Hmm,

FWIW, I can't reproduce this with Fedora Rawhide X Server 1.20.3-1.fc30, Perhaps Fedora carries some additional patch to fix this?
Comment 26 Olivier Fourdan 2018-11-23 10:05:40 UTC
(In reply to Thomas Hellström from comment #25)
> Hmm,
> 
> FWIW, I can't reproduce this with Fedora Rawhide X Server 1.20.3-1.fc30,
> Perhaps Fedora carries some additional patch to fix this?

Nope, Fedora is pretty bog standard in this regard, most patches in the xserver are backports for Xwayland and for the vmwre driver, the only patch is for the 1.20 compat (fbGetRotatedPixmap)
Comment 27 Olivier Fourdan 2018-11-23 10:07:02 UTC
Considering the memory is freed but not necessarily changed, I reckon the crash would be random.
Comment 28 Olivier Fourdan 2018-11-23 12:57:50 UTC
Humm, I'm confused now...

I added traces to the entry point of the relevant functions and I see, when closing:

 xf86CrtcCloseScreen()
 RRCloseScreen()
 drv_close_screen()
 vmwgfx_disable_scanout()

So by the time we get to `vmwgfx_disable_scanout()`, the `xf86OutputPtr output->randr_output` should be all `NULL` already.... And valgrind doesn't seem to report anything wrong either.
Comment 29 Olivier Fourdan 2018-11-23 13:33:48 UTC
Oh blimey!

I found out, it's due to a *downstream* patch which calls LeaveVT() fromxf86CrtcCloseScreen() just *before* the output_randr_output are set to NULL!

Completely sorry, all my apologies for the false alert!
Comment 30 Thomas Hellström 2018-11-26 13:58:40 UTC
Marking this bug as fixed now.

Workaround is basically no calls into the xf86crtc module in the closeScreen Path.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.