Bug 99635 - It is impossible to screenshot a user selected window.
Summary: It is impossible to screenshot a user selected window.
Status: RESOLVED FIXED
Alias: None
Product: Wayland
Classification: Unclassified
Component: wayland (show other bugs)
Version: 1.2.x
Hardware: Other All
: medium enhancement
Assignee: Wayland bug list
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-01 23:46 UTC by naelstrof
Modified: 2018-06-04 08:29 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments

Description naelstrof 2017-02-01 23:46:43 UTC
With the current security of Wayland, it's impossible to allow a user to select a window to screenshot. A coveted feature of screen shooting in X11.

There's multiple things that'd have to happen for this to work properly:
1. Gnome would have to add a "surface-handle" parameter to their org.gnome.Shell.Screenshot.ScreenshotWindow function. (other compositors would have to do something similar)
2. Wayland OR Gnome's compositor would need to publicly expose window positions, sizes, and "surface-handles".

The reason I put surface-handle in quotes is because that'd be an obviously bad security flaw, it doesn't have to be a REAL surface handle. Just some sort of identification number in order to communicate with the compositor which window needs to be captured. Anytime I refer to "surface-handles" I actually mean an identification number referring to a specific surface.

So my question is: in order to make screen shooting on Wayland more robust-- is it the responsibility of the Compositor or Wayland itself to expose surface information?

As for security, I don't think exposing surface positions or sizes is bad at all since it's already possible to make global silent screenshots through gnome's compositor. As for exposing "surface-handles" I can see that reducing security somewhat, since that'd allow a Wayland application to see even off-screen windows. Though I don't see the utility of scrying off-screen windows from the perspective of a malicious application.

The reason I need window positions and sizes is because I'm porting [slop](https://github.com/naelstrof/slop) to Wayland. The reason I need "surface-handles" is for [maim](https://github.com/naelstrof/maim).

P.S. Here's some reasons for including this feature:
A user might write a script that screenshots a minimized/off-screen Counter-Strike game repeatedly to check to see if a match has been found to send him an email/text/sound.
Time-lapses of specific applications, even if they get minimized or moved off-screen.
Screen shooting specific locations in games (like a healthbar, dps graph in path of exile).
Computer-illiterates would appreciate if gnome-screenshot's --interactive option would easily allow for responsive, automatic window selection. (Especially if the window is poorly placed or only partially visible.)

Thanks for reading!
Comment 1 naelstrof 2017-02-02 00:54:42 UTC
Ok, just chatted with SardemFF7 on freenode about it. Here's some useful snippets:

SardemFF7: two pieces of information: Wayland is not a process 
SardemFF7: it’s just a protocol, it cannot “expose handles” 
SardemFF7: it’ll always be up to the compositor (gnome-shell, kwin, e, weston and others) to implement anything that might be “handles” 
SardemFF7: and second thing: surfaces off-screen (minimized, hidden or anything) are supposed *not* to render (if they play nice, they can always render-for-nothing), so you would not be really capable of monitoring an off-screen surface 

---

naelstrof: being in control of the compositing, wouldn't gnome be capable of asking for an up-to-date buffer on screenshot? 
SardemFF7: it can “ask” (= fake that the current frame hit screen so the client would render the next one) but it’d have to wait 
SardemFF7: can you (the user) wait? :-) 

---

SardemFF7: (you could make a perfect DE-independent solution btw, it would be really nice, but really really hard ;-) ) 
naelstrof: oh you've got me interested 
naelstrof: difficulty isn't part of the equation here 
naelstrof: could you brief me? 
SardemFF7: political and technical security difficulty 
naelstrof: lmaoo

---

From this chat I understand it'd be easier to ask Gnome to expose some window information to help with screenshots, and hopefully their implementation would be good enough for other DE's to follow along.

Ideally I'd like a DE-independent solution, but as SardemFF7 said, it's probably beyond my poor communication/security skills.
Comment 2 Pekka Paalanen 2017-02-02 09:03:05 UTC
Hi,

you may need to split and reorganize the ideas and concepts somewhat.

For instance, selecting a window in a client program would require not just exposing window handles, but also augmenting the input interfaces to be able to talk in the window handle terms, and make the compositor allow to grab e.g. the pointer so that the client program can get input outside its own windows (which might be none, which currently disallows grabbing altogether). These are each one a fairly awkward feature to design and justify.

If the only thing you wanted to do is to make a one-shot capture of one user selected window, it would be much easier to create e.g. a D-Bus command to do just that: "ask the user to select a window and give me the screenshot of it". This would also allow the compositor not just use the pointer as we have done on X11 for decades, but e.g. display a list of windows, including those you cannot currently reach with the pointer.

Exposing surface positions is also not a trivial thing. What if a surface has multiple positions? There are compositors that do that. What if the position is not on a linear 2D space, but in a 3D space? Maybe curved?

You can of course choose to limit the protocol extension to the usual cases, but then you might make it hard to implement for some compositors.

The reason we have worked very hard to avoid any global position information in Wayland core set of protocols is that having it will exclude some inventive, funny, strange, or awesome use cases.

Rather than blindly copying concepts from the way we have always done things, it does pay off to first think what the user really wants to do and how best to achieve that instead of starting from how it has been implemented before. One might find much better ways, or whole new use cases become possible. Or, one might that the old way really is the way, in which case one needs to figure out how to integrate it in the new world order. And some things might actually be better left out.

Your listed use cases are far from easy to design for, but I wouldn't call them impossible or unacceptable. I would encourage you to work on them. The success of your proposals will be measured in how many compositor projects implement your proposal.

Making your extensions optional is key, because no-one can force compositor projects to accept this kind of features, and we probably would not have it in Wayland core set of protocols. What the latter means is that you should not add window handle methods in wl_pointer or wl_surface interfaces for instance.
Comment 3 Daniel Stone 2017-02-02 11:04:22 UTC
(In reply to Pekka Paalanen from comment #2)
> If the only thing you wanted to do is to make a one-shot capture of one user
> selected window, it would be much easier to create e.g. a D-Bus command to
> do just that: "ask the user to select a window and give me the screenshot of
> it". This would also allow the compositor not just use the pointer as we
> have done on X11 for decades, but e.g. display a list of windows, including
> those you cannot currently reach with the pointer.

Definitely. I think Flatpak portals are great prior art here: e.g. for opening a file, rather than exposing the full file system, they request the host system open a file-selection dialog, and then pass back a file descriptor.
Comment 4 naelstrof 2017-02-02 11:13:49 UTC
Thanks for the insightful comment, I should've taken my proposal from that perspective this whole time. Is there a name for it? Anyway I've done a bit of thinking and came to some conclusions.

> What does the user really want to do?

The user wants to share what he/she sees. Easy.
Breaking down what needs to happen into steps:

1. The user must let the computer know he/she wants to share something.
2. What wants to be shared needs to be selected somehow.
3. The data must be encoded somewhere/somehow.
4. The encoded data must be uploaded somewhere.
5. The content URL/data should be available in a clipboard.

There's obviously no one-size fits all implementation for any one of these steps. Some people might have tiny hands and struggle to press certain button combinations. Some people may not be able to access certain image sharing websites, or doesn't want to have their screenshots publicly available. Instead of a link to an image put into the clipboard, someone might prefer the actual pixel data to show up there for posting in mumble or the like. Some people prefer JPEG over PNG because their hard-disk space is low. etc. etc.

My point is that every step of this process is very expressive, personal, and equally important. If any one of these steps suffers, so does the user trying to share their screen.

As of right now, every step is perfectly customizable. Every step could easily be filled with its own application, and offer infinite customization options. Except for step 2: The user is currently forced to draw a rectangle around what he wants to share on Wayland.
"Oh but you can capture the active window instead!"
NO. That'd mean I would have to bind TWO screenshot buttons to serve the same purpose.
(In reply to Pekka Paalanen from comment #2)
> ...If the only thing you wanted to do is to make a one-shot capture of one user selected window, it would be much easier to create e.g. a D-Bus command to do just that...
NO. There's times where I need to select a window, and times where I need to crop it. Again I would have to bind TWO buttons to serve the same purpose.

(In reply to Pekka Paalanen from comment #2)
> What if a surface has multiple positions? There are compositors that do that. What if the position is not on a linear 2D space, but in a 3D space? Maybe curved?
Now that the purpose has been more clearly defined, I believe this becomes the root of the problem. How could one develop a selection protocol/extension that could cover these cases, not just the 2D rectangle ones. This makes it seem like selection should be up to the compositor, since with 2D rectangles the solution is really simple.

Exposing surface positions is non-trivial and allowing applications to grab input is dumb and unnecessary. You've mentioned extensions, Pekka. Could you describe that a bit more? It's hard for me to imagine how to implement this currently.
Comment 5 naelstrof 2017-02-02 11:33:24 UTC
Oh I want to clarify: Input grabbing is completely unnecessary. The wayland port of slop just covers the whole screen as a fullscreen application which works great.
Slop just needs a way to know where windows are in relation to the pointer. In X11 slop can simply ask for all surfaces intersecting with the pointer.

Maybe the extension could be similar, where compositors could be pretty clever with it and make it return actual windows when hovering over a window list on a task bar, or on a mini-window inside of a workspace.
Comment 6 Daniel Stone 2017-02-02 11:41:05 UTC
(In reply to naelstrof from comment #5)
> Oh I want to clarify: Input grabbing is completely unnecessary. The wayland
> port of slop just covers the whole screen as a fullscreen application which
> works great.
> Slop just needs a way to know where windows are in relation to the pointer.
> In X11 slop can simply ask for all surfaces intersecting with the pointer.

... that doesn't work for windows which are currently occluded. For that reason, Chrome's screencasting offers a grid view of all current top-level windows, with previews. This to me seems a far better approach than xwininfo-style get-window-under-pointer.

> Maybe the extension could be similar, where compositors could be pretty
> clever with it and make it return actual windows when hovering over a window
> list on a task bar, or on a mini-window inside of a workspace.

Sure, but then a malicious app could use that to just get all windows all the time, which wouldn't be fun. I'd much rather see something like the Flatpak document portal:
http://flatpak.org/xdg-desktop-portal/portal-docs.html#gdbus-org.freedesktop.portal.FileChooser

The compositor could then present a list of windows to select from (in whichever manner it feels is appropriate), and either provide a screenshot or cast immediately. This makes the intent unambiguous, sidestepping the security issue, and also avoids the indirection in your suggestion.

tl;dr the request should be 'I would like to shot/cast a window'; let the compositor sort the rest
Comment 7 naelstrof 2017-02-02 11:54:43 UTC
(In reply to Daniel Stone from comment #6)
> (In reply to naelstrof from comment #5)
> > Oh I want to clarify: Input grabbing is completely unnecessary. The wayland
> > port of slop just covers the whole screen as a fullscreen application which
> > works great.
> > Slop just needs a way to know where windows are in relation to the pointer.
> > In X11 slop can simply ask for all surfaces intersecting with the pointer.
> 
> ... that doesn't work for windows which are currently occluded. For that
> reason, Chrome's screencasting offers a grid view of all current top-level
> windows, with previews. This to me seems a far better approach than
> xwininfo-style get-window-under-pointer.
> 
> > Maybe the extension could be similar, where compositors could be pretty
> > clever with it and make it return actual windows when hovering over a window
> > list on a task bar, or on a mini-window inside of a workspace.
> 
> Sure, but then a malicious app could use that to just get all windows all
> the time, which wouldn't be fun. I'd much rather see something like the
> Flatpak document portal:
> http://flatpak.org/xdg-desktop-portal/portal-docs.html#gdbus-org.freedesktop.
> portal.FileChooser
> 
> The compositor could then present a list of windows to select from (in
> whichever manner it feels is appropriate), and either provide a screenshot
> or cast immediately. This makes the intent unambiguous, sidestepping the
> security issue, and also avoids the indirection in your suggestion.
> 
> tl;dr the request should be 'I would like to shot/cast a window'; let the
> compositor sort the rest

But my request is to share something. It won't always be a window. Sometimes it'll be a crop of a window. Sometimes it would be the whole desktop, and sometimes it would be a single screen.

It's starting to seem unrealistic for compositors to implement such a frivolous request.
Comment 8 Daniel Stone 2017-02-02 12:03:26 UTC
(In reply to naelstrof from comment #7)
> But my request is to share something. It won't always be a window. Sometimes
> it'll be a crop of a window. Sometimes it would be the whole desktop, and
> sometimes it would be a single screen.
> 
> It's starting to seem unrealistic for compositors to implement such a
> frivolous request.

Given how much it's needed, I don't think you can call it frivolous! :) Implementing cropping really doesn't seem that onerous for a compositor, either.
Comment 9 naelstrof 2017-02-02 12:07:14 UTC
So to start implementing this, should I look into making a dbus function that does such a task?

I could use gnome's screenshot dbus interface as a base.
Comment 10 Pekka Paalanen 2017-02-02 12:08:24 UTC
(In reply to naelstrof from comment #4)
> (In reply to Pekka Paalanen from comment #2)
> > ...If the only thing you wanted to do is to make a one-shot capture of one user selected window, it would be much easier to create e.g. a D-Bus command to do just that...
> NO. There's times where I need to select a window, and times where I need to
> crop it. Again I would have to bind TWO buttons to serve the same purpose.

Why should cropping further down to a window sub-region be part of capture? Wouldn't that be done with, say, an image manipulation program, like one that submits the image to a web service if one uses such?

The compositor should probably be concerned about cropping on the scale of: all outputs, a specific output, a specific top-level window. To me that seems like a nice trade-off between complexity of use vs. security (what is guaranteed to not be captured). The same way one picks a specific window, there could be options for a specific output or all outputs.

This kind of split between the compositor and client program responsibilities would be natural: the compositor can keep working with complete buffers, while the client program never needs to care about window positions or how they might be warped on screen. If you capture a window, you get the 2D image the application has drawn for the window. If you capture an output, you get what has been sent to a monitor.

> (In reply to Pekka Paalanen from comment #2)
> > What if a surface has multiple positions? There are compositors that do that. What if the position is not on a linear 2D space, but in a 3D space? Maybe curved?
> Now that the purpose has been more clearly defined, I believe this becomes
> the root of the problem. How could one develop a selection
> protocol/extension that could cover these cases, not just the 2D rectangle
> ones. This makes it seem like selection should be up to the compositor,
> since with 2D rectangles the solution is really simple.

My first go would be something like:
1. A client program asks the compositor to pick something to be captured.
2. The compositor asks the user what he wants to pick.
3. The compositor passes back an abstract object referring to the picked target.
4. The object delivers e.g. snapshots once or on-demand, continuous video, notifies of the target disappearing, etc.

This leaves it completely unspecified how the compositor gets the selection from the user, which is a good thing as it depends on how the compositor works. One possibility could be an exposé kind of overview where each window is shown clearly separate plus previews for each output and all outputs together. Or maybe per virtual desktop views. Or something completely different, like drawing a box for where to capture from.

If the capture target is or becomes off-screen, the compositor could make sure it keeps updating as if it was on-screen as long as the object exists.

> Exposing surface positions is non-trivial and allowing applications to grab
> input is dumb and unnecessary. You've mentioned extensions, Pekka. Could you
> describe that a bit more? It's hard for me to imagine how to implement this
> currently.

Everything in Wayland is an extension. There are a bunch in:
https://cgit.freedesktop.org/wayland/wayland-protocols/
https://cgit.freedesktop.org/wayland/weston/tree/protocol

The major point there is that each extension does its thing without the need to modify what is in wayland.xml file.

An extension is essentially a (at least one) global interface advertised through wl_registry, which provides discovery of the feature and a binding point, plus any further interfaces needed.

But depending on the case it does not need to be Wayland, it could be something else like D-Bus.

The reason to use Wayland is when the application is already using Wayland anyway and needs to address its own objects as part of the operation. If that is not readily true, then one needs to think which communication mechanism is the most appropriate. Could e.g. D-Bus be more suitable because of the features it has that Wayland does not?
Comment 11 Daniel Stone 2018-06-04 08:29:01 UTC
GNOME now implements PipeWire doing exactly this, in part driven by the Flatpak-portal usecase I mentioned above. It's not fully complete, but discussing it with them would be the best way to go about getting it fully functional.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.