Bug 103536

Summary:	Windows filenames should be in UTF-8 encoding
Product:	cairo	Reporter:	Tom Schoonjans <Tom.Schoonjans>
Component:	win32 backend	Assignee:	cairo-bugs mailing list <cairo-bugs>
Status:	RESOLVED FIXED	QA Contact:	cairo-bugs mailing list <cairo-bugs>
Severity:	normal
Priority:	medium
Version:	unspecified
Hardware:	Other
OS:	Windows (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	Patch to use UTF-8 filenames on Windows Patch to use UTF-8 filenames on Windows Windows UTF-8 patch

Description Tom Schoonjans 2017-11-01 15:55:42 UTC

Currently all cairo functions involving writing or reading from files, use the fopen call to access those files. This works fine on Linux and macOS, but is troublesome on Windows since it assumes that the filename string is encoded in the ANSI codepage (see https://msdn.microsoft.com/en-us/library/yeby3zcb.aspx#Anchor_2 for more on this).

This means that UTF-8 encoded filenames cannot be passed to these cairo methods, since any non-ASCII characters will be misinterpreted leading to run-time fopen errors (file not found etc). This is particularly unfortunate when writing cross platform software using Gtk+/Glib, which use UTF-8 filenames throughout the API. 

Keeping in mind that Microsoft is actively discouraging the reliance on codepages (https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx), I was wondering if the cairo maintainers would be willing to accept a patch that fixes this issue by switching to the internal usage of _wfopen (https://msdn.microsoft.com/en-us/library/yeby3zcb.aspx) for the Windows platform, which would convert the UTF-8 filenames to their UTF-16 wide character string counterparts that will then be passed to _wfopen, very similar to what glib does with its g_fopen function.

I actually already found a 10 year old patch on the mailing list (https://lists.cairographics.org/archives/cairo/2007-February/009591.html), which was unfortunately never merged in, and may serve as a starting point for a new patch, compatible with the master branch.

I would be more than willing to perform the necessary work for this.

Thanks in advance for your reply,

Tom

Comment 1 Bill Spitzak 2017-11-01 19:29:20 UTC

This is a really good idea.

The patch only calls a function named _cairo_utf8_to_utf16 (which already exists for font support), it does not call the functions cairo_win32_filename_from_unicode or cairo_win32_filename_from_ansi. These functions are also really confusing as it sounds like they return a windows filename, but in fact are conversions *from* windows filenames to utf8. I would just omit those functions.

Comment 2 Bill Spitzak 2017-11-01 19:48:52 UTC

A few comments on this, hopefully pointing out that a number of people are strongly in favor of doing it:

http://utf8everywhere.org/
http://cppcms.com/files/nowide/html/

Comment 3 Tom Schoonjans 2017-11-03 08:08:34 UTC

Created attachment 135220 [details] [review]
Patch to use UTF-8 filenames on Windows

Glad to read you like my proposal. I am attaching a patch that accomplishes exactly this.

Comment 4 Bill Spitzak 2017-11-03 16:43:13 UTC

Looks right to me, only problem is the comments mention the non-existent function cairo_win32_filename_from_unicode(). Probably should just read like this:

@filename: name of PNG file to load. On Windows this is encoded in UTF-8.

Comment 5 Tom Schoonjans 2017-11-03 16:53:08 UTC

Created attachment 135230 [details] [review]
Patch to use UTF-8 filenames on Windows

Sorry about that. I removed those references from the patch

Comment 6 Uli Schlachter 2017-11-05 08:44:37 UTC

Comment on attachment 135230 [details] [review]
Patch to use UTF-8 filenames on Windows

Review of attachment 135230 [details] [review]:
-----------------------------------------------------------------

Any specific reason why not all calls to fopen() would get this new treatment? (Although I do agree that it would be a bit stupid for those cases with hardcoded ASCII file names, but still, consistency-wise it might be a good idea)

$ git grep fopen | wc -l
60
$ git grep fopen src/ | wc -l
21

(Only src/ might make sense since stuff outside of libcairo does not get access to libcairo-internal symbols)

I don't have much of an opinion here, I'm just asking.

And API-doc-wise I want to mention the following callers of _cairo_output_stream_create_for_filename which in turn end up getting the file name from some public API:

$ git grep _cairo_output_stream_create_for_filename
src/cairo-output-stream-private.h:_cairo_output_stream_create_for_filename (const char *filename);
src/cairo-output-stream.c:_cairo_output_stream_create_for_filename (const char *filename)
src/cairo-pdf-surface.c:    output = _cairo_output_stream_create_for_filename (filename);
src/cairo-ps-surface.c:    stream = _cairo_output_stream_create_for_filename (filename);
src/cairo-script-surface.c:    stream = _cairo_output_stream_create_for_filename (filename);
src/cairo-svg-surface.c:    stream = _cairo_output_stream_create_for_filename (filename);
src/cairo-xml-surface.c:    stream = _cairo_output_stream_create_for_filename (filename);

Comment 7 Adrian Johnson 2017-11-05 10:43:24 UTC

_cairo_fopen() belongs in cairo-misc.c. Just before _cairo_win32_tmpfile() would be a good place.

It needs to pass through the status returned by cairo_utf8_to_utf16. So something like:

cairo_status_t
_cairo_fopen (const char *utf8_filename, const char *mode, FILE **file_out);

And add a comment that the status is for the filename conversion. It can return success and file_out = NULL.

I wouldn't bother using it for hard coded file names. Only the calls to fopen() where the user supplies the filename.

> On Unix platforms, the filename is
> + * passed directly to the system

It is all platforms except windows.

Comment 8 Adrian Johnson 2017-11-06 08:18:10 UTC

Also, I'd be inclined to use a for/while loop and fixed size uint16_t array to convert the mode. Seems pointless using the overhead of _cairo_utf8_to_utf16() for a string of between 1 and 3 ASCII characters.

Comment 9 Tom Schoonjans 2017-11-06 12:55:30 UTC

Created attachment 135258 [details] [review]
Windows UTF-8 patch

I think I addressed the comments with this new patch except the last suggestion regarding using a for/while loop for the fopen mode. Though I agree that there will be some overhead caused by using _cairo_utf8_to_utf16, I think that it's a reasonable one, especially since these methods are unlikely to be called a lot during the execution of a program.

I have also not gone ahead with replacing every occurrence in the source of fopen to _cairo_fopen. Unless I missed one, the remaining ones are in DEBUG_* sections only, and they are always passed ASCII string filenames, either directly or indirectly.

Comment 10 Bill Spitzak 2017-11-13 19:51:12 UTC

I personally am not thrilled that the function has two different methods of returning errors. I would instead just return null if the utf8->utf16 conversion fails and force errno to whatever it gets if you send invalid utf16 as a filename to Windows. Then it is easier to drop-in replacement of fopen.

If you don't do that it may be better to return the status number for any failure, rather than the case statements that remain below both calls to this.

Comment 11 Adrian Johnson 2017-11-13 20:31:08 UTC

(In reply to Bill Spitzak from comment #10)
> I personally am not thrilled that the function has two different methods of
> returning errors. I would instead just return null if the utf8->utf16
> conversion fails and force errno to whatever it gets if you send invalid
> utf16 as a filename to Windows. Then it is easier to drop-in replacement of
> fopen.

It is an internal function. All internal functions that can generate a cairo status return the status number so that it will be propagated through to the cairo object that provides a get status.

Comment 12 Bill Spitzak 2017-11-13 20:34:28 UTC

In that case maybe it should contain the case statement that is now after the calls to it currently, which checks errno and turns it into status codes. The idea is that if it returns anything other than SUCCESS then the caller can return that code immediately. And SUCCESS means the file really opened.

Comment 13 Adrian Johnson 2017-11-26 09:52:34 UTC

Pushed

Comment 14 Tom Schoonjans 2017-11-26 11:13:28 UTC

Many thanks for merging this patch in!

In case there are plans for another 1.14.x release, would it be possible to include this patch in there as well?

Comment 15 Adrian Johnson 2017-11-26 11:49:27 UTC

> In case there are plans for another 1.14.x release, would it be possible to
> include this patch in there as well?

It is more a feature than a bug fix. Not worth the small chance it may break something.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.