Bug 103693 - Support unicode args and console output on windows
Summary: Support unicode args and console output on windows
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: Other Windows (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-12 00:48 UTC by Adrian Johnson
Modified: 2017-11-15 08:29 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Move UTF8.h to UnicodeMapFuncs.h (3.01 KB, patch)
2017-11-12 00:49 UTC, Adrian Johnson
Details | Splinter Review
sort encodings (1006 bytes, patch)
2017-11-12 00:49 UTC, Adrian Johnson
Details | Splinter Review
fix some mingw warnings (2.85 KB, patch)
2017-11-12 00:50 UTC, Adrian Johnson
Details | Splinter Review
support unicode on windows console (29.39 KB, patch)
2017-11-12 00:50 UTC, Adrian Johnson
Details | Splinter Review
support unicode on windows console v2 (30.47 KB, patch)
2017-11-12 10:26 UTC, Adrian Johnson
Details | Splinter Review

Description Adrian Johnson 2017-11-12 00:48:20 UTC
Currently, using the utils in a windows console does not work with unicode, either for the command line arguments, or for the output.

I looked into the possible solutions for this. The command line arguments are easy to fix. There is GetCommandLineW() to get the unicode command line.

Getting unicode output on the console is a bit more work. There are three ways:

- Set the code page to UTF-8 then printf with UTF-8 should just work - except for when it doesn't. The UTF-8 support is a second class citizen in windows and very buggy. Just redirecting the output to "more" crashed it.

- Use the wide IO functions eg wprintf(). This only works if the IO mode is set to 16-bit. ie _setmode(_fileno(stdout), _O_U16TEXT). The only problems is once the mode is changed to 16-bit, the 8-bit IO functions no longer work. printf() crashes.

- Use the WriteConsoleW() function. This seems to be the best solution as it can be mixed with printf. I did some testing and windows flushes stdio after each end of line so printf and WriteConsoleW() can be interleaved providing they are not used for the same line.

The next problem is finding the easiest way to support this in poppler. For the command line argument I create a Win32Console class that converts the args to UTF-8. eg

int main(int argc, char **argv)
{
  Win32Console win32Console(&argc, &argv);
  ...
}

On windows it will get the unicode args, convert to UTF-8 and store in argc/argv. On other platforms it is a no-op.

For the output there a 3 ways it could be done:

- Create a DLL that overrides the printf and related functions. I couldn't get this to work. I seems to depend on the order of linking and MSVC links the system libraries first.

- Write our own utf8_printf that calls WriteConsoleW on windows or printf on other platforms. I would prefer not to use non standard functions that need to be used everywhere in the code.

- #define printf to a replacement function on windows.

I ended up going the #define solution. The Win32Console header redefines printf/fprintf/fputc on windows to use WriteConsoleW if outputting to the console.
Comment 1 Adrian Johnson 2017-11-12 00:49:12 UTC
Created attachment 135409 [details] [review]
Move UTF8.h to UnicodeMapFuncs.h
Comment 2 Adrian Johnson 2017-11-12 00:49:31 UTC
Created attachment 135410 [details] [review]
sort encodings
Comment 3 Adrian Johnson 2017-11-12 00:50:09 UTC
Created attachment 135411 [details] [review]
fix some mingw warnings
Comment 4 Adrian Johnson 2017-11-12 00:50:39 UTC
Created attachment 135412 [details] [review]
support unicode on windows console
Comment 5 Adrian Johnson 2017-11-12 00:54:49 UTC
One problem I did find while testing is this doesn't work the pdftotext outputing to stdout because the TextOuputDev printf is in the poppler lib, not in pdftotext.cc. However, I don't think this a much of an issue as you would not normally output pdftotext to the console due to the size of the output. Normally you would pipe it through "more".

The unicode console output is intended more for things like pdfinfo, and file names.
Comment 6 Adrian Johnson 2017-11-12 01:02:03 UTC
The next thing I want to look at is the windows filename support in poppler. Currently is is a mess of duplicated code and #ifdefs everywhere. I'd like to simplify it to pass UTF-8 filenames through the code and just convert to wchar_t at the point where the file is opened (in goo/gfile.cc).

goo/gfile.cc curently has:

- openFile() takes UTF-8 filenames, has it's own UTF-8 to wchar_t conversion.

- GooFile::open() has separate UTF-8 and wchar_t variants.

- the GdirEntry code does not support unicode on windows.
Comment 7 Albert Astals Cid 2017-11-12 10:14:08 UTC
You need to copy the license text from the decoder, not just link to the web.

Are you sure we're not breaking people's code by changing the PDFDoc constructor behaviour?
Comment 8 Adrian Johnson 2017-11-12 10:26:01 UTC
Created attachment 135422 [details] [review]
support unicode on windows console v2

Include license.
Comment 9 Adrian Johnson 2017-11-12 10:29:01 UTC
(In reply to Albert Astals Cid from comment #7)
> Are you sure we're not breaking people's code by changing the PDFDoc
> constructor behaviour?

Old behavior is on win32 it calls GooFile::open(const GooString *fileName) which uses CreateFileA. This only supports ASCII. So new code is backwards compatible.
Comment 10 Albert Astals Cid 2017-11-13 23:08:17 UTC
Having to have 

Win32Console win32Console(&argc, &argv);

in each and every of the apps is a bit meh, but i guess there's no other way around it.

About the improvements, do you want to finish them before commiting or you want to commit what you already have?
Comment 11 Adrian Johnson 2017-11-14 10:44:30 UTC
I'd like to commit this first. I've got some other work in progress that I'd like to finish first then I'll open a new bug for the windows file/path unicode improvements.

I noticed also that the code has a mixture of mostly gmalloc/gfree but sometimes new/delete.Do you have a preference for one or the other in new code?
Comment 12 Albert Astals Cid 2017-11-14 23:14:40 UTC
(In reply to Adrian Johnson from comment #11)
> I'd like to commit this first. I've got some other work in progress that I'd
> like to finish first then I'll open a new bug for the windows file/path
> unicode improvements.

I don't see anything obviously wrong there and since it's mostly windows changes (which personally is less important for me) i guess you can go on.

> 
> I noticed also that the code has a mixture of mostly gmalloc/gfree but
> sometimes new/delete.Do you have a preference for one or the other in new
> code?

The only benefit i see in gmalloc is the checkoverflow variants, otherwise i'd say use new/delete ?
Comment 13 Adrian Johnson 2017-11-15 08:29:05 UTC
pushed


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.