40817 – MemoryError with strange utf8 chars

Bug 40817 - MemoryError with strange utf8 chars

Summary: MemoryError with strange utf8 chars

Status:	RESOLVED FIXED

Alias:	None

Product:	dbus
Classification:	Unclassified
Component:	python (show other bugs)
Version:	1.4.x
Hardware:	Other All

Importance:	medium critical
Assignee:	Simon McVittie
QA Contact:	John (J5) Palmieri

URL:
Whiteboard:
Keywords:

Depends on:	39549
Blocks:
	Show dependency tree / graph

Reported:	2011-09-13 00:12 UTC by Yann Leboulanger
Modified:	2012-06-25 13:43 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
test script (675 bytes, text/x-python) 2011-09-13 00:12 UTC, Yann Leboulanger	Details
View All

Description Yann Leboulanger 2011-09-13 00:12:03 UTC

Created attachment 51099 [details]
test script

I attached a smapp script that raises a MemoryError when signaling an object with a strange dbus.String()

Comment 1 Simon McVittie 2011-09-13 03:02:34 UTC

U+FDEF isn't a Unicode character: it's one of the 66 Unicode non-characters (U+xxxFE and U+xxxFF, plus U+FDD0..U+FDEF). It seems Python allows those non-characters to appear in unicode objects.

D-Bus rejects non-characters, U+0000, and invalid or over-long UTF-8; all are considered to be programming errors, and produce a warning on stderr and a failed function call (which normally means out-of-memory).

To give you a sensible error instead of MemoryError, dbus-python should perform the same validation that libdbus does, before appending a string to a message.

Comment 2 Thiago Macieira 2011-09-13 03:23:35 UTC

We had the same problem with Qt. And it was a source of remote Denial-of-Service.

Overlong sequences are definitely an encoding error. They are the same as having bad UTF-8 sequences. However, the non-characters are a different issue: they are properly encoded and the codepoints exist.

Comment 3 Florian Zeitz 2011-09-13 17:09:30 UTC

From the Unicode Standard:
"Applications are free to use any of these noncharacter code points internally"
I.e. Python is not wrong in allowing them.

"but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD replacement character, to indicate the problem in the text."
How about following that recommendation instead of failing completely?

Comment 4 Thiago Macieira 2011-09-13 23:08:09 UTC

(In reply to comment #3)
> From the Unicode Standard:
> "Applications are free to use any of these noncharacter code points internally"
> "but should never attempt to exchange them. If a noncharacter is received in
> open interchange, an application is not required to interpret it in any way. 

My argument was that transmitting those characters over D-Bus from one front-end to its backend is not open interchange, but still inside "internal use".

Comment 5 Yann Leboulanger 2011-09-14 00:47:26 UTC

much more grave:

running this same script on Arch with those version:
dbus 1.4.14-1
dbus-core 1.4.14-1
dbus-glib 0.94-2
dbus-python 0.84.0-1
lib32-dbus-core 1.4.0-2

make python crash!

process 12879: arguments to dbus_message_iter_append_basic() were incorrect, assertion "_dbus_check_is_valid_utf8 (*string_p)" failed in file dbus-message.c line 2534.
This is normally a bug in some application using the D-Bus library.
  D-Bus not built with -rdynamic so unable to print a backtrace
Aborted

on my system (Debian) with those version:
dbus 1.4.14-1
dbus-x11 1.4.14-1
libdbus-1-3 1.4.14-1
libdbus-glib-1-2 0.94-4
python-dbus 0.84.0-2

I have this output:
process 28558: arguments to dbus_message_iter_append_basic() were incorrect, assertion "_dbus_check_is_valid_utf8 (*string_p)" failed in file ../../dbus/dbus-message.c line 2534.
This is normally a bug in some application using the D-Bus library.
Traceback (most recent call last):
  File "t3.py", line 22, in <module>
    getattr(signal_object, 'test')(dbo)
  File "/usr/lib/python2.6/dist-packages/dbus/decorators.py", line 309, in emit_signal
    message.append(signature=signature, *args)
MemoryError

Comment 6 Yann Leboulanger 2011-09-20 01:16:31 UTC

As this makes applications using D-Bus crash, I set importance to critical.

Comment 7 Sjoerd Simons 2011-11-01 13:52:46 UTC

This in some ways all goes back to:

https://bugzilla.gnome.org/show_bug.cgi?id=107427

Which is the same code used by dbus and glib. From reading up on this it seems that both D-Bus and GLib, when they say Valid UTF-8/Unicode they actually mean Valid Unicode *Characters* and not just Valid Unicode Codepoints..

That said, I don't have a real opinion on what's more correct here. In any case, it should get documented that there is a difference here in the D-Bus spec if the current checking stays in place.

Comment 8 Simon McVittie 2011-11-02 08:05:29 UTC

This is awkward to fix fully in dbus-python (disallowing surrogates like U+D800) because Python unicode objects can either be UTF-16 or UCS-4, depending on platform and build options. The easiest way would be for libdbus to finally have proper API for validation (Bug #39549), so dbus-python can just use that rather than second-guessing what libdbus thinks is valid.

(In reply to comment #7)
> Which is the same code used by dbus and glib. From reading up on this it seems
> that both D-Bus and GLib, when they say Valid UTF-8/Unicode they actually mean
> Valid Unicode *Characters* and not just Valid Unicode Codepoints..

Yes. Last time this came up on the mailing list (because D-Bus was being a bit less strict than GLib), most participants said they preferred to keep this approach, and make D-Bus as strict as GLib; that's what we did. (Notable exception: Thiago disagreed, and thought D-Bus should accept non-characters.)

> That said, I don't have a real opinion on what's more correct here. In any
> case, it should get documented that there is a difference here in the D-Bus
> spec if the current checking stays in place.

Care to propose some wording for the spec?

Comment 9 Simon McVittie 2012-06-25 13:40:46 UTC

(In reply to comment #8)
> The easiest way would be for libdbus to finally
> have proper API for validation

Now that it does, dbus-python 1.1.1 uses it if available. Non-characters and surrogates raise a UnicodeError when serialized into a message.

If compiled against libdbus < 1.6, dbus-python does the same checks as libdbus instead. It's not very efficient but should work the same.

Comment 10 Simon McVittie 2012-06-25 13:43:41 UTC

(In reply to comment #7)
> it should get documented that there is a difference here in the D-Bus
> spec

Fixed in dbus git, this text will be in dbus 1.7.0:

"The UTF-8 text must be validated strictly: in particular, it must not contain overlong sequences, noncharacters such as U+FFFE, or codepoints above U+10FFFF."

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.