Created attachment 51099 [details] test script I attached a smapp script that raises a MemoryError when signaling an object with a strange dbus.String()
U+FDEF isn't a Unicode character: it's one of the 66 Unicode non-characters (U+xxxFE and U+xxxFF, plus U+FDD0..U+FDEF). It seems Python allows those non-characters to appear in unicode objects. D-Bus rejects non-characters, U+0000, and invalid or over-long UTF-8; all are considered to be programming errors, and produce a warning on stderr and a failed function call (which normally means out-of-memory). To give you a sensible error instead of MemoryError, dbus-python should perform the same validation that libdbus does, before appending a string to a message.
We had the same problem with Qt. And it was a source of remote Denial-of-Service. Overlong sequences are definitely an encoding error. They are the same as having bad UTF-8 sequences. However, the non-characters are a different issue: they are properly encoded and the codepoints exist.
From the Unicode Standard: "Applications are free to use any of these noncharacter code points internally" I.e. Python is not wrong in allowing them. "but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD replacement character, to indicate the problem in the text." How about following that recommendation instead of failing completely?
(In reply to comment #3) > From the Unicode Standard: > "Applications are free to use any of these noncharacter code points internally" > "but should never attempt to exchange them. If a noncharacter is received in > open interchange, an application is not required to interpret it in any way. My argument was that transmitting those characters over D-Bus from one front-end to its backend is not open interchange, but still inside "internal use".
much more grave: running this same script on Arch with those version: dbus 1.4.14-1 dbus-core 1.4.14-1 dbus-glib 0.94-2 dbus-python 0.84.0-1 lib32-dbus-core 1.4.0-2 make python crash! process 12879: arguments to dbus_message_iter_append_basic() were incorrect, assertion "_dbus_check_is_valid_utf8 (*string_p)" failed in file dbus-message.c line 2534. This is normally a bug in some application using the D-Bus library. D-Bus not built with -rdynamic so unable to print a backtrace Aborted on my system (Debian) with those version: dbus 1.4.14-1 dbus-x11 1.4.14-1 libdbus-1-3 1.4.14-1 libdbus-glib-1-2 0.94-4 python-dbus 0.84.0-2 I have this output: process 28558: arguments to dbus_message_iter_append_basic() were incorrect, assertion "_dbus_check_is_valid_utf8 (*string_p)" failed in file ../../dbus/dbus-message.c line 2534. This is normally a bug in some application using the D-Bus library. Traceback (most recent call last): File "t3.py", line 22, in <module> getattr(signal_object, 'test')(dbo) File "/usr/lib/python2.6/dist-packages/dbus/decorators.py", line 309, in emit_signal message.append(signature=signature, *args) MemoryError
As this makes applications using D-Bus crash, I set importance to critical.
This in some ways all goes back to: https://bugzilla.gnome.org/show_bug.cgi?id=107427 Which is the same code used by dbus and glib. From reading up on this it seems that both D-Bus and GLib, when they say Valid UTF-8/Unicode they actually mean Valid Unicode *Characters* and not just Valid Unicode Codepoints.. That said, I don't have a real opinion on what's more correct here. In any case, it should get documented that there is a difference here in the D-Bus spec if the current checking stays in place.
This is awkward to fix fully in dbus-python (disallowing surrogates like U+D800) because Python unicode objects can either be UTF-16 or UCS-4, depending on platform and build options. The easiest way would be for libdbus to finally have proper API for validation (Bug #39549), so dbus-python can just use that rather than second-guessing what libdbus thinks is valid. (In reply to comment #7) > Which is the same code used by dbus and glib. From reading up on this it seems > that both D-Bus and GLib, when they say Valid UTF-8/Unicode they actually mean > Valid Unicode *Characters* and not just Valid Unicode Codepoints.. Yes. Last time this came up on the mailing list (because D-Bus was being a bit less strict than GLib), most participants said they preferred to keep this approach, and make D-Bus as strict as GLib; that's what we did. (Notable exception: Thiago disagreed, and thought D-Bus should accept non-characters.) > That said, I don't have a real opinion on what's more correct here. In any > case, it should get documented that there is a difference here in the D-Bus > spec if the current checking stays in place. Care to propose some wording for the spec?
(In reply to comment #8) > The easiest way would be for libdbus to finally > have proper API for validation Now that it does, dbus-python 1.1.1 uses it if available. Non-characters and surrogates raise a UnicodeError when serialized into a message. If compiled against libdbus < 1.6, dbus-python does the same checks as libdbus instead. It's not very efficient but should work the same.
(In reply to comment #7) > it should get documented that there is a difference here in the D-Bus > spec Fixed in dbus git, this text will be in dbus 1.7.0: "The UTF-8 text must be validated strictly: in particular, it must not contain overlong sequences, noncharacters such as U+FFFE, or codepoints above U+10FFFF."
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.