libdbus and the D-Bus Specification currently disallow Unicode non-characters (U+FDD0..U+FDEF, U+xFFFE, U+xFFFF) in UTF-8 strings. This is consistent with pre-2013 versions of GLib. There has been considerable discussion of this in the past, including: <http://lists.freedesktop.org/archives/dbus/2010-February/012182.html> <https://bugs.freedesktop.org/show_bug.cgi?id=40817> <https://bugzilla.gnome.org/show_bug.cgi?id=107427> However, Unicode Corrigendum 9 <http://www.unicode.org/versions/corrigendum9.html> clarifies that this was not the intention of the standard, and g_utf8_validate() has been changed <https://bugzilla.gnome.org/show_bug.cgi?id=694669> to consider noncharacters to be valid. This matches the interpretation Thiago advocated in our previous discussions. We should consider changing the D-Bus Specification, the reference implementation, and any bindings that do their own validity checking (notably dbus-python, at least in git master) to allow non-characters. As a practical note, GDBus uses g_utf8_validate() to check for validity, so it will happily send messages that dbus-daemon considers to be invalid (and get kicked off the bus as a result).
Should this change also be made in D-Bus 1.6? Answers on a postcard. For: if an application using new-GDBus sends a message containing Corrigendum 9 UTF-8, making this change in D-Bus 1.6 means it won't get rejected. Against: an application expecting a message in "GLib 2.34 UTF-8" could receive an unexpected message in "Corrigendum 9 UTF-8" via a stable-branch dbus-daemon, and crash. If we're going to make this change at all then my inclination would be to say "yes, also change D-Bus 1.6".
"yes, also change D-Bus 1.6" The number of applications that depend on not receiving non-characters via D-Bus must be vanishingly small.
Created attachment 78331 [details] [review] [1.6, master] Accept non-characters when validating Unicode Unicode Corrigendum #9 clarifies that the non-characters U+nFFFE (for n in the range 0 to 0x10), U+nFFFF (for n in the same range), and U+FDD0..U+FDEF are valid for interchange, and their presence does not make a string ill-formed. GLib 2.36 made the corresponding change in its definition of UTF-8 as used by g_utf8_validate() and similar functions.
Created attachment 78332 [details] [review] [master] Specification: explicitly allow the Unicode noncharacters This follows Unicode Corrigendum #9.
Created attachment 78333 [details] [review] [1.6, master] [v2] Accept non-characters when validating Unicode Unicode Corrigendum #9 clarifies that the non-characters U+nFFFE (for n in the range 0 to 0x10), U+nFFFF (for n in the same range), and U+FDD0..U+FDEF are valid for interchange, and their presence does not make a string ill-formed. GLib 2.36 made the corresponding change in its definition of UTF-8 as used by g_utf8_validate() and similar functions. --- v2: also fix the comment above UNICODE_VALID().
Comment on attachment 78331 [details] [review] [1.6, master] Accept non-characters when validating Unicode Review of attachment 78331 [details] [review]: ----------------------------------------------------------------- Ship it!
Comment on attachment 78332 [details] [review] [master] Specification: explicitly allow the Unicode noncharacters Review of attachment 78332 [details] [review]: ----------------------------------------------------------------- Ship it!
Comment on attachment 78333 [details] [review] [1.6, master] [v2] Accept non-characters when validating Unicode Review of attachment 78333 [details] [review]: ----------------------------------------------------------------- Ship it!
Fixed in git for 1.7.2, 1.6.10. Any chance you could review Bug #63166, which breaks the build on recent Linux systems, including mine? I think that's the only release blocker at the moment.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.