Ruby 2: Recognizing decomposed utf8 in XML entities (NFD) -
Ruby 2: Recognizing decomposed utf8 in XML entities (NFD) -
problem
problem simple: have xml containing value
mu¨ller
this appears valid xml format representing u
umlaut, this.
müller
but parsers have tried far result in u¨
-- 2 distinct characters.
this form of unicode (utf-8) uses 2 codepoints represent single character; , called normalized form decomposed or nfd, , in binary \303\274
.
most characters can represented single codepoint , entity, including case. xml have included ü
or ü
or ü
, in binary \195\188
. called normalized form composed. of these work fine.
so think question 1 of:
is there parser (doesn't seem nokogiri) can observe , normalize our preferred form? is there reasonable way reliably observe entities in nfd form , convert them nfc form (or there out there?)thanks!
the character you’re using, u+00a8 (diaeresis
) isn’t combining character – distinct u+0308 (combining diaeresis
). (i’ve discovered myself – don’t know utilize non-combining diaeresis is).
it looks in case behaviour right , xml wrong (it should using ̈
, not ¨
).
ruby xml utf-8
Comments
Post a Comment