Ruby 2: Recognizing decomposed utf8 in XML entities (NFD) -



Ruby 2: Recognizing decomposed utf8 in XML entities (NFD) -

problem

problem simple: have xml containing value

mu¨ller

this appears valid xml format representing u umlaut, this.

müller

but parsers have tried far result in -- 2 distinct characters.

background

this form of unicode (utf-8) uses 2 codepoints represent single character; , called normalized form decomposed or nfd, , in binary \303\274.

most characters can represented single codepoint , entity, including case. xml have included ü or ü or ü , in binary \195\188. called normalized form composed. of these work fine.

getting right question

so think question 1 of:

is there parser (doesn't seem nokogiri) can observe , normalize our preferred form? is there reasonable way reliably observe entities in nfd form , convert them nfc form (or there out there?)

thanks!

the character you’re using, u+00a8 (diaeresis) isn’t combining character – distinct u+0308 (combining diaeresis). (i’ve discovered myself – don’t know utilize non-combining diaeresis is).

it looks in case behaviour right , xml wrong (it should using ̈ , not ¨).

ruby xml utf-8

Comments

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

django - Access session in user model .save() -

php - .htaccess Multiple Rewrite Rules / Prioritizing -