Stripping diacritics from input

By Ben Bullock on September 10, 2018 3:48 PM

If you have input containing lots of Unicode diacritics, and you need to process them into equivalent ASCII characters, there are several options on CPAN. My module Unicode::Diacritic::Strip offers a slow and reliable method involving the use of Unicode::UCD, and a fast method involving a tr/// operator.

Today I was examining user logs for a web application, and I noticed that the fast method had completely failed on input "Jalālu'd Dīn Muḥammad Rūmī" because it had failed to catch the middle h character. Looking at the Unicode characters I found a whole block of Latin characters which I'd omitted. I've now added them to the application for version 0.11

1 comment

Tagged as:

Diacritics, Latin, Unicode

1 Comment

Mohammad S Anwar | September 10, 2018 11:53 PM | Reply

Excellent :-)

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Ben Bullock

Perl user since about 2006, I have also released some CPAN modules.

More info »

The Incredible Journey