How old is a Unicode character?

Posted on Thu 02 February 2017 in articles

I often need to find out when a particular character was encoded in The Unicode Standard. The Unicode Character Database (UCD) has a file called DerivedAge.txt that contains such information. This UCD file provides the Age property for characters, which can be queried to obtain the version of Unicode …

Continue reading

Visualizing the growth of Unicode using matplotlib

Posted on Mon 30 January 2017 in articles

In a previous post, I presented information about the growth of Unicode in terms of the number of codepoints assigned in each version. The data was displayed as text tables using the PrettyTable package. As we are visual beings, I think it would be useful to also present that data …

Continue reading

Exploring the growth of Unicode using the UCD and Python

Posted on Sat 28 January 2017 in articles

How many characters are in Unicode? How many new characters have been added for each version of Unicode? From a programmatic standpoint, figuring out the number of characters published in a given version of Unicode is not as straightforward as one might imagine. While it is true that Unicode contains …

Continue reading

Python function to transliterate Devanagari

Posted on Thu 15 September 2016 in articles

Transliteration between complex scripts and Latin often requires more than a one-to-one mapping table. The underlying difficulty arises from the typology of complex scripts, such as those of the Indic family, which are alpha-syllabic. One issue with transliterating Indic scripts to Latin is handling the inherent a of consonant letters …

Continue reading

Tokenizing Indic strings by syllables in Python

Posted on Mon 05 September 2016 in articles

Processing text in non-Latin scripts in Python has become much easier on account of the native support for Unicode in version 3. There are still some challenges, but these are issues related more to individual writing systems than to Python. One such issue is the manner in which strings of …

Continue reading