Nearest Chicago Public Library
Revisualizing the Unicode Roadmap
I undertook this project in an effort to present a singular view of the Unicode Roadmap to the SMP that stretches over 90 versions from 2001 to the present, and encompasses 298 entries for scripts / blocks. I produced a web-scraper in Python using requests and beautifulsoup4 to parse each version. The data processing initially relied upon a set of dicts generated from the raw data, but I switched to pandas for greater efficiency. The visualization is similar to a gantt chart and has been plotted using matplotlib.
Devanagari Syllable Analyzer
A Python implementation of Unicode Standard Annex (UAX) #29 “Unicode Text Segmentation”, which analyzes grapheme clusters of Devanagari. Such clusters are synonymous with orthographic syllables of the script. The algorithm will be extended for performing identification of Sanskrit meters using machine learning. Note: iOS devices may not correctly or at all display some glyphs on account of known bugs in Apple’s Devanagari font.
Statistical Transliteration of Devanagari to Arabic
Automated transliteration of Devanagari into the Arabic script cannot be performed accurately using rule-based approaches. Not only do the structural differences of the alpha-syllabic system of Devanagari and the abjad system of Arabic pose challenges, but the incongruent character repertoires of the two scripts presents additional hurdles. This Python prototype converts the Urdu/Hindi languages in Devanagari into the Arabic script using both rule-based and simple statistical techniques. The core function is a simple spelling validator. As expected, some Arabic ‘corrections’ are wrong. An enhancement based upon n-gram analysis using TensorFlow is being developed.
Jumble Solver with WordNet Integration
Produces permutations of a scrambled string and validates productions against the WordNet lexicon to return plausible English words and definitions. I coded this as a prototype for a validator of jumbled strings in South Asian languages, which are written in alpha-syllabic scripts. Permutation of text in these languages requires analysis of syllabic tokens instead of individual letters as in English.