Tokenizing Indic strings by syllables in Python

Posted on Mon 05 September 2016 in articles

Processing text in non-Latin scripts in Python has become much easier on account of the native support for Unicode in version 3. There are still some challenges, but these are issues related more to individual writing systems than to Python. One such issue is the manner in which strings of various script typologies are handled. While handling strings in alphabetic systems, such as the Latin script in which this post is written, is relatively straightforward, scripts that have an alpha-syllabic structure often require additional processing.

For instance, the string “English” has the following number of characters:

len('English')

list('English')

['E', 'n', 'g', 'l', 'i', 's', 'h']

Now consider the string ‘देवनागरी’, which contains characters from the Devanagari script, which is part of the large ‘Indic’ family of writing systems:

len('देवनागरी')

list('देवनागरी')

['द', 'े', 'व', 'न', 'ा', 'ग', 'र', 'ी']

Seems reasonable, doesn’t it? At some level: yes. For practical purposes: no.

For Indic scripts, parsing a string by its characters is not as meaningful as it is in Latin. The reason is that Indic scripts are alphasyllabic writing systems, while Latin is an alphabetic system. For alphasyllabic types, instead of individual characters, the basic orthographic unit that carries meaning is the syllable. For effective text processing, the Devanagari string should be analyzed as:

['दे', 'व', 'ना', 'ग', 'री']

Let’s find an approach to tokenize Indic strings by syllables in Python.

Approach

Import the re module for processing regular expressions.

import re

Define character classes

An orthographic syllable may be defined in terms of character classes. Each syllable has a base, which is either an independent vowel letter or consonant letter. One or more combining signs may attach to a base. Analyze the Unicode Devanagari charts and classify characters.

vowels = '\u0904-\u0914\u0960-\u0961\u0972-\u0977'
consonants = '\u0915-\u0939\u0958-\u095F\u0978-\u097C\u097E-\u097F'
glottal = '\u097D'

vowel_signs = '\u093E-\u094C\u093A-\u093B\u094E-\u094F\u0955-\u0957\u1CF8-\u1CF9'
nasals = '\u0900-\u0902\u1CF2-\u1CF6'
visarga = '\u0903'
nukta = '\u093C'
avagraha = '\u093D'
virama = '\u094D'

vedic_signs = '\u0951-\u0952\u1CD0-\u1CE1\u1CED'
visarga_modifiers = '\u1CE2-\u1CE8'
combining = '\uA8E0-\uA8F1'

om = '\u0950'

accents = '\u0953-\u0954'
dandas = '\u0964-\u0965'
digits = '\u0966-\u096F'
abbreviation = '\u0970'
spacing = '\u0971'

vedic_nasals = '\uA8F2-\uA8F7\u1CE9-\u1CEC\u1CEE-\u1CF1'
fillers = '\uA8F8-\uA8F9'
caret = '\uA8FA'
headstroke = '\uA8FB'

space = '\u0020'
joiners = '\u200C-\u200D'

Define functions

syllabify() : the basic syllable tokenization function.

def syllabify(inputtext):

    syllables = []
    curr = ''

    # iterate over each character in the input. if a char belongs to a 
    # class that can be part of a syllable, then add it to the curr 
    # buffer. otherwise, output it to syllables[] right away.

    for char in inputtext:

        if re.match('[' + vowels + avagraha + glottal + om + ']', char):

            # need to handle non-initial independent vowel letters,
            # avagraha, and om

            if curr != '':
                syllables.append(curr)
                curr = char
            else:
                curr = curr + char

        elif re.match('[' + consonants + ']', char):

            # if last in curr is not virama, output curr as syllable
            # else add present consonant to curr

            if len(curr) > 0 and curr[-1] != virama:
                syllables.append(curr)
                curr = char
            else:
                curr = curr + char

        elif re.match('[' + vowel_signs + visarga + vedic_signs + ']', char):
            curr = curr + char

        elif re.match('[' + visarga_modifiers + ']', char):

            if len(curr) > 0 and curr[-1] == visarga:
                curr = curr + char
                syllables.append(curr)
                curr = ''
            else:
                syllables.append(curr)
                curr = ''

        elif re.match('[' + nasals + vedic_nasals + ']', char):

            # if last in curr is a vowel sign, output curr as syllable
            # else add present vowel modifier to curr and output as syllable

            vowelsign = re.match('[' + vowel_signs + ']$', curr)
            if vowelsign:
                syllables.append(curr)
                curr = ''
            else:
                curr = curr + char
                syllables.append(curr)
                curr = ''

        elif re.match('[' + nukta + ']', char):
            curr = curr + char

        elif re.match('[' + virama + ']', char):
            curr = curr + char

        elif re.match('[' + digits + ']', char):
            curr = curr + char

        elif re.match('[' + fillers + headstroke + ']', char):
            syllables.append(char)

        elif re.match('[' + joiners + ']', char):
            curr = curr + char

        else:
            pass
            #print ("unhandled: " + char + " ", char.encode('unicode_escape'))

    # handle remaining curr
    if curr != '':
        syllables.append(curr)
        curr = ''

    # return each syllable as item in a list
    return syllables

getSyllables() : splits input string into words and passes them to syllabify().

def getSyllables(inputtext):

    word_syllables = []
    all_words = []

    for word in inputtext.split():

        word = word.strip()
        word = re.sub('[\s\n\u0964\u0965\.]', '', word)

        word_syllables = syllabify(word)
        #number_syllables = len(word_syllables)

        #joined_syllables = '\u00B7'.join(word_syllables)
        joined_syllables = word_syllables

        # make list of lists containing each word
        #all_words.append([word, joined_syllables, number_syllables])
        all_words.append([word, joined_syllables])

    return all_words

getSyllableStats() : get number of unique words and syllables in input text.

def getSyllableStats(inputtext):

    syllcount = {}
    wordcount = {}
    word = ''
    syllable = ''
    count = ''
    syllablestatus = {}

    words = getSyllables(inputtext)

    for entry in words:

        word = entry[0]
        syllables = entry[1]
        #count = entry[2]

        #word_syllables = syllables.split('\u00B7')
        word_syllables = syllables

        # count all words in input
        if word in wordcount:
            wordcount[word] += 1
        else:
            wordcount[word] = 1

        # count all syllables in input
        for syll in word_syllables:
            if syll in syllcount:
                syllcount[syll] += 1
            else:
                syllcount[syll] = 1

    syllablestatus.update({'words' : wordcount})
    syllablestatus.update({'syllables' : syllcount})

    return (syllablestatus)

Test the functions

Some text in Sanskrit

text = "अ॒ग्निमी॑ळे पु॒रोहि॑तं य॒ज्ञस्य॑ दे॒वमृ॒त्विज॑म्। होता॑रं रत्न॒धात॑मम्॥"

syllabify(text)

['अ॒',
 'ग्नि',
 'मी॑',
 'ळे',
 'पु॒',
 'रो',
 'हि॑',
 'तं',
 'य॒',
 'ज्ञ',
 'स्य॑',
 'दे॒',
 'व',
 'मृ॒',
 'त्वि',
 'ज॑',
 'म्हो',
 'ता॑',
 'रं',
 'र',
 'त्न॒',
 'धा',
 'त॑',
 'म',
 'म्']

results = getSyllables(text)

for x in results:
    print (x[0], ":", '\u00B7'.join(x[1]), ":", len(x[1]))

अ॒ग्निमी॑ळे : अ॒·ग्नि·मी॑·ळे : 4
पु॒रोहि॑तं : पु॒·रो·हि॑·तं : 4
य॒ज्ञस्य॑ : य॒·ज्ञ·स्य॑ : 3
दे॒वमृ॒त्विज॑म् : दे॒·व·मृ॒·त्वि·ज॑·म् : 6
होता॑रं : हो·ता॑·रं : 3
रत्न॒धात॑मम् : र·त्न॒·धा·त॑·म·म् : 6

from IPython.display import HTML, display
import tabulate
display(HTML(tabulate.tabulate(results, tablefmt='html')))

अ॒ग्निमी॑ळे	[‘अ॒’, ‘ग्नि’, ‘मी॑’, ‘ळे’]
पु॒रोहि॑तं	[‘पु॒’, ‘रो’, ‘हि॑’, ‘तं’]
य॒ज्ञस्य॑	[‘य॒’, ‘ज्ञ’, ‘स्य॑’]
दे॒वमृ॒त्विज॑म्	[‘दे॒’, ‘व’, ‘मृ॒’, ‘त्वि’, ‘ज॑’, ‘म्’]
होता॑रं	[‘हो’, ‘ता॑’, ‘रं’]
रत्न॒धात॑मम्	[‘र’, ‘त्न॒’, ‘धा’, ‘त॑’, ‘म’, ‘म्’]

stats = getSyllableStats(text)
for k, v in stats.items():
    print (k)
    for x, y in v.items():
        print ("\t", x, ":", y)
    print ("total", k, "=", sum(v.values()))
    print ("unique", k, "=", len(v), end='\n\n')

words
     अ॒ग्निमी॑ळे : 1
     पु॒रोहि॑तं : 1
     य॒ज्ञस्य॑ : 1
     दे॒वमृ॒त्विज॑म् : 1
     होता॑रं : 1
     रत्न॒धात॑मम् : 1
total words = 6
unique words = 6

syllables
     अ॒ : 1
     ग्नि : 1
     मी॑ : 1
     ळे : 1
     पु॒ : 1
     रो : 1
     हि॑ : 1
     तं : 1
     य॒ : 1
     ज्ञ : 1
     स्य॑ : 1
     दे॒ : 1
     व : 1
     मृ॒ : 1
     त्वि : 1
     ज॑ : 1
     म् : 2
     हो : 1
     ता॑ : 1
     रं : 1
     र : 1
     त्न॒ : 1
     धा : 1
     त॑ : 1
     म : 1
total syllables = 26
unique syllables = 25

Some text in Hindi

text = "उन्हें बुद्धि और अन्तरात्मा की देन प्राप्त है और परस्पर उन्हें भाईचारे के भाव से बर्ताव करना चाहिए"

results = getSyllables(text)

for x in results:
    print (x[0], ":", '\u00B7'.join(x[1]), ":", len(x[1]))

उन्हें : उ·न्हें : 2
बुद्धि : बु·द्धि : 2
और : औ·र : 2
अन्तरात्मा : अ·न्त·रा·त्मा : 4
की : की : 1
देन : दे·न : 2
प्राप्त : प्रा·प्त : 2
है : है : 1
और : औ·र : 2
परस्पर : प·र·स्प·र : 4
उन्हें : उ·न्हें : 2
भाईचारे : भा·ई·चा·रे : 4
के : के : 1
भाव : भा·व : 2
से : से : 1
बर्ताव : ब·र्ता·व : 3
करना : क·र·ना : 3
चाहिए : चा·हि·ए : 3

stats = getSyllableStats(text)
for k, v in stats.items():
    print (k)
    for x, y in v.items():
        print ("\t", x, ":", y)
    print ("total", k, "=", sum(v.values()))
    print ("unique", k, "=", len(v), end='\n\n')

words
     उन्हें : 2
     बुद्धि : 1
     और : 2
     अन्तरात्मा : 1
     की : 1
     देन : 1
     प्राप्त : 1
     है : 1
     परस्पर : 1
     भाईचारे : 1
     के : 1
     भाव : 1
     से : 1
     बर्ताव : 1
     करना : 1
     चाहिए : 1
total words = 18
unique words = 16

syllables
     उ : 2
     न्हें : 2
     बु : 1
     द्धि : 1
     औ : 2
     र : 5
     अ : 1
     न्त : 1
     रा : 1
     त्मा : 1
     की : 1
     दे : 1
     न : 1
     प्रा : 1
     प्त : 1
     है : 1
     प : 1
     स्प : 1
     भा : 2
     ई : 1
     चा : 2
     रे : 1
     के : 1
     व : 2
     से : 1
     ब : 1
     र्ता : 1
     क : 1
     ना : 1
     हि : 1
     ए : 1
total syllables = 41
unique syllables = 31

Edge cases

The presence of ZWJ or ZWNJ after virama might suggest the termination of a grapheme cluster.

text = 'क्क क्‍क क्‌क'

syllabify(text)

['क्क', 'क्\u200d', 'क', 'क्\u200c', 'क']

Next steps

The issue with the above approach is that it requires manual definition of character classes. This is suitable for a single script. But, if a use case requires analysis of, say, all twenty-two official languages of India, then there will be a need to handle not just Devanagari, but up to ten additional scripts. The upside is that these additional scripts are based upon the same orthographic structure as Devanagari, but the downside is that each script has its own features. In an upcoming post, I will show how to derive character classes programmatically from the Unicode Character Database.