Processing text in non-Latin scripts in Python has become much easier on account
of the native support for Unicode in version 3. There are still some challenges,
but these are issues related more to individual writing systems than to Python.
One such issue is the manner in which strings of various script typologies are
handled. While handling strings in alphabetic systems, such as the Latin
script in which this post is written, is relatively straightforward, scripts
that have an alpha-syllabic structure often require additional processing.
For instance, the string “English” has the following number of characters:
['E', 'n', 'g', 'l', 'i', 's', 'h']
Now consider the string ‘देवनागरी’, which contains characters from the Devanagari script,
which is part of the large ‘Indic’ family of writing systems:
['द', 'े', 'व', 'न', 'ा', 'ग', 'र', 'ी']
Seems reasonable, doesn’t it? At some level: yes. For practical purposes: no.
For Indic scripts, parsing a string by its characters is not as meaningful as it is in Latin. The reason is that Indic scripts are alphasyllabic
writing systems, while Latin is an alphabetic system. For alphasyllabic types,
instead of individual characters, the basic orthographic unit that carries meaning
is the syllable. For effective text processing, the Devanagari string should be analyzed as:
['दे', 'व', 'ना', 'ग', 'री']
Let’s find an approach to tokenize Indic strings by syllables in Python.
Approach
Import the re
module for processing regular expressions.
Define character classes
An orthographic syllable may be defined in terms of character classes. Each syllable has a base, which is either an independent vowel letter or consonant letter. One or more combining signs may attach to a base. Analyze the Unicode Devanagari charts and classify characters.
vowels = '\u0904-\u0914\u0960-\u0961\u0972-\u0977'
consonants = '\u0915-\u0939\u0958-\u095F\u0978-\u097C\u097E-\u097F'
glottal = '\u097D'
vowel_signs = '\u093E-\u094C\u093A-\u093B\u094E-\u094F\u0955-\u0957\u1CF8-\u1CF9'
nasals = '\u0900-\u0902\u1CF2-\u1CF6'
visarga = '\u0903'
nukta = '\u093C'
avagraha = '\u093D'
virama = '\u094D'
vedic_signs = '\u0951-\u0952\u1CD0-\u1CE1\u1CED'
visarga_modifiers = '\u1CE2-\u1CE8'
combining = '\uA8E0-\uA8F1'
om = '\u0950'
accents = '\u0953-\u0954'
dandas = '\u0964-\u0965'
digits = '\u0966-\u096F'
abbreviation = '\u0970'
spacing = '\u0971'
vedic_nasals = '\uA8F2-\uA8F7\u1CE9-\u1CEC\u1CEE-\u1CF1'
fillers = '\uA8F8-\uA8F9'
caret = '\uA8FA'
headstroke = '\uA8FB'
space = '\u0020'
joiners = '\u200C-\u200D'
Define functions
syllabify() : the basic syllable tokenization function.
def syllabify(inputtext):
syllables = []
curr = ''
# iterate over each character in the input. if a char belongs to a
# class that can be part of a syllable, then add it to the curr
# buffer. otherwise, output it to syllables[] right away.
for char in inputtext:
if re.match('[' + vowels + avagraha + glottal + om + ']', char):
# need to handle non-initial independent vowel letters,
# avagraha, and om
if curr != '':
syllables.append(curr)
curr = char
else:
curr = curr + char
elif re.match('[' + consonants + ']', char):
# if last in curr is not virama, output curr as syllable
# else add present consonant to curr
if len(curr) > 0 and curr[-1] != virama:
syllables.append(curr)
curr = char
else:
curr = curr + char
elif re.match('[' + vowel_signs + visarga + vedic_signs + ']', char):
curr = curr + char
elif re.match('[' + visarga_modifiers + ']', char):
if len(curr) > 0 and curr[-1] == visarga:
curr = curr + char
syllables.append(curr)
curr = ''
else:
syllables.append(curr)
curr = ''
elif re.match('[' + nasals + vedic_nasals + ']', char):
# if last in curr is a vowel sign, output curr as syllable
# else add present vowel modifier to curr and output as syllable
vowelsign = re.match('[' + vowel_signs + ']$', curr)
if vowelsign:
syllables.append(curr)
curr = ''
else:
curr = curr + char
syllables.append(curr)
curr = ''
elif re.match('[' + nukta + ']', char):
curr = curr + char
elif re.match('[' + virama + ']', char):
curr = curr + char
elif re.match('[' + digits + ']', char):
curr = curr + char
elif re.match('[' + fillers + headstroke + ']', char):
syllables.append(char)
elif re.match('[' + joiners + ']', char):
curr = curr + char
else:
pass
#print ("unhandled: " + char + " ", char.encode('unicode_escape'))
# handle remaining curr
if curr != '':
syllables.append(curr)
curr = ''
# return each syllable as item in a list
return syllables
getSyllables() : splits input string into words and passes them to syllabify().
def getSyllables(inputtext):
word_syllables = []
all_words = []
for word in inputtext.split():
word = word.strip()
word = re.sub('[\s\n\u0964\u0965\.]', '', word)
word_syllables = syllabify(word)
#number_syllables = len(word_syllables)
#joined_syllables = '\u00B7'.join(word_syllables)
joined_syllables = word_syllables
# make list of lists containing each word
#all_words.append([word, joined_syllables, number_syllables])
all_words.append([word, joined_syllables])
return all_words
getSyllableStats() : get number of unique words and syllables in input text.
def getSyllableStats(inputtext):
syllcount = {}
wordcount = {}
word = ''
syllable = ''
count = ''
syllablestatus = {}
words = getSyllables(inputtext)
for entry in words:
word = entry[0]
syllables = entry[1]
#count = entry[2]
#word_syllables = syllables.split('\u00B7')
word_syllables = syllables
# count all words in input
if word in wordcount:
wordcount[word] += 1
else:
wordcount[word] = 1
# count all syllables in input
for syll in word_syllables:
if syll in syllcount:
syllcount[syll] += 1
else:
syllcount[syll] = 1
syllablestatus.update({'words' : wordcount})
syllablestatus.update({'syllables' : syllcount})
return (syllablestatus)
Test the functions
Some text in Sanskrit
text = "अ॒ग्निमी॑ळे पु॒रोहि॑तं य॒ज्ञस्य॑ दे॒वमृ॒त्विज॑म्। होता॑रं रत्न॒धात॑मम्॥"
['अ॒',
'ग्नि',
'मी॑',
'ळे',
'पु॒',
'रो',
'हि॑',
'तं',
'य॒',
'ज्ञ',
'स्य॑',
'दे॒',
'व',
'मृ॒',
'त्वि',
'ज॑',
'म्हो',
'ता॑',
'रं',
'र',
'त्न॒',
'धा',
'त॑',
'म',
'म्']
results = getSyllables(text)
for x in results:
print (x[0], ":", '\u00B7'.join(x[1]), ":", len(x[1]))
अ॒ग्निमी॑ळे : अ॒·ग्नि·मी॑·ळे : 4
पु॒रोहि॑तं : पु॒·रो·हि॑·तं : 4
य॒ज्ञस्य॑ : य॒·ज्ञ·स्य॑ : 3
दे॒वमृ॒त्विज॑म् : दे॒·व·मृ॒·त्वि·ज॑·म् : 6
होता॑रं : हो·ता॑·रं : 3
रत्न॒धात॑मम् : र·त्न॒·धा·त॑·म·म् : 6
from IPython.display import HTML, display
import tabulate
display(HTML(tabulate.tabulate(results, tablefmt='html')))
अ॒ग्निमी॑ळे | [‘अ॒’, ‘ग्नि’, ‘मी॑’, ‘ळे’] |
पु॒रोहि॑तं | [‘पु॒’, ‘रो’, ‘हि॑’, ‘तं’] |
य॒ज्ञस्य॑ | [‘य॒’, ‘ज्ञ’, ‘स्य॑’] |
दे॒वमृ॒त्विज॑म् | [‘दे॒’, ‘व’, ‘मृ॒’, ‘त्वि’, ‘ज॑’, ‘म्’] |
होता॑रं | [‘हो’, ‘ता॑’, ‘रं’] |
रत्न॒धात॑मम् | [‘र’, ‘त्न॒’, ‘धा’, ‘त॑’, ‘म’, ‘म्’] |
stats = getSyllableStats(text)
for k, v in stats.items():
print (k)
for x, y in v.items():
print ("\t", x, ":", y)
print ("total", k, "=", sum(v.values()))
print ("unique", k, "=", len(v), end='\n\n')
words
अ॒ग्निमी॑ळे : 1
पु॒रोहि॑तं : 1
य॒ज्ञस्य॑ : 1
दे॒वमृ॒त्विज॑म् : 1
होता॑रं : 1
रत्न॒धात॑मम् : 1
total words = 6
unique words = 6
syllables
अ॒ : 1
ग्नि : 1
मी॑ : 1
ळे : 1
पु॒ : 1
रो : 1
हि॑ : 1
तं : 1
य॒ : 1
ज्ञ : 1
स्य॑ : 1
दे॒ : 1
व : 1
मृ॒ : 1
त्वि : 1
ज॑ : 1
म् : 2
हो : 1
ता॑ : 1
रं : 1
र : 1
त्न॒ : 1
धा : 1
त॑ : 1
म : 1
total syllables = 26
unique syllables = 25
Some text in Hindi
text = "उन्हें बुद्धि और अन्तरात्मा की देन प्राप्त है और परस्पर उन्हें भाईचारे के भाव से बर्ताव करना चाहिए"
results = getSyllables(text)
for x in results:
print (x[0], ":", '\u00B7'.join(x[1]), ":", len(x[1]))
उन्हें : उ·न्हें : 2
बुद्धि : बु·द्धि : 2
और : औ·र : 2
अन्तरात्मा : अ·न्त·रा·त्मा : 4
की : की : 1
देन : दे·न : 2
प्राप्त : प्रा·प्त : 2
है : है : 1
और : औ·र : 2
परस्पर : प·र·स्प·र : 4
उन्हें : उ·न्हें : 2
भाईचारे : भा·ई·चा·रे : 4
के : के : 1
भाव : भा·व : 2
से : से : 1
बर्ताव : ब·र्ता·व : 3
करना : क·र·ना : 3
चाहिए : चा·हि·ए : 3
stats = getSyllableStats(text)
for k, v in stats.items():
print (k)
for x, y in v.items():
print ("\t", x, ":", y)
print ("total", k, "=", sum(v.values()))
print ("unique", k, "=", len(v), end='\n\n')
words
उन्हें : 2
बुद्धि : 1
और : 2
अन्तरात्मा : 1
की : 1
देन : 1
प्राप्त : 1
है : 1
परस्पर : 1
भाईचारे : 1
के : 1
भाव : 1
से : 1
बर्ताव : 1
करना : 1
चाहिए : 1
total words = 18
unique words = 16
syllables
उ : 2
न्हें : 2
बु : 1
द्धि : 1
औ : 2
र : 5
अ : 1
न्त : 1
रा : 1
त्मा : 1
की : 1
दे : 1
न : 1
प्रा : 1
प्त : 1
है : 1
प : 1
स्प : 1
भा : 2
ई : 1
चा : 2
रे : 1
के : 1
व : 2
से : 1
ब : 1
र्ता : 1
क : 1
ना : 1
हि : 1
ए : 1
total syllables = 41
unique syllables = 31
Edge cases
The presence of ZWJ or ZWNJ after virama might suggest the termination of a grapheme cluster.
['क्क', 'क्\u200d', 'क', 'क्\u200c', 'क']
Next steps
The issue with the above approach is that it requires manual definition of character classes. This is suitable for a single script. But, if a use case requires analysis of, say, all twenty-two official languages of India, then there will be a need to handle not just Devanagari, but up to ten additional scripts. The upside is that these additional scripts are based upon the same orthographic structure as Devanagari, but the downside is that each script has its own features. In an upcoming post, I will show how to derive character classes programmatically from the Unicode Character Database.