Python function to transliterate Devanagari

Posted on Thu 15 September 2016 in articles

Transliteration between complex scripts and Latin often requires more than a one-to-one mapping table. The underlying difficulty arises from the typology of complex scripts, such as those of the Indic family, which are alpha-syllabic. One issue with transliterating Indic scripts to Latin is handling the inherent a of consonant letters. In some contexts, a consonant letter should be transliterated using only the bare consonant, ie. k, kh, g, gh, , while in others the inherent a is required, ie. ka, kha, ga, gha, ṅa.

import re

Define conversion maps

Define mapping table as dictionary:

conversiontable = {
    'ॐ' : 'oṁ',
    'ऀ' : 'ṁ',
    'ँ' : 'ṃ',
    'ं' : 'ṃ',
    'ः' : 'ḥ',
    'अ' : 'a',
    'आ' : 'ā',
    'इ' : 'i',
    'ई' : 'ī',
    'उ' : 'u',
    'ऊ' : 'ū',
    'ऋ' : 'r̥',
    'ॠ' : ' r̥̄',
    'ऌ' : 'l̥',
    'ॡ' : ' l̥̄',
    'ऍ' : 'ê',
    'ऎ' : 'e',
    'ए' : 'e',
    'ऐ' : 'ai',
    'ऑ' : 'ô',
    'ऒ' : 'o',
    'ओ' : 'o',
    'औ' : 'au',
    'ा' : 'ā',
    'ि' : 'i',
    'ी' : 'ī',
    'ु' : 'u',
    'ू' : 'ū',
    'ृ' : 'r̥',
    'ॄ' : ' r̥̄',
    'ॢ' : 'l̥',
    'ॣ' : ' l̥̄',
    'ॅ' : 'ê',
    'े' : 'e',
    'ै' : 'ai',
    'ॉ' : 'ô',
    'ो' : 'o',
    'ौ' : 'au',
    'क़' : 'q',
    'क' : 'k',
    'ख़' : 'x',
    'ख' : 'kh',
    'ग़' : 'ġ',
    'ग' : 'g',
    'ॻ' : 'g',
    'घ' : 'gh',
    'ङ' : 'ṅ',
    'च' : 'c',
    'छ' : 'ch',
    'ज़' : 'z',
    'ज' : 'j',
    'ॼ' : 'j',
    'झ' : 'jh',
    'ञ' : 'ñ',
    'ट' : 'ṭ',
    'ठ' : 'ṭh',
    'ड़' : 'ṛ',
    'ड' : 'ḍ',
    'ॸ' : 'ḍ',
    'ॾ' : 'd',
    'ढ़' : 'ṛh',
    'ढ' : 'ḍh',
    'ण' : 'ṇ',
    'त' : 't',
    'थ' : 'th',
    'द' : 'd',
    'ध' : 'dh',
    'न' : 'n',
    'प' : 'p',
    'फ़' : 'f',
    'फ' : 'ph',
    'ब' : 'b',
    'ॿ' : 'b',
    'भ' : 'bh',
    'म' : 'm',
    'य' : 'y',
    'र' : 'r',
    'ल' : 'l',
    'ळ' : 'ḷ',
    'व' : 'v',
    'श' : 'ś',
    'ष' : 'ṣ',
    'स' : 's',
    'ह' : 'h',
    'ऽ' : '\'',
    '्' : '',
    '़' : '',
    '०' : '0',
    '१' : '1',
    '२' : '2',
    '३' : '3',
    '४' : '4',
    '५' : '5',
    '६' : '6',
    '७' : '7',
    '८' : '8',
    '९' : '9',
    'ꣳ' : 'ṁ',
    '।' : '.',
    '॥' : '..',
    ' ' : ' ',

Define character classes:

consonants = '\u0915-\u0939\u0958-\u095F\u0978-\u097C\u097E-\u097F'
vowelsigns = '\u093E-\u094C\u093A-\u093B\u094E-\u094F\u0955-\u0957'
nukta = '\u093C'
virama = '\u094D'

devanagarichars = '\u0900-\u097F\u1CD0-\u1CFF\uA8E0-\uA8FF'

Define functions


Primary transliteration function. Performs the following general operations:

  1. Loops over each character in a word
  2. If character is a Devanagari character, then continue processing, otherwise go to next character
  3. Most characters can be transliterated by mapping to Latin characters in the conversationtable dictionary. Consonants need additional attention.
  4. Perform lookahead check to determine if an ‘a needs to be appended after a consonant.
def deva_to_latn(text):

    word = text.strip()

    # define a buffer to store the transliteration
    curr = ''

    for index, char in enumerate(word):

        # check if char is a Devanagari character. if true then continue processing.
        # otherwise, output char to curr

        if re.match('[' + devanagarichars + ']', char):

            # if char = consonant, then its transliteration is dependent upon various
            # factors. need to check if next char = nukta, virama, vowel sign.

            if re.match('[' + consonants + ']', char):

                # check next char
                nextchar = word[(index + 1) % len(word)]

                if nextchar:

                    # if next char = nukta, then add present char and nukta 
                    # to 'cons'. else just add present char. set variable
                    # to test for nukta when processing next char

                    if re.match('[' + nukta + ']', nextchar):
                        cons = char + nextchar
                        nukta_present = 1
                        cons = char
                        nukta_present = 0

                    # if present char is nukta, then check next char

                    if nukta_present:
                        nextchar = word[(index + 2) % len(word)]

                    # if next char = combining sign or virama, convert consonant 
                    # without "a". else if next char != combining sign or virama, 
                    # add "a" to consonant

                    if re.match('[' + vowelsigns + virama +']', nextchar):
                        trans = conversiontable.get(cons, '')
                        curr = curr + trans
                        trans = conversiontable.get(cons, '')
                        trans = trans + "a"
                        curr = curr + trans

            # transliterate all other chars

                trans = conversiontable.get(char, '')
                curr = curr + trans

        # char is not Devanagari. output char to curr

             curr = curr + char

    return curr


Function that preprocesses Devanagari input and passes it to deva_to_latn()

def getLatin(inputtext):

    word_syllables = []
    all_words = []

    for word in inputtext.split():

        latin_output = deva_to_latn(word)
        joined_all_words = ' '.join(all_words)

    return joined_all_words

Test transliteration routines

Test some Sanskrit text in Devanagari with Vedic ‘accents’. Accents are ignored at present.

text = "अ॒ग्निमी॑ळे पु॒रोहि॑तं य॒ज्ञस्य॑ दे॒वमृ॒त्विज॑म्। होता॑रं रत्न॒धात॑मम्॥"
'agnimīḷe purohitaṃ yajñasya devamr̥tvijam. hotāraṃ ratnadhātamam..'

Test some Hindi text. Word-final inherent ‘a is generally not transliterated in modern Hindi.

text = "उन्हें बुद्धि और अन्तरात्मा की देन प्राप्त है और परस्पर उन्हें भाईचारे के भाव से बर्ताव करना चाहिए"
'unheṃ buddhi aura antarātmā kī dena prāpta hai aura paraspara unheṃ bhāīcāre ke bhāva se bartāva karanā cāhie'

Test some mixed script text.

text = "This is in the देवनागरी script"
'This is in the devanāgarī script'

Next steps

The deva_to_latn() function can be adapted for transliterating other Indic scripts.