Querying the age of a Unicode code point

Posted on Thu 02 February 2017 in articles

I used the Derived.Age file from the UCD in order to explore the growth of Unicode. As the data presents the Age property of characters, we can use it to query the version of Unicode in which a code point was assigned. We can use the derivedage list we created and define a new function in order to achieve this goal.

Understand the data

The derivedage list of lists has the following structure and format:

derivedage[0:5]
[['1.1', '0000..001F', '32', '<control-0000>..<control-001F>'],
 ['1.1', '0020..007E', '95', 'SPACE..TILDE'],
 ['1.1', '007F..009F', '33', '<control-007F>..<control-009F>'],
 ['1.1', '00A0..00AC', '13', 'NO-BREAK SPACE..NOT SIGN'],
 ['1.1', '00AD', '1', 'SOFT HYPHEN']]

Approach

Two ways to approach the task:

00A0..00AC    ; 1.1 #  [13] NO-BREAK SPACE..NOT SIGN
00AD          ; 1.1 #       SOFT HYPHEN
  1. Expand all code ranges to produce entries for each literal code point
  2. Infer codepoint’s membership within a range of codepoints

The first approach would require generating new entries for each code point in a given range:

coderange = ['00A0', '00AC']
hexlower = int("0x" + coderange[0], 16)
hexupper = int("0x" + coderange[1], 16)

for i in (range(hexlower, hexupper + 1)):
    print (hex(i))
0xa0
0xa1
0xa2
0xa3
0xa4
0xa5
0xa6
0xa7
0xa8
0xa9
0xaa
0xab
0xac

One benefit to this approach is the ability to produce a dictionary that contains every codepoint as a key and it’s age as the value:

derivedage = {
    '00A0' : '1.1',
    '00A1' : '1.1',
    ...
    }

The downside is that it will significantly expand the dataset as there are more than 160k codepoints as of Unicode version 9.0.

The other approach is to search within a range at runtime and to return the associated age:

coderange = ['00A0', '00AC']
hexlower = int("0x" + coderange[0], 16)
hexupper = int("0x" + coderange[1], 16)

char = '00A3'

char = int("0x" + char, 16)
if hexlower <= char <= hexupper:
    print ('found', hex(char), 'within', hex(hexlower), 'and', hex(hexupper))
else:
    print ('not found')
found 0xa3 within 0xa0 and 0xac

Define function

The function relies upon the derivedage container, which is a list with the following structure:

['1.1', '0020..007E', '95', 'SPACE..TILDE']

For our present purposes, we need to perform an operation on element #2, which is either a single codepoint value or a range of codepoints. If it’s a range, then we need to split into two to define the lower and upper bounds:

['0020', '007E']

The function will take a list of 1 or more codepoints. The codepoint is expressed as a string, eg. ‘0065’. For each element in the list, the function will determine if the codepoint scan derivedage

def getDerivedAge(inputlist_):

    resultslist = {}

    for codepoint in inputlist_:

        scriptinfo = None
        char = int("0x" + codepoint, 16)

        for script in derivedage:

            version = script[0]
            coderange = script[1].split('..')

            if len(coderange) == 2:

                hexlower = int("0x" + coderange[0], 16)
                hexupper = int("0x" + coderange[1], 16)

                if hexlower <= char <= hexupper:
                    scriptinfo = version

            elif len(coderange) == 1:

                if codepoint == coderange[0]:
                    scriptinfo = version

            else:

                scriptinfo = 'Error'

        resultslist.update({codepoint : scriptinfo})

    return resultslist

Test

codepoints = ['0035', '0913', 'A8F0', 'BBBB', '11080', '14000']
getDerivedAge(codepoints)
{'0035': '1.1',
 '0913': '1.1',
 '11080': '5.2',
 '14000': None,
 'A8F0': '5.2',
 'BBBB': '2.0'}

Assessment

The DerivedAge.txt file presents the ‘age’ of a code point, or the version of Unicode in which it was assigned in the stanard.. When this version information is aggregated with other data in the UCD, it is possible to generate holistic metadata for code points in Unicode. In a future post I will explore the Blocks.txt and Scripts.txt file in order to illustrate how codepoints are grouped within Unicode paradigms of ‘blocks’ and ‘scripts’.