How old is a Unicode character?

Posted on Thu 02 February 2017 in articles

I often need to find out when a particular character was encoded in The Unicode Standard. The Unicode Character Database (UCD) has a file called DerivedAge.txt that contains such information. This UCD file provides the Age property for characters, which can be queried to obtain the version of Unicode in which a codepoint was assigned. However, deriving the age of a character from DerivedAge.txt is not so straightforward on account of the file’s format. Here, I will parse that file and define a new function in order to obtain the necessary data with ease.

Understand the data

I used DerivedAge.txt in a previous post to explore the growth of Unicode in terms of the number of characters added in each version. Recall that the DerivedAge.txt file has the following presentation:

    # Age=V1_1

    # Assigned as of Unicode 1.1.0 (June, 1993)
    # [excluding removed Hangul Syllables]

    0000..001F    ; 1.1 #  [32] <control-0000>..<control-001F>
    0020..007E    ; 1.1 #  [95] SPACE..TILDE
    007F..009F    ; 1.1 #  [33] <control-007F>..<control-009F>
    00A0..00AC    ; 1.1 #  [13] NO-BREAK SPACE..NOT SIGN
    00AD          ; 1.1 #       SOFT HYPHEN

In that previous post, I parsed this data and stored it into the derivedage list of lists, which has the following structure:

derivedage[0:5]
[['1.1', '0000..001F', '32', '<control-0000>..<control-001F>'],
 ['1.1', '0020..007E', '95', 'SPACE..TILDE'],
 ['1.1', '007F..009F', '33', '<control-007F>..<control-009F>'],
 ['1.1', '00A0..00AC', '13', 'NO-BREAK SPACE..NOT SIGN'],
 ['1.1', '00AD', '1', 'SOFT HYPHEN']]

Notice that the second element contains the codepoint. However, there are actual two types of data masquerading as one. In some lists, the second element is a single codepoint, eg. 00AD, while in others it stands for a range, eg. 00A0..00AC.

Approach the problem

The issue becomes clear: searching for a particular codepoint will have one of two results. If the query matches the exact value of the second element, then the Age property is readily available. However, if there is no exact match, then additional processing is necessary.

There are two ways to approach the task:

  1. Expand all code ranges to produce entries for each literal codepoint
  2. Infer codepoint’s membership within a range of codepoints

The first approach would require generating new entries for each codepoint in a given range:

coderange = ['00A0', '00AC']
hexlower = int("0x" + coderange[0], 16)
hexupper = int("0x" + coderange[1], 16)

for i in (range(hexlower, hexupper + 1)):
    print (hex(i))
0xa0
0xa1
0xa2
0xa3
0xa4
0xa5
0xa6
0xa7
0xa8
0xa9
0xaa
0xab
0xac

One benefit to this approach is the ability to produce a dictionary that contains every codepoint as a key and it’s age as the value:

derivedage = {
    '00A0' : '1.1',
    '00A1' : '1.1',
    ...
    }

The downside is that it will significantly expand the dataset as there are more than 160k codepoints as of Unicode version 9.0.

The other approach is to search within a range at runtime and to return the associated age:

coderange = ['00A0', '00AC']
hexlower = int("0x" + coderange[0], 16)
hexupper = int("0x" + coderange[1], 16)

char = '00A3'

char = int("0x" + char, 16)
if hexlower <= char <= hexupper:
    print ('found', hex(char), 'within', hex(hexlower), 'and', hex(hexupper))
else:
    print ('not found')
found 0xa3 within 0xa0 and 0xac

Define the function

The function relies upon the derivedage container, which is a list of lists with the following structure:

['1.1', '0020..007E', '95', 'SPACE..TILDE']

For our present purposes, we need to perform an operation on element #2, which is either a single codepoint value or a range of codepoints. If it is a range, then we need to split into two to define the lower and upper bounds:

['0020', '007E']

The function will take a list of 1 or more codepoints. The codepoint is expressed as a string, eg. ‘0065’. For each element in the list, the function will determine if the codepoint scan derivedage:

def getDerivedAge(inputlist_):

    resultslist = {}

    for codepoint in inputlist_:

        scriptinfo = None
        char = int("0x" + codepoint, 16)

        for script in derivedage:

            version = script[0]
            coderange = script[1].split('..')

            if len(coderange) == 2:

                hexlower = int("0x" + coderange[0], 16)
                hexupper = int("0x" + coderange[1], 16)

                if hexlower <= char <= hexupper:
                    scriptinfo = version

            elif len(coderange) == 1:

                if codepoint == coderange[0]:
                    scriptinfo = version

            else:

                scriptinfo = 'Error'

        resultslist.update({codepoint : scriptinfo})

    return resultslist

Test the function

codepoints = ['0035', '0913', 'A8F0', 'BBBB', '11080', '14000']
getDerivedAge(codepoints)
{'0035': '1.1',
 '0913': '1.1',
 '11080': '5.2',
 '14000': None,
 'A8F0': '5.2',
 'BBBB': '2.0'}

The function returns the versions in which each of our specified codepoints was encoded. But, we got None for 14000… is there an error somewhere? Performing a search through DerivedAge.txt for a single codepoint or withinn a range yields nothing. A quick glance at the Unicode Roadmap shows that 14000 is an unencoded codepoint in a block that is tentatively allocated for Egyptian Hieroglyphs. So, there is no character in Unicode associated with this codepoint. The value None is accurate.

Assessment

The DerivedAge.txt file presents the ‘age’ of a codepoint, or the version of Unicode in which it was assigned in the stanard.. When this version information is aggregated with other data in the UCD, it is possible to generate holistic metadata for codepoints in Unicode. In a future post I will explore the Blocks.txt and Scripts.txt file in order to illustrate how codepoints are grouped within Unicode paradigms of ‘blocks’ and ‘scripts’.