I often need to find out when a particular character was encoded in The
Unicode Standard. The Unicode Character Database (UCD) has
a file called DerivedAge.txt
that contains such information. This UCD
file provides the Age
property for characters, which can be queried
to obtain the version of Unicode in which a codepoint was assigned.
However, deriving the age of a character from DerivedAge.txt
is not
so straightforward on account of the file’s format. Here, I will
parse that file and define a new function in order to obtain the
necessary data with ease.
Understand the data
I used DerivedAge.txt
in a previous post to explore the
growth of Unicode
in terms of the number of characters added in each version.
Recall that the DerivedAge.txt
file has the following presentation:
# Age=V1_1
# Assigned as of Unicode 1.1.0 (June, 1993)
# [excluding removed Hangul Syllables]
0000..001F ; 1.1 # [32] <control-0000>..<control-001F>
0020..007E ; 1.1 # [95] SPACE..TILDE
007F..009F ; 1.1 # [33] <control-007F>..<control-009F>
00A0..00AC ; 1.1 # [13] NO-BREAK SPACE..NOT SIGN
00AD ; 1.1 # SOFT HYPHEN
In that previous post, I parsed this data and stored it into the
derivedage
list of lists, which has the following structure:
[['1.1', '0000..001F', '32', '<control-0000>..<control-001F>'],
['1.1', '0020..007E', '95', 'SPACE..TILDE'],
['1.1', '007F..009F', '33', '<control-007F>..<control-009F>'],
['1.1', '00A0..00AC', '13', 'NO-BREAK SPACE..NOT SIGN'],
['1.1', '00AD', '1', 'SOFT HYPHEN']]
Notice that the second element contains the codepoint. However,
there are actual two types of data masquerading as one. In some lists,
the second element is a single codepoint, eg. 00AD, while in others
it stands for a range, eg. 00A0..00AC.
Approach the problem
The issue becomes clear: searching for a particular codepoint will have
one of two results. If the query matches the exact value of the second
element, then the Age
property is readily available. However, if there
is no exact match, then additional processing is necessary.
There are two ways to approach the task:
- Expand all code ranges to produce entries for each literal codepoint
- Infer codepoint’s membership within a range of codepoints
The first approach would require generating new entries for each codepoint in a given range:
coderange = ['00A0', '00AC']
hexlower = int("0x" + coderange[0], 16)
hexupper = int("0x" + coderange[1], 16)
for i in (range(hexlower, hexupper + 1)):
print (hex(i))
0xa0
0xa1
0xa2
0xa3
0xa4
0xa5
0xa6
0xa7
0xa8
0xa9
0xaa
0xab
0xac
One benefit to this approach is the ability to produce a dictionary that contains every codepoint as a key and it’s age as the value:
derivedage = {
'00A0' : '1.1',
'00A1' : '1.1',
...
}
The downside is that it will significantly expand the dataset as there are more than 160k codepoints as of Unicode version 9.0.
The other approach is to search within a range at runtime and to return the associated age:
coderange = ['00A0', '00AC']
hexlower = int("0x" + coderange[0], 16)
hexupper = int("0x" + coderange[1], 16)
char = '00A3'
char = int("0x" + char, 16)
if hexlower <= char <= hexupper:
print ('found', hex(char), 'within', hex(hexlower), 'and', hex(hexupper))
else:
print ('not found')
found 0xa3 within 0xa0 and 0xac
Define the function
The function relies upon the derivedage
container, which is a list of
lists with the following structure:
['1.1', '0020..007E', '95', 'SPACE..TILDE']
For our present purposes, we need to perform an operation on element #2, which is
either a single codepoint value or a range of codepoints. If it is a range, then
we need to split into two to define the lower and upper bounds:
The function will take a list of 1 or more codepoints. The codepoint is expressed
as a string, eg. ‘0065’. For each element in the list, the function will determine
if the codepoint scan derivedage
:
def getDerivedAge(inputlist_):
resultslist = {}
for codepoint in inputlist_:
scriptinfo = None
char = int("0x" + codepoint, 16)
for script in derivedage:
version = script[0]
coderange = script[1].split('..')
if len(coderange) == 2:
hexlower = int("0x" + coderange[0], 16)
hexupper = int("0x" + coderange[1], 16)
if hexlower <= char <= hexupper:
scriptinfo = version
elif len(coderange) == 1:
if codepoint == coderange[0]:
scriptinfo = version
else:
scriptinfo = 'Error'
resultslist.update({codepoint : scriptinfo})
return resultslist
Test the function
codepoints = ['0035', '0913', 'A8F0', 'BBBB', '11080', '14000']
getDerivedAge(codepoints)
{'0035': '1.1',
'0913': '1.1',
'11080': '5.2',
'14000': None,
'A8F0': '5.2',
'BBBB': '2.0'}
The function returns the versions in which each of our specified codepoints
was encoded. But, we got None
for 14000… is there an error somewhere?
Performing a search through DerivedAge.txt
for a single codepoint or
withinn a range yields nothing. A quick glance at the
Unicode Roadmap shows that 14000 is an unencoded codepoint in a block
that is tentatively allocated for Egyptian Hieroglyphs. So, there is no
character in Unicode associated with this codepoint. The value None
is accurate.
Assessment
The DerivedAge.txt
file presents the ‘age’ of a codepoint, or
the version of Unicode in which it was assigned in the stanard..
When this version information is aggregated with other data in the UCD,
it is possible to generate holistic metadata for codepoints in Unicode.
In a future post I will explore the Blocks.txt
and Scripts.txt
file
in order to illustrate how codepoints are grouped within Unicode paradigms
of ‘blocks’ and ‘scripts’.