Exploring the growth of Unicode using the UCD and Python

Posted on Sat 28 January 2017 in articles

How many characters are in Unicode? How many new characters have been added for each version of Unicode? From a programmatic standpoint, figuring out the number of characters published in a given version of Unicode is not as straightforward as one might imagine. While it is true that Unicode contains ‘characters’ from writing systems, there are also entities that are not truly ‘characters’ in the conventional sense. There are entities such as control characters, private-use characters, surrogates, and even noncharacters. Therefore, to get a sense of the number of entities in Unicode, it is practical to think instead in terms of ‘code points’.

In this exercise, I process data from the Unicode Character Database (UCD) in order to produce an overview of the longitudinal growth of Unicode. There are certainly other ways to solve the challenge, my goal is to do so in a programmatic fashion using available data. Specifically, I will explore a UCD file called DerivedAge.txt, which provides all codepoints assignments for each version, from 1.1 to 9.0.

I will process this file to extract the number of code points in each version of Unicode, and also to distinguish at a high level between characters and other entities.

Approach

Import the Python modules needed for this task:

import re

import requests
from prettytable import PrettyTable

Get the data

Fetch the plain text file DerivedAge.txt from the Unicode Character Database using requests, and get the text of the requests object using the .text method:

url = "http://www.unicode.org/Public/UCD/latest/ucd/DerivedAge.txt"
r = requests.get(url).text

Inspect the data

len(r)

Split the requests text by newline into a list named filestream, then take a slice of the list to get a sense of the data:

filestream = r.split('\n')

for line in filestream[51:70]:
    print (line)

# Age=V1_1

# Assigned as of Unicode 1.1.0 (June, 1993)
# [excluding removed Hangul Syllables]

0000..001F    ; 1.1 #  [32] <control-0000>..<control-001F>
0020..007E    ; 1.1 #  [95] SPACE..TILDE
007F..009F    ; 1.1 #  [33] <control-007F>..<control-009F>
00A0..00AC    ; 1.1 #  [13] NO-BREAK SPACE..NOT SIGN
00AD          ; 1.1 #       SOFT HYPHEN
00AE..01F5    ; 1.1 # [328] REGISTERED SIGN..LATIN SMALL LETTER G WITH ACUTE
01FA..0217    ; 1.1 #  [30] LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE..LATIN SMALL LETTER U WITH INVERTED BREVE
0250..02A8    ; 1.1 #  [89] LATIN SMALL LETTER TURNED A..LATIN SMALL LETTER TC DIGRAPH WITH CURL
02B0..02DE    ; 1.1 #  [47] MODIFIER LETTER SMALL H..MODIFIER LETTER RHOTIC HOOK
02E0..02E9    ; 1.1 #  [10] MODIFIER LETTER SMALL GAMMA..MODIFIER LETTER EXTRA-LOW TONE BAR
0300..0345    ; 1.1 #  [70] COMBINING GRAVE ACCENT..COMBINING GREEK YPOGEGRAMMENI
0360..0361    ; 1.1 #   [2] COMBINING DOUBLE TILDE..COMBINING DOUBLE INVERTED BREVE
0374..0375    ; 1.1 #   [2] GREEK NUMERAL SIGN..GREEK LOWER NUMERAL SIGN
037A          ; 1.1 #       GREEK YPOGEGRAMMENI

Observations

The file lists the codepoints encoded in each version. It specifies the property Age, which is indicated both as a comment before each group of codepoints

# Age=V1_1

and in the data for every code point or range of codepoints:

037A          ; 1.1 #       GREEK YPOGEGRAMMENI

The above record tells us that the character U+037A GREEK YPOGRAMMENI, was encoded in Unicode version 1.1.

The total number of code points is given as a comment after the list for each version:

# Total code points: 33979

However, the number of codepoints does not tell us about the number of characters, or other entities. Such details are available in DerivedAge.txt, but we will need to obtain them programmatically from the data. First, let’s take a look at the format and contents of the file.

Description of the data format

A sample of data in DerivedAge.txt:

00A0..00AC    ; 1.1 #  [13] NO-BREAK SPACE..NOT SIGN
00AD          ; 1.1 #       SOFT HYPHEN
4E00..9FA5    ; 1.1 # [20902] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FA5
E000..F8FF    ; 1.1 # [6400] <private-use-E000>..<private-use-F8FF>

As shown DerivedAge.txt is a CSV-like flat file. There are two primary fields, separated by a ; semi-colon. The information after # may be considered comments, but may be separated into two fields:

field 0: codepoint or codepoint range for a character or group of contiguous characters
field 1: version of Unicode in which the character(s) was/were encoded
field 2: number of codepoints in a range, in square braces, empty if a single codepoint
field 3: name or names of first and last item in a contiguous group; characters are UPPERCASE, while noncharacters are lowercase within < … > delimiters

The format is generally uniform, apart from inconsistencies in the spacing of fields after #. Field 3 is alloted a length of seven spaces between the # and field 4. In most cases, the […] fits within the 7 columns. When the codepoint count is present, there is 1 space between the # and […], and 1 space between […] and field 4. If the […] takes more than four spaces, then the starting column for the character names are shifted right.

Define regex for extracting fields

The fields can be extracted using a regex:

^([\w\.]*)\s*\; (.*) \#\s+(?:\[(\d*)\])?\s+(.*)

  ^^^^^^^        ^^          ^^^^^^^^^      ^^
     1            2              3           4

Four capture groups are defined:

[\w\.]* for the codepoint range or single code point
(.*) for the version number
(?:\[(\d*)\]) is a lookahead for […] to handle single code point
(.*) for the character name(s)

Wrangle the data

Now wrangle the data from its native format into a more usable container:

derivedage = []

derivedage will be a list of lists with a structure of:

[
  ['version number', 'code point range', 'count', 'character names']]
]

Also, let’s use a dictionary to capture the Age property values that have been explicitly specified. These values can be used for validating the data that we will derive from the file.

derivedage_properties = {}

Now we’ll read the file line by line, and extract information from lines:

for line in filestream:

    # use the regex defined above:

    m = re.search('^([\w\.]*)\s*\; (.*) \#\s+(?:\[(\d*)\])?\s+(.*)', line)

    if m:

        m_coderange = m.group(1)
        m_version = m.group(2)

        # m.group(3) == None if the (?:\[(\d*)\])? segment of the regex is empty

        if m.group(3) == None:
            m_count = '1'
        else:
            m_count = m.group(3)

        m_charname = m.group(4)

        derivedage.append([m_version, m_coderange, m_count, m_charname])


    # now capture the age property details:

    age_property_match = re.search('\# Age=V(.*)', line)
    age_count_match = re.search('\# Total code points: (.*)', line)

    if age_property_match:
        age_property = age_property_match.group(1)
        age_property = re.sub('\_', '.', age_property)
        derivedage_properties.update({age_property : 0})

    if age_count_match:
        age_count = age_count_match.group(1)
        derivedage_properties[age_property] = int(age_count)

Verify derivedage:

derivedage[0:10]

[['1.1', '0000..001F', '32', '<control-0000>..<control-001F>'],
 ['1.1', '0020..007E', '95', 'SPACE..TILDE'],
 ['1.1', '007F..009F', '33', '<control-007F>..<control-009F>'],
 ['1.1', '00A0..00AC', '13', 'NO-BREAK SPACE..NOT SIGN'],
 ['1.1', '00AD', '1', 'SOFT HYPHEN'],
 ['1.1',
  '00AE..01F5',
  '328',
  'REGISTERED SIGN..LATIN SMALL LETTER G WITH ACUTE'],
 ['1.1',
  '01FA..0217',
  '30',
  'LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE..LATIN SMALL LETTER U WITH INVERTED BREVE'],
 ['1.1',
  '0250..02A8',
  '89',
  'LATIN SMALL LETTER TURNED A..LATIN SMALL LETTER TC DIGRAPH WITH CURL'],
 ['1.1',
  '02B0..02DE',
  '47',
  'MODIFIER LETTER SMALL H..MODIFIER LETTER RHOTIC HOOK'],
 ['1.1',
  '02E0..02E9',
  '10',
  'MODIFIER LETTER SMALL GAMMA..MODIFIER LETTER EXTRA-LOW TONE BAR']]

Verify derivedage_summary:

derivedage_properties

{'1.1': 33979,
 '2.0': 144521,
 '2.1': 2,
 '3.0': 10307,
 '3.1': 44978,
 '3.2': 1016,
 '4.0': 1226,
 '4.1': 1273,
 '5.0': 1369,
 '5.1': 1624,
 '5.2': 6648,
 '6.0': 2088,
 '6.1': 732,
 '6.2': 1,
 '6.3': 5,
 '7.0': 2834,
 '8.0': 7716,
 '9.0': 7500}

Process the data

Define dictionaries for capturing the data:

DA_codepoints = {}
DA_chars = {}
DA_other = {}

DA_entities = {}

Now iterate over the derivedage list of lists:

for entry in derivedage:

    version = entry[0]
    coderange = entry[1].split('..')
    count = entry[2]
    charname = entry[3]

    # 1. get codepoint range. if coderange ==2, get number of codepoints in 
    # range by subtracting coderange[1] and coderange[0]. the values in 
    # these lists are string represents of hex codes, so convert them 
    # to actual hex numbers:

    if len(coderange) == 2:
        range_length = int(coderange[1], 16) - int(coderange[0], 16) + 1
    elif len(coderange) == 1:
        range_length = 1

    # 2. store number of total codepoints per version:

    if version in DA_codepoints:
        DA_codepoints[version] += range_length
    else:
        DA_codepoints[version] = range_length


    # 3. store number of characters and other codepoints.
    # check for codepoint types. noncharacters are identified in 
    # DerivedAge.txt using descriptions in <..> brackets

    nonchar_match = re.search('^\<', charname)

    # if entry is not a character:
    if nonchar_match:

        # keep a record of the number of noncharacters per version:

        if version in DA_other:
            DA_other[version] += range_length
        else:
            DA_other[version] = range_length

    # if entry is a character:
    else:

        # keep a record of the number of characters per version:

        if version in DA_chars:
            DA_chars[version] += range_length
        else:
            DA_chars[version] = range_length

    # 4. store number of codepoint types per version:

    nonchar_type_match = re.search('^\<(\w*)\-', charname)

    if nonchar_type_match:
        entity_type = nonchar_type_match.group(1)
    else:
        entity_type = 'character'

    if version in DA_entities:               
        if entity_type in DA_entities[version]:
            DA_entities[version][entity_type] += range_length
        else:
            DA_entities[version].update({entity_type : range_length})
    else:
        DA_entities[version] = {entity_type : range_length}

Review generated data

** Get number of codepoints by version:**

DA_codepoints

{'1.1': 33979,
 '2.0': 144521,
 '2.1': 2,
 '3.0': 10307,
 '3.1': 44978,
 '3.2': 1016,
 '4.0': 1226,
 '4.1': 1273,
 '5.0': 1369,
 '5.1': 1624,
 '5.2': 6648,
 '6.0': 2088,
 '6.1': 732,
 '6.2': 1,
 '6.3': 5,
 '7.0': 2834,
 '8.0': 7716,
 '9.0': 7500}

Does our codepoint count match the Age property values explicitly given in DerivedAge.txt?:

DA_codepoints == derivedage_properties

True

** Number of new characters per version: **

DA_chars

{'1.1': 27512,
 '2.0': 11373,
 '2.1': 2,
 '3.0': 10307,
 '3.1': 44946,
 '3.2': 1016,
 '4.0': 1226,
 '4.1': 1273,
 '5.0': 1369,
 '5.1': 1624,
 '5.2': 6648,
 '6.0': 2088,
 '6.1': 732,
 '6.2': 1,
 '6.3': 5,
 '7.0': 2834,
 '8.0': 7716,
 '9.0': 7500}

** Number of new noncharacters per version: **

DA_other

{'1.1': 6467, '2.0': 133148, '3.1': 32}

** Breakdown of codepoint types by version **

DA_entities

{'1.1': {'character': 27512,
  'control': 65,
  'noncharacter': 2,
  'private': 6400},
 '2.0': {'character': 11373,
  'noncharacter': 32,
  'private': 131068,
  'surrogate': 2048},
 '2.1': {'character': 2},
 '3.0': {'character': 10307},
 '3.1': {'character': 44946, 'noncharacter': 32},
 '3.2': {'character': 1016},
 '4.0': {'character': 1226},
 '4.1': {'character': 1273},
 '5.0': {'character': 1369},
 '5.1': {'character': 1624},
 '5.2': {'character': 6648},
 '6.0': {'character': 2088},
 '6.1': {'character': 732},
 '6.2': {'character': 1},
 '6.3': {'character': 5},
 '7.0': {'character': 2834},
 '8.0': {'character': 7716},
 '9.0': {'character': 7500}}

Present the data

We will use the PrettyTable module to present the data using text tables

New codepoints and codepoint types by version

x = PrettyTable(['version', 'new codepoints', 'new chars', 'new other'])

for k, v in DA_codepoints.items():
    x.add_row([k, v, DA_chars.get(k, '-'), DA_other.get(k, '-')])

print (x)

+---------+----------------+-----------+-----------+
| version | new codepoints | new chars | new other |
+---------+----------------+-----------+-----------+
|   1.1   |     33979      |   27512   |    6467   |
|   2.0   |     144521     |   11373   |   133148  |
|   2.1   |       2        |     2     |     -     |
|   3.0   |     10307      |   10307   |     -     |
|   3.1   |     44978      |   44946   |     32    |
|   3.2   |      1016      |    1016   |     -     |
|   4.0   |      1226      |    1226   |     -     |
|   4.1   |      1273      |    1273   |     -     |
|   5.0   |      1369      |    1369   |     -     |
|   5.1   |      1624      |    1624   |     -     |
|   5.2   |      6648      |    6648   |     -     |
|   6.0   |      2088      |    2088   |     -     |
|   6.1   |      732       |    732    |     -     |
|   6.2   |       1        |     1     |     -     |
|   6.3   |       5        |     5     |     -     |
|   7.0   |      2834      |    2834   |     -     |
|   8.0   |      7716      |    7716   |     -     |
|   9.0   |      7500      |    7500   |     -     |
+---------+----------------+-----------+-----------+

Growth of codepoints, characters, noncharacters by version

Declare three lists:

DA_char_totals = []
DA_other_totals = []
DA_codepoint_totals = []

count_codepoints = 0
count_chars = 0
count_other = 0

z = PrettyTable(['version', 'total codepoints', 'total chars', 'total other'])

# for each version in DA_codepoints, get the corresponding values 
# from DA_chars and DA_other. Produce running count of the values 
# for each version.

for k, v in DA_codepoints.items():

    count_codepoints += DA_codepoints.get(k, 0)
    count_chars += DA_chars.get(k, 0)
    count_other += DA_other.get(k, 0)

    DA_char_totals.append([k, DA_chars.get(k, 0), count_chars])
    DA_other_totals.append([k, DA_other.get(k, 0), count_other])
    DA_codepoint_totals.append([k, DA_codepoints.get(k, 0), count_codepoints])

    z.add_row([k, count_codepoints, count_chars, count_other])

print (z)

+---------+------------------+-------------+-------------+
| version | total codepoints | total chars | total other |
+---------+------------------+-------------+-------------+
|   1.1   |      33979       |    27512    |     6467    |
|   2.0   |      178500      |    38885    |    139615   |
|   2.1   |      178502      |    38887    |    139615   |
|   3.0   |      188809      |    49194    |    139615   |
|   3.1   |      233787      |    94140    |    139647   |
|   3.2   |      234803      |    95156    |    139647   |
|   4.0   |      236029      |    96382    |    139647   |
|   4.1   |      237302      |    97655    |    139647   |
|   5.0   |      238671      |    99024    |    139647   |
|   5.1   |      240295      |    100648   |    139647   |
|   5.2   |      246943      |    107296   |    139647   |
|   6.0   |      249031      |    109384   |    139647   |
|   6.1   |      249763      |    110116   |    139647   |
|   6.2   |      249764      |    110117   |    139647   |
|   6.3   |      249769      |    110122   |    139647   |
|   7.0   |      252603      |    112956   |    139647   |
|   8.0   |      260319      |    120672   |    139647   |
|   9.0   |      267819      |    128172   |    139647   |
+---------+------------------+-------------+-------------+

z = PrettyTable(['version', 'new codepoints', 'total codepoints'])

for e in DA_codepoint_totals:
    z.add_row([x for x in e])

print (z)

+---------+----------------+------------------+
| version | new codepoints | total codepoints |
+---------+----------------+------------------+
|   1.1   |     33979      |      33979       |
|   2.0   |     144521     |      178500      |
|   2.1   |       2        |      178502      |
|   3.0   |     10307      |      188809      |
|   3.1   |     44978      |      233787      |
|   3.2   |      1016      |      234803      |
|   4.0   |      1226      |      236029      |
|   4.1   |      1273      |      237302      |
|   5.0   |      1369      |      238671      |
|   5.1   |      1624      |      240295      |
|   5.2   |      6648      |      246943      |
|   6.0   |      2088      |      249031      |
|   6.1   |      732       |      249763      |
|   6.2   |       1        |      249764      |
|   6.3   |       5        |      249769      |
|   7.0   |      2834      |      252603      |
|   8.0   |      7716      |      260319      |
|   9.0   |      7500      |      267819      |
+---------+----------------+------------------+

z = PrettyTable(['version', 'new chars', 'total chars'])

for e in DA_char_totals:
    z.add_row([x for x in e])

print (z)

+---------+-----------+-------------+
| version | new chars | total chars |
+---------+-----------+-------------+
|   1.1   |   27512   |    27512    |
|   2.0   |   11373   |    38885    |
|   2.1   |     2     |    38887    |
|   3.0   |   10307   |    49194    |
|   3.1   |   44946   |    94140    |
|   3.2   |    1016   |    95156    |
|   4.0   |    1226   |    96382    |
|   4.1   |    1273   |    97655    |
|   5.0   |    1369   |    99024    |
|   5.1   |    1624   |    100648   |
|   5.2   |    6648   |    107296   |
|   6.0   |    2088   |    109384   |
|   6.1   |    732    |    110116   |
|   6.2   |     1     |    110117   |
|   6.3   |     5     |    110122   |
|   7.0   |    2834   |    112956   |
|   8.0   |    7716   |    120672   |
|   9.0   |    7500   |    128172   |
+---------+-----------+-------------+

z = PrettyTable(['version', 'new other', 'total other'])

for e in DA_other_totals:
    z.add_row([x for x in e])

print (z)

+---------+-----------+-------------+
| version | new other | total other |
+---------+-----------+-------------+
|   1.1   |    6467   |     6467    |
|   2.0   |   133148  |    139615   |
|   2.1   |     0     |    139615   |
|   3.0   |     0     |    139615   |
|   3.1   |     32    |    139647   |
|   3.2   |     0     |    139647   |
|   4.0   |     0     |    139647   |
|   4.1   |     0     |    139647   |
|   5.0   |     0     |    139647   |
|   5.1   |     0     |    139647   |
|   5.2   |     0     |    139647   |
|   6.0   |     0     |    139647   |
|   6.1   |     0     |    139647   |
|   6.2   |     0     |    139647   |
|   6.3   |     0     |    139647   |
|   7.0   |     0     |    139647   |
|   8.0   |     0     |    139647   |
|   9.0   |     0     |    139647   |
+---------+-----------+-------------+

Commentary

This exercise helped us to analyze the number of codepoints added to Unicode over time using DerivedAge.txt. The data provides a high-level overview of the number and types of codepoints that have been assigned. The UCD contains other data that provide more details on the characters, noncharacters, and other entities in the standard. I will explore these files in subsequent posts in order to expand our understanding of Unicode data through the Unicode Character Database.