How many characters are in Unicode? How many new characters have been added for each version of Unicode? From a programmatic standpoint, figuring out the number of characters published in a given version of Unicode is not as straightforward as one might imagine. While it is true that Unicode contains ‘characters’ from writing systems, there are also entities that are not truly ‘characters’ in the conventional sense. There are entities such as control characters, private-use characters, surrogates, and even noncharacters. Therefore, to get a sense of the number of entities in Unicode, it is practical to think instead in terms of ‘code points’.
In this exercise, I process data from the Unicode Character Database (UCD) in order to produce an overview of the longitudinal growth of Unicode. There are certainly other ways to solve the challenge, my goal is to do so in a programmatic fashion using available data. Specifically, I will explore a UCD file called DerivedAge.txt
, which provides all codepoints assignments for each version, from 1.1 to 9.0.
I will process this file to extract the number of code points in each version of Unicode, and also to distinguish at a high level between characters and other entities.
Approach
Import the Python modules needed for this task:
import requests
from prettytable import PrettyTable
Get the data
Fetch the plain text file DerivedAge.txt
from the Unicode Character Database using requests
, and get the text of the requests
object using the .text
method:
url = "http://www.unicode.org/Public/UCD/latest/ucd/DerivedAge.txt"
r = requests.get(url).text
Inspect the data
Split the requests
text by newline into a list named filestream
, then take a slice of the list to get a sense of the data:
filestream = r.split('\n')
for line in filestream[51:70]:
print (line)
# Age=V1_1
# Assigned as of Unicode 1.1.0 (June, 1993)
# [excluding removed Hangul Syllables]
0000..001F ; 1.1 # [32] <control-0000>..<control-001F>
0020..007E ; 1.1 # [95] SPACE..TILDE
007F..009F ; 1.1 # [33] <control-007F>..<control-009F>
00A0..00AC ; 1.1 # [13] NO-BREAK SPACE..NOT SIGN
00AD ; 1.1 # SOFT HYPHEN
00AE..01F5 ; 1.1 # [328] REGISTERED SIGN..LATIN SMALL LETTER G WITH ACUTE
01FA..0217 ; 1.1 # [30] LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE..LATIN SMALL LETTER U WITH INVERTED BREVE
0250..02A8 ; 1.1 # [89] LATIN SMALL LETTER TURNED A..LATIN SMALL LETTER TC DIGRAPH WITH CURL
02B0..02DE ; 1.1 # [47] MODIFIER LETTER SMALL H..MODIFIER LETTER RHOTIC HOOK
02E0..02E9 ; 1.1 # [10] MODIFIER LETTER SMALL GAMMA..MODIFIER LETTER EXTRA-LOW TONE BAR
0300..0345 ; 1.1 # [70] COMBINING GRAVE ACCENT..COMBINING GREEK YPOGEGRAMMENI
0360..0361 ; 1.1 # [2] COMBINING DOUBLE TILDE..COMBINING DOUBLE INVERTED BREVE
0374..0375 ; 1.1 # [2] GREEK NUMERAL SIGN..GREEK LOWER NUMERAL SIGN
037A ; 1.1 # GREEK YPOGEGRAMMENI
Observations
The file lists the codepoints encoded in each version. It specifies the property Age
, which is indicated both as a comment before each group of codepoints
and in the data for every code point or range of codepoints:
037A ; 1.1 # GREEK YPOGEGRAMMENI
The above record tells us that the character U+037A GREEK YPOGRAMMENI, was encoded in Unicode version 1.1.
The total number of code points is given as a comment after the list for each version:
# Total code points: 33979
However, the number of codepoints does not tell us about the number of characters, or other entities. Such details are available in DerivedAge.txt
, but we will need to obtain them programmatically from the data. First, let’s take a look at the format and contents of the file.
Description of the data format
A sample of data in DerivedAge.txt
:
00A0..00AC ; 1.1 # [13] NO-BREAK SPACE..NOT SIGN
00AD ; 1.1 # SOFT HYPHEN
4E00..9FA5 ; 1.1 # [20902] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FA5
E000..F8FF ; 1.1 # [6400] <private-use-E000>..<private-use-F8FF>
As shown DerivedAge.txt
is a CSV-like flat file. There are two primary fields, separated by a ;
semi-colon. The information after #
may be considered comments, but may be separated into two fields:
- field 0: codepoint or codepoint range for a character or group of contiguous characters
- field 1: version of Unicode in which the character(s) was/were encoded
- field 2: number of codepoints in a range, in square braces, empty if a single codepoint
- field 3: name or names of first and last item in a contiguous group; characters are UPPERCASE, while noncharacters are lowercase within
<
… >
delimiters
The format is generally uniform, apart from inconsistencies in the spacing of fields after #
. Field 3 is alloted a length of seven spaces between the # and field 4. In most cases, the [
…]
fits within the 7 columns. When the codepoint count is present, there is 1 space between the #
and [
…]
, and 1 space between [
…]
and field 4. If the [
…]
takes more than four spaces, then the starting column for the character names are shifted right.
Define regex for extracting fields
The fields can be extracted using a regex:
^([\w\.]*)\s*\; (.*) \#\s+(?:\[(\d*)\])?\s+(.*)
^^^^^^^ ^^ ^^^^^^^^^ ^^
1 2 3 4
Four capture groups are defined:
[\w\.]*
for the codepoint range or single code point
(.*)
for the version number
(?:\[(\d*)\])
is a lookahead for […] to handle single code point
(.*)
for the character name(s)
Wrangle the data
Now wrangle the data from its native format into a more usable container:
derivedage
will be a list of lists with a structure of:
[
['version number', 'code point range', 'count', 'character names']]
]
Also, let’s use a dictionary to capture the Age
property values that have been explicitly specified. These values can be used for validating the data that we will derive from the file.
derivedage_properties = {}
Now we’ll read the file line by line, and extract information from lines:
for line in filestream:
# use the regex defined above:
m = re.search('^([\w\.]*)\s*\; (.*) \#\s+(?:\[(\d*)\])?\s+(.*)', line)
if m:
m_coderange = m.group(1)
m_version = m.group(2)
# m.group(3) == None if the (?:\[(\d*)\])? segment of the regex is empty
if m.group(3) == None:
m_count = '1'
else:
m_count = m.group(3)
m_charname = m.group(4)
derivedage.append([m_version, m_coderange, m_count, m_charname])
# now capture the age property details:
age_property_match = re.search('\# Age=V(.*)', line)
age_count_match = re.search('\# Total code points: (.*)', line)
if age_property_match:
age_property = age_property_match.group(1)
age_property = re.sub('\_', '.', age_property)
derivedage_properties.update({age_property : 0})
if age_count_match:
age_count = age_count_match.group(1)
derivedage_properties[age_property] = int(age_count)
Verify derivedage
:
[['1.1', '0000..001F', '32', '<control-0000>..<control-001F>'],
['1.1', '0020..007E', '95', 'SPACE..TILDE'],
['1.1', '007F..009F', '33', '<control-007F>..<control-009F>'],
['1.1', '00A0..00AC', '13', 'NO-BREAK SPACE..NOT SIGN'],
['1.1', '00AD', '1', 'SOFT HYPHEN'],
['1.1',
'00AE..01F5',
'328',
'REGISTERED SIGN..LATIN SMALL LETTER G WITH ACUTE'],
['1.1',
'01FA..0217',
'30',
'LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE..LATIN SMALL LETTER U WITH INVERTED BREVE'],
['1.1',
'0250..02A8',
'89',
'LATIN SMALL LETTER TURNED A..LATIN SMALL LETTER TC DIGRAPH WITH CURL'],
['1.1',
'02B0..02DE',
'47',
'MODIFIER LETTER SMALL H..MODIFIER LETTER RHOTIC HOOK'],
['1.1',
'02E0..02E9',
'10',
'MODIFIER LETTER SMALL GAMMA..MODIFIER LETTER EXTRA-LOW TONE BAR']]
Verify derivedage_summary
:
{'1.1': 33979,
'2.0': 144521,
'2.1': 2,
'3.0': 10307,
'3.1': 44978,
'3.2': 1016,
'4.0': 1226,
'4.1': 1273,
'5.0': 1369,
'5.1': 1624,
'5.2': 6648,
'6.0': 2088,
'6.1': 732,
'6.2': 1,
'6.3': 5,
'7.0': 2834,
'8.0': 7716,
'9.0': 7500}
Process the data
Define dictionaries for capturing the data:
DA_codepoints = {}
DA_chars = {}
DA_other = {}
DA_entities = {}
Now iterate over the derivedage
list of lists:
for entry in derivedage:
version = entry[0]
coderange = entry[1].split('..')
count = entry[2]
charname = entry[3]
# 1. get codepoint range. if coderange ==2, get number of codepoints in
# range by subtracting coderange[1] and coderange[0]. the values in
# these lists are string represents of hex codes, so convert them
# to actual hex numbers:
if len(coderange) == 2:
range_length = int(coderange[1], 16) - int(coderange[0], 16) + 1
elif len(coderange) == 1:
range_length = 1
# 2. store number of total codepoints per version:
if version in DA_codepoints:
DA_codepoints[version] += range_length
else:
DA_codepoints[version] = range_length
# 3. store number of characters and other codepoints.
# check for codepoint types. noncharacters are identified in
# DerivedAge.txt using descriptions in <..> brackets
nonchar_match = re.search('^\<', charname)
# if entry is not a character:
if nonchar_match:
# keep a record of the number of noncharacters per version:
if version in DA_other:
DA_other[version] += range_length
else:
DA_other[version] = range_length
# if entry is a character:
else:
# keep a record of the number of characters per version:
if version in DA_chars:
DA_chars[version] += range_length
else:
DA_chars[version] = range_length
# 4. store number of codepoint types per version:
nonchar_type_match = re.search('^\<(\w*)\-', charname)
if nonchar_type_match:
entity_type = nonchar_type_match.group(1)
else:
entity_type = 'character'
if version in DA_entities:
if entity_type in DA_entities[version]:
DA_entities[version][entity_type] += range_length
else:
DA_entities[version].update({entity_type : range_length})
else:
DA_entities[version] = {entity_type : range_length}
Review generated data
** Get number of codepoints by version:**
{'1.1': 33979,
'2.0': 144521,
'2.1': 2,
'3.0': 10307,
'3.1': 44978,
'3.2': 1016,
'4.0': 1226,
'4.1': 1273,
'5.0': 1369,
'5.1': 1624,
'5.2': 6648,
'6.0': 2088,
'6.1': 732,
'6.2': 1,
'6.3': 5,
'7.0': 2834,
'8.0': 7716,
'9.0': 7500}
Does our codepoint count match the Age
property values explicitly given in DerivedAge.txt
?:
DA_codepoints == derivedage_properties
** Number of new characters per version: **
{'1.1': 27512,
'2.0': 11373,
'2.1': 2,
'3.0': 10307,
'3.1': 44946,
'3.2': 1016,
'4.0': 1226,
'4.1': 1273,
'5.0': 1369,
'5.1': 1624,
'5.2': 6648,
'6.0': 2088,
'6.1': 732,
'6.2': 1,
'6.3': 5,
'7.0': 2834,
'8.0': 7716,
'9.0': 7500}
** Number of new noncharacters per version: **
{'1.1': 6467, '2.0': 133148, '3.1': 32}
** Breakdown of codepoint types by version **
{'1.1': {'character': 27512,
'control': 65,
'noncharacter': 2,
'private': 6400},
'2.0': {'character': 11373,
'noncharacter': 32,
'private': 131068,
'surrogate': 2048},
'2.1': {'character': 2},
'3.0': {'character': 10307},
'3.1': {'character': 44946, 'noncharacter': 32},
'3.2': {'character': 1016},
'4.0': {'character': 1226},
'4.1': {'character': 1273},
'5.0': {'character': 1369},
'5.1': {'character': 1624},
'5.2': {'character': 6648},
'6.0': {'character': 2088},
'6.1': {'character': 732},
'6.2': {'character': 1},
'6.3': {'character': 5},
'7.0': {'character': 2834},
'8.0': {'character': 7716},
'9.0': {'character': 7500}}
Present the data
We will use the PrettyTable
module to present the data using text tables
New codepoints and codepoint types by version
x = PrettyTable(['version', 'new codepoints', 'new chars', 'new other'])
for k, v in DA_codepoints.items():
x.add_row([k, v, DA_chars.get(k, '-'), DA_other.get(k, '-')])
print (x)
+---------+----------------+-----------+-----------+
| version | new codepoints | new chars | new other |
+---------+----------------+-----------+-----------+
| 1.1 | 33979 | 27512 | 6467 |
| 2.0 | 144521 | 11373 | 133148 |
| 2.1 | 2 | 2 | - |
| 3.0 | 10307 | 10307 | - |
| 3.1 | 44978 | 44946 | 32 |
| 3.2 | 1016 | 1016 | - |
| 4.0 | 1226 | 1226 | - |
| 4.1 | 1273 | 1273 | - |
| 5.0 | 1369 | 1369 | - |
| 5.1 | 1624 | 1624 | - |
| 5.2 | 6648 | 6648 | - |
| 6.0 | 2088 | 2088 | - |
| 6.1 | 732 | 732 | - |
| 6.2 | 1 | 1 | - |
| 6.3 | 5 | 5 | - |
| 7.0 | 2834 | 2834 | - |
| 8.0 | 7716 | 7716 | - |
| 9.0 | 7500 | 7500 | - |
+---------+----------------+-----------+-----------+
Growth of codepoints, characters, noncharacters by version
Declare three lists:
DA_char_totals = []
DA_other_totals = []
DA_codepoint_totals = []
count_codepoints = 0
count_chars = 0
count_other = 0
z = PrettyTable(['version', 'total codepoints', 'total chars', 'total other'])
# for each version in DA_codepoints, get the corresponding values
# from DA_chars and DA_other. Produce running count of the values
# for each version.
for k, v in DA_codepoints.items():
count_codepoints += DA_codepoints.get(k, 0)
count_chars += DA_chars.get(k, 0)
count_other += DA_other.get(k, 0)
DA_char_totals.append([k, DA_chars.get(k, 0), count_chars])
DA_other_totals.append([k, DA_other.get(k, 0), count_other])
DA_codepoint_totals.append([k, DA_codepoints.get(k, 0), count_codepoints])
z.add_row([k, count_codepoints, count_chars, count_other])
print (z)
+---------+------------------+-------------+-------------+
| version | total codepoints | total chars | total other |
+---------+------------------+-------------+-------------+
| 1.1 | 33979 | 27512 | 6467 |
| 2.0 | 178500 | 38885 | 139615 |
| 2.1 | 178502 | 38887 | 139615 |
| 3.0 | 188809 | 49194 | 139615 |
| 3.1 | 233787 | 94140 | 139647 |
| 3.2 | 234803 | 95156 | 139647 |
| 4.0 | 236029 | 96382 | 139647 |
| 4.1 | 237302 | 97655 | 139647 |
| 5.0 | 238671 | 99024 | 139647 |
| 5.1 | 240295 | 100648 | 139647 |
| 5.2 | 246943 | 107296 | 139647 |
| 6.0 | 249031 | 109384 | 139647 |
| 6.1 | 249763 | 110116 | 139647 |
| 6.2 | 249764 | 110117 | 139647 |
| 6.3 | 249769 | 110122 | 139647 |
| 7.0 | 252603 | 112956 | 139647 |
| 8.0 | 260319 | 120672 | 139647 |
| 9.0 | 267819 | 128172 | 139647 |
+---------+------------------+-------------+-------------+
z = PrettyTable(['version', 'new codepoints', 'total codepoints'])
for e in DA_codepoint_totals:
z.add_row([x for x in e])
print (z)
+---------+----------------+------------------+
| version | new codepoints | total codepoints |
+---------+----------------+------------------+
| 1.1 | 33979 | 33979 |
| 2.0 | 144521 | 178500 |
| 2.1 | 2 | 178502 |
| 3.0 | 10307 | 188809 |
| 3.1 | 44978 | 233787 |
| 3.2 | 1016 | 234803 |
| 4.0 | 1226 | 236029 |
| 4.1 | 1273 | 237302 |
| 5.0 | 1369 | 238671 |
| 5.1 | 1624 | 240295 |
| 5.2 | 6648 | 246943 |
| 6.0 | 2088 | 249031 |
| 6.1 | 732 | 249763 |
| 6.2 | 1 | 249764 |
| 6.3 | 5 | 249769 |
| 7.0 | 2834 | 252603 |
| 8.0 | 7716 | 260319 |
| 9.0 | 7500 | 267819 |
+---------+----------------+------------------+
z = PrettyTable(['version', 'new chars', 'total chars'])
for e in DA_char_totals:
z.add_row([x for x in e])
print (z)
+---------+-----------+-------------+
| version | new chars | total chars |
+---------+-----------+-------------+
| 1.1 | 27512 | 27512 |
| 2.0 | 11373 | 38885 |
| 2.1 | 2 | 38887 |
| 3.0 | 10307 | 49194 |
| 3.1 | 44946 | 94140 |
| 3.2 | 1016 | 95156 |
| 4.0 | 1226 | 96382 |
| 4.1 | 1273 | 97655 |
| 5.0 | 1369 | 99024 |
| 5.1 | 1624 | 100648 |
| 5.2 | 6648 | 107296 |
| 6.0 | 2088 | 109384 |
| 6.1 | 732 | 110116 |
| 6.2 | 1 | 110117 |
| 6.3 | 5 | 110122 |
| 7.0 | 2834 | 112956 |
| 8.0 | 7716 | 120672 |
| 9.0 | 7500 | 128172 |
+---------+-----------+-------------+
z = PrettyTable(['version', 'new other', 'total other'])
for e in DA_other_totals:
z.add_row([x for x in e])
print (z)
+---------+-----------+-------------+
| version | new other | total other |
+---------+-----------+-------------+
| 1.1 | 6467 | 6467 |
| 2.0 | 133148 | 139615 |
| 2.1 | 0 | 139615 |
| 3.0 | 0 | 139615 |
| 3.1 | 32 | 139647 |
| 3.2 | 0 | 139647 |
| 4.0 | 0 | 139647 |
| 4.1 | 0 | 139647 |
| 5.0 | 0 | 139647 |
| 5.1 | 0 | 139647 |
| 5.2 | 0 | 139647 |
| 6.0 | 0 | 139647 |
| 6.1 | 0 | 139647 |
| 6.2 | 0 | 139647 |
| 6.3 | 0 | 139647 |
| 7.0 | 0 | 139647 |
| 8.0 | 0 | 139647 |
| 9.0 | 0 | 139647 |
+---------+-----------+-------------+
Commentary
This exercise helped us to analyze the number of codepoints added to Unicode over time using
DerivedAge.txt
. The data provides a high-level overview of the number and types of
codepoints that have been assigned. The UCD contains other data that provide more details on
the characters, noncharacters, and other entities in the standard. I will explore these files
in subsequent posts in order to expand our understanding of Unicode data through the
Unicode Character Database.