Visualizing the growth of Unicode using matplotlib

Posted on Mon 30 January 2017 in articles

In a previous post, I presented information about the growth of Unicode in terms of the number of codepoints assigned in each version. The data was displayed as text tables using the PrettyTable package. As we are visual beings, I think it would be useful to also present that data using charts. Here I use matplotlib and the same data sets to produce such visualizations.

Prerequisites

Recall the lists of lists we produced: DA_codepoint_totals, DA_char_totals, and DA_other_totals:

Growth of codepoints by version:

DA_codepoint_totals
[['1.1', 33979, 33979],
 ['2.0', 144521, 178500],
 ['2.1', 2, 178502],
 ['3.0', 10307, 188809],
 ['3.1', 44978, 233787],
 ['3.2', 1016, 234803],
 ['4.0', 1226, 236029],
 ['4.1', 1273, 237302],
 ['5.0', 1369, 238671],
 ['5.1', 1624, 240295],
 ['5.2', 6648, 246943],
 ['6.0', 2088, 249031],
 ['6.1', 732, 249763],
 ['6.2', 1, 249764],
 ['6.3', 5, 249769],
 ['7.0', 2834, 252603],
 ['8.0', 7716, 260319],
 ['9.0', 7500, 267819]]

Growth of characters per version:

DA_char_totals
[['1.1', 27512, 27512],
 ['2.0', 11373, 38885],
 ['2.1', 2, 38887],
 ['3.0', 10307, 49194],
 ['3.1', 44946, 94140],
 ['3.2', 1016, 95156],
 ['4.0', 1226, 96382],
 ['4.1', 1273, 97655],
 ['5.0', 1369, 99024],
 ['5.1', 1624, 100648],
 ['5.2', 6648, 107296],
 ['6.0', 2088, 109384],
 ['6.1', 732, 110116],
 ['6.2', 1, 110117],
 ['6.3', 5, 110122],
 ['7.0', 2834, 112956],
 ['8.0', 7716, 120672],
 ['9.0', 7500, 128172]]

Growth of noncharacters per version:

DA_other_totals
[['1.1', 6467, 6467],
 ['2.0', 133148, 139615],
 ['2.1', 0, 139615],
 ['3.0', 0, 139615],
 ['3.1', 32, 139647],
 ['3.2', 0, 139647],
 ['4.0', 0, 139647],
 ['4.1', 0, 139647],
 ['5.0', 0, 139647],
 ['5.1', 0, 139647],
 ['5.2', 0, 139647],
 ['6.0', 0, 139647],
 ['6.1', 0, 139647],
 ['6.2', 0, 139647],
 ['6.3', 0, 139647],
 ['7.0', 0, 139647],
 ['8.0', 0, 139647],
 ['9.0', 0, 139647]]

Approach

Import the Python modules needed for this task:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn

Use Jupyter magic to show plots within the notebook:

%matplotlib inline

Plot line graphs

The x axis is the arange of the number of versions, not the version number itself. The y axis is the number of characters get values for x and y axes. Convert list of lists to numpy array, then read columms from array into lists: x values from column 0, y values from column 2. Use numpy slice [:,0] to read the matrix by column. Convert y values to type int when slicing.

Line graph: number of characters per version

char_array = np.array(DA_char_totals)
fig, ax = plt.subplots()

plt.title('Number of characters encoded in Unicode')
plt.ylabel('Characters')
plt.xlabel('Versions')

x_values = char_array[:,0]
x_range = np.arange(len(x_values))
y_values = char_array[:,2].astype(int)

ax.set_xticks(x_range)
ax.set_xticklabels(x_values, rotation=45)

ax.plot (x_range, y_values)
[<matplotlib.lines.Line2D at 0x973c610>]

image

As we will be plotting two more graphs of the same type and using the same type of container, it is practical to create a function for the above plot type:

def plot_line_graph(list_, labels_):

    # process parameters

    char_array = np.array(list_)

    plt_title, plt_ylabel, plt_xlabel = labels_

    # set up plot

    fig, ax = plt.subplots()

    plt.title(plt_title)
    plt.ylabel(plt_ylabel)
    plt.xlabel(plt_xlabel)

    x_values = char_array[:,0]
    x_range = np.arange(len(x_values))
    y_values = char_array[:,2].astype(int)

    ax.set_xticks(x_range)
    ax.set_xticklabels(x_values, rotation=45)

    ax.plot (x_range, y_values)

Now we can pass a container and a list of labels to the function to generate additional plots:

labels = ['Number of characters encoded in Unicode', 'Characters', 'Versions']
plot_line_graph(DA_char_totals, labels)

image

Line graph: number of other entities per version

labels = ['Number of other entities encoded in Unicode', 'Entities', 'Versions']
plot_line_graph(DA_other_totals, labels)

image

Line graph: number of codepoints assigned per version

labels = ['Number of codepoints encoded in Unicode', 'Noncharacters', 'Versions']
plot_line_graph(DA_codepoint_totals, labels)

image

Plot stacked bar graphs

We can also plot the growth of number of characters using a stacked bar graph. Each bar is segmented into smaller categories in order to convey the smaller units that constitute the whole value.

The x axis is the arange of the number of versions, not the version number itself. We will display the actual version numbers by modifying the labels of the plot. Our stacked bar requires two y values: the first (bottom) is the number of new characters per version, the second (top) is the value from the previous version, or, the difference between the total characters for a given version minus the new characters added.

We’ll convert a list of lists to a numpy array, then read columms from the array into lists using numpy slice, eg. [:,0].

Bar graph: growth of number of characters per version

char_array = np.array(DA_char_totals)

Get the difference between the total characters and the new characters for each version. This is expressed as follows:

diff_array = char_array[:,2].astype(int) - char_array[:,1].astype(int)

Set up the plot:

fig, ax = plt.subplots()
ax = plt.axes()

ax.set_title('Characters per version')
ax.set_ylabel('Number of characters')
ax.set_xlabel('Versions')

x_values = char_array[:,0]
x_range = np.arange(len(x_values))

y1_values = char_array[:,1].astype(int)
y2_values = diff_array

y_total = char_array[:,2].astype(int)

ax.xaxis.set_major_locator(ticker.FixedLocator(x_range))
ax.xaxis.set_major_formatter(ticker.FixedFormatter(x_values))
ax.set_xticklabels(x_values, rotation=45)

ax.bar(x_range, y1_values, width=1, label = 'New chars')
ax.bar(x_range, y2_values, width=1, bottom=y1_values, label = 'Total chars')

image

Commentary

This exercise helped us to convert our data from lists into graphs using matplotlib. The text tables I produced in the previous exercise, but the visualizations give us another perspective into the growth of Unicode.