ISTA 311 HW 4: HUFFMAN CODES AND DATA COMPRESSION 
Due Friday, May 1, 2020, 11:59 PM 
1. Introduction 
For this assignment, we will write a simple file compression utility for text files implementing the Huffman 
code algorithm. This utility reads a text file, analyzes the frequency distribution of characters in the file, 
and calculates an optimal symbol code (Huffman code) for that distribution. It then writes a binary file 
encoded in that Huffman code, and an auxiliary file containing the Python representation of the code in the 
form of a dictionary so that it can be decoded later. 
2. Instructions 
2.1. Code and submission. Create a Python module called compress.py and upload it to the assignment 
folder on D2L. You will need to do the following imports: 
from bitstring import Bits 
import numpy as np 
and then implement the functions described in the specifications below. 
You might have to install the bitstring library if you haven’t yet. This can be installed through pip or 
conda. 
Documentation for the Bits class can be found at https://pythonhosted.org/bitstring/constbitarray. 
html 
2.2. Documentation. Your script must contain a header docstring containing your name, ISTA 311, the 
date, and a brief description of the module. Each function must contain a docstring. Each function docstring 
should include a description of the function’s purpose, the name, type, and purpose of each parameter, and 
the type and meaning of the function’s return value. 
2.3. Testing and required files. A test script, test compress.py, will be available on D2L along with 
some auxiliary text and data files necessary to run the tests. 
2.4. Grading. Your submission will be graded by running it through the test script, running some tests on 
other files, and examining your code. Code should be clear and concise, but you will only lose points for 
style if your code is a real mess. Include inline comments to explain tricky lines and summarize sections of 
code. 
3. Function specifications 
Implement the following functions: 
• get frequency dict: This function takes a single string argument and returns a dictionary whose 
keys are the characters appearing in the string, and whose values are the normalized frequencies 
(number of times each character appears, divided by the total number of characters in the string). 
1 
2 ISTA 311 HW 4: HUFFMAN CODES AND DATA COMPRESSION 
• dict entropy: This function takes a frequency dictionary and returns the entropy of the probability 
distribution. Recall that the entropy is computed by the formula: 
H(X) = − 
∑ 
i 
P (xi) log2 P (xi) 
where xi are the individual outcomes. Here that means the characters in the dictionary. 
• huffman code: This takes a frequency dictionary and returns a Huffman code dictionary, where the 
keys are the characters appearing in the string and the values are the code words (strings of 0s 
and 1s) assigned to each character. (This may be simpler with some helper functions; see the tree 
algorithm in the Jupyter notebook encoding trees filled.ipynb, for instance.) 
• invert dict: This function takes a dictionary representing an encoding, with characters mapping 
to code words, and reverses it so that it maps code words to characters. (This is useful in decoding.) 
• encode string: This function takes two arguments: a string and a dictionary representing an 
encoding. Encode the string into a string of 0s and 1s and return it. 
• decode string: This function takes two arguments: a string of 0s and 1s and a dictionary repre- 
senting a reversed encoding. Use the dictionary to decode the string. 
• pad bitstring: Remember that bit strings we write to a file must be a whole number of bytes (i.e. a 
multiple of 8 bits.) This function takes a string of 1s and 0s and appends the number of 0s necessary 
onto the end of the string to make the length of the string a multiple of 8. Then it prepends onto 
the string an 8-bit bitstring representation of that number of 1s and 0s. Then return the padded 
string. 
• write encoded file: This function takes a string as its only argument, and does the following: 
– Passes the string to get frequency dict to get a frequency dictionary; 
– Finds the Huffman code for that frequency dictionary; 
– Converts the contents of the file into a bit string using the Huffman code, pads it using 
pad bitstring, and writes it to a (binary) file. The filename of the binary file should be 
the original filename with ’.compressed’ replacing the .txt extension; 
– Writes (in plain text) a representation of the Huffman code dictionary. Use repr to get a string 
representation of the dictionary. The filename of the text file should be the original filename 
with .code replacing the .txt extension. 
After all this, the function should return [file len, dict len], where file len is the length of 
the compressed file in bytes (which is the length of your bit string divided by 8) and dict len is the 
length of the string representation of the dictionary. 
• decode encoded file: This function takes a filename stem (a filename without an extension). Then 
do the following: 
– Open the file filename.compressed in binary read mode and read it to a Bits object. 
– Open the file filename.code in text read mode, read it to a string, and call eval() on that 
string to obtain the encoding dictionary; reverse it to get a decoding dictionary. 
– Access the bin attribute of the Bits object to obtain a string of 0s and 1s. 
– Interpret the first 8 bits in the bit string as an integer to learn how many digits of zero padding 
are on the end of the bit string. 
– Strip the initial 8 bits and any zero padding from the end, and pass the resulting bit string to 
decode string along with the decoding dictionary, then return the resulting string. 
• main: Finally, write a main function. The main function should: 
– Open the file saguaros.txt, reads the text, and passes the text to write encoded file to 
write a compressed version of the file. 
– Print the length of the file saguaros.txt in bytes (which is the same as the length of the string 
you read from it). Also print the entropy of the frequency dictionary, length of the compressed 
file in bytes and the length of the dictionary, formatting the output like this: 
Length of input file: xxxx 
Input file entropy: xxxx 
Length of compressed file: xxxx 
Length of encoding dict: xxxx 
where the xxxx are the various numerical quantities. 
ISTA 311 HW 4: HUFFMAN CODES AND DATA COMPRESSION 3 
– Pass the filename stem poem to decode encoded file. Take the string that is returned and 
write it to a file called poem.txt. 
Hint: The code necessary for encode string, decode string, pad bitstring, invert dict, 
get frequency dict, and part of write encoded file can be found in the Jupyter notebook 
encoding trees filled.ipynb on D2L.