CS602 - Data-Driven Development with Python             Spring 2020                         Programming Assignment 5  
Programming Assignment 5  
Getting started  
Review class handouts, executing all examples shown in them in Eclipse, as well as creating and typing  
examples for functions that are listed, but not illustrated with an example. This is essential for your  
understanding of the Pandas package functionality required to complete this assignment. Complete the  
reading and practice assignments posted on the course schedule.    
Programming Project: Recommend           worth: 25 points  
Rating-based movie recommendation.  
Data and program overview  
In this assignment you will be working with data on movies and people’s ratings of these movies. The task  
will be to create movie recommendations for a person, based on the match between personal ratings and  
critics’ ratings of the movies.   
I provide two data sets for this assignment, in zip files called data.zip and data-tiny.zip. Download and unzip  
the files in your project folder. Unzipping should result in two folders added to your project folder: data and  
data-tiny.    
The following data is provided in csv files:     
• A table with movie information (IMDB.csv); we will call this the movies data.   
• A table with ratings of all movies listed in the movies data, by 100 critics (ratings.csv); let’s call  
this the critics data. The column names in the critics data correspond to the name of each critic.  
All ratings are integer numbers in the 1..10 range.  
• A table with one person’s ratings of a subset of the movies in the movies data set (pX.csv), the  
person data, where X is a number. The name of the person is provided as a column title in the file.  
Review the data files to familiarize yourself with their content and structure.    
The program that you write must work as follows.   
1. Ask the user to specify the subfolder in the current working directory, where the files are stored,  
along with the names of the movies, critics, and person data files.  
2. Determine and output the names of three critics, whose ratings of the movies are closest to the  
person’s ratings based on the Euclidean distance metric (as described later within definition of  
function findClosestCritics()).  
3. Use the ratings by the critics identified in item 2 to determine which movies to recommend:   
• The movie recommendations must be based on the average ratings of movies by the three  
critics identified in step 2 above, and consist of the highest rated movies in each movie  
genre1. (see also the definition of function recommendMovies()).  
1 Movie genre is determined by the Genre1 column of the movies data.  
CS602 - Data-Driven Development with Python             Spring 2020                         Programming Assignment 5  
2  
4. Display information about recommended movies as described below and illustrated by the sample  
interactions below.   
• Recommendations must be listed in alphabetical order by genre.   
• Missing data (e.g. running time) should not be included.  
The sample interactions below demonstrate the running of the program.   
Sample interactions  
First, let’s use the tiny data set. The interaction below uses personal data file tinyp.csv that contains  
movie ratings by a person named Kimberwick. Critics Aldbridge, Moon, Benris had the closest  
recommendations.  
Please enter the name of the folder with files, the name of movies file,  
the name of critics file,  the name of personal ratings file, separated by spaces:  
data-tiny  tinyIMDB.csv tinyratings.csv tinyp.csv  
The following critics had reviews closest to the person's:  
Aldbridge, Moon, Benris  
Recommendations for Kimberwick  
"127 Hours"  (Adventure), rating: 8.0, 2010, runs 94 min  
"50/50"      (Comedy), rating: 7.0, 2011, runs 100 min  
"About Time" (Comedy), rating: 7.0, 2013, runs 123 min  
The next interaction shows the output given the larger data set  
Please enter the name of the folder with files, the name of movies file,  
the name of critics file, the name of personal ratings file, separated by spaces:  
data  IMDB.csv ratings.csv   p8.csv   
The following critics had reviews closest to the person's:  
Quartermaine, Arvon, Merrison  
Recommendations for Catulpa:  
"Star Wars: The Force Awakens"    (Action), rating: 9.67, 2015, runs 136 min  
"The Grand Budapest Hotel"        (Adventure), rating: 9.0, 2014, runs 99 min  
"The Martian"                     (Adventure), rating: 9.0, 2015, runs 144 min  
"How to Train Your Dragon"        (Animation), rating: 9.67, 2010  
"Kubo and the Two Strings"        (Animation), rating: 9.67, 2016  
"Hacksaw Ridge"                   (Biography), rating: 9.33, 2016, runs 139 min  
"What We Do in the Shadows"       (Comedy), rating: 9.0, 2014  
"Prisoners"                       (Crime), rating: 8.33, 2013, runs 153 min  
"Spotlight"                       (Crime), rating: 8.33, 2015, runs 128 min  
"The Perks of Being a Wallflower" (Drama), rating: 9.67, 2012, runs 102 min  
"Shutter Island"                  (Mystery), rating: 8.33, 2010, runs 138 mi  
CS602 - Data-Driven Development with Python             Spring 2020                         Programming Assignment 5  
3  
Note that in the above interaction there are sometimes more than one movie listed per genre. As, for  
instance, is the case with the two Adventure movies, both of them had the highest average rating, hence  
both are included in the list.   
Important Notes and Requirements  
In addition to the requirements stated so far, your code must satisfy the following to gain full credit:  
• Your program should not use any global variables and should have no code outside of function  
definitions, except for a single call to main.   
• All file related operations should use device-independent handling of paths (use  os.getcwd()  and  
os.path.join() functions to create paths, instead of hardcoding them).  
• You must define and use functions specified below in the Required Functions section. You may and  
should define other methods as appropriate.  
• You should use the pandas data structures effectively to efficiently achieve the goals of your functions  
and programs.   
• The formatting of the recommendation printout should use the length of the longest movie in the list  
(which should be computed in the program) in formatting the output in a way that aligns categories.   
Required Functions  
You must define and use the following functions, plus define and use others as you see fit:   
a. Function findClosestCritics() which will be used to identify three critics, whose recommendations  
are closest to the person’s recommendations. The function should take two parameters of type  
DataFrame, the first one providing data about critics ratings,  and the second – about personal ratings.  
The function must return a list of three critics, whose ratings of movies are most similar to those  
provided in the personal ratings data, based on Euclidean distance.   
Euclidean distance of two vectors p = (p 1, p 2, … , p n) and c = (c1, c2, … ,cn) is computed as  
�(p1 − 1 )2  + (p2 − 2 )2  +⋯+ (p −  )2  . To compute how similar ratings of a critic are to the  
ratings of the person, we compute the distance between a vector, in which the coordinates (c1, c2, … ,cn)    
are the critic’s ratings of each movie, and the vector composed of the person’s ratings (p 1, p 2, … , p n).   
The lower the distance, the closer, thus more similar, the critic’s ratings are to the person’s.  
For example, if the personal data included three ratings (4, 7, 6), where the critic rated the same  
movies as (4, 5, 6), the Euclidean distance would equal �(4− 4)2  + (5 − 7)2  + (6− 6)2  = 2.0.  
H int: for this function, create a DataFrame with critics names as its columns, movie titles as its index,  
and data in each column  set to be the difference between the critic’s and the person’s score of each movie,  
squared. Then, calculate the sum of all column values and find the smallest three values using sorting.  
Return a list of the associated critics’ names.   
b. Function recommendMovies () which will be used to generate movie recommendations based on  
ratings by the chosen critics. The function must accept four parameters:  the critics and personal  
ratings data frames, the list of three critics most similar to the person, and the movie data frame. The  
function should determine out of the set of movies that are not rated in the personal data, but are rated  
by the critics, which movies have the highest average of the rating by the most similar critics in each  
movie genre (specified by the Genre1 column of movie data).   In other words, you need to compute  
CS602 - Data-Driven Development with Python             Spring 2020                         Programming Assignment 5  
4  
the top-rated unwatched movies in each genre category, based on the average of the three critics’  
ratings.   
You may assume that the critics data will always be complete, i.e. will include ratings of all movies.  
The function must then (a) put together information about these top-rated movies sufficient to produce  
the printout, showing the details of each of the recommended movies as illustrated by the interactions,  
and (b) return it using some data structure of your choice.   
H int: An easy way to generate a list of unwatched movies with all critics’ ratings is to merge the  
person data with the critics data and select the portion of it, which has missing values in the person’s  
column. After that, to find the highest rated movies per genre, first join the resulting file with the  
movies data to have full movie information. In this joint DataFrame, compute the average rating by the  
three critics (storing the average in a new column) and then select the highest rating in each genre  
(using groupby). After that, select those movies in each genre, that have the rating not lower than the  
highest in the genre.  
c. Function printRecommendations () with two parameters: the first containing information about the  
recommended movies, and the second – the name of the person, for whom the recommendation is made  
(the name is specified in the header of the personal ratings data file).  The function must produce a  
printout of all recommendations passed in via the first parameter, in alphabetical order by the genre, as  
shown in the sample interactions.  Make sure to examine the sample interactions and implement the  
details of the printout.  The function should return no value.  
d. Function main(), which will be called to start the program and works according to your design.  
More Hints  
• For some csv data, you may not need all of the columns. You can specify which columns to import  
into a data frame, or you could drop unnecessary ones to improve performance and simplify testing  
and debugging.  
• Keeping your data frames indexed by the title should help in making joins easy. Note that the title can  
be both an index and one of the columns, if necessary.  
• Make sure to inspect intermediate results via printing them or saving to files. To see the data frames  
completely when they are printed, you can include the following function calls in your program to set  
display parameters for pandas   
pd.set_option('display.max_columns', 1000)   
pd.set_option('display.max_rows', 1000)   
pd.set_option('display.width', 1000)   
• When you open and save the csv files in Excel, Excel may change the encoding used for file characters.  
To return the encoding to the UTF-8 standard that Python likes, open the file in Notepad, and when  
saving it, specify the encoding as UTF-8.   
• Although I have provided a sample small data sets, for testing purposes, I encourage you to create  
your own one, for which you should know the result in advance.  
CS602 - Data-Driven Development with Python             Spring 2020                         Programming Assignment 5  
5  
Grading    
The grading schema for this project is roughly as follows:   
Five points will be awarded for the correct implementation of each of the four functions above (which may  
call other functions that you define), which uses data structures, methods and functions of the pandas/numpy  
package appropriately and effectively.    
Three points will be awarded for making the code sufficiently general to handle different input files, i.e. not  
tied to the specific content of the files that you are given (though it might be somewhat dependent on the  
structure of those files, i.e. what is provided by the columns, rows, etc.)  
Two points will be awarded for style, as defined by the guidelines in Handout 1.   
Created by Tamara Babaian on November 8, 2019