COMP3314 Machine Learning  
Homework:   
Release date: April 18, 2020  
Due date: 11:59pm, April 23, 2020  
(20% deduction for every 24 hours after the due date)  
1. Given a convolution function  =  (|)  and the value of  , perform one step  
convolution with weight tensor , zero padding set to 1 pixel and stride set to 1. (1) calculate  
the forward pass result  (4 points); (2) calculate the Jacobian matrix  
(5 points); (3)  
assume  is the final loss function and  
is given, calculate  
using chain rule (5 points).  
Note:  is a 4x4 matrix,  is a 4x4 matrix, w is 3x3 a matrix,  
is a 4x4 matrix,  
is a  
16x9 matrix and  
is a 3x3 matrix.  
0 1 2 3  
4 5 6 7  
8 9 10 11  
12 13 14 15  
2. Given a maxpooling function  =  (), perform one step maxpooling with pooling  
kernel size set to 3. (1) calculate the forward pass result  (5 points); (2) assume  is the  
final loss function and  
is given, calculate  
using chain rule (5 points). Note: (i). no  
padding is needed and the pooling stride is set to 1; (ii).  is a 5x5 matrix,  is a 3x3 matrix,  
is a 3x3 matrix and  
is a 5x5 matrix; (iii) when multiple max values exist in the kernel,  
select the one with the smallest index.  
example kernel index  
3. Given the following dataset:   
0.0625 0 0 0  
0 0.0625 0 0  
0 0 0 0.0625  
0.0625 0.0625 0 0  
0.1 0.2 0.3  
0.4 0.5 0.6  
0.7 0.8 0.9  
0.1111 -0.0007 0  
0 0.1104 0.1111  
0.1111 0.1111 0.1111  
1 2 3  
4 5 6  
7 8 9  
6 2 5 4 4  
9 1 5 3 7  
7 2 4 3 8  
4 5 5 1 0  
0 2 8 0 8  
V W X Y  
0 0 0 0  
0 1 0 1  
1 0 0 1  
1 1 0 0  
1 1 1 0  
Your task is to build a decision tree for classifying variable . (You can think of the dataset as  
replicated many times, i.e. overfitting is not an issue here).   
1) Compute the information gains (|), (|) and (|). Remember, information  
gain is defined as  
() = () - ∑ 
=1 ()  
where  
() = 1 − ∑ (|) 
2 
=1   
is the class number,  and  are the dataset of the parent and th child node.  is  
gini impurity.  is the total number of samples at the parent node and  is the number of  
samples in the th child node (12 points).  
2) Write down the entire decision tree with gini impurity(8 points).  
4. Given the following dataset and the points information, use logistic regression classifier to  
solve the problem.  
In logistic regression classification, the initial decision boundary is defined as −2.7 − 2.2 + 
15.2 = 0. Apparently, this boundary is not optimal and the initial accuracy with this boundary is  
83.33%. Your task is to:   
(1). With this initial decision boundary, use Gradient Descent and Minibatch Gradient Descent to  
calculate a new decision boundary respectively. Note: in Minibatch Gradient Descent, the batch  
sampling order is fixed and defined as:  
=  [6, 1, 4, 2, 5, 3]  
Number Var a Var b Label  
1 2.5 3.0 1  
2 3.0 2.0 1  
3 3.5 3.0 1  
4 5.0 3.5 0  
5 4.0 4.5 0  
6 4.0 4.0 0  
Minibatch size is set to 2, learning rate is set to 0.01 and training iteration is set to 3. Write  
down the final decision boundary equation and the calculation steps required to calculate the  
boundary (including all variable gradients, cost value at each step) (12 points).  
(2). Suppose the decision boundary is −2.4 − 3.8 + 21.7 = 0 , predict the class of the  
following datapoints and write down corresponding probabilities (Sigmoid Function is used as the  
activation function, probability threshold is set to 0.5) (6 points).  
5. Consider a classification problem: there is a feature vector  ∈ × with N features and  
class label  can be 1 or -1. Denote the weight for the slack variables  is . There are   
samples in total and for each sample   the corresponding annotations are ,   ,   
respectively. Let equation of the decision boundary be   
+  = 0  
(1) We denote the standard QP form as:  
min 
1 
2 
+   
s. t.  ≤ , =   
Please derive the standard QP form variable P, q, z, G, h, A, b of a linear SVM by using variables  
,   ,  , , ,  and other necessary constant variables. (8 points)  
(2) Let’s assume we have solve the quadratic programming problem, and the attained normalized  
direction of the decision hyper-plane ⃑  is [ 
1 
2 
, 
1 
2 
, 
1 
2 
, 
1 
2 
]. There is a negative support vector at  
[0, 0, 0,0] and a positive support vector at [1, 2, 3,4]. What is the width of the margin (twice  
the distance from decision boundary to the closest training samples)? What is the weight   
and bias ? (12 points)  
(3) Let two positive samples at [1, 1, 1,1] and [[4, 4, 4,4]]. What is the slack variables for them?  
(4 points)  
6. As a simple illustration of a k-means algorithm, consider the following data set consisting of  
the scores of two variables on each of 7 individuals:  
Number Var a Var b Label Prob  
1 3.5 4.0    
2 6.0 1.5    
3 4.5 3.3    
x y  
1.0 1.0  
0.3 0.4  
1.0 1.5  
3.0 0.2  
3.5 -0.8  
4.5 -0.6  
3.5 -0.5  
The dataset has roughly 2 clusters. And we set initial cluster centers to be [4.5,−1] and [3, 0]  
respectively. It is apparent two cluster center are initialized in the same cluster. Use Sum of  
Squared Error (SSE) as the distance metric:  
(1) What is the initial clustering result given these two initial cluster centers? (4 points)  
(2) What is the clustering result after 1 iteration? (5 points)  
(3) What is the clustering result after 4 iterations? (5 points)