辅导COMP3314辅导Machine Learning

COMP3314 Machine Learning
Homework:
Release date: April 18, 2020
Due date: 11:59pm, April 23, 2020
(20% deduction for every 24 hours after the due date)

1. Given a convolution function = (|) and the value of , perform one step
convolution with weight tensor , zero padding set to 1 pixel and stride set to 1. (1) calculate
the forward pass result (4 points); (2) calculate the Jacobian matrix

(5 points); (3)
assume is the final loss function and

is given, calculate

using chain rule (5 points).
Note: is a 4x4 matrix, is a 4x4 matrix, w is 3x3 a matrix,

is a 4x4 matrix,

is a
16x9 matrix and

is a 3x3 matrix.

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

2. Given a maxpooling function = (), perform one step maxpooling with pooling
kernel size set to 3. (1) calculate the forward pass result (5 points); (2) assume is the
final loss function and

is given, calculate

using chain rule (5 points). Note: (i). no
padding is needed and the pooling stride is set to 1; (ii). is a 5x5 matrix, is a 3x3 matrix,

is a 3x3 matrix and

is a 5x5 matrix; (iii) when multiple max values exist in the kernel,
select the one with the smallest index.

example kernel index

3. Given the following dataset:
0.0625 0 0 0
0 0.0625 0 0
0 0 0 0.0625
0.0625 0.0625 0 0
0.1 0.2 0.3
0.4 0.5 0.6
0.7 0.8 0.9
0.1111 -0.0007 0
0 0.1104 0.1111
0.1111 0.1111 0.1111
1 2 3
4 5 6
7 8 9
6 2 5 4 4
9 1 5 3 7
7 2 4 3 8
4 5 5 1 0
0 2 8 0 8
V W X Y
0 0 0 0
0 1 0 1
1 0 0 1
1 1 0 0
1 1 1 0

Your task is to build a decision tree for classifying variable . (You can think of the dataset as
replicated many times, i.e. overfitting is not an issue here).

1) Compute the information gains (|), (|) and (|). Remember, information
gain is defined as
() = () - ∑

=1 ()
where
() = 1 − ∑ (|)
2
=1
is the class number, and are the dataset of the parent and th child node. is
gini impurity. is the total number of samples at the parent node and is the number of
samples in the th child node (12 points).
2) Write down the entire decision tree with gini impurity(8 points).

4. Given the following dataset and the points information, use logistic regression classifier to
solve the problem.

In logistic regression classification, the initial decision boundary is defined as −2.7 − 2.2 +
15.2 = 0. Apparently, this boundary is not optimal and the initial accuracy with this boundary is
83.33%. Your task is to:
(1). With this initial decision boundary, use Gradient Descent and Minibatch Gradient Descent to
calculate a new decision boundary respectively. Note: in Minibatch Gradient Descent, the batch
sampling order is fixed and defined as:
= [6, 1, 4, 2, 5, 3]
Number Var a Var b Label
1 2.5 3.0 1
2 3.0 2.0 1
3 3.5 3.0 1
4 5.0 3.5 0
5 4.0 4.5 0
6 4.0 4.0 0
Minibatch size is set to 2, learning rate is set to 0.01 and training iteration is set to 3. Write
down the final decision boundary equation and the calculation steps required to calculate the
boundary (including all variable gradients, cost value at each step) (12 points).
(2). Suppose the decision boundary is −2.4 − 3.8 + 21.7 = 0 , predict the class of the
following datapoints and write down corresponding probabilities (Sigmoid Function is used as the
activation function, probability threshold is set to 0.5) (6 points).

5. Consider a classification problem: there is a feature vector ∈ × with N features and
class label can be 1 or -1. Denote the weight for the slack variables is . There are
samples in total and for each sample the corresponding annotations are , ,
respectively. Let equation of the decision boundary be
+ = 0
(1) We denote the standard QP form as:
min

1
2
+
s. t. ≤ , =
Please derive the standard QP form variable P, q, z, G, h, A, b of a linear SVM by using variables
, , , , , and other necessary constant variables. (8 points)

(2) Let’s assume we have solve the quadratic programming problem, and the attained normalized
direction of the decision hyper-plane ⃑ is [
1
2
,
1
2
,
1
2
,
1
2
]. There is a negative support vector at
[0, 0, 0,0] and a positive support vector at [1, 2, 3,4]. What is the width of the margin (twice
the distance from decision boundary to the closest training samples)? What is the weight
and bias ? (12 points)
(3) Let two positive samples at [1, 1, 1,1] and [[4, 4, 4,4]]. What is the slack variables for them?
(4 points)
6. As a simple illustration of a k-means algorithm, consider the following data set consisting of
the scores of two variables on each of 7 individuals:

Number Var a Var b Label Prob
1 3.5 4.0
2 6.0 1.5
3 4.5 3.3
x y
1.0 1.0
0.3 0.4
1.0 1.5
3.0 0.2
3.5 -0.8
4.5 -0.6
3.5 -0.5
The dataset has roughly 2 clusters. And we set initial cluster centers to be [4.5,−1] and [3, 0]
respectively. It is apparent two cluster center are initialized in the same cluster. Use Sum of
Squared Error (SSE) as the distance metric:
(1) What is the initial clustering result given these two initial cluster centers? (4 points)
(2) What is the clustering result after 1 iteration? (5 points)
(3) What is the clustering result after 4 iterations? (5 points)