AN 
SW 
ER 
S 
ECM3420 
(with Answers) 
UNIVERSITY OF EXETER 
COLLEGE OF ENGINEERING, MATHEMATICS 
AND PHYSICAL SCIENCES 
COMPUTER SCIENCE 
Examination, May 2019 
Learning from Data 
Module Leader: Dr. Lorenzo Livi 
Duration: TWO HOURS 
Answer question 1, and two out of the other four questions. 
Question 1 is worth 40 marks. Other questions are worth 30 marks each. 
This assessment will contribute 60% to the final mark for this module. 
No electronic calculators of any sort are to be used during the course of this 
examination. 
This is a CLOSED BOOK examination. 
AN 
SW 
ER 
S 
ECM3420 – 28/Jan/2019 – with answers Page 1 
SECTION A (Compulsory Question) 
Question 1 
(a) What is the difference between supervised and unsupervised learning? 
Provide an example of each. What is meant by the term semi-supervised 
learning? 
(8 marks) 
In supervised learning, the information regarding the target/desired 
output is available and exploited during training. On the other hand, 
in unsupervised learning this information is not available. [3] Examples: 
linear regression (supervised) and clustering (unsupervised). [2] 
In semi-supervised learning a small amount of labelled training data is 
available, but usually a large quantity of unlabelled data. Learning can 
take advantage of the structure of the data implicit in the unlabelled data 
as well as the labels. [3] 
(b) Briefly explain what is meant by the kernel trick and how it may be used 
to obtain a nonlinear classifier from a linear one. Name and provide 
expressions for at least two (positive definite) kernel functions. 
(10 marks) 
The kernel trick describes the fact that inner products 〈φ(x), φ(y)〉 [2] 
between vectors x and y mapped to a (usually high dimensional) feature 
space by φ(·) can be computed via a kernel k(x,y) in the original space 
(subject to the kernel being a Mercer kernel) [3]. 
If the computation of the linear classifier can be done via inner products, 
then the classification can be performed after nonlinearly mapping φ(x) 
to a high-dimensional space, but the computations can be achieved in the 
original space. The basis of support vector machines and kernel logistic 
regression. [3] 
Two well-known (positive definite) kernel functions are [2]: 
• Polynomial: k(x, z) = (x · z+ v)p, v ≥ 0, p ≥ 0 
• RBF: k(x, z) = exp(−γ × ‖x− z‖22), which is known as Gaussian 
RBF when γ = 1 
2σ2 
(c) Explain the difference between covariance and correlation. The covariance 
ECM3420 (2019) 1 
AN 
SW 
ER 
S 
ECM3420 – 28/Jan/2019 – with answers Page 2 
matrix for two variables is given by 
S = 
[ 
10.2 −45.6 
−45.6 8.1 
] 
Calculate the correlation between them and provide comments on the 
outcome. 
Briefly explain why two decorrelated variables need not be independent. 
(8 marks) 
Correlation is a standardised measure of covariance [2]. They are related 
by ρ = covxy/(σxσy), where σx, σy denote the standard deviations of 
x and y, respectively. So ρ = −45.6/(10.2 × 8.1) = −0.55. [2] 
Identification of correct elements in covariance matrix and calculation. 
Therefore, the two variables are moderately anti-correlated [2]. 
Correlation is a linear measure so the variables might be positively 
correlated in one place and negatively correlated in another summing 
to zero; independence requires p(x, y) = p(x)p(y) which is stronger [2]. 
Standard material, but not all appreciate the differences. 
(d) Explain what is meant by the terms generalisation error and over-fitting. 
How can over-fitting be combatted? 
(6 marks) 
Generalisation error: The average error made on unseen test data after 
training. [2] 
Overfitting: The phenomenon of learning the details of the training data 
rather than the trends. [2] 
Combatted by controlling the complexity of the learning machines. [2] 
(e) With the aid of a carefully labelled diagram, explain what is meant by a 
convex function. Why is it desirable for an error function to be convex? 
(8 marks) 
ECM3420 (2019) 2 Please Turn Over 
AN 
SW 
ER 
S 
ECM3420 – 28/Jan/2019 – with answers Page 3 
Several possible definitions in terms of the secant line or second 
derivative for functions of one variable. For multi-variate functions 
the eigenvalues of the Hessian are all positive or the epigraph of the 
function is a convex set. [4] for a multivariate definition ; for a univariate 
definition. Diagram: [2]. 
Desirable to have a convex error function because convex functions have 
only one minimum. Therefore if a minimum is located, it is the global 
minimum. [2] 
(Total 40 marks) 
ECM3420 (2019) 3 
AN 
SW 
ER 
S 
ECM3420 – 28/Jan/2019 – with answers Page 4 
SECTION B (Optional Questions) 
Question 2 
(a) Write the equation and draw a diagram describing the perceptron model. 
Specify all variables involved in the model (using formal notation) and 
describe their role. 
(10 marks) 
The equation describing a perceptron reads: 
y = φ(v) = φ 
( 
d∑ 
i=1 
wixi + b 
) 
. 
[4] Let us describe the elements involved in the above expression:[6] 
• x = [x1, x2, ..., xd]T is the input vector (feature vector); 
• w = [w1, w2, ..., wd]T are the synaptic weights; 
• b ∈ R is called bias; 
• v = 
∑d 
i=1wixi + b is called local field or pre-activation; 
• φ : R→ R is called activation (or transfer) function; 
• y = φ(v) is the neuron output. 
(b) Write the equations describing a three-layer MLP. Specify all variables and 
functions involved in the model and describe their role. Assume one hidden 
layer with L neurons and one output neuron. 
(10 marks) 
We assume bias is included in the network weights. Let us start from the 
output neuron: 
y = φ(v), v = 
L∑ 
i=0 
xiαi. (1) 
[3] Each component xi in Eq. 1 is the output y 
(H) 
i of the activation 
function of a neuron in the hidden layer, i.e., xi ≡ y(H)i [3]. Therefore, a 
ECM3420 (2019) 4 Please Turn Over 
AN 
SW 
ER 
S 
ECM3420 – 28/Jan/2019 – with answers Page 5 
three-layer MLP can be described as follows: 
y = φ 
 
L∑ 
i=0 
αi φ 
( 
d∑ 
j=0 
wjixj 
) 
︸ ︷︷ ︸ 
y 
(H) 
i 
 , (2) 
where d is the input dimensionality. [4] 
(c) Write the update rule and describe the Least-Mean-Square (LMS) algorithm. 
(10 marks) 
Least-Mean-Square (LMS) is an online learning algorithm that is used 
for finding the weights of a linear model. LMS updates the solution after 
the presentation of each input datum. [5] The LMS update rule reads: 
w(k + 1) = w(k) + µx(k)e(k), (3) 
where w(k + 1) denotes the updated weights at time-step k + 1, w(k) 
are the current weights, µ > 0 is the learning rate, x(k) is the current 
input, and finally e(k) = d(k)− y(k) is the instantaneous error given by 
the difference between desired and computed outputs, respectively. [5] 
(Total 30 marks) 
ECM3420 (2019) 5 
AN 
SW 
ER 
S 
ECM3420 – 28/Jan/2019 – with answers Page 6 
Question 3 
A start-up wishes to use machine learning to predict the polluting emissions in 
car exhaust from the characteristics of the car, such as engine size, fuel, driving 
speed, etc. They have collected a large data set of measured emissions and the 
corresponding features for each measurement. 
(a) In supervised learning, an error function, E(w), may be defined as: 
E(w) = − 
N∑ 
n=1 
log p(tn |xn;w). 
State the meaning of each of the symbols w, N,xn and tn in this definition 
and carefully explain how the error function may be derived from the 
principle of maximum likelihood. 
(11 marks) 
• w Parameters/weights of the model. [1] 
• N Number of training data pairs. [1] 
• xn Training features. [1] 
• tn Training targets. [1] 
The likelihood is 
∏N 
n=1 p(tn,xn |w). [1] Taking negative logarithms 
and assuming that the targets are independent [2], and writing 
p(tn,xn |w) = p(tn |xn,w)p(xn) [2] obtains the error function to be 
minimised (because log is monotonic [1]) if the constant due to p(xn) is 
dropped. [1] 
Straightforward, but many find it difficult. 
(b) The start-up uses a radial basis function neural network with a mean squared 
error function to learn the relationship between features and emissions. 
Explain what the use of the mean squared error function implies about the 
noise or measurement errors in the data. 
(4 marks) 
The mean squared error function is derived from the assumption that the 
errors between the systematic model and the measurements are Gaussian 
distributed. [4] 
Discussed in lectures. 
ECM3420 (2019) 6 Please Turn Over 
AN 
SW 
ER 
S 
ECM3420 – 28/Jan/2019 – with answers Page 7 
(c) Explain how the parameters of the radial basis function network may be 
determined with the mean squared error function. 
(7 marks) 
Learning by selecting a number of data points as RBF centres and then 
linear regression from the basis function activations. [3] This requires 
constructing a design/data matrix X whose n, j-th entry is the activation 
of basis function j by feature xn. Weights w found by w = X†y where 
y is the vector of targets. [2] Regularisation constant found by cross 
validation. [2] 
Discussed in lectures and illustrated in a workshop. 
(d) In order to cope with occasional outliers the start-up decides to use a sum of 
absolute errors function. Explain why the sum of absolute errors function 
may help in this context. 
(4 marks) 
Sum of absolute errors function is corresponds to Laplacian distributed 
errors [2] p(t |x;w) ∝ exp(−|t − y(w)|) so large deviations from the 
prediction are more likely and will not have such a large influence on the 
weights as for the Gaussian case. [2] 
(e) Suggest how the parameters of the radial basis function network could be 
learned using the sum of absolute errors. 
(4 marks) 
Numerical optimisation rather than linear algebra must be used. [2] 
Either Nelder-Mead simplex method or a gradient-based method. [2] 
This was illustrated in a workshop for ordinary linear regression, but 
most candidates will not put the RBF training and optimisation ideas 
together. 
(Total 30 marks) 
ECM3420 (2019) 7 
AN 
SW 
ER 
S 
ECM3420 – 28/Jan/2019 – with answers Page 8 
Question 4 
(a) Explain what is meant by the term recurrent neural network and what 
recurrent neural networks are used for. 
(4 marks) 
Recurrent neural networks feed the output signal back to the inputs. [2] 
Used for modelling and predicting time series. [2] 
Bookwork 
(b) Illustrating your answer with a diagram, explain what it meant by an echo 
state network. Your answer should include equations giving the update rule 
and output for an echo state network. 
(11 marks) 
Diagram showing input layer, reservoir, output layer and feedback. [4] 
Discussed in lectures 
Update rule: 
xt =φ(W 
rxt−1 +Wiut + ), 
zt =W 
oxt, 
[2] where φ(·) is the activation function (typically, but not necessarily, 
a sigmoidal function) xt is the ESN state at time t, ut and zt are the 
input and output at time t, respectively, and  is an additive white noise 
term. Wr are the weights of the recurrent layer (connections between 
hidden neurons) and Wi are weights between input and recurrent layer. 
[2]Wo weights connect the ESN state with the output and form the so- 
called read-out layer. [3] 
Standard material. 
(c) Describe in detail how the read-out weights of echo state networks are 
obtained. You should include in your answer the expressions describing the 
related optimization problem and how it is solved. 
(11 marks) 
ECM3420 (2019) 8 Please Turn Over 
AN 
SW 
ER 
S 
ECM3420 – 28/Jan/2019 – with answers Page 9 
The read-out weights Wo are obtained by solving the following 
regularized [1] least-square problem [2]. 
argmin 
W∈RNr×No 
1 
2 
‖XW − t‖2 + λ 
2 
‖W‖2, 
[3] where λ ≥ 0 is the regularization parameter [1] and X ∈ RN×Nr 
contains the ESN states produced by the action of an input of size N and 
t ∈ RN is a vector containing the target values. [2] 
Closed-form solution for the solution is obtained as: 
Wo = 
( 
X>X+ λ2I 
)−1 
X>t, (4) 
where I is a Nr ×Nr identity matrix. [2] 
(d) Explain why regularisation is used in finding the read-out weights. 
(4 marks) 
Regularization improves the numerical stability of the matrix inversion 
operation and reduces the magnitudes of entries in Wo, thus mitigating 
sensitivity to noise and overfitting. [4] 
(Total 30 marks) 
ECM3420 (2019) 9 
AN 
SW 
ER 
S 
ECM3420 – 28/Jan/2019 – with answers Page 10 
Question 5 
(a) Explain the difference between a hierarchical clustering algorithm and a 
clustering algorithm that yields a partition. Explain when you might prefer 
to use a hierarchical algorithm. 
(5 marks) 
Hierarchical algorithm gives clusters at different resolutions while a 
partition just divides the data into a fixed number of clusters [3]. 
Hierarchical algorithms useful for examining structure of data at 
different resolutions and when the number of clusters is unknown. [2] 
(b) Explain how agglomerative clustering is performed. 
(6 marks) 
Agglomerative clustering begins with each data point as a single 
cluster. [1] At each stage the two closest clusters are merged to form a 
new cluster. [2] Algorithm terminates when a single cluster remains. [1] 
Distances between clusters can be measured in a variety of ways, for 
example, as the minimum or maximum distance between any two points, 
one from each cluster. [2] 
Standard, but many will not appreciate different distance measures 
(c) Carefully defining the symbols you use, give an expression for the objective 
function minimised by the k-means algorithm. 
(6 marks) 
Objective function: 
k∑ 
j=1 
∑ 
xi∈Cj 
‖xi − µj‖22, [2] 
where ‖·‖22 is the squared Euclidean norm, Cj is the jth cluster in the 
partition P , k = |P|, and µj, j = 1, ..., k, are k cluster representatives. 
[2] If clusters are represented by centroids, then µj is defined as: 
µj = 
1 
|Cj| 
∑ 
xi∈Cj 
xi [2] 
ECM3420 (2019) 10 Please Turn Over 
AN 
SW 
ER 
S 
ECM3420 – 28/Jan/2019 – with answers Page 11 
(d) Giving pseudo-code, explain how the k-means clustering is achieved. 
(7 marks) 
Here is detailed pseudo-code, but full marks for less formal presentation 
that includes initialisation [2], assignment [2] and update steps [2] and 
termination [1] steps. 
Input: Input data X = {x1, ..., xn}, partition order k, a distance 
function d : X × X → R+0 , MAX number of iterations 
Output: A partition P = {C1, ..., Ck} 
1: t = 0,P = ∅ 
2: Initialisation: Assign k cluster representatives as randomly chosen 
xi 
3: loop 
4: t = t+ 1 
5: Assignment Step: assign each sample xj to the cluster with the 
closest representative 
6: C 
(t) 
i = {xj : d(xj, µi) ≤ d(xj, µh) for all h = 1, . . . , k} 
7: Update Step: update the representatives 
8: µ 
(t+1) 
i = 
1 
|C(t)i | 
∑|C(t)i | 
j=1 xj 
9: Update the partition with the modified clusters: 
P t = {C(t)1 , ..., C(t)k } 
10: if t ≥ MAX OR P t = P t−1 then 
11: return P t 
12: end if 
13: end loop 
(e) Two users run the k-means algorithm on the same data set, but find that it 
gives two different partitions. Explain how this may occur and what may be 
done in practice to cope with this phenomenon. 
(6 marks) 
The reason is that the clusters are initialised at random so the algorithm 
may converge to different local minima. [3] In practice the algorithm is 
run several times from different initialisations and the partition with the 
minimum objective function returned. [3] 
Standard, but not appreciated by many. 
(Total 30 marks) 
ECM3420 (2019) 11 End of Paper