辅导ECM3420辅导留学生asp程序

AN
SW
ER
S
ECM3420
(with Answers)
UNIVERSITY OF EXETER
COLLEGE OF ENGINEERING, MATHEMATICS
AND PHYSICAL SCIENCES
COMPUTER SCIENCE
Examination, May 2019
Learning from Data
Module Leader: Dr. Lorenzo Livi
Duration: TWO HOURS
Answer question 1, and two out of the other four questions.
Question 1 is worth 40 marks. Other questions are worth 30 marks each.
This assessment will contribute 60% to the final mark for this module.
No electronic calculators of any sort are to be used during the course of this
examination.
This is a CLOSED BOOK examination.
AN
SW
ER
S
ECM3420 – 28/Jan/2019 – with answers Page 1
SECTION A (Compulsory Question)
Question 1
(a) What is the difference between supervised and unsupervised learning?
Provide an example of each. What is meant by the term semi-supervised
learning?
(8 marks)
In supervised learning, the information regarding the target/desired
output is available and exploited during training. On the other hand,
in unsupervised learning this information is not available. [3] Examples:
linear regression (supervised) and clustering (unsupervised). [2]
In semi-supervised learning a small amount of labelled training data is
available, but usually a large quantity of unlabelled data. Learning can
take advantage of the structure of the data implicit in the unlabelled data
as well as the labels. [3]
(b) Briefly explain what is meant by the kernel trick and how it may be used
to obtain a nonlinear classifier from a linear one. Name and provide
expressions for at least two (positive definite) kernel functions.
(10 marks)
The kernel trick describes the fact that inner products 〈φ(x), φ(y)〉 [2]
between vectors x and y mapped to a (usually high dimensional) feature
space by φ(·) can be computed via a kernel k(x,y) in the original space
(subject to the kernel being a Mercer kernel) [3].
If the computation of the linear classifier can be done via inner products,
then the classification can be performed after nonlinearly mapping φ(x)
to a high-dimensional space, but the computations can be achieved in the
original space. The basis of support vector machines and kernel logistic
regression. [3]
Two well-known (positive definite) kernel functions are [2]:
• Polynomial: k(x, z) = (x · z+ v)p, v ≥ 0, p ≥ 0
• RBF: k(x, z) = exp(−γ × ‖x− z‖22), which is known as Gaussian
RBF when γ = 1
2σ2
(c) Explain the difference between covariance and correlation. The covariance
ECM3420 (2019) 1
AN
SW
ER
S
ECM3420 – 28/Jan/2019 – with answers Page 2
matrix for two variables is given by
S =
[
10.2 −45.6
−45.6 8.1
]
Calculate the correlation between them and provide comments on the
outcome.
Briefly explain why two decorrelated variables need not be independent.
(8 marks)
Correlation is a standardised measure of covariance [2]. They are related
by ρ = covxy/(σxσy), where σx, σy denote the standard deviations of
x and y, respectively. So ρ = −45.6/(10.2 × 8.1) = −0.55. [2]
Identification of correct elements in covariance matrix and calculation.
Therefore, the two variables are moderately anti-correlated [2].
Correlation is a linear measure so the variables might be positively
correlated in one place and negatively correlated in another summing
to zero; independence requires p(x, y) = p(x)p(y) which is stronger [2].
Standard material, but not all appreciate the differences.
(d) Explain what is meant by the terms generalisation error and over-fitting.
How can over-fitting be combatted?
(6 marks)
Generalisation error: The average error made on unseen test data after
training. [2]
Overfitting: The phenomenon of learning the details of the training data
rather than the trends. [2]
Combatted by controlling the complexity of the learning machines. [2]
(e) With the aid of a carefully labelled diagram, explain what is meant by a
convex function. Why is it desirable for an error function to be convex?
(8 marks)
ECM3420 (2019) 2 Please Turn Over
AN
SW
ER
S
ECM3420 – 28/Jan/2019 – with answers Page 3
Several possible definitions in terms of the secant line or second
derivative for functions of one variable. For multi-variate functions
the eigenvalues of the Hessian are all positive or the epigraph of the
function is a convex set. [4] for a multivariate definition ; for a univariate
definition. Diagram: [2].
Desirable to have a convex error function because convex functions have
only one minimum. Therefore if a minimum is located, it is the global
minimum. [2]
(Total 40 marks)
ECM3420 (2019) 3
AN
SW
ER
S
ECM3420 – 28/Jan/2019 – with answers Page 4
SECTION B (Optional Questions)
Question 2
(a) Write the equation and draw a diagram describing the perceptron model.
Specify all variables involved in the model (using formal notation) and
describe their role.
(10 marks)
The equation describing a perceptron reads:
y = φ(v) = φ
(
d∑
i=1
wixi + b
)
.
[4] Let us describe the elements involved in the above expression:[6]
• x = [x1, x2, ..., xd]T is the input vector (feature vector);
• w = [w1, w2, ..., wd]T are the synaptic weights;
• b ∈ R is called bias;
• v =
∑d
i=1wixi + b is called local field or pre-activation;
• φ : R→ R is called activation (or transfer) function;
• y = φ(v) is the neuron output.
(b) Write the equations describing a three-layer MLP. Specify all variables and
functions involved in the model and describe their role. Assume one hidden
layer with L neurons and one output neuron.
(10 marks)
We assume bias is included in the network weights. Let us start from the
output neuron:
y = φ(v), v =
L∑
i=0
xiαi. (1)
[3] Each component xi in Eq. 1 is the output y
(H)
i of the activation
function of a neuron in the hidden layer, i.e., xi ≡ y(H)i [3]. Therefore, a
ECM3420 (2019) 4 Please Turn Over
AN
SW
ER
S
ECM3420 – 28/Jan/2019 – with answers Page 5
three-layer MLP can be described as follows:
y = φ

L∑
i=0
αi φ
(
d∑
j=0
wjixj
)
︸︷︷︸
y
(H)
i
 , (2)
where d is the input dimensionality. [4]
(c) Write the update rule and describe the Least-Mean-Square (LMS) algorithm.
(10 marks)
Least-Mean-Square (LMS) is an online learning algorithm that is used
for finding the weights of a linear model. LMS updates the solution after
the presentation of each input datum. [5] The LMS update rule reads:
w(k + 1) = w(k) + µx(k)e(k), (3)
where w(k + 1) denotes the updated weights at time-step k + 1, w(k)
are the current weights, µ > 0 is the learning rate, x(k) is the current
input, and finally e(k) = d(k)− y(k) is the instantaneous error given by
the difference between desired and computed outputs, respectively. [5]
(Total 30 marks)
ECM3420 (2019) 5
AN
SW
ER
S
ECM3420 – 28/Jan/2019 – with answers Page 6
Question 3
A start-up wishes to use machine learning to predict the polluting emissions in
car exhaust from the characteristics of the car, such as engine size, fuel, driving
speed, etc. They have collected a large data set of measured emissions and the
corresponding features for each measurement.
(a) In supervised learning, an error function, E(w), may be defined as:
E(w) = −
N∑
n=1
log p(tn |xn;w).
State the meaning of each of the symbols w, N,xn and tn in this definition
and carefully explain how the error function may be derived from the
principle of maximum likelihood.
(11 marks)
• w Parameters/weights of the model. [1]
• N Number of training data pairs. [1]
• xn Training features. [1]
• tn Training targets. [1]
The likelihood is
∏N
n=1 p(tn,xn |w). [1] Taking negative logarithms
and assuming that the targets are independent [2], and writing
p(tn,xn |w) = p(tn |xn,w)p(xn) [2] obtains the error function to be
minimised (because log is monotonic [1]) if the constant due to p(xn) is
dropped. [1]
Straightforward, but many find it difficult.
(b) The start-up uses a radial basis function neural network with a mean squared
error function to learn the relationship between features and emissions.
Explain what the use of the mean squared error function implies about the
noise or measurement errors in the data.
(4 marks)
The mean squared error function is derived from the assumption that the
errors between the systematic model and the measurements are Gaussian
distributed. [4]
Discussed in lectures.
ECM3420 (2019) 6 Please Turn Over
AN
SW
ER
S
ECM3420 – 28/Jan/2019 – with answers Page 7
(c) Explain how the parameters of the radial basis function network may be
determined with the mean squared error function.
(7 marks)
Learning by selecting a number of data points as RBF centres and then
linear regression from the basis function activations. [3] This requires
constructing a design/data matrix X whose n, j-th entry is the activation
of basis function j by feature xn. Weights w found by w = X†y where
y is the vector of targets. [2] Regularisation constant found by cross
validation. [2]
Discussed in lectures and illustrated in a workshop.
(d) In order to cope with occasional outliers the start-up decides to use a sum of
absolute errors function. Explain why the sum of absolute errors function
may help in this context.
(4 marks)
Sum of absolute errors function is corresponds to Laplacian distributed
errors [2] p(t |x;w) ∝ exp(−|t − y(w)|) so large deviations from the
prediction are more likely and will not have such a large influence on the
weights as for the Gaussian case. [2]
(e) Suggest how the parameters of the radial basis function network could be
learned using the sum of absolute errors.
(4 marks)
Numerical optimisation rather than linear algebra must be used. [2]
Either Nelder-Mead simplex method or a gradient-based method. [2]
This was illustrated in a workshop for ordinary linear regression, but
most candidates will not put the RBF training and optimisation ideas
together.
(Total 30 marks)
ECM3420 (2019) 7
AN
SW
ER
S
ECM3420 – 28/Jan/2019 – with answers Page 8
Question 4
(a) Explain what is meant by the term recurrent neural network and what
recurrent neural networks are used for.
(4 marks)
Recurrent neural networks feed the output signal back to the inputs. [2]
Used for modelling and predicting time series. [2]
Bookwork
(b) Illustrating your answer with a diagram, explain what it meant by an echo
state network. Your answer should include equations giving the update rule
and output for an echo state network.
(11 marks)
Diagram showing input layer, reservoir, output layer and feedback. [4]
Discussed in lectures
Update rule:
xt =φ(W
rxt−1 +Wiut + ),
zt =W
oxt,
[2] where φ(·) is the activation function (typically, but not necessarily,
a sigmoidal function) xt is the ESN state at time t, ut and zt are the
input and output at time t, respectively, and is an additive white noise
term. Wr are the weights of the recurrent layer (connections between
hidden neurons) and Wi are weights between input and recurrent layer.
[2]Wo weights connect the ESN state with the output and form the so-
called read-out layer. [3]
Standard material.
(c) Describe in detail how the read-out weights of echo state networks are
obtained. You should include in your answer the expressions describing the
related optimization problem and how it is solved.
(11 marks)
ECM3420 (2019) 8 Please Turn Over
AN
SW
ER
S
ECM3420 – 28/Jan/2019 – with answers Page 9
The read-out weights Wo are obtained by solving the following
regularized [1] least-square problem [2].
argmin
W∈RNr×No
1
2
‖XW − t‖2 + λ
2
‖W‖2,
[3] where λ ≥ 0 is the regularization parameter [1] and X ∈ RN×Nr
contains the ESN states produced by the action of an input of size N and
t ∈ RN is a vector containing the target values. [2]
Closed-form solution for the solution is obtained as:
Wo =
(
X>X+ λ2I
)−1
X>t, (4)
where I is a Nr ×Nr identity matrix. [2]
(d) Explain why regularisation is used in finding the read-out weights.
(4 marks)
Regularization improves the numerical stability of the matrix inversion
operation and reduces the magnitudes of entries in Wo, thus mitigating
sensitivity to noise and overfitting. [4]
(Total 30 marks)
ECM3420 (2019) 9
AN
SW
ER
S
ECM3420 – 28/Jan/2019 – with answers Page 10
Question 5
(a) Explain the difference between a hierarchical clustering algorithm and a
clustering algorithm that yields a partition. Explain when you might prefer
to use a hierarchical algorithm.
(5 marks)
Hierarchical algorithm gives clusters at different resolutions while a
partition just divides the data into a fixed number of clusters [3].
Hierarchical algorithms useful for examining structure of data at
different resolutions and when the number of clusters is unknown. [2]
(b) Explain how agglomerative clustering is performed.
(6 marks)
Agglomerative clustering begins with each data point as a single
cluster. [1] At each stage the two closest clusters are merged to form a
new cluster. [2] Algorithm terminates when a single cluster remains. [1]
Distances between clusters can be measured in a variety of ways, for
example, as the minimum or maximum distance between any two points,
one from each cluster. [2]
Standard, but many will not appreciate different distance measures
(c) Carefully defining the symbols you use, give an expression for the objective
function minimised by the k-means algorithm.
(6 marks)
Objective function:
k∑
j=1
∑
xi∈Cj
‖xi − µj‖22, [2]
where ‖·‖22 is the squared Euclidean norm, Cj is the jth cluster in the
partition P , k = |P|, and µj, j = 1, ..., k, are k cluster representatives.
[2] If clusters are represented by centroids, then µj is defined as:
µj =
1
|Cj|
∑
xi∈Cj
xi [2]
ECM3420 (2019) 10 Please Turn Over
AN
SW
ER
S
ECM3420 – 28/Jan/2019 – with answers Page 11
(d) Giving pseudo-code, explain how the k-means clustering is achieved.
(7 marks)
Here is detailed pseudo-code, but full marks for less formal presentation
that includes initialisation [2], assignment [2] and update steps [2] and
termination [1] steps.
Input: Input data X = {x1, ..., xn}, partition order k, a distance
function d : X × X → R+0 , MAX number of iterations
Output: A partition P = {C1, ..., Ck}
1: t = 0,P = ∅
2: Initialisation: Assign k cluster representatives as randomly chosen
xi
3: loop
4: t = t+ 1
5: Assignment Step: assign each sample xj to the cluster with the
closest representative
6: C
(t)
i = {xj : d(xj, µi) ≤ d(xj, µh) for all h = 1, . . . , k}
7: Update Step: update the representatives
8: µ
(t+1)
i =
1
|C(t)i |
∑|C(t)i |
j=1 xj
9: Update the partition with the modified clusters:
P t = {C(t)1 , ..., C(t)k }
10: if t ≥ MAX OR P t = P t−1 then
11: return P t
12: end if
13: end loop
(e) Two users run the k-means algorithm on the same data set, but find that it
gives two different partitions. Explain how this may occur and what may be
done in practice to cope with this phenomenon.
(6 marks)
The reason is that the clusters are initialised at random so the algorithm
may converge to different local minima. [3] In practice the algorithm is
run several times from different initialisations and the partition with the
minimum objective function returned. [3]
Standard, but not appreciated by many.
(Total 30 marks)
ECM3420 (2019) 11 End of Paper