Machine Learning & Artificial Intelligence for Data Scientists
COMPSCI5100
Thursday 12 December 2024
Question 1: Regression (Total marks: 20)
Consider using regression to predict the birth rate in the US using the data shown in the following figure:
Figure 1.1 Birth rate (per 1000)from 1909 to 2008
(a) Consider fitting the data with a polynomial regression of order 10. Identify the numerical issue with model fitting and propose a solution with sufficient details [4 marks]
(b) Consider fitting the data with a polynomial regression of order 2, identify the two regions of most likely poorly fitted data points and explain why. [6 marks]
(c) Consider fitting the data in Figure 1.1 with a liner regression model with the sigmoid basis function:
Explain the choice of hyperparameter μk (mu_k) and s that could lead to the following fitted model [4 marks]
Figure 1.2 A liner regression model using sigmoid basis function fitted to the data
(d) We used two fitting strategies, namely ridge regression and lasso, and obtained the
following fitting models in Figure 1.3 A and B. Identify which fitting strategy is used in each figure and explain why and how the chosen fitting method could have generated the result. (note, each method is used only once). [6 marks]
Figure 1.3 A
Figure 1.3 B
Question 2: Classification (Total marks: 20)
a) Assume the following training data in the two-dimensional plane of x1 and x2 is available (Figure 1). The target variables for the points in the red and blue are +1 and -1. We summarise the data as the following tuples: <(2,0), 1>, <(0,2),-1>, <(0,-2),1>, and <(-2,0),1>, respectively.
Figure 2
i. Design a k-NN classifier with k=1 and use it to determine the class variables C1
through C4 for the following test data points: <(0,1), C1>, <(1.5, 1), C2 >, <(-0.5, 1), C3 >, and <(0,0), C4 >: [4 marks]
ii. What would be the class variable C4, if we had used k=3? [2 marks]
iii. Write down the equations that specify the decision boundary between the two classes. [4 marks]
b) In the same data set in Figure 1, we apply a linear SVM model with the predictor y(x1,x2) for classification.
i. Which data points are the support vectors? Write down the equation for y(x1,x2).
(Hint: First visually assess the data to determine the decision boundary and the support vectors. Observe the constraints for the margin and SVM classifier.) [ 6 marks]
ii. Specify the Lagrange multipliers α1, α2, α3, α4 for each of the data points in the training data (2,0), (0,2), (-2,0), and (0,-2), respectively. [4 marks]
Question 3: Unsupervised learning (Total marks 20)
Consider using the K-means algorithm to perform. clustering on the following scenario Figure 3.1 A. We expect to form. two clusters as shown in Figure 3.1 B.
Figure 3. 1 A: Original Data
Figure 3.1 B Expected Clusters
(a) Outline what would happen if we directly apply K-means with Euclidean distance to this data. Can it achieve the clustering objective? How will it split/group the data and why? [3 marks]
(b) An alternative approach is to use Kernel K–means. Would kernel K-means could help in this dataset and why? [2 marks]
(c) An alternative approach is to use mixture models. Would mixture models help to better classify this dataset than K-means and why? [3 marks]
(d) The plot in Figure 3.2 shows some 2D data. PCA is applied to this data. Explain how the first principal component would look if it is overlaid on the plot. Explain your reasoning. (Note: there is no need to make a drawing. You can provide a description of the shape based on the coordinate system provided in the original figure.)
Figure 3.2 2D Points
[2 marks]
(e) Similar to the previous question, explain what the second principal component would look like and why. (Note: there is no need to make a drawing. You can provide a description of the shape based on the coordinate system provided in the original figure.) [2 marks]
(f) Describe the four-step process you should use to determine the number of clusters in Kernel K-Means. (Hint: Each step gets a mark.) [4 marks]
(g) Describe two approaches you could take to managing the curse of dimensionality in, for example, genetic data. For example, how would you overcome this if you had a high-dimensional dataset with thousands of genetic features but only hundreds of subjects? [4 marks]