CS 471 Intro Artificial Intelligence  
2020 Spring  
Practice Final  
Note: These questions are meant to help guide your studying for the exam only. It  
is not guaranteed that this practice midterm covers all of this material (though we  
have tried to cover as much as possible) and it is not guaranteed that every topic  
on this practice will show up on the final exam. Please note that while the final  
exam will have an emphasis on material covered since Midterm 2, material from  
earlier parts of the course may be tested as well.    
We will go over this practice final and its solutions in class in the review session.  
Goodluck!  
Bayes Net  
1. After consulting a behavioral psychologist, we obtained the following complete set of  
conditional relationships:   
• HW depends only on Party and Smart   
• Mac depends only on Smart and Creative   
• Project depends only on Smart and Creative   
• Success depends only on HW and Project   
• Happy depends only on Party, Mac, and Success   
i. Draw the Bayesian network.  
ii. Write joint distribution as a product of conditional probabilities.  
iii. What is the number of independent parameters needed for each conditional probability table?  
iv. What is the total number of independent parameters?  
v. Using only the Bayesian network structure from part 4.1, answer the following True/False  
questions and provide a brief explanation:  
a. Party is independent of Success given HW.   
b. Party is independent of Smart given Success  
c. Party is independent of Creative given Happy.  
vi. Given the simple Bayes net shown above, what is the probability that it is raining, given that  
the grass is wet (P(R = T | G = T))? Show your work.  
.   
2.  
(a) Write down the joint probability distribution associated with the following Bayes Net.  
Express the answer as a product of terms representing individual conditional probabilities tables  
associated with this Bayes Net:  
(b) Draw the Bayes net associated with the following joint distribution: P(A) · P(B) · P(C|A,  
B) · P(D|C) · P(E|B, C)  
(c) Do the following products of factors correspond to a valid joint distribution over the  
variables A, B, C, D? (Circle TRUE or FALSE and give a sentence of explanation)  
(i) TRUE FALSE   P(A) · P(B) · P(C|A) · P(C|B) · P(D|C)  
(ii) TRUE FALSE  P(A) · P(B|A) · P(C) · P(D|B, C)  
(iii) TRUE FALSE   P(A) · P(B|A) · P(C) · P(C|A) · P(D)  
(iv) TRUE FALSE   P(A|B) · P(B|C) · P(C|D) · P(D|A)  
(d) What factor can be multiplied with the following factors to form a valid joint distribution?  
(Write “none” if the given set of factors can’t be turned into a joint by the inclusion of exactly  
one more factor.)  
(i) P(A) · P(B|A) · P(C|A) · P(E|B, C, D)  
(ii) P(D) · P(B) · P(C|D, B) · P(E|C, D, A)  
(e) Answer the next questions based of the Bayes Net below:  
All variables have domains of {-1, 0, 1}  
(i) Before eliminating any variables or including any evidence, how many entries does the  
factor at G have?  
(ii) Now we observe e = 1 and want to query P(D|e = 1), and you get to pick the first variable  
to be eliminated  
• Which choice would create the largest number of factor f?  
• Which choice would create the smallest number of factor f1?  
Regression  
3.We would like to classify some data. We have N samples, where each sample consists of a  
feature vector x = {x1, · · · ,xk} and a label y = {0, 1}. We introduce a new type of classifier  
called logistic regression, which produces predictions as follows:  
where s(γ) is the logistic function, exp(x) = ex , and w = {w1, · · · , wk} are the learned weights.  
Let’s find the weights wj for logistic regression using stochastic gradient descent. We would like  
to minimize the following loss function for each sample:  
(a) Find dL/dwi . Hint: s ' (γ) = s(γ)(1 − s(γ)).  
(b) Write the stochastic gradient descent update for wi . Our step size is η.  
4.In many real-world scenarios our data has millions of dimensions, but a given example has  
only hundreds of non-zero features. For example, in document analysis with word counts for  
features, our dictionary may have millions of words, but a given document has only hundreds of  
unique words. In this question we will make l2 regularized SGD efficient when our input data is  
sparse. Recall that in l2 regularized logistic regression, we want to maximize the following  
objective (in this problem we have excluded w0 for simplicity):  
where l(x (j) , y(j) , w) is the logistic objective function  
and the remaining sum is our regularization penalty.  
When we do stochastic gradient descent on point (x (j) , y(j) ), we are approximating the objective  
function as  
Definition of sparsity: Assume that our input data has d features, i.e. x (j) ∈ Rd . In this problem,  
we will consider the scenario where x (j) is sparse. Formally, let s be average number of nonzero  
elements in each example. We say the data is sparse when s << d. In the following questions,  
your answer should take the sparsity of x(j) into consideration when possible. Note: When we use  
a sparse data structure, we can iterate over the non-zero elements in O(s) time, whereas a dense  
data structure requires O(d) time.  
a)  Let us first consider the case when λ = 0. Write down the SGD update rule for wi when λ = 0,  
using step size η, given the example (x (j) , y(j) ).  
b) If we use a dense data structure, what is the average time complexity to update wi when λ = 0?  
What if we use a sparse data structure? Justify your answer in one or two sentences.  
c) Now let us consider the general case when λ > 0. Write down the SGD update rule for wi when  
λ > 0, using step size η, given the example (x (j) , y(j) )  
d)  If we use a dense data structure, what is the average time complexity to update wi when λ >  
0?  
e) Let w (t) i be the weight vector after t-th update. Now imagine that we perform k SGD updates  
on w using examples (x (t+1), y(t+1)), · · · ,(x (t+k) , y(t+k) ), where x (j) i = 0 for every example in the  
sequence. (i.e. the i-th feature is zero for all of the examples in the sequence). Express the new  
weight, w (t+k) i in terms of w (t) i , k, η, and λ  
f) Using your answer in the previous part, come up with an efficient algorithm for regularized  
SGD when we use a sparse data structure. What is the average time complexity per example?  
(Hint: when do you need to update wi?)  
5.Below is a data set to be linearly separated into two sets (+) and (o) by the blue line. The blue  
line is initialized as x2 = -2*x1 by the weights w = [ 0 1 0.5 ]’. Use data points from the set  
below to test their classification by the blue line. Once you find a point that is incorrectly  
classified, use it to update the weights until correct weights are found. Use eta = 0.2.  
6.We are dealing with samples x where x is a single value. We would like to test two alternative  
regression models:  
1. y = ax + e  
2. y = ax + bx2 + e  
We make the same assumptions we had in class about the distribution of e (e~N(0,s2))  
a) Assume we have n samples: x1 … xn . with their corresponding y values: y1 … yn. Derive  
the value assigned to b in model 2. You can use a in the equation for b.  
b) Which of the two models is more likely to fit the training data better, explain your answer?  
i) model 1.  
ii) model 2.  
iii) both will fit equally well  
iv) impossible to tell  
c) Which of the two models is more likely to fit the test data better, explain your answer?  
i) model 1.  
ii) model 2.  
iii) both will fit equally well  
iv) impossible to tell  
Perceptron  
7. You are learning a multi-class perceptron between 4 classes. The weight vectors are currently  
[1, 0]T , [0, 1] T , [−1, 0] T , [0, −1] T for the classes A, B, C, and D. The next training example x  
has a label of A and feature vector f(x). For the following questions, do not make any  
assumptions about tie-breaking. (Do not write down a solution that creates a tie.). Show your  
work.  
(i) Write down a feature vector in which no weight vectors will be updated.  
() = [ ]  or    Not possible  
(ii) Write down a feature vector in which only wA will be updated by the perceptron.  
() = [ ]  or    Not possible  
(iii) Write down a feature vector in which only wA and wB will be updated by the perceptron  
() = [ ]  or    Not possible  
(iv) Write down a feature vector in which only wA and wC will be updated by the perceptron.  
() = [ ]  or    Not possible  
The weight vectors are the same as before, but now there is a bias feature with value of 1 for all x  
and the weight of this bias feature is 0, −2, 1, −1 for classes A, B, C, and D respectively.  
As before, the next training example x has a label of A and a feature vector f(x). The always “1”  
bias feature is the first entry in f(x).  
(v) Write down a feature vector in which only wB and wC will be updated by the perceptron  
() = [ 
1 
]  or    Not possible  
(vi) Write down a feature vector in which only wA and wC will be updated by the perceptron.  
() = [ 
1 
]  or    Not possible  
8. We would like to use a perceptron to train a classifier for drone position in 3D world  
coordinate system and labels 1(active tracking of adversaries)  or 0 (passive  
surveillance). Let’s use a learning rate of α = .25. Consider the following labeled training  
data:  
Features – Drone Position  
(x, y, z)  
Label  
y*  
(-1,2,3)         1  
(3,-1,3)         0  
(1,2,3)         0  
(3,1,3)         1         
(a)  Our two perceptron weights have been initialized to w1 = 2 ,  w2 = −2, w3 = 0.5. After  
processing the first point with the perceptron learning rule , what will be the updated values for  
these weights?  
(b)  After how many steps will the perceptron algorithm converge? Write “never” if it will never  
converge. Note: one steps means processing one point. Points are processed in order and then  
repeated, until convergence.  
(c) Instead of the perceptron, we decide to treat the perceptron as a single node neural network  
and update the weights using gradient descent on the loss function.  
The loss function for one data point is Loss(y, y* ) = (y − y* ) 2 , where y* is the training label for  
a given point and y is the output of our single node network for that point.  
(i) Given a general activation function g(z) and its derivative g’(z), what is the derivative of the  
loss function with respect to w2 in terms of g, g’ , y* , x1, x2, x3, w1, w2 and w3?  
(ii) For this question, ReLU activation function is used:  
g(z) = 1 if z ≥ 0 and = 0 if z < 0  
Given the following gradient descent equation to update the weights given a single data point.  
With initial weights of w1 = 2 ,  w2 = −2, w3 = 0.5, what are the updated weights after processing  
the first? Gradient descent update equation: wi = wi − α ∂Loss/ ∂wi   . Show all of your work.  
(ii) For this question, a TanH activation function is used:  
g(z) = exp(z) – exp(-z) / exp(z) + exp(-z)  
Given the following gradient descent equation to update the weights given a single data point.  
With initial weights of w1 = 2 ,  w2 = −2, w3 = 0.5, what are the updated weights after processing  
the second point? Gradient descent update equation: wi = wi − α ∂Loss/ ∂wi . Show all of your  
work. (you can round of the value to 1 digit after the decimal)  
Neural networks  
9.The network below is a neural network with inputs x1 and x2, and outputs y1 and y2. The  
internal nodes are computed below. All variables are scalar values.   
Note that ReLU(x) = max(0, x).   
The expressions for the internal nodes in the network are given here for convenience:  
h1 = w11x1 + w12x2 h2 = w21x1 + w22x2 h3 = w31x1 + w32x2   
r1 = ReLU(h1) r2 = ReLU(h2) r3 = ReLU(h3) s1 = x1 + r1 + r2 s2 = x2 + r2 + r3   
y1 = 
exp(s1) 
exp(s1)+exp(s2)  
y2 =  
exp(s2)  
exp(s1)+exp(s2) 
(a) Forward Propagation Suppose for this part only, x1 = 3, x2 = 5, w11 = −10, w12 = 7, w21 = 2,  
w22 = 5, w31 = 4, w32 = −4. What are the values of the following internal nodes? Please simplify  
any fractions.  
(i) h1 =     (ii) s1 =     (iii) y2 =   
(b) Back Propagation Compute the following gradients analytically. The answer should be an  
expression of any of the nodes in the network (x1, x2, h1, h2, h3, r1, r2, r3, s1, s2, y1, y2) or weights  
w11, w12, w21, w22, w31. In the case where the gradient depend on the value of nodes in the  
network, please list all possible analytical expressions, caused by active/inactive ReLU,  
separated by comma.   
Hint 1: If z is a function of y, and y is a function of x, the chain rule of taking derivative is: 
Hint 2: Hint: Recall that for functions of the    ,   
(i)   
(ii)   
(iii)   
(iv)   
(v)   
(vi)   
(c) In roughly 15 words, what role could the non-negative values in node r2 play according to  
its location in the network architecture?  
10.  
For each of the piecewise-linear functions below, mark all networks from the list above that can  
represent the function exactly on the range x ∈ (−∞, ∞). In the networks above, relu denotes the  
element-wise ReLU nonlinearity: relu(z) = max(0, z). The networks Gi use 1-dimensional layers,  
while the networks Hi have some 2-dimensional intermediate layers. Briefly show your reason.