CP1407讲解 、辅导 c/c++,Java程序
            
                CP1407 Assignment 2 
 
- Page 1 - 
 
 
Note: This is an individual assignment. While it is expected that students will 
discuss their ideas with one another, students need to be aware of their 
responsibilities in ensuring that they do not deliberately or inadvertently 
plagiarise the work of others. 
 
 
Assignment 2 – Practice on various Machine Learning algorithms 
 
 
 
 1. [Data Pre-Processing, Clustering] [10 marks] 
Why is attribute scaling of data important? The following table contains sample 
records having the number of numbers and the total revenue generated by particular 
stores of a supermarket. Use the table as an example to discuss the necessity of 
normalisation in any proximity measurement for clustering purposes. 
 
Supermarket ID Employee Count Revenue 
001 38 $5,500,000 
002 29 $5,000,000 
003 24 $5,000,000 
004 10 $890,000 
005 40 $2,500,000 
006 31 $3,200,000 
007 14 $678,000 
008 35 $5,200,000 
009 30 $5,300,000 
010 22 $5,500,000 
 
 
 
 
2. [Classification – Decision Tree algorithm] [20 marks] 
Use the soybean dataset (diabetes.arff) to perform decision tree induction in Weka 
using three different decision tree induction algorithms; J48, REPTree, and 
RandomTree. Investigate different options, particularly looking at differences between 
pruned trees and unpruned trees. In discussing your results, consider the following 
questions. 
 
a) What are the effects of pruning on the results for the soybean datasets? 
b) Are there differences in the performances of the three decision tree algorithms? 
c) What impacts do other parameters of the algorithms have on the results? 
 
3. [Classification – Naïve Bayes algorithm] [30 marks] 
Suppose we have data on a few individuals randomly examined for basic health check. 
The following table gives the data on these individuals’ health-related attributes. CP1407 Assignment 2 
 
- Page 2 - 
Body 
Weight 
Body 
Height 
Blood 
Pressure 
Blood Sugar 
Level 
Habit Class 
Heavy Tall High 3 Smoker P 
Heavy Short High 1 Nonsmoker P 
Normal Tall Normal 3 Nonsmoker N 
Heavy Tall Normal 2 Smoker N 
Low Medium Normal 2 Nonsmoker N 
Low Tall Normal 1 Nonsmoker P 
Normal Medium High 3 Smoker P 
Low Short High 2 Smoker P 
Heavy Tall High 2 Nonsmoker P 
Low Medium Normal 3 Smoker P 
Heavy Medium Normal 3 Smoker N 
 
 Use the data together with the Naïve Bayes classifier to perform a new classification for 
the following new instance. Create and use the classifier by hand, not with Weka, and 
show all your working. 
Body 
Weight 
Body 
Height 
Blood 
Pressure 
Blood Sugar 
Level 
Habit Class 
Low Tall High 2 Smoker ? 
 
 4. [Association Rules Mining] [20 marks] 
The following table film watching histories for several viewers of an on-demand service. 
 
User Id Items 
001 Airplane!, Downfall, Evita, Idiocracy, Jurassic Park 
002 Casablanca, Downfall, Evita, Flubber, Jurassic Park 
003 Airplane!, Downfall, Half Baked, Jurassic Park 
004 Airplane!, Downfall 
005 Casablanca, Downfall, Flubber, Jurassic Park, Zoolander 
006 Casablanca, Downfall, Half Baked, Idiocracy, Zoolander 
007 Evita, Idiocracy, Jurassic Park 
008 Downfall, Jurassic Park, Zoolander 
009 Casablanca, Downfall, Evita, Half Baked, Jurassic Park, Zoolander 
 
a) Follow the steps outlined in Practical 07 and conduct a mining task for Boolean 
association rules using the Apriori algorithm in Weka. 
b) Set different parameters and observe the association rules discovered. 
c) Weka provides association evaluation parameters other than support and 
confidence. Note the evaluation results by those evaluation parameters of example 
rules. 
 CP1407 Assignment 2 
 
- Page 3 - 
 
5. [Clustering] [20 marks] 
Consider the following 2-dimensional point data set presented in (x,y) coordinates: 
 P1(1,1), P2(1,3), P3(4,3), P4(5,4), P5(9,4), P6(9, 6). 
Apply the hierarchical clustering method by hand (using Agglomerative algorithm) to 
get final two clusters. Use the Manhattan distance function to measure the distance 
between points and use the single-linkage scheme to do clustering. Show all your 
working. 
 
Rubric 
 Exemplary Good Satisfactory Limited Very Limited 
 90-100% 70-80% 50-60% 30-40% 0-20% 
For each 
question 
Answer 
demonstrates 
excellent 
knowledge of 
machine 
learning and 
data science, 
is well-written, 
and very welljustified.
 
Exhibits 
aspects of 
exemplary 
(left) and 
satisfactory 
(right) 
Answer 
demonstrates 
sound 
knowledge of 
machine 
learning and 
data science 
and provides 
justification. 
 
Exhibits 
aspects of 
satisfactory 
(left) and very 
limited (right) 
Answer 
demonstrates 
flawed 
knowledge of 
machine 
learning 
and/or 
provides 
incoherent 
justification. 
 
Or 
 
Answer is 
absent or 
negligible.