讲解 data编程、辅导 Java程序设计

Week 1 Practical
Introduction to WEKA
What are we doing?
• Download an open source machine learning tool “WEKA” and explore the main
features of this tool.
• Understand and practice the basic data pre-processing operations that can be
performed using WEKA.

Submission:
You are required to submit one .arff file (after completing the practical task as
instructed in this prac document) via the weekly-practical submission box.

What is WEKA?
The WEKA (The Waikato Environment for Knowledge Analysis) is a machine learning
toolkit developed at the University of Waikato in Hamilton, New Zealand. The
software provides many machine learning statistics and other data mining solutions
for various types of data mining task, such as classification, cluster detection,
association rule discovery and attribute selection. The software is also equipped with
data pre-processing and post-processing tools and visualisation tools so that
complete data mining projects can be conducted via a number of different styles of
user interface. The toolkit is written in Java and can, therefore, run on various
platforms, such as Linux, Windows and Macintosh. It is an open-source software and
distributed under the terms and conditions of the GNU General Public License.

Launching and Starting WEKA
You can find instructions for installing Weka at

https://waikato.github.io/weka-wiki/downloading_weka/

When you open Weka you should see a screen like the one below (Figure 1).
[Figure 1]

Select the Explorer option below Applications.

Data Pre-Processing using WEKA
This example illustrates some of the basic data preprocessing operations that can be
performed using WEKA. The sample data set used for this example, unless otherwise
indicated, is the "bank data", called Bank Data.csv
The data contains the following fields
id a unique identification number
age age of customer in years (numeric)
sex MALE / FEMALE
region inner_city/rural/suburban/town
income income of customer (numeric)
married is the customer married (YES/NO)
children number of children (numeric)
car does the customer own a car (YES/NO)
save_acct does the customer have a saving account (YES/NO)
current_acct does the customer have a current account (YES/NO)
mortgage does the customer have a mortgage (YES/NO)
pep
did the customer buy a PEP (Personal Equity Plan) after the last mailing
(YES/NO)

Loading the Data
In addition to the native ARFF data file format, WEKA has the capability to read in
".csv" format files. This is fortunate since many databases or spreadsheet applications
can save or export data into flat files in this format. A usual Microsoft Excel worksheet

can be saved as a CSV file and opened by WEKA. The first row of the spreadsheet is
used to name the attributes and the data types for the attributes are derived
automatically but not always accurately. Once opened, you can save the data set into
an ARFF file in WEKA (by clicking “Save” in the Preprocess tab).
In this example, we load the data set into WEKA, perform a series of operations using
WEKA's attribute and discretization filters. While all of these operations can be
performed from the command line, we use the GUI interface for WEKA Explorer.
Initially (in the Preprocess tab) click "open" and navigate to the directory containing
the data file (which is something like bank-data.csv). This is shown in [Figure 2].
Once the data is loaded, WEKA will recognize the attributes and during the scan of the
data will compute some basic statistics on each attribute. The left panel in [Figure 3]
shows the list of recognized attributes, while the top panels indicate the names of the
base relation (or table) and the current working relation (which are the same initially).
Note: The recent version of WEKA has an additional tab named “Edit” under
Preprocess menu to view the current contents of the dataset under working.
Whenever you apply any filter in WEKA, you can see the updated contents via this
viewer facility. (Alternatively, you can use the “Arff Viewer” tool included in WEKA.
Refer to the WEKA manual document for further details)
[Figure 2]

[Figure 3]
Clicking on any attribute in the left panel will show the basic statistics on that
attribute. For categorical attributes, the frequency for each attribute value is shown,
while for continuous attributes we can obtain min, max, mean, standard deviation,
etc. As an example, see the [Figure 4] below which show the results of selecting the
“age” attribute.

[Figure 4]

Selecting or Filtering Attributes
In our sample data file, each record is uniquely identified by a customer id (the "id"
attribute). We need to remove this attribute before the data mining step (as this
attribute is not necessary). We can do this by using the Attribute filters in WEKA.

In the "Filter" panel, click on the "Choose" button.

This will show a popup window with a list available filters. Scroll down the list and
select the "weka.filters.unsupervised.attribute.Remove" filter as shown in [Figure 5].

Next, click on text box immediately to the right of the "Choose" button.

In the resulting dialog box enter the index of the attribute to be filtered out (this can
be a range or a list separated by commas). In this case, we enter 1 which is the index
of the "id" attribute (see the left panel). Make sure that the "invertSelection" option
is set to false (otherwise everything except attribute 1 will be filtered). Then click "OK"
(See [Figure 6]). Now, in the filter box you will see "Remove -R 1" (see [Figure 7]).

[Figure 5]
[Figure 6]

[Figure 7]

Click the "Apply" button to apply this filter to the data. This will remove the "id"
attribute and create a new working relation (whose name now includes the details of
the filter that was applied). The result is depicted in [Figure 8].
[Figure 8]

Discretization
Some techniques, such as association rule mining, can only be performed on
categorical data. This requires performing discretization on numeric or continuous
attributes. (There are 3 such attributes in this data set: "age", "income", and
"children"). Click on the “age” attribute. Again we activate the Filter dialog box, but
this time, we will select "Discretize" filter from the list. (see [Figure 9]).
[Figure 9]

Next, to change the defaults for this filter, click on the box to the right of the "Choose"
button. This will open the Discretize Filter dialog box.
We enter the index for the the attributes to be discretized. In this case we enter 1
corresponding to attribute "age". We also enter 3 as the number of bins (note that it
is possible to discretize more than one attribute at the same time (by using a list of
attribute indexes). Since we are doing simple binning, all of the other available options
are set to "false". The dialog box is shown in [Figure 10].

Click "Apply" in the Filter panel. This will result in a new working relation with the
selected attribute partitioned into 3 bins (shown in Figure 10).
Finally, save the file as something like "bank-data-final.arff".
Submit this final filtered arff file to prove your work for this weekly
practical.

[Figure 10]
[Figure 11]

Other Useful Filters in WEKA
There are more useful preprocessing filters provided in WEKA in addition to filters we
tried in this exercise. The following is briefs of some among them. You are
recommended to refer to WEKA manual for further details and have a try to apply
some to bank data for your own exercise.

In WEKA, data pre-processing is done using attribute or instance filters that can
operate supervised or unsupervised. Attribute filters are applied to attributes
(columns) and instance filters are applied to data objects (rows). Supervised filters
perform with consideration of a class attribute whereas unsupervised filters do not.
(Many unsupervised filters have a supervised counterpart. Supervised filters must be
used with care for classification tasks; test examples must be pre-processed in the
same way as the training examples.)

The many other filters for data pre-processing have not been described here due to
limitations of space. Filters in WEKA are continuously developed and new filters are
constantly added in new versions.

Add attribute filter
Using “Add” filter, we can create a new attribute (with empty value as default)
and specify the location, name and labels of the new attribute. Once created,
the value of the new attribute can be entered manually in the viewer window
for data objects.
New numeric features can be added with the “AddExpression” filter, which
applies a mathematical expression based on the values of other attributes.

Numeric transformation attribute filters
The “MathExpression” filter allows transformation with a valid mathematical
expression that uses arithmetic operators and built-in functions, such as
absolute (abs), logarithm (log), square root (sqrt), etc.
The “NumericTransform” filter only allows transformations by methods
supported by the Java math library. Unlike AddExpression, these filters do not
create new attributes but replace the current values with the transformed
values.

Transformation attribute filters
The “Normalize” filter converts the values of all numeric attributes in the
loaded data set to those within a common range. The default range is [0.1].
The user can change the normal range if needed.
The “Standardize” filter standardizes all numeric attributes to have zero mean
and unit variance.

ReplaceMissingValues filter
This rudimentary filter fills in missing values; numeric values are replaced with
the sample mean and nominal values are replaced with the sample mode. The
user can also fill in missing values manually in the viewer window (using “Edit”

menu). For numeric attributes, the user may enter any value. For nominal
attributes, the user can only select one of the nominal labels that already exists
in the attribute domain. If the label does not exist (for instance, it is a special
code indicating unknown), the label can be added into the attribute domain by
using “AddValues” filter.

Resample instance filter
This filter selects a random sample of a certain percentage (SampleSizePercent
parameter) of the loaded data set, with or without replacement (to sample
without replacement, set the noReplacement parameter to True). The
unsupervised Resample filter draws the sample from the entire data set
reflecting the real distribution of attribute values including class values; the
supervised Resample filter draws samples according to either the real
distribution of classes (set the biasToUniformClass parameter to 0) or a
uniform distribution of classes (set the biasToUniformClass parameter to 1).