Unistat Statistics Software | Hierarchical Cluster Analysis

8.1.1. Hierarchical Cluster Analysis

First, select the data columns to be analysed by clicking on [Variable] from the Variable Selection Dialogue. If the data is not a proximity matrix (if it is not square and symmetric) then another dialogue will appear allowing you to choose from six distance measures. This dialogue will not be available when you input a proximity matrix.

8.1.1.1. Distance Measures

Hierarchical Cluster Analysis

Euclid:

Squared Euclid:

Cosine:

Hierarchical Cluster Analysis

Chebychev:

Block:

Power:

where the power terms p and r are supplied by the user.

8.1.1.2. Distance Matrix

After the distance matrix is computed, a dialogue containing nine hierarchical clustering methods and a Distance Matrix option will appear.

It is possible to select one of the methods and proceed immediately with the analysis, or select the last option to view or save the generated distance matrix. The Distance Matrix option will not be available when you input a proximity matrix for analysis.

8.1.1.3. Hierarchical Methods

Hierarchical Cluster Analysis

All hierarchical methods apply the same algorithm. However, they differ in the way they compute the distance between two clusters.

First, the n(n – 1)/2 elements of the proximity matrix are sorted in ascending order. The nearest two points are joined to form the first cluster. At the i^th step the remaining points and the existing clusters are considered. Either the next two nearest points, or a cluster and a point, or two clusters are formed into a new cluster. This process is repeated until the number of clusters is reduced to one.

One of the following nine hierarchical clustering methods can be selected, where d_ij is the dissimilarity between clusters i and j, ni = 1, i = 1, …, n is a unity vector, Si = 0, i = 1, …, n is a zero vector and indices t and r represent a new cluster and all other clusters respectively.

Average Between Groups:

Compute an unweighted average distance between pairs belonging to two clusters. Update:

and select the minimum of:

Average Within Groups:

Update:

Single Linkage:

Select the smallest distance between pairs of elements in each cluster. Update:

Complete Linkage:

Select the largest distance between pairs of elements in each cluster. Update:

Centroid:

A cluster’s location is represented by the centroid of all points within the cluster. Update:

This method should be used only with squared Euclid distance.

Median:

Compute the weighted average distance between pairs belonging to two clusters. Update:

This method should be used only with squared Euclid distance.

Ward:

This method is also known as incremental sum of squares. Unlike other methods which minimise the distance between two clusters, the Ward’s method minimises the increase in total within-cluster sum of squares of the newly formed cluster. The distance between the two clusters is given as:

where n_r is the number of observations within the current cluster. This method should be used only with squared Euclid distance.

McQuitty:

The distance between the two clusters is calculated as:

Flexible:

Update:

where β is a constant supplied by the user. The default value for β is -0.25.

8.1.1.4. Hierarchical Cluster Output Options

Hierarchical Cluster Analysis

If there are n valid cases in data, the program will start with n clusters and combine them one-by-one until there is only one cluster is left. The History output option will summarise the clustering steps and the two dendrogram diagrams will show this entire process. The remaining output options depend on the Number of Clusters parameter defined by the user.

Number of Clusters: By entering a number between 1 and n, it is possible to display clustering results for any number of clusters. This number can also be changed from within the Cluster Graph, by selecting the Edit → XY Points dialogue.

History: This table shows the two clusters combined at each step, the number of cases in the new cluster and the distance between them. The newly formed cluster is given the label of the cluster in the left hand column.

Character Dendrogram: A dendrogram displays a visual summary of the clustering process, providing you with an understanding of the groups and proximities inherent in data. The order in which clusters are combined does not necessarily coincide with the order they are drawn on a dendrogram. The dendrogram procedure first rearranges the History table to produce an uncluttered tree diagram. The same tree structure can also be output in the form of a graph.

The advantage of this form of output is in its ability to display all Row Labels without any cluttering. However, due its low resolution on the (horizontal) distance axis, some of the clusters which are too close to each other may not be distinguished.

Cluster Table: The number of cases and their percentages are displayed for the number of clusters defined by the user. The within cluster sum of squares, average, minimum and maximum distance of individual cases from their cluster’s centroid are also displayed.

Cluster Centroids: For each cluster, the coordinates of the cluster centroid are displayed.

Distance Between Centroids: Distance between each pair of cluster centroids is displayed in a square-symmetric table.

Cluster Membership: A table containing all cases displays which case belongs to which cluster. As in the Cluster Table option, the number of clusters to be formed can be selected by the user.

Hi-res Dendrogram: The high-resolution dendrogram is convenient when the number of rows in the data set does not exceed 100. The vertical axis represents the distance and the horizontal axis represents the clusters combined.

Hierarchical Cluster Analysis

The Edit → XY Points dialogue for the Hi-Res Dendrogram procedure enables you to change the colour and thickness of lines, as well as positions of the stems (the vertical lines representing the newly formed clusters). Stems can be started from the midpoint (the default), the right or the left corner of the line connecting the two old clusters.

Hierarchical Cluster Analysis

By default, the row numbers are displayed as the X-axis labels. It is also possible to display the Row Labels as X-axis labels, from the Edit → Axes dialogue. If the Row Labels are too long, you can display them up and down or rotate the text by 90º or 270º.

Cluster Graph: Two and three-dimensional scatter diagrams can be displayed showing which data point belongs to which cluster. If you select two variables a 2D graph is displayed and a 3D graph if you select three variables. Different clusters are represented by different letters in different colours.

Hierarchical Cluster Analysis

You can change the number of clusters to be displayed from the Edit → XY Points dialogue, without having to go back to the Output Options Dialogue. It is possible to select the font and the size of the letters and display point labels for them. You can also display cluster centroids in capital letters.

Hierarchical Cluster Analysis

If the Cluster No field is zero, all groups will be displayed simultaneously. If this field is set to any other number less than or equal to the Number of Clusters, then only the cases belonging to that cluster will be displayed.

For 2D cluster plots it is also possible to draw ellipse intervals around the cluster centres.

Hierarchical Cluster Analysis

When the Ellipse box is clicked on, the ellipse interval dialog pops up.

Hierarchical Cluster Analysis

Here you can select multiple confidence levels, which will be drawn for all clusters selected. For further details see Ellipse Confidence and Prediction Intervals in 4.1.1.1.1. Line.

Hierarchical Cluster Analysis

Estimated parameters (correlation coefficient, angle of inclination, two radii and the area enclosed) for all 10 ellipses are saved in the following file:

..\Documents\Unistat10\Work\EllipseInfo.txt

8.1.1.5. Hierarchical Cluster Example

Open MULTIVAR, select Statistics 2 → Cluster Analysis → Hierarchical Cluster Analysis and select Perf, Info, Verbexp and Age (C1 to C4) as [Variable]s. Select distance measure as Euclid and linking method as Average Between Groups. Select number of clusters as 3 and all the output options to obtain the following results:

Hierarchical Cluster Analysis

Variables Selected: Perf, Info, Verbexp, Age

Measure: Euclid, Method: Average Between Groups

History

Step	Combined1	Combined2	Cases	Distance
1	1	8	2	4.6915
2	2	9	2	9.4345
3	1	5	3	9.5967
4	1	6	4	10.8672
5	3	4	2	12.5714
6	1	2	6	16.1606
7	3	7	3	19.2953
8	1	3	9	26.9553

Character Dendrogram

             1+----------+                                                      
             8+----------+-----------+                                          
             5+----------------------+--+                                       
             6+-------------------------+-----------+                           
             2+---------------------+               |                           
             9+---------------------+---------------+-------------------------+ 
             3+-----------------------------+                                 | 
             4+-----------------------------+---------------+                 | 
             7+---------------------------------------------+-----------------+

Cluster Table

	Cases	Percentage	Within SSQ	Average Distance	Minimum	Maximum
Cluster 1	6	66.7%	479.6467	8.2992	3.4567	12.2459
Cluster 2	2	22.2%	79.0200	6.2857	6.2857	6.2857
Cluster 3	1	11.1%	0.0000	0.0000	0.0000	0.0000

Cluster Centroids

	Cluster 1	Cluster 2	Cluster 3	Overall
Perf	88.1667	107.0000	120.0000	95.8889
Info	8.0000	12.5000	12.0000	9.4444
Verbexp	32.0000	43.5000	30.0000	34.3333
Age	7.1667	7.1000	8.4000	7.2889

Distance Between Centroids

	Cluster 1	Cluster 2	Cluster 3
Cluster 1	0.0000	22.5211	32.1696
Cluster 2	22.5211	0.0000	18.7933
Cluster 3	32.1696	18.7933	0.0000

Cluster Membership

Observation	Cluster
1	1
2	1
3	2
4	2
5	1
6	1
7	3
8	1
9	1

Hierarchical Cluster Analysis

Previous topic | Next topic