iMADE Workshops
Current location: Home > Workshops > iMADE Workshops

Workshop #8 Unsupervised Learning: Principal Component Analysis

Posted:2021-04-27 13:01:45 Click:442


Principal component analysis is one technique used to take a large list of interconnected variables and choose the ones that best suit a model. This process of focusing in on only a few variables is called dimensionality reduction, and helps reduce complexity of our dataset. At its root, principal component analysis summarizes data.




Principal component analysis is extremely useful for deriving an overall, linearly independent, trend for a given dataset with many variables. It allows you to extract important relationships out of variables that may or may not be related. Another application of principal component analysis is for display - instead of representing a number of different variables, you can create principal components for just a few and plot them.

Dimensionality Reduction

There are two types of dimensionality reduction: feature elimination and feature extraction.

Feature elimination simply involves pruning features from a dataset we deem unnecessary. A downside of feature elimination is that we lose any potential information gained from the dropped features.

Feature extraction, however, creates new variables by combining existing features. At the cost of some simplicity or interpretability, feature extraction allows you to maintain all important information held within features.

Principal component analysis deals with feature extraction (rather than elimination) by creating a set of independent variables called principal components.

PCA Example

Principal component analysis is performed by considering all of our variables and calculating a set of direction and magnitude pairs (vectors) to represent them. For example, let’s consider a small example dataset plotted below:



Here we can see two direction pairs, represented by the red and green lines. In this scenario, the red line has a greater magnitude as the points are more clustered across a greater distance than with the green direction. Principal component analysis will use the vector with the greater magnitude to transform the data into a smaller feature space, reducing dimensionality. For example, the above graph would be transformed into the following:



By transforming our data in this way, we’ve ignored a feature that is less important to our model - that is, higher variation along the green dimension will have a greater impact on our results than variation along the red.

The mathematics behind principal component analysis are left out of this discussion for brevity, but if you’re interested in learning about them we highly recommend visiting the references listed at the bottom of this page.

Number of Components

In the example above, we took a two-dimensional feature space and reduced it to a single dimension. In most scenarios though, you will be working with far more than two variables. Principal component analysis can be used to just remove a single feature, but it is often useful to reduce several. There are several strategies you can employ to decide how many feature reductions to perform:

  1. Arbitrarily

    This simply involves picking a number of features to keep for your given model. This method is highly dependent on your dataset and what you want to convey. For instance, it may be beneficial to represent your higher-order data on a 2D space for visualization. In this case, you would perform feature reduction until you have two features.

  2. Percent of cumulative variability

    Part of the principal component analysis calculation involves finding a proportion of variance which approaches 1 through each round of PCA performed. This method of choosing the number of feature reduction steps involves selecting a target variance percentage. For instance, let’s look at a graph of cumulative variance at each level of PCA for a theoretical dataset:

    The above image is called a scree plot, and is a representation of the cumulative and current proportion of variance for each principal component. If we wanted at least 80% cumulative variance, we would use at least 6 principal components based on this scree plot. Aiming for 100% variance is not generally recommended, as reaching this means your dataset has redundant data.

  3. Percent of individual variability

    Instead of using principal components until we reach a cumulative percent of variability, we can instead use principal components until a new component wouldn’t add much variability. In the plot above, we might choose to use 3 principal components since the next components don’t have as strong a drop in variability.


Principal component analysis is a technique to summarize data, and is highly flexible depending on your use case. It can be valuable in both displaying and analyzing a large number of possibly dependent variables. Techniques of performing principal component analysis range from arbitrarily selecting principal components, to automatically finding them until a variance is reached.

PCA Assignment

1) Implement a Java-based PCA() function with specific interface and parameters;
2) For iMADE implementation, implement a multiagent based PCA application.
In order to ensure a well defined multiagent system (on top of JADE platform), the application must at least consists of two (or more than 2 iMADE agents). E,g. Data-miner agent to get the data from the iMADE server, PCA agent to perform the PCA operations.

PCA Parameters

class PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)

The detail can be refereed to the website:sklearn.decomposition.PCA — scikit-learn 0.24.1 documentation

1)n_components:int float or ‘mle’, default=None

Number of components to keep. if n_components is not set all components are kept:

n_components == min(n_samples, n_features) 

If n_components == 'mle' and svd_solver == 'full', Minka’s MLE is used to guess the dimension. Use of n_components == 'mle' will interpret svd_solver == 'auto' as svd_solver == 'full'.

If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.

If svd_solver == 'arpack', the number of components must be strictly less than the minimum of n_features and n_samples.

Hence, the None case results in:

n_components == min(n_samples, n_features) - 1

2)copybool, default=True

If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.

3)whitenbool, default=False

When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

4)svd_solver{‘auto’, ‘full’, ‘arpack’, ‘randomized’}, default=’auto’If auto :

The solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.

If full :

run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing

If arpack :

run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)

If randomized :

run randomized SVD by the method of Halko et al.

5)tolfloat, default=0.0

Tolerance for singular values computed by svd_solver == ‘arpack’. Must be of range [0.0, infinity).

6)iterated_power:int or ‘auto’, default=’auto’

Number of iterations for the power method computed by svd_solver == ‘randomized’. Must be of range [0, infinity).

7)random_state:int, RandomState instance or None, default=None

Used when the ‘arpack’ or ‘randomized’ solvers are used. Pass an int for reproducible results across multiple function calls. See Glossary.

PCA Task


The Covid-19 pandemic obliged people around the world to stay home and self-isolate,
with a number of negative psychological consequences. This study focuses on the protective
role of character strengths in sustaining mental health and self-efficacy during
lockdown. Data were collected from 944 Italian respondents (mean age = 37.24 years,
SD = 14.50) by means of an online survey investigating character strengths, psychological
distress and Covid-19-related self-efficacy one month after lockdown began. Using principal
component analysis, four strengths factors were extracted, namely transcendence, interpersonal,
openness and restraint. Regression models with second-order factors showed that
transcendence strengths had a strong inverse association with psychological distress, and
a positive association with self-efficacy. Regression models with single strengths identified
hope, zest, prudence, love and forgiveness as the strengths most associated with distress,
love and zest as the most related to self-efficacy and zest to general mental health.
Openness factor and appreciation of beauty showed an unexpected direct relation with psychological
distress. These results provide original evidence of the association of character
strengths, and transcendence strengths in particular, with mental health and self-efficacy in
a pandemic and are discussed within the field of positive psychology.