Skip to content

Latest commit

 

History

History
66 lines (46 loc) · 3.35 KB

File metadata and controls

66 lines (46 loc) · 3.35 KB

Unsupervised Learning

What is unsupervised Learning?

Unsupervised learning is where you only have input data (X) and no corresponding output variables.

What is clustering?

It's an machine learning technique which segregate the various data points into different groups called clusters such that entities in a particular group comparatively have more similar traits than entities in another group.

Top 5 Clustering Algorithms to know

  • K-Means Clustering
  • Agglomerative Hierarchical Clustering
  • Mean-Shift Clustering
  • Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
  • Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM)

K- means Clustering ?

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

K-Means Clustering Algorithm

Pros

  • K-Means has the advantage that it’s pretty fast, as all we’re really doing is computing the distances between points and group centers; very few computations! It thus has a linear complexity O(n).

Cons

  • You have to select how many groups/classes there are.
  • K-means also starts with a random choice of cluster centers and therefore it may yield different clustering results on different runs of the algorithm. Thus, the results may not be repeatable and lack consistency.

Determining The Optimal Number Of Clusters: 3 Must Know Methods?

  • Elbow method
  • Average silhouette method
  • Gap statistic method

Hierarchical Clustering

It is a type of connectivity model clustering which is based on the fact that data points that are closer to each other are more similar than the data points lying far away in a data space.

As the name speaks for itself, the hierarchical clustering forms the hierarchy of the clusters that can be studied by visualising dendogram.

Pros

  • Hierarchical clustering does not require us to specify the number of clusters and we can even select which number of clusters looks best since we are building a tree.

Cons

  • Lower efficiency, as it has a time complexity of O(n³)

Principal Component Analysis

  • Principal Component Analysis (PCA) is a dimension reduction technique that projects the data into a lower dimensional space
  • PCA uses Singular Value Decomposition (SVD), which is a matrix factorization method that decomposes a matrix into three smaller matrices (more details of SVD here)
  • PCA finds top N principal components, which are dimensions along which the data vary (spread out) the most. Intuitively, the more spread out the data along a specific dimension, the more information is contained, thus the more important this dimension is for the pattern recognition of the dataset
  • PCA can be used as pre-step for data visualization: reducing high dimensional data into 2D or 3D. An alternative dimensionality reduction technique is t-SNE