Q&A 29 How do you explore complex patterns in high-dimensional data using a UMAP plot?

29.1 Explanation

UMAP (Uniform Manifold Approximation and Projection) is a nonlinear technique that preserves both local and global structure better than PCA. It’s excellent for:

  • Revealing clusters, manifolds, or nonlinear groupings
  • Visualizing high-dimensional feature behavior in 2D
  • Exploring potential for classification or clustering

29.2 Python Code

import pandas as pd
from sklearn.preprocessing import StandardScaler
import umap
import seaborn as sns
import matplotlib.pyplot as plt

# Load and sample data
diamonds = pd.read_csv("data/diamonds_sample.csv")
subset = diamonds[["carat", "depth", "table", "price", "x", "y", "z", "cut"]].sample(500, random_state=1)

# Normalize numeric features
X = StandardScaler().fit_transform(subset.drop("cut", axis=1))

# Run UMAP
reducer = umap.UMAP(random_state=42)
embedding = reducer.fit_transform(X)

# Plot
umap_df = pd.DataFrame(embedding, columns=["UMAP1", "UMAP2"])
umap_df["cut"] = subset["cut"].values

plt.figure(figsize=(8, 6))
sns.scatterplot(data=umap_df, x="UMAP1", y="UMAP2", hue="cut", palette="Set2", alpha=0.7)
plt.title("UMAP Projection of Diamond Features by Cut")
plt.tight_layout()
plt.show()
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(

29.3 R Code (UMAP via uwot)

library(readr)
library(dplyr)
library(ggplot2)
library(uwot)

# Load and sample
diamonds <- read_csv("data/diamonds_sample.csv")
subset <- diamonds %>% select(carat, depth, table, price, x, y, z, cut) %>% sample_n(500)

# Standardize features
X <- scale(subset %>% select(-cut))

# Apply UMAP
set.seed(42)
embedding <- umap(X, n_neighbors = 15, min_dist = 0.1)

# Combine with labels
umap_df <- data.frame(embedding, cut = subset$cut)
colnames(umap_df)[1:2] <- c("UMAP1", "UMAP2")

# Plot
ggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = cut)) +
  geom_point(alpha = 0.7) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "UMAP Projection of Diamond Features by Cut")


βœ… UMAP captures nonlinear patterns in complex datasets, helping you visualize hidden structure, group separations, and feature interactions that PCA may miss.