Q&A 27 How do you uncover structure in high-dimensional data using a PCA plot?

27.1 Explanation

Principal Component Analysis (PCA) reduces high-dimensional data into 2 or 3 principal axes (components) that preserve the most variance. It helps:

  • Reveal clusters or overlaps in feature space
  • Understand group separation
  • Prepare for clustering or modeling

Itโ€™s most useful for numeric data and can be colored by group (e.g., cut).


27.2 Python Code

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt

# Load and prepare data
diamonds = pd.read_csv("data/diamonds_sample.csv")
subset = diamonds[["carat", "depth", "table", "price", "x", "y", "z", "cut"]].sample(500, random_state=1)

# Standardize features
X = StandardScaler().fit_transform(subset.drop("cut", axis=1))

# PCA transformation
pca = PCA(n_components=2)
pca_result = pca.fit_transform(X)

# Combine with labels
pca_df = pd.DataFrame(pca_result, columns=["PC1", "PC2"])
pca_df["cut"] = subset["cut"].values

# Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=pca_df, x="PC1", y="PC2", hue="cut", palette="Set2", alpha=0.7)
plt.title("PCA Plot: Diamond Features Colored by Cut")
plt.tight_layout()
plt.show()

27.3 R Code

library(readr)
library(ggplot2)
library(dplyr)

# Load and sample
diamonds <- read_csv("data/diamonds_sample.csv")
subset <- diamonds %>% select(carat, depth, table, price, x, y, z, cut) %>% sample_n(500)

# PCA
features <- subset %>% select(-cut)
features_scaled <- scale(features)
pca_result <- prcomp(features_scaled)

# Combine for plotting
pca_df <- data.frame(pca_result$x[,1:2], cut = subset$cut)

# Plot
ggplot(pca_df, aes(x = PC1, y = PC2, color = cut)) +
  geom_point(alpha = 0.7) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "PCA Plot: Diamond Features by Cut")


โœ… PCAโ€”Principal Component Analysis reduces complexity while preserving patterns. When plotted in 2D, it can reveal clustering, separation, or overlap between groups.