Q&A 5 How do you summarize numerical and categorical variables?
5.1 Explanation
Summarizing variables helps you quickly understand data distribution, central tendency, and variation — essential before any visualization.
- Numerical variables: We summarize using measures like mean, median, standard deviation, min, max, and percentiles.
- Categorical variables: We summarize by counting the frequency of each category.
Here we use a sample of 500 diamonds for fast, clear summaries.
5.2 Python Code
import seaborn as sns
import pandas as pd
# Load and sample the diamonds dataset
df_full = sns.load_dataset("diamonds")
df = df_full.sample(n=500, random_state=42)
# Summary of numerical variables
print("📊 Summary of Numerical Variables:\n")
print(df.describe())
# Frequency count of categorical variables
print("\n🔠 Frequency of Categorical Variables:\n")
for col in ["cut", "color", "clarity"]:
print(f"\n{col}:\n", df[col].value_counts())
📊 Summary of Numerical Variables:
carat depth table price x \
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean 0.834520 61.753800 57.255600 4243.838000 5.807500
std 0.504862 1.395428 2.185484 4321.650221 1.173615
min 0.230000 55.200000 53.000000 373.000000 3.900000
25% 0.400000 61.100000 56.000000 955.250000 4.730000
50% 0.720000 61.900000 57.000000 2665.000000 5.770000
75% 1.090000 62.600000 59.000000 5508.500000 6.605000
max 2.750000 67.100000 66.000000 18803.000000 9.040000
y z
count 500.000000 500.000000
mean 5.806940 3.584380
std 1.167344 0.719462
min 3.940000 2.430000
25% 4.730000 2.930000
50% 5.780000 3.555000
75% 6.595000 4.070000
max 8.980000 5.490000
🔠 Frequency of Categorical Variables:
cut:
cut
Ideal 200
Premium 131
Very Good 115
Good 41
Fair 13
Name: count, dtype: int64
color:
color
E 109
F 103
G 98
H 75
D 48
I 43
J 24
Name: count, dtype: int64
clarity:
clarity
SI1 123
VS2 107
SI2 93
VS1 66
VVS2 52
VVS1 33
IF 17
I1 9
Name: count, dtype: int64
5.3 R Code
library(ggplot2)
library(dplyr)
# Load and sample the diamonds dataset
set.seed(42)
df <- ggplot2::diamonds %>%
sample_n(500)
# Summary of numerical variables
cat("📊 Summary of Numerical Variables:\n")
📊 Summary of Numerical Variables:
carat depth table price
Min. :0.2300 Min. :56.10 Min. :52.00 Min. : 345.0
1st Qu.:0.4000 1st Qu.:61.10 1st Qu.:56.00 1st Qu.: 971.8
Median :0.7200 Median :61.95 Median :57.00 Median : 2652.5
Mean :0.8133 Mean :61.85 Mean :57.41 Mean : 3951.3
3rd Qu.:1.0400 3rd Qu.:62.60 3rd Qu.:59.00 3rd Qu.: 5134.5
Max. :3.0000 Max. :68.80 Max. :66.00 Max. :18493.0
x y z
Min. :3.920 Min. :3.940 Min. :2.410
1st Qu.:4.725 1st Qu.:4.710 1st Qu.:2.917
Median :5.780 Median :5.790 Median :3.570
Mean :5.766 Mean :5.765 Mean :3.566
3rd Qu.:6.520 3rd Qu.:6.490 3rd Qu.:4.032
Max. :9.320 Max. :9.190 Max. :5.500
🔠 Frequency of Categorical Variables:
# A tibble: 1 × 3
cut color clarity
<list> <list> <list>
1 <table [5]> <table [7]> <table [8]>
✅ Summarizing your variables helps reveal patterns, detect outliers, and identify potential problems — all before you create your first plot.