Q&A 5 How do you summarize numerical and categorical variables?

5.1 Explanation

Summarizing variables helps you quickly understand data distribution, central tendency, and variation — essential before any visualization.

  • Numerical variables: We summarize using measures like mean, median, standard deviation, min, max, and percentiles.
  • Categorical variables: We summarize by counting the frequency of each category.

Here we use a sample of 500 diamonds for fast, clear summaries.

5.2 Python Code

import seaborn as sns
import pandas as pd

# Load and sample the diamonds dataset
df_full = sns.load_dataset("diamonds")
df = df_full.sample(n=500, random_state=42)

# Summary of numerical variables
print("📊 Summary of Numerical Variables:\n")
print(df.describe())

# Frequency count of categorical variables
print("\n🔠 Frequency of Categorical Variables:\n")
for col in ["cut", "color", "clarity"]:
    print(f"\n{col}:\n", df[col].value_counts())
📊 Summary of Numerical Variables:

            carat       depth       table         price           x  \
count  500.000000  500.000000  500.000000    500.000000  500.000000   
mean     0.834520   61.753800   57.255600   4243.838000    5.807500   
std      0.504862    1.395428    2.185484   4321.650221    1.173615   
min      0.230000   55.200000   53.000000    373.000000    3.900000   
25%      0.400000   61.100000   56.000000    955.250000    4.730000   
50%      0.720000   61.900000   57.000000   2665.000000    5.770000   
75%      1.090000   62.600000   59.000000   5508.500000    6.605000   
max      2.750000   67.100000   66.000000  18803.000000    9.040000   

                y           z  
count  500.000000  500.000000  
mean     5.806940    3.584380  
std      1.167344    0.719462  
min      3.940000    2.430000  
25%      4.730000    2.930000  
50%      5.780000    3.555000  
75%      6.595000    4.070000  
max      8.980000    5.490000  

🔠 Frequency of Categorical Variables:


cut:
 cut
Ideal        200
Premium      131
Very Good    115
Good          41
Fair          13
Name: count, dtype: int64

color:
 color
E    109
F    103
G     98
H     75
D     48
I     43
J     24
Name: count, dtype: int64

clarity:
 clarity
SI1     123
VS2     107
SI2      93
VS1      66
VVS2     52
VVS1     33
IF       17
I1        9
Name: count, dtype: int64

5.3 R Code

library(ggplot2)
library(dplyr)

# Load and sample the diamonds dataset
set.seed(42)
df <- ggplot2::diamonds %>%
  sample_n(500)

# Summary of numerical variables
cat("📊 Summary of Numerical Variables:\n")
📊 Summary of Numerical Variables:
summary(select(df, where(is.numeric)))
     carat            depth           table           price        
 Min.   :0.2300   Min.   :56.10   Min.   :52.00   Min.   :  345.0  
 1st Qu.:0.4000   1st Qu.:61.10   1st Qu.:56.00   1st Qu.:  971.8  
 Median :0.7200   Median :61.95   Median :57.00   Median : 2652.5  
 Mean   :0.8133   Mean   :61.85   Mean   :57.41   Mean   : 3951.3  
 3rd Qu.:1.0400   3rd Qu.:62.60   3rd Qu.:59.00   3rd Qu.: 5134.5  
 Max.   :3.0000   Max.   :68.80   Max.   :66.00   Max.   :18493.0  
       x               y               z        
 Min.   :3.920   Min.   :3.940   Min.   :2.410  
 1st Qu.:4.725   1st Qu.:4.710   1st Qu.:2.917  
 Median :5.780   Median :5.790   Median :3.570  
 Mean   :5.766   Mean   :5.765   Mean   :3.566  
 3rd Qu.:6.520   3rd Qu.:6.490   3rd Qu.:4.032  
 Max.   :9.320   Max.   :9.190   Max.   :5.500  
# Frequency count of categorical variables
cat("\n🔠 Frequency of Categorical Variables:\n")

🔠 Frequency of Categorical Variables:
df %>%
  select(cut, color, clarity) %>%
  summarise(across(everything(), ~ list(table(.))))
# A tibble: 1 × 3
  cut         color       clarity    
  <list>      <list>      <list>     
1 <table [5]> <table [7]> <table [8]>

✅ Summarizing your variables helps reveal patterns, detect outliers, and identify potential problems — all before you create your first plot.