Q&A 3 How do you convert variable types in a dataset?

3.1 Explanation

Before visualizing or modeling your data, it’s important to ensure that each variable has the correct type. For example:

  • Categorical variables (like cut, color, clarity) should be treated as factors or categories
  • Numerical variables accidentally stored as strings should be converted to numeric types

In this example, we’ll use a sample of 500 diamonds to demonstrate how to inspect and convert variable types where needed β€” a crucial step for grouped plots and modeling accuracy.

3.2 Python Code

import seaborn as sns
import pandas as pd

# Load and sample the diamonds dataset
df_full = sns.load_dataset("diamonds")
df = df_full.sample(n=500, random_state=42)

# Convert selected columns to categorical
categorical_cols = ["cut", "color", "clarity"]
for col in categorical_cols:
    df[col] = df[col].astype("category")

# Confirm data types
print("πŸ”  Updated Variable Types:\n")
print(df.dtypes)
πŸ”  Updated Variable Types:

carat       float64
cut        category
color      category
clarity    category
depth       float64
table       float64
price         int64
x           float64
y           float64
z           float64
dtype: object

3.3 R Code

library(ggplot2)
library(dplyr)

# Load and sample the diamonds dataset
set.seed(42)
df <- ggplot2::diamonds %>%
  sample_n(500)

# Convert selected columns to factor
df <- df %>%
  mutate(
    cut = as.factor(cut),
    color = as.factor(color),
    clarity = as.factor(clarity)
  )

# Confirm structure
cat("πŸ”  Updated Variable Types:\n")
πŸ”  Updated Variable Types:
str(df)
tibble [500 Γ— 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:500] 0.39 1.12 0.51 0.52 0.28 1.01 0.4 0.9 0.33 0.71 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 3 3 3 3 1 3 5 5 4 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 6 4 4 1 2 3 1 1 4 4 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 6 2 6 5 6 3 5 3 7 4 ...
 $ depth  : num [1:500] 60.8 63.3 62.9 62.5 61.4 67.2 60.8 62.1 62 62.1 ...
 $ table  : num [1:500] 56 58 57 57 55 60 59 57 55 62 ...
 $ price  : int [1:500] 849 4478 1750 1829 612 4276 954 4523 838 2623 ...
 $ x      : num [1:500] 4.74 6.7 5.06 5.11 4.22 6.06 4.74 6.18 4.45 5.71 ...
 $ y      : num [1:500] 4.76 6.63 5.12 5.16 4.25 6 4.76 6.25 4.49 5.65 ...
 $ z      : num [1:500] 2.89 4.22 3.2 3.21 2.6 4.05 2.89 3.86 2.77 3.53 ...

βœ… Ensuring correct variable types improves how your data is visualized, summarized, and modeled β€” especially when working with grouped plots or categorical aesthetics.