Q&A 3 How do you convert variable types in a dataset?
3.1 Explanation
Before visualizing or modeling your data, itβs important to ensure that each variable has the correct type. For example:
- Categorical variables (like
cut
,color
,clarity
) should be treated as factors or categories
- Numerical variables accidentally stored as strings should be converted to numeric types
In this example, weβll use a sample of 500 diamonds to demonstrate how to inspect and convert variable types where needed β a crucial step for grouped plots and modeling accuracy.
3.2 Python Code
import seaborn as sns
import pandas as pd
# Load and sample the diamonds dataset
df_full = sns.load_dataset("diamonds")
df = df_full.sample(n=500, random_state=42)
# Convert selected columns to categorical
categorical_cols = ["cut", "color", "clarity"]
for col in categorical_cols:
df[col] = df[col].astype("category")
# Confirm data types
print("π Updated Variable Types:\n")
print(df.dtypes)
π Updated Variable Types:
carat float64
cut category
color category
clarity category
depth float64
table float64
price int64
x float64
y float64
z float64
dtype: object
3.3 R Code
library(ggplot2)
library(dplyr)
# Load and sample the diamonds dataset
set.seed(42)
df <- ggplot2::diamonds %>%
sample_n(500)
# Convert selected columns to factor
df <- df %>%
mutate(
cut = as.factor(cut),
color = as.factor(color),
clarity = as.factor(clarity)
)
# Confirm structure
cat("π Updated Variable Types:\n")
π Updated Variable Types:
tibble [500 Γ 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:500] 0.39 1.12 0.51 0.52 0.28 1.01 0.4 0.9 0.33 0.71 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 3 3 3 3 1 3 5 5 4 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 6 4 4 1 2 3 1 1 4 4 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 6 2 6 5 6 3 5 3 7 4 ...
$ depth : num [1:500] 60.8 63.3 62.9 62.5 61.4 67.2 60.8 62.1 62 62.1 ...
$ table : num [1:500] 56 58 57 57 55 60 59 57 55 62 ...
$ price : int [1:500] 849 4478 1750 1829 612 4276 954 4523 838 2623 ...
$ x : num [1:500] 4.74 6.7 5.06 5.11 4.22 6.06 4.74 6.18 4.45 5.71 ...
$ y : num [1:500] 4.76 6.63 5.12 5.16 4.25 6 4.76 6.25 4.49 5.65 ...
$ z : num [1:500] 2.89 4.22 3.2 3.21 2.6 4.05 2.89 3.86 2.77 3.53 ...
β Ensuring correct variable types improves how your data is visualized, summarized, and modeled β especially when working with grouped plots or categorical aesthetics.