# Import required libraries
import polars as pl
import plotly.express as px
import plotly.graph_objects as go
06 | Data Visualization
Data Science with Polars and Plotly: EDA and Communication
1 Exploratory Data Analysis with Polars and Plotly
Exploratory Data Analysis (EDA) is a critical first step in any data analysis project. It involves examining the data to identify patterns, spot anomalies, test hypotheses, and check assumptions through summary statistics and graphical representations. This module adapts the EDA concepts from R for Data Science (2e) to the Python ecosystem, specifically using Polars for data manipulation and Plotly Express for visualization.
1.1 The Dataset
For this module, we’ll use the “diamonds” dataset, similar to the one used in R4DS. This dataset contains information about ~54,000 diamonds, including their prices and various attributes.
# Load the diamonds dataset
= pl.read_csv("data/diamonds.csv")
diamonds
# Display first few rows
diamonds.head()
carat | cut | color | clarity | depth | table | price | x | y | z |
---|---|---|---|---|---|---|---|---|---|
f64 | str | str | str | f64 | f64 | i64 | f64 | f64 | f64 |
0.23 | "Ideal" | "E" | "SI2" | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
0.21 | "Premium" | "E" | "SI1" | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
0.23 | "Good" | "E" | "VS1" | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
0.29 | "Premium" | "I" | "VS2" | 62.4 | 58.0 | 334 | 4.2 | 4.23 | 2.63 |
0.31 | "Good" | "J" | "SI2" | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
Let’s examine the structure of our dataset:
# Get a quick summary of the dataset
diamonds.describe()
statistic | carat | cut | color | clarity | depth | table | price | x | y | z |
---|---|---|---|---|---|---|---|---|---|---|
str | f64 | str | str | str | f64 | f64 | f64 | f64 | f64 | f64 |
"count" | 53940.0 | "53940" | "53940" | "53940" | 53940.0 | 53940.0 | 53940.0 | 53940.0 | 53940.0 | 53940.0 |
"null_count" | 0.0 | "0" | "0" | "0" | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
"mean" | 0.79794 | null | null | null | 61.749405 | 57.457184 | 3932.799722 | 5.731157 | 5.734526 | 3.538734 |
"std" | 0.474011 | null | null | null | 1.432621 | 2.234491 | 3989.439738 | 1.121761 | 1.142135 | 0.705699 |
… | … | … | … | … | … | … | … | … | … | … |
"25%" | 0.4 | null | null | null | 61.0 | 56.0 | 950.0 | 4.71 | 4.72 | 2.91 |
"50%" | 0.7 | null | null | null | 61.8 | 57.0 | 2401.0 | 5.7 | 5.71 | 3.53 |
"75%" | 1.04 | null | null | null | 62.5 | 59.0 | 5324.0 | 6.54 | 6.54 | 4.04 |
"max" | 5.01 | "Very Good" | "J" | "VVS2" | 79.0 | 95.0 | 18823.0 | 10.74 | 58.9 | 31.8 |
diamonds.glimpse()
Rows: 53940
Columns: 10
$ carat <f64> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23
$ cut <str> 'Ideal', 'Premium', 'Good', 'Premium', 'Good', 'Very Good', 'Very Good', 'Very Good', 'Fair', 'Very Good'
$ color <str> 'E', 'E', 'E', 'I', 'J', 'J', 'I', 'H', 'E', 'H'
$ clarity <str> 'SI2', 'SI1', 'VS1', 'VS2', 'SI2', 'VVS2', 'VVS1', 'SI1', 'VS2', 'VS1'
$ depth <f64> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4
$ table <f64> 55.0, 61.0, 65.0, 58.0, 58.0, 57.0, 57.0, 55.0, 61.0, 61.0
$ price <i64> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338
$ x <f64> 3.95, 3.89, 4.05, 4.2, 4.34, 3.94, 3.95, 4.07, 3.87, 4.0
$ y <f64> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05
$ z <f64> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39
1.2 Data Visualization for EDA
1.2.1 Creating a Basic Plot
Let’s start with a simple visualization to examine the relationship
between carat (weight) and price. We’ve used Plotly Express to create
plots quickly, and it even has a built in way of updating the labels,
but we can fine tune these plots with fig.update_layout()
,
where fig
is the plot object created with Plotly
Express:
# Create a basic scatter plot
= px.scatter(
fig
diamonds, ="carat",
x="price",
y=0.5, # Make points semi-transparent
opacity="Diamond Price vs Carat"
title
)
fig.update_layout(="Weight (carats)",
xaxis_title="Price (USD)"
yaxis_title
)
fig.show()
The plot reveals a positive relationship between a diamond’s weight and its price, but the relationship isn’t perfectly linear, and there’s significant variation.
1.2.2 Visualizing Distributions
Understanding the distribution of individual variables is an important part of EDA.
# Histogram of diamond prices
= px.histogram(
fig
diamonds, ="price",
x=50,
nbins="Distribution of Diamond Prices"
title
)
fig.update_layout(="Price (USD)",
xaxis_title="Count"
yaxis_title
)
fig.show()
We can also look at the distribution of categorical variables:
# Count of diamonds by cut quality
= px.bar(
fig "cut").agg(pl.count()).sort("cut"),
diamonds.group_by(="cut",
x="count",
y="Number of Diamonds by Cut Quality"
title
)
fig.update_layout(="Cut Quality",
xaxis_title="Count"
yaxis_title
)
fig.show()
C:\Users\WVU\AppData\Local\Temp\ipykernel_30000\990919602.py:3: DeprecationWarning:
`pl.count()` is deprecated. Please use `pl.len()` instead.
1.2.3 Visualizing Relationships
Let’s explore the relationship between multiple variables. For instance, how does the price-to-carat relationship change based on the diamond’s cut quality?
# Create a scatter plot with color showing cut quality
= px.scatter(
fig
diamonds, ="carat",
x="price",
y="cut",
color=0.6,
opacity="Diamond Price vs Carat by Cut Quality"
title
)
fig.update_layout(="Weight (carats)",
xaxis_title="Price (USD)"
yaxis_title
)
fig.show()
We can also look at average prices by cut and color:
# Calculate average price by cut and color
= (
avg_price_by_cut_color
diamonds"cut", "color"])
.group_by(["price").alias("avg_price"))
.agg(pl.mean("cut", "color"])
.sort([
)
# Create a heatmap
= px.density_heatmap(
fig
avg_price_by_cut_color,="color",
x="cut",
y="avg_price",
z="Average Diamond Price by Cut and Color"
title
)
fig.update_layout(="Color (D is best)",
xaxis_title="Cut Quality"
yaxis_title
)
fig.show()
1.2.4 Handling Overplotting
When working with large datasets like the diamonds dataset, overplotting can be an issue. Here are techniques to address this:
- Using transparency
= px.scatter(
fig
diamonds, ="carat",
x="price",
y=0.2,
opacity="Using Transparency to Handle Overplotting"
title
) fig.show()
- Using smaller points
= px.scatter(
fig
diamonds, ="carat",
x="price",
y="Using Smaller Points to Handle Overplotting"
title
)=dict(size=2))
fig.update_traces(marker fig.show()
- Using a 2D density plot (hexbin)
= px.density_heatmap(
fig
diamonds, ="carat",
x="price",
y=30,
nbinsx=30,
nbinsy="Using a 2D Density Plot to Handle Overplotting"
title
) fig.show()
Adding jitter for categorical variables
= px.strip(
fig =1000), # sample here is important
diamonds.sample(n="cut",
x="price",
y="Using Jitter for Categorical Variables"
title
) fig.show()
1.3 Patterns and Models
Visualizations help us identify patterns in the data, which can then inform our modeling approach. Let’s visualize a few more relationships:
# Create a boxplot of price by cut
= px.box(
fig
diamonds, ="cut",
x="price",
y="Diamond Price Distribution by Cut"
title
) fig.show()
# Create a violin plot of price by cut
= px.violin(
fig
diamonds, ="cut",
x="price",
y=True,
box="Diamond Price Violin Plot by Cut"
title
) fig.show()
1.4 Typical EDA Workflow
A typical EDA workflow with Polars and Plotly might look like this:
Load and inspect the data
# Load data = pl.read_csv("your_dataset.csv") data # Inspect structure data.head() data.schema data.shape # Check for missing values data.null_count()
Compute summary statistics
# Basic summary statistics data.describe() # Custom aggregations by group (data"category_column") .group_by( .agg(["numeric_column1").alias("mean_value"), pl.mean("numeric_column1").alias("median_value"), pl.median("numeric_column1").alias("std_value"), pl.std("count") pl.count().alias( ]))
Visualize univariate distributions
# For numeric variables = px.histogram(data, x="numeric_column") fig fig.show() # For categorical variables = data.group_by("category_column").agg(pl.count()).sort("count", descending=True) counts = px.bar(counts, x="category_column", y="count") fig fig.show()
Explore relationships between variables
# Scatter plot for two numeric variables = px.scatter(data, x="numeric_column1", y="numeric_column2") fig fig.show() # Add a third variable using color = px.scatter(data, x="numeric_column1", y="numeric_column2", color="category_column") fig fig.show() # Boxplots for numeric vs categorical = px.box(data, x="category_column", y="numeric_column") fig fig.show()
Identify and investigate unusual observations
# Filter to outliers (e.g., beyond 3 standard deviations) = pl.col("numeric_column").mean() mean_expr = pl.col("numeric_column").std() std_expr = data.filter( outliers "numeric_column") > pl.col("numeric_column").mean() + 3 * pl.col("numeric_column").std()) | (pl.col("numeric_column") < pl.col("numeric_column").mean() - 3 * pl.col("numeric_column").std()) (pl.col( )print(outliers) # Visualize with outliers highlighted = data.with_columns( data pl.when("numeric_column") > pl.col("numeric_column").mean() + 3 * pl.col("numeric_column").std()) | ((pl.col("numeric_column") < pl.col("numeric_column").mean() - 3 * pl.col("numeric_column").std()) (pl.col( )True) .then(False) .otherwise("is_outlier") .alias( ) = px.scatter( fig data, ="numeric_column1", x="numeric_column2", y="is_outlier", color={True: "red", False: "blue"}, color_discrete_map="Outlier Identification" title ) fig.show()
Transform variables if needed
# Log transformation for skewed data = data.with_column( data "skewed_column").log().alias("log_skewed_column") pl.col( ) # Before and after histograms = px.histogram(data, x="skewed_column", title="Original Distribution") fig1 = px.histogram(data, x="log_skewed_column", title="Log-Transformed Distribution") fig2 fig1.show() fig2.show()
1.5 Practical EDA Questions
When performing EDA, it’s helpful to have some guiding questions:
- What type of variation occurs within my variables?
- What values are common? What values are rare?
- Are there any unexpected values or outliers?
- What’s the shape of the distribution?
- What type of covariation occurs between my
variables?
- How do variables relate to each other?
- Are there any clear patterns or relationships?
- Do these relationships make sense given the domain?
- Are there interesting subgroups in the data?
- Do patterns change when you filter or group the data?
- Are there clusters or segments that behave differently?
- What might explain the observed patterns?
- Can domain knowledge explain the relationships?
- What additional data might help understanding?
- What hypotheses can you form for further analysis?
2 Communicating with Data
Once you’ve explored and analyzed your data, the next crucial step is effectively communicating your findings. The goal is to help others understand what you’ve discovered without requiring them to go through the entire analysis process themselves. This section adapts the concepts from R for Data Science (2e) - Communication chapter to Polars and Plotly.
2.1 Creating Effective Visualizations
The key to effective visualization is clarity and purpose. Each visualization should answer a specific question or highlight a particular insight.
2.1.1 Improving Basic Plots
Let’s start with a simple scatter plot of diamond price vs. carat, and progressively improve it for better communication:
# Basic scatter plot
= px.scatter(
fig =1000),
diamonds.sample(n="carat",
x="price",
y="Diamond Price vs Carat"
title
) fig.show()
Improved version with better labels and context:
= px.scatter(
fig =1000),
diamonds.sample(n="carat",
x="price",
y="cut",
color="Diamond Price vs Weight by Cut Quality",
title={
labels"carat": "Weight (carats)",
"price": "Price (USD)",
"cut": "Cut Quality"
},=["clarity", "color", "depth", "table"]
hover_data
)
fig.update_layout(=0.5, # Center the title
title_x="Cut Quality",
legend_title_text=dict(
xaxis='linear',
tickmode=0,
tick0=0.5
dtick
),=dict(
yaxis="$",
tickprefix="tozero"
rangemode
)
)
fig.show()
2.1.2 Using Annotations
Annotations can help draw attention to important aspects of your visualization:
# Create a scatter plot with an annotation
= px.scatter(
fig =1000),
diamonds.sample(n="carat",
x="price",
y="cut",
color="Diamond Price vs Weight with Annotation",
title={
labels"carat": "Weight (carats)",
"price": "Price (USD)",
"cut": "Cut Quality"
}
)
# Add annotation highlighting an interesting pattern
fig.add_annotation(=2,
x=15000,
y="Premium cut diamonds maintain<br>higher value at larger sizes", # <br> is an HTML element
text=True,
showarrow=1,
arrowhead=50,
ax=-50
ay
)
fig.show()
2.1.3 Multiple Views with Facets
Faceting allows you to create multiple views of the same data, split by categories:
# Create a faceted scatter plot
= px.scatter(
fig =2000),
diamonds.sample(n="carat",
x="price",
y="color",
color="cut",
facet_col="Diamond Price vs Weight by Cut and Color",
title={
labels"carat": "Weight (carats)",
"price": "Price (USD)",
"cut": "Cut Quality",
"color": "Color (D is best)"
}
)
# Update layout
fig.update_layout(=0.5,
title_x="Diamond Color"
legend_title_text
)
# Make facet column titles more readable
lambda a: a.update(text=a.text.split("=")[1]))
fig.for_each_annotation(# this removes "Cut Quality" from "Cut Quality=Ideal" in the facet
fig.show()
2.1.4 Creating Small Multiples
Small multiples are a powerful way to compare patterns across different subgroups:
# Create small multiples with box plots
= px.box(
fig
diamonds, ="cut",
x="price",
y="cut",
color="clarity",
facet_col=4,
facet_col_wrap="Diamond Price Distribution by Cut and Clarity",
title={
labels"cut": "Cut Quality",
"price": "Price (USD)",
"clarity": "Clarity"
}
)
# Update layout
=0.5)
fig.update_layout(title_xlambda a: a.update(text=a.text.split("=")[1]))
fig.for_each_annotation(=45)
fig.update_xaxes(tickangle
fig.show()
2.2 Scales and Guides
The choice of scales and guides (legends, axes, etc.) can significantly impact how a visualization is interpreted.
2.2.1 Adjusting Scales
# Scatter plot with adjusted scales
= px.scatter(
fig =1000),
diamonds.sample(n="carat",
x="price",
y="depth",
color="Diamond Price vs Weight with Color Scale",
title={
labels"carat": "Weight (carats)",
"price": "Price (USD)",
"depth": "Depth Percentage"
}
)
# Update color scale
fig.update_layout(=dict(
coloraxis_colorbar="Depth %",
title=[55, 60, 65, 70],
tickvals=["55%", "60%", "65%", "70%"]
ticktext
)
)
# Log scale for price
type="log")
fig.update_yaxes(
fig.show()
2.2.2 Customizing Legends
# Plot with custom legend
= px.scatter(
fig =1000),
diamonds.sample(n="carat",
x="price",
y="cut",
color="clarity",
symbol="Diamond Price vs Weight with Custom Legend",
title={
labels"carat": "Weight (carats)",
"price": "Price (USD)",
"cut": "Cut Quality",
"clarity": "Clarity"
}
)
# Update legend
fig.update_layout(=dict(
legend="h",
orientation="bottom",
yanchor=1.02,
y="right",
xanchor=1,
x=""
title
)
)
fig.show()
2.3 Themes and Typography
Consistent themes and typography help create a cohesive look across all your visualizations:
# Define WVU color theme
= {
wvu_theme "layout": {
"font": {"family": "Arial, sans-serif", "size": 12, "color": "#333333"},
"title": {"font": {"size": 18, "color": "#002855"}}, # WVU Blue for title
"plot_bgcolor": "#FFFFFF",
"paper_bgcolor": "#FFFFFF",
"colorway": ["#002855", "#EAAA00", "#0033A0", "#9AABBA", "#4F5B66"], # WVU Blue, Gold, and complementary colors
"xaxis": {"gridcolor": "#E5E5E5", "zerolinecolor": "#E5E5E5"},
"yaxis": {"gridcolor": "#E5E5E5", "zerolinecolor": "#E5E5E5"},
"legend": {"bgcolor": "#FFFFFF", "bordercolor": "#E5E5E5"},
"margin": {"t": 60, "b": 60, "l": 50, "r": 50}
}
}
= diamonds.group_by('cut').agg(
cut_summary 'price').alias('avg_price'),
pl.mean('price').alias('count')
pl.count('avg_price', descending=True)
).sort(
# Apply theme to a plot
= px.bar(
fig
cut_summary,='cut',
x='avg_price',
y='cut',
color='Average Diamond Price by Cut (WVU Theme)',
title={'avg_price': 'Average Price ($)', 'cut': 'Cut Quality'},
labels=wvu_theme['layout']['colorway']
color_discrete_sequence
)
# Apply WVU theme
**wvu_theme['layout'])
fig.update_layout(
# Add values on top of bars
fig.update_traces(='$%{y:.0f}',
texttemplate='outside',
textposition=dict(size=12, color='#333333'),
textfont=1,
marker_line_width='#FFFFFF'
marker_line_color
)
# Add count as a hover info
fig.update_traces(='<b>%{x}</b><br>Average Price: $%{y:.2f}<br>Count: %{customdata}<extra></extra>',
hovertemplate=cut_summary.select('count').to_numpy()
customdata
)
# Show the figure
fig.show()
2.4 Tables for Communication
While visualizations are powerful, sometimes a well-designed table is the best way to communicate specific values or detailed information. The Great Tables package can help us create print-ready tables:
from great_tables import GT
# Create a summary table
= (
summary_data
diamonds"cut", "color"])
.group_by([
.agg(["price").round(2).alias("avg_price"),
pl.mean(len().alias("count"),
pl."carat").round(3).alias("median_carat")
pl.median(
])"cut", "color"])
.sort([
)
# Create the GT table
= (
gt_table
GT(summary_data)# Set title and subtitle
.tab_header(="Diamond Summary Statistics",
title="Aggregated by Cut and Color"
subtitle
)# Format columns
.fmt_currency(=["avg_price"],
columns="USD"
currency
)
.fmt_number(=["count"],
columns=True
use_seps
)
.fmt_number(=["median_carat"],
columns=3
decimals
)# Rename columns for display
.cols_label(="Cut Quality",
cut="Color Grade",
color="Average Price",
avg_price="Count",
count="Median Carat"
median_carat
)# Add source note
.tab_source_note(="Data summarized from diamonds dataset"
source_note
)
)
# Display the table
gt_table
Diamond Summary Statistics | ||||
---|---|---|---|---|
Aggregated by Cut and Color | ||||
Cut Quality | Color Grade | Average Price | Count | Median Carat |
Fair | D | $4,291.06 | 163.00 | 0.900 |
Fair | E | $3,682.31 | 224.00 | 0.900 |
Fair | F | $3,827.00 | 312.00 | 0.900 |
Fair | G | $4,239.25 | 314.00 | 0.980 |
Fair | H | $5,135.68 | 303.00 | 1.010 |
Fair | I | $4,685.45 | 175.00 | 1.010 |
Fair | J | $4,975.66 | 119.00 | 1.030 |
Good | D | $3,405.38 | 662.00 | 0.700 |
Good | E | $3,423.64 | 933.00 | 0.700 |
Good | F | $3,495.75 | 909.00 | 0.710 |
Good | G | $4,123.48 | 871.00 | 0.900 |
Good | H | $4,276.25 | 702.00 | 0.900 |
Good | I | $5,078.53 | 522.00 | 1.000 |
Good | J | $4,574.17 | 307.00 | 1.020 |
Ideal | D | $2,629.09 | 2,834.00 | 0.500 |
Ideal | E | $2,597.55 | 3,903.00 | 0.500 |
Ideal | F | $3,374.94 | 3,826.00 | 0.530 |
Ideal | G | $3,720.71 | 4,884.00 | 0.540 |
Ideal | H | $3,889.33 | 3,115.00 | 0.700 |
Ideal | I | $4,451.97 | 2,093.00 | 0.740 |
Ideal | J | $4,918.19 | 896.00 | 1.030 |
Premium | D | $3,631.29 | 1,603.00 | 0.580 |
Premium | E | $3,538.91 | 2,337.00 | 0.580 |
Premium | F | $4,324.89 | 2,331.00 | 0.760 |
Premium | G | $4,500.74 | 2,924.00 | 0.755 |
Premium | H | $5,216.71 | 2,360.00 | 1.010 |
Premium | I | $5,946.18 | 1,428.00 | 1.140 |
Premium | J | $6,294.59 | 808.00 | 1.250 |
Very Good | D | $3,470.47 | 1,513.00 | 0.610 |
Very Good | E | $3,214.65 | 2,400.00 | 0.570 |
Very Good | F | $3,778.82 | 2,164.00 | 0.700 |
Very Good | G | $3,872.75 | 2,299.00 | 0.700 |
Very Good | H | $4,535.39 | 1,824.00 | 0.900 |
Very Good | I | $5,255.88 | 1,204.00 | 1.005 |
Very Good | J | $5,103.51 | 678.00 | 1.060 |
Data summarized from diamonds dataset |
2.5 Combining Text and Visualizations
In a Quarto document, you can combine text explanations with your visualizations to create a coherent narrative:
## Diamond Price Analysis
Our analysis of the diamond dataset reveals several interesting patterns:
```python
# Load necessary libraries
import polars as pl
import plotly.express as px
# Load the diamonds dataset
diamonds = pl.read_csv("data/diamonds.csv")
# Create a visualization
fig = px.scatter(
diamonds.sample(n=1000),
x="carat",
y="price",
color="cut",
title="Diamond Price vs Weight by Cut Quality"
)
fig.show()
```
As shown in the plot above, there is a strong positive relationship between a diamond's weight (carat) and its price. However, this relationship is moderated by the quality of the cut, with higher quality cuts generally commanding premium prices across the weight spectrum.
2.6 Principles of Effective Communication
Here are key principles to follow when communicating with data:
Know your audience: Adapt your visualization complexity and terminology to match your audience’s expertise.
Tell a story: Structure your communication as a narrative with a beginning, middle, and end.
Focus on the message: Every element in your visualization should support your main message.
Simplify: Remove chart junk and unnecessary elements that don’t contribute to understanding.
Choose appropriate visualizations: Select chart types that best represent your data and answer your specific questions.
Iterate: Create drafts, get feedback, and refine your visualizations before finalizing them.
2.7 Practical Communication Workflow
A practical workflow for communicating with data might look like this:
Identify your key findings: What are the 2-3 most important insights from your analysis?
Determine your audience: Who will be consuming your visualization? What do they already know? What do they need to learn?
Select appropriate visualization types: Choose the charts that best communicate your findings.
Create draft visualizations: Build initial versions of your plots with Plotly Express.
Refine and polish: Add appropriate titles, labels, colors, and annotations.
Integrate with narrative: Combine your visualizations with explanatory text in your Quarto document.
Review and revise: Get feedback and iterate as needed.
3 Conclusion
Effective exploratory data analysis and communication are essential skills for any data scientist. With Polars for data manipulation and Plotly Express for visualization, you have powerful tools to explore, understand, and communicate insights from your data.
This module has provided a Python-focused adaptation of the concepts from R4DS, demonstrating how to perform EDA and create effective visualizations using modern Python data science tools. By mastering these techniques, you’ll be better equipped to extract meaningful insights from data and effectively share those insights with others.
4 Further Resources
5 Exercises
These exercises will help you practice the concepts of Exploratory Data Analysis and Data Communication using Polars and Plotly Express. We’ll work with two datasets throughout these exercises: the diamonds dataset from our module and the Titanic dataset, which is widely used in data science education.
import polars as pl
import plotly.express as px
import plotly.graph_objects as go
# Diamonds Dataset
= pl.read_csv("data/diamonds.csv")
diamonds
# Load the Titanic dataset
= "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_url = pl.read_csv(titanic_url) titanic
5.1 Basic EDA with Diamonds Dataset
Perform a basic exploratory data analysis on the diamonds dataset:
- Create a summary of the dataset showing the count, mean, standard deviation, min, and max for all numeric columns
- Calculate the correlation between price and all other numeric variables
- Create a histogram of diamond prices with an appropriate number of bins
- Create a box plot showing price distribution by cut quality
- Identify the top 10 most expensive diamonds and display their attributes
5.2 Visualizing Relationships in the Diamonds Dataset
Explore relationships between variables in the diamonds dataset:
- Create a scatter plot of carat vs. price with color representing cut quality
- Implement at least two different techniques to handle overplotting in this visualization
- Create a visualization showing how the price-to-carat relationship varies across different clarity categories
- Create a heatmap showing the average price by cut and color
- Design a visualization that effectively communicates which combination of diamond attributes tends to yield the highest value (price relative to carat weight)
5.3 Advanced Visualization Techniques for Diamonds
Apply more advanced visualization techniques to the diamonds dataset:
- Create a small multiples (faceted) visualization showing the price-to-carat relationship across different cut and clarity combinations
- Create a violin plot comparing the price distributions across cut categories
- Design an interactive visualization that allows users to explore how different combinations of attributes affect diamond prices
- Add appropriate annotations to highlight key insights in one of your visualizations
- Create a custom theme for your plots that could be used consistently across a presentation or report
5.4 Initial EDA with Titanic Dataset
Perform an initial exploratory data analysis on the Titanic dataset:
- Create a summary of the dataset, including the count of missing values for each column
- Calculate the overall survival rate and visualize survival counts
- Create a visualization showing survival rates by passenger class (Pclass)
- Create a visualization showing survival rates by sex
- Create a visualization showing the age distribution of passengers, with color indicating survival status
5.5 Investigating Survival Factors in the Titanic Dataset
Dig deeper into what factors influenced survival on the Titanic.
- Create a visualization showing survival rates by passenger class (Pclass) and sex
- Investigate if fare amount was related to survival chances
- Explore if traveling with family members (SibSp + Parch > 0) affected survival rates
- Create a visualization showing survival rates by age groups (e.g., children, adults, elderly)
- Design a composite visualization that effectively communicates the most important factors that influenced survival
5.6 Effective Communication with the Titanic Dataset
Create presentation-quality visualizations that tell a story about the Titanic disaster.
- Create a visualization that effectively communicates the “women and children first” policy
- Design a visualization that shows how social class (indicated by passenger class) affected survival chances
- Create a visualization that communicates how survival rates varied by the deck/location of the cabin (extract the deck from the cabin column)