06 | Data Visualization

Data Science with Polars and Plotly: EDA and Communication

Author

Affiliation

Mr. Ozan Ozbeker

Industrial and Management Systems Engineering

1 Exploratory Data Analysis with Polars and Plotly

Exploratory Data Analysis (EDA) is a critical first step in any data analysis project. It involves examining the data to identify patterns, spot anomalies, test hypotheses, and check assumptions through summary statistics and graphical representations. This module adapts the EDA concepts from R for Data Science (2e) to the Python ecosystem, specifically using Polars for data manipulation and Plotly Express for visualization.

# Import required libraries
import polars as pl
import plotly.express as px
import plotly.graph_objects as go

1.1 The Dataset

For this module, we’ll use the “diamonds” dataset, similar to the one used in R4DS. This dataset contains information about ~54,000 diamonds, including their prices and various attributes.

# Load the diamonds dataset
diamonds = pl.read_csv("data/diamonds.csv")

# Display first few rows
diamonds.head()

shape: (5, 10)

carat	cut	color	clarity	depth	table	price	x	y	z
f64	str	str	str	f64	f64	i64	f64	f64	f64
0.23	"Ideal"	"E"	"SI2"	61.5	55.0	326	3.95	3.98	2.43
0.21	"Premium"	"E"	"SI1"	59.8	61.0	326	3.89	3.84	2.31
0.23	"Good"	"E"	"VS1"	56.9	65.0	327	4.05	4.07	2.31
0.29	"Premium"	"I"	"VS2"	62.4	58.0	334	4.2	4.23	2.63
0.31	"Good"	"J"	"SI2"	63.3	58.0	335	4.34	4.35	2.75

Let’s examine the structure of our dataset:

# Get a quick summary of the dataset
diamonds.describe()

shape: (9, 11)

statistic	carat	cut	color	clarity	depth	table	price	x	y	z
str	f64	str	str	str	f64	f64	f64	f64	f64	f64
"count"	53940.0	"53940"	"53940"	"53940"	53940.0	53940.0	53940.0	53940.0	53940.0	53940.0
"null_count"	0.0	"0"	"0"	"0"	0.0	0.0	0.0	0.0	0.0	0.0
"mean"	0.79794	null	null	null	61.749405	57.457184	3932.799722	5.731157	5.734526	3.538734
"std"	0.474011	null	null	null	1.432621	2.234491	3989.439738	1.121761	1.142135	0.705699
…	…	…	…	…	…	…	…	…	…	…
"25%"	0.4	null	null	null	61.0	56.0	950.0	4.71	4.72	2.91
"50%"	0.7	null	null	null	61.8	57.0	2401.0	5.7	5.71	3.53
"75%"	1.04	null	null	null	62.5	59.0	5324.0	6.54	6.54	4.04
"max"	5.01	"Very Good"	"J"	"VVS2"	79.0	95.0	18823.0	10.74	58.9	31.8

diamonds.glimpse()

Rows: 53940
Columns: 10
$ carat   <f64> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23
$ cut     <str> 'Ideal', 'Premium', 'Good', 'Premium', 'Good', 'Very Good', 'Very Good', 'Very Good', 'Fair', 'Very Good'
$ color   <str> 'E', 'E', 'E', 'I', 'J', 'J', 'I', 'H', 'E', 'H'
$ clarity <str> 'SI2', 'SI1', 'VS1', 'VS2', 'SI2', 'VVS2', 'VVS1', 'SI1', 'VS2', 'VS1'
$ depth   <f64> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4
$ table   <f64> 55.0, 61.0, 65.0, 58.0, 58.0, 57.0, 57.0, 55.0, 61.0, 61.0
$ price   <i64> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338
$ x       <f64> 3.95, 3.89, 4.05, 4.2, 4.34, 3.94, 3.95, 4.07, 3.87, 4.0
$ y       <f64> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05
$ z       <f64> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39

1.2 Data Visualization for EDA

1.2.1 Creating a Basic Plot

Let’s start with a simple visualization to examine the relationship between carat (weight) and price. We’ve used Plotly Express to create plots quickly, and it even has a built in way of updating the labels, but we can fine tune these plots with fig.update_layout(), where fig is the plot object created with Plotly Express:

# Create a basic scatter plot
fig = px.scatter(
    diamonds, 
    x="carat", 
    y="price",
    opacity=0.5,  # Make points semi-transparent
    title="Diamond Price vs Carat"
)

fig.update_layout(
    xaxis_title="Weight (carats)",
    yaxis_title="Price (USD)"
)

fig.show()

The plot reveals a positive relationship between a diamond’s weight and its price, but the relationship isn’t perfectly linear, and there’s significant variation.

1.2.2 Visualizing Distributions

Understanding the distribution of individual variables is an important part of EDA.

# Histogram of diamond prices
fig = px.histogram(
    diamonds, 
    x="price", 
    nbins=50,
    title="Distribution of Diamond Prices"
)

fig.update_layout(
    xaxis_title="Price (USD)",
    yaxis_title="Count"
)

fig.show()

We can also look at the distribution of categorical variables:

# Count of diamonds by cut quality
fig = px.bar(
    diamonds.group_by("cut").agg(pl.count()).sort("cut"),
    x="cut", 
    y="count",
    title="Number of Diamonds by Cut Quality"
)

fig.update_layout(
    xaxis_title="Cut Quality",
    yaxis_title="Count"
)

fig.show()

C:\Users\WVU\AppData\Local\Temp\ipykernel_30000\990919602.py:3: DeprecationWarning:

`pl.count()` is deprecated. Please use `pl.len()` instead.

1.2.3 Visualizing Relationships

Let’s explore the relationship between multiple variables. For instance, how does the price-to-carat relationship change based on the diamond’s cut quality?

# Create a scatter plot with color showing cut quality
fig = px.scatter(
    diamonds, 
    x="carat", 
    y="price", 
    color="cut",
    opacity=0.6,
    title="Diamond Price vs Carat by Cut Quality"
)

fig.update_layout(
    xaxis_title="Weight (carats)",
    yaxis_title="Price (USD)"
)

fig.show()

We can also look at average prices by cut and color:

# Calculate average price by cut and color
avg_price_by_cut_color = (
    diamonds
    .group_by(["cut", "color"])
    .agg(pl.mean("price").alias("avg_price"))
    .sort(["cut", "color"])
)

# Create a heatmap
fig = px.density_heatmap(
    avg_price_by_cut_color,
    x="color", 
    y="cut", 
    z="avg_price",
    title="Average Diamond Price by Cut and Color"
)

fig.update_layout(
    xaxis_title="Color (D is best)",
    yaxis_title="Cut Quality"
)

fig.show()

1.2.4 Handling Overplotting

When working with large datasets like the diamonds dataset, overplotting can be an issue. Here are techniques to address this:

Using transparency

fig = px.scatter(
    diamonds, 
    x="carat", 
    y="price", 
    opacity=0.2,
    title="Using Transparency to Handle Overplotting"
)
fig.show()

Using smaller points

fig = px.scatter(
    diamonds, 
    x="carat", 
    y="price", 
    title="Using Smaller Points to Handle Overplotting"
)
fig.update_traces(marker=dict(size=2))
fig.show()

Using a 2D density plot (hexbin)

fig = px.density_heatmap(
    diamonds, 
    x="carat", 
    y="price", 
    nbinsx=30, 
    nbinsy=30,
    title="Using a 2D Density Plot to Handle Overplotting"
)
fig.show()

Adding jitter for categorical variables

fig = px.strip(
    diamonds.sample(n=1000), # sample here is important
    x="cut", 
    y="price", 
    title="Using Jitter for Categorical Variables"
)
fig.show()

1.3 Patterns and Models

Visualizations help us identify patterns in the data, which can then inform our modeling approach. Let’s visualize a few more relationships:

# Create a boxplot of price by cut
fig = px.box(
    diamonds, 
    x="cut", 
    y="price",
    title="Diamond Price Distribution by Cut"
)
fig.show()

# Create a violin plot of price by cut
fig = px.violin(
    diamonds, 
    x="cut", 
    y="price", 
    box=True,
    title="Diamond Price Violin Plot by Cut"
)
fig.show()

1.4 Typical EDA Workflow

A typical EDA workflow with Polars and Plotly might look like this:

Load and inspect the data

# Load data
data = pl.read_csv("your_dataset.csv")

# Inspect structure
data.head()
data.schema
data.shape

# Check for missing values
data.null_count()

Compute summary statistics

# Basic summary statistics
data.describe()

# Custom aggregations by group
(data
.group_by("category_column")
.agg([
    pl.mean("numeric_column1").alias("mean_value"),
    pl.median("numeric_column1").alias("median_value"),
    pl.std("numeric_column1").alias("std_value"),
    pl.count().alias("count")
]))

Visualize univariate distributions

# For numeric variables
fig = px.histogram(data, x="numeric_column")
fig.show()

# For categorical variables
counts = data.group_by("category_column").agg(pl.count()).sort("count", descending=True)
fig = px.bar(counts, x="category_column", y="count")
fig.show()

Explore relationships between variables

# Scatter plot for two numeric variables
fig = px.scatter(data, x="numeric_column1", y="numeric_column2")
fig.show()

# Add a third variable using color
fig = px.scatter(data, x="numeric_column1", y="numeric_column2", color="category_column")
fig.show()

# Boxplots for numeric vs categorical
fig = px.box(data, x="category_column", y="numeric_column")
fig.show()

Identify and investigate unusual observations

# Filter to outliers (e.g., beyond 3 standard deviations)
mean_expr = pl.col("numeric_column").mean()
std_expr = pl.col("numeric_column").std()

outliers = data.filter(
    (pl.col("numeric_column") > pl.col("numeric_column").mean() + 3 * pl.col("numeric_column").std()) | 
    (pl.col("numeric_column") < pl.col("numeric_column").mean() - 3 * pl.col("numeric_column").std())
)
print(outliers)

# Visualize with outliers highlighted
data = data.with_columns(
    pl.when(
        ((pl.col("numeric_column") > pl.col("numeric_column").mean() + 3 * pl.col("numeric_column").std()) | 
        (pl.col("numeric_column") < pl.col("numeric_column").mean() - 3 * pl.col("numeric_column").std())
    )
    .then(True)
    .otherwise(False)
    .alias("is_outlier")
)

fig = px.scatter(
    data, 
    x="numeric_column1", 
    y="numeric_column2", 
    color="is_outlier",
    color_discrete_map={True: "red", False: "blue"},
    title="Outlier Identification"
)
fig.show()

Transform variables if needed

# Log transformation for skewed data
data = data.with_column(
    pl.col("skewed_column").log().alias("log_skewed_column")
)

# Before and after histograms
fig1 = px.histogram(data, x="skewed_column", title="Original Distribution")
fig2 = px.histogram(data, x="log_skewed_column", title="Log-Transformed Distribution")

fig1.show()
fig2.show()

1.5 Practical EDA Questions

When performing EDA, it’s helpful to have some guiding questions:

What type of variation occurs within my variables?
- What values are common? What values are rare?
- Are there any unexpected values or outliers?
- What’s the shape of the distribution?
What type of covariation occurs between my variables?
- How do variables relate to each other?
- Are there any clear patterns or relationships?
- Do these relationships make sense given the domain?
Are there interesting subgroups in the data?
- Do patterns change when you filter or group the data?
- Are there clusters or segments that behave differently?
What might explain the observed patterns?
- Can domain knowledge explain the relationships?
- What additional data might help understanding?
- What hypotheses can you form for further analysis?

2 Communicating with Data

Once you’ve explored and analyzed your data, the next crucial step is effectively communicating your findings. The goal is to help others understand what you’ve discovered without requiring them to go through the entire analysis process themselves. This section adapts the concepts from R for Data Science (2e) - Communication chapter to Polars and Plotly.

2.1 Creating Effective Visualizations

The key to effective visualization is clarity and purpose. Each visualization should answer a specific question or highlight a particular insight.

2.1.1 Improving Basic Plots

Let’s start with a simple scatter plot of diamond price vs. carat, and progressively improve it for better communication:

# Basic scatter plot
fig = px.scatter(
    diamonds.sample(n=1000), 
    x="carat", 
    y="price",
    title="Diamond Price vs Carat"
)
fig.show()

Improved version with better labels and context:

fig = px.scatter(
    diamonds.sample(n=1000), 
    x="carat", 
    y="price",
    color="cut",
    title="Diamond Price vs Weight by Cut Quality",
    labels={
        "carat": "Weight (carats)",
        "price": "Price (USD)",
        "cut": "Cut Quality"
    },
    hover_data=["clarity", "color", "depth", "table"]
)

fig.update_layout(
    title_x=0.5,  # Center the title
    legend_title_text="Cut Quality",
    xaxis=dict(
        tickmode='linear',
        tick0=0,
        dtick=0.5
    ),
    yaxis=dict(
        tickprefix="$",
        rangemode="tozero"
    )
)

fig.show()

2.1.2 Using Annotations

Annotations can help draw attention to important aspects of your visualization:

# Create a scatter plot with an annotation
fig = px.scatter(
    diamonds.sample(n=1000), 
    x="carat", 
    y="price",
    color="cut",
    title="Diamond Price vs Weight with Annotation",
    labels={
        "carat": "Weight (carats)",
        "price": "Price (USD)",
        "cut": "Cut Quality"
    }
)

# Add annotation highlighting an interesting pattern
fig.add_annotation(
    x=2,
    y=15000,
    text="Premium cut diamonds maintain<br>higher value at larger sizes", # <br> is an HTML element
    showarrow=True,
    arrowhead=1,
    ax=50,
    ay=-50
)

fig.show()

2.1.3 Multiple Views with Facets

Faceting allows you to create multiple views of the same data, split by categories:

# Create a faceted scatter plot
fig = px.scatter(
    diamonds.sample(n=2000), 
    x="carat", 
    y="price",
    color="color",
    facet_col="cut",
    title="Diamond Price vs Weight by Cut and Color",
    labels={
        "carat": "Weight (carats)",
        "price": "Price (USD)",
        "cut": "Cut Quality",
        "color": "Color (D is best)"
    }
)

# Update layout
fig.update_layout(
    title_x=0.5,
    legend_title_text="Diamond Color"
)

# Make facet column titles more readable
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[1]))
# this removes "Cut Quality" from "Cut Quality=Ideal" in the facet

fig.show()

2.1.4 Creating Small Multiples

Small multiples are a powerful way to compare patterns across different subgroups:

# Create small multiples with box plots
fig = px.box(
    diamonds, 
    x="cut", 
    y="price",
    color="cut",
    facet_col="clarity", 
    facet_col_wrap=4,
    title="Diamond Price Distribution by Cut and Clarity",
    labels={
        "cut": "Cut Quality",
        "price": "Price (USD)",
        "clarity": "Clarity"
    }
)

# Update layout
fig.update_layout(title_x=0.5)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[1]))
fig.update_xaxes(tickangle=45)

fig.show()

2.2 Scales and Guides

The choice of scales and guides (legends, axes, etc.) can significantly impact how a visualization is interpreted.

2.2.1 Adjusting Scales

# Scatter plot with adjusted scales
fig = px.scatter(
    diamonds.sample(n=1000), 
    x="carat", 
    y="price",
    color="depth",
    title="Diamond Price vs Weight with Color Scale",
    labels={
        "carat": "Weight (carats)",
        "price": "Price (USD)",
        "depth": "Depth Percentage"
    }
)

# Update color scale
fig.update_layout(
    coloraxis_colorbar=dict(
        title="Depth %",
        tickvals=[55, 60, 65, 70],
        ticktext=["55%", "60%", "65%", "70%"]
    )
)

# Log scale for price
fig.update_yaxes(type="log")

fig.show()

2.2.2 Customizing Legends

# Plot with custom legend
fig = px.scatter(
    diamonds.sample(n=1000), 
    x="carat", 
    y="price",
    color="cut",
    symbol="clarity",
    title="Diamond Price vs Weight with Custom Legend",
    labels={
        "carat": "Weight (carats)",
        "price": "Price (USD)",
        "cut": "Cut Quality",
        "clarity": "Clarity"
    }
)

# Update legend
fig.update_layout(
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1,
        title=""
    )
)

fig.show()

2.3 Themes and Typography

Consistent themes and typography help create a cohesive look across all your visualizations:

# Define WVU color theme
wvu_theme = {
    "layout": {
        "font": {"family": "Arial, sans-serif", "size": 12, "color": "#333333"},
        "title": {"font": {"size": 18, "color": "#002855"}},  # WVU Blue for title
        "plot_bgcolor": "#FFFFFF",
        "paper_bgcolor": "#FFFFFF",
        "colorway": ["#002855", "#EAAA00", "#0033A0", "#9AABBA", "#4F5B66"],  # WVU Blue, Gold, and complementary colors
        "xaxis": {"gridcolor": "#E5E5E5", "zerolinecolor": "#E5E5E5"},
        "yaxis": {"gridcolor": "#E5E5E5", "zerolinecolor": "#E5E5E5"},
        "legend": {"bgcolor": "#FFFFFF", "bordercolor": "#E5E5E5"},
        "margin": {"t": 60, "b": 60, "l": 50, "r": 50}
    }
}

cut_summary = diamonds.group_by('cut').agg(
    pl.mean('price').alias('avg_price'),
    pl.count('price').alias('count')
).sort('avg_price', descending=True)

# Apply theme to a plot
fig = px.bar(
    cut_summary,
    x='cut',
    y='avg_price',
    color='cut',
    title='Average Diamond Price by Cut (WVU Theme)',
    labels={'avg_price': 'Average Price ($)', 'cut': 'Cut Quality'},
    color_discrete_sequence=wvu_theme['layout']['colorway']
)

# Apply WVU theme
fig.update_layout(**wvu_theme['layout'])

# Add values on top of bars
fig.update_traces(
    texttemplate='$%{y:.0f}',
    textposition='outside',
    textfont=dict(size=12, color='#333333'),
    marker_line_width=1,
    marker_line_color='#FFFFFF'
)

# Add count as a hover info
fig.update_traces(
    hovertemplate='<b>%{x}</b><br>Average Price: $%{y:.2f}<br>Count: %{customdata}<extra></extra>',
    customdata=cut_summary.select('count').to_numpy()
)

# Show the figure
fig.show()

2.4 Tables for Communication

While visualizations are powerful, sometimes a well-designed table is the best way to communicate specific values or detailed information. The Great Tables package can help us create print-ready tables:

from great_tables import GT

# Create a summary table
summary_data = (
    diamonds
    .group_by(["cut", "color"])
    .agg([
        pl.mean("price").round(2).alias("avg_price"),
        pl.len().alias("count"),
        pl.median("carat").round(3).alias("median_carat")
    ])
    .sort(["cut", "color"])
)

# Create the GT table
gt_table = (
    GT(summary_data)
    # Set title and subtitle
    .tab_header(
        title="Diamond Summary Statistics",
        subtitle="Aggregated by Cut and Color"
    )
    # Format columns
    .fmt_currency(
        columns=["avg_price"],
        currency="USD"
    )
    .fmt_number(
        columns=["count"],
        use_seps=True
    )
    .fmt_number(
        columns=["median_carat"],
        decimals=3
    )
    # Rename columns for display
    .cols_label(
        cut="Cut Quality",
        color="Color Grade",
        avg_price="Average Price",
        count="Count",
        median_carat="Median Carat"
    )
    # Add source note
    .tab_source_note(
        source_note="Data summarized from diamonds dataset"
    )
)

# Display the table
gt_table

Diamond Summary Statistics
Aggregated by Cut and Color
Cut Quality	Color Grade	Average Price	Count	Median Carat
Fair	D	$4,291.06	163.00	0.900
Fair	E	$3,682.31	224.00	0.900
Fair	F	$3,827.00	312.00	0.900
Fair	G	$4,239.25	314.00	0.980
Fair	H	$5,135.68	303.00	1.010
Fair	I	$4,685.45	175.00	1.010
Fair	J	$4,975.66	119.00	1.030
Good	D	$3,405.38	662.00	0.700
Good	E	$3,423.64	933.00	0.700
Good	F	$3,495.75	909.00	0.710
Good	G	$4,123.48	871.00	0.900
Good	H	$4,276.25	702.00	0.900
Good	I	$5,078.53	522.00	1.000
Good	J	$4,574.17	307.00	1.020
Ideal	D	$2,629.09	2,834.00	0.500
Ideal	E	$2,597.55	3,903.00	0.500
Ideal	F	$3,374.94	3,826.00	0.530
Ideal	G	$3,720.71	4,884.00	0.540
Ideal	H	$3,889.33	3,115.00	0.700
Ideal	I	$4,451.97	2,093.00	0.740
Ideal	J	$4,918.19	896.00	1.030
Premium	D	$3,631.29	1,603.00	0.580
Premium	E	$3,538.91	2,337.00	0.580
Premium	F	$4,324.89	2,331.00	0.760
Premium	G	$4,500.74	2,924.00	0.755
Premium	H	$5,216.71	2,360.00	1.010
Premium	I	$5,946.18	1,428.00	1.140
Premium	J	$6,294.59	808.00	1.250
Very Good	D	$3,470.47	1,513.00	0.610
Very Good	E	$3,214.65	2,400.00	0.570
Very Good	F	$3,778.82	2,164.00	0.700
Very Good	G	$3,872.75	2,299.00	0.700
Very Good	H	$4,535.39	1,824.00	0.900
Very Good	I	$5,255.88	1,204.00	1.005
Very Good	J	$5,103.51	678.00	1.060
Data summarized from diamonds dataset

2.5 Combining Text and Visualizations

In a Quarto document, you can combine text explanations with your visualizations to create a coherent narrative:

## Diamond Price Analysis

Our analysis of the diamond dataset reveals several interesting patterns:

```python
# Load necessary libraries
import polars as pl
import plotly.express as px

# Load the diamonds dataset
diamonds = pl.read_csv("data/diamonds.csv")

# Create a visualization
fig = px.scatter(
    diamonds.sample(n=1000), 
    x="carat", 
    y="price",
    color="cut",
    title="Diamond Price vs Weight by Cut Quality"
)
fig.show()
```

As shown in the plot above, there is a strong positive relationship between a diamond's weight (carat) and its price. However, this relationship is moderated by the quality of the cut, with higher quality cuts generally commanding premium prices across the weight spectrum.

2.6 Principles of Effective Communication

Here are key principles to follow when communicating with data:

Know your audience: Adapt your visualization complexity and terminology to match your audience’s expertise.
Tell a story: Structure your communication as a narrative with a beginning, middle, and end.
Focus on the message: Every element in your visualization should support your main message.
Simplify: Remove chart junk and unnecessary elements that don’t contribute to understanding.
Choose appropriate visualizations: Select chart types that best represent your data and answer your specific questions.
Iterate: Create drafts, get feedback, and refine your visualizations before finalizing them.

2.7 Practical Communication Workflow

A practical workflow for communicating with data might look like this:

Identify your key findings: What are the 2-3 most important insights from your analysis?
Determine your audience: Who will be consuming your visualization? What do they already know? What do they need to learn?
Select appropriate visualization types: Choose the charts that best communicate your findings.
Create draft visualizations: Build initial versions of your plots with Plotly Express.
Refine and polish: Add appropriate titles, labels, colors, and annotations.
Integrate with narrative: Combine your visualizations with explanatory text in your Quarto document.
Review and revise: Get feedback and iterate as needed.

3 Conclusion

Effective exploratory data analysis and communication are essential skills for any data scientist. With Polars for data manipulation and Plotly Express for visualization, you have powerful tools to explore, understand, and communicate insights from your data.

This module has provided a Python-focused adaptation of the concepts from R4DS, demonstrating how to perform EDA and create effective visualizations using modern Python data science tools. By mastering these techniques, you’ll be better equipped to extract meaningful insights from data and effectively share those insights with others.

4 Further Resources

5 Exercises

These exercises will help you practice the concepts of Exploratory Data Analysis and Data Communication using Polars and Plotly Express. We’ll work with two datasets throughout these exercises: the diamonds dataset from our module and the Titanic dataset, which is widely used in data science education.

Learn more about diamonds here
Learn more about titanic here

import polars as pl
import plotly.express as px
import plotly.graph_objects as go

# Diamonds Dataset
diamonds = pl.read_csv("data/diamonds.csv")

# Load the Titanic dataset
titanic_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic = pl.read_csv(titanic_url)

5.1 Basic EDA with Diamonds Dataset

Perform a basic exploratory data analysis on the diamonds dataset:

Create a summary of the dataset showing the count, mean, standard deviation, min, and max for all numeric columns
Calculate the correlation between price and all other numeric variables
Create a histogram of diamond prices with an appropriate number of bins
Create a box plot showing price distribution by cut quality
Identify the top 10 most expensive diamonds and display their attributes

5.2 Visualizing Relationships in the Diamonds Dataset

Explore relationships between variables in the diamonds dataset:

Create a scatter plot of carat vs. price with color representing cut quality
Implement at least two different techniques to handle overplotting in this visualization
Create a visualization showing how the price-to-carat relationship varies across different clarity categories
Create a heatmap showing the average price by cut and color
Design a visualization that effectively communicates which combination of diamond attributes tends to yield the highest value (price relative to carat weight)

5.3 Advanced Visualization Techniques for Diamonds

Apply more advanced visualization techniques to the diamonds dataset:

Create a small multiples (faceted) visualization showing the price-to-carat relationship across different cut and clarity combinations
Create a violin plot comparing the price distributions across cut categories
Design an interactive visualization that allows users to explore how different combinations of attributes affect diamond prices
Add appropriate annotations to highlight key insights in one of your visualizations
Create a custom theme for your plots that could be used consistently across a presentation or report

5.4 Initial EDA with Titanic Dataset

Perform an initial exploratory data analysis on the Titanic dataset:

Create a summary of the dataset, including the count of missing values for each column
Calculate the overall survival rate and visualize survival counts
Create a visualization showing survival rates by passenger class (Pclass)
Create a visualization showing survival rates by sex
Create a visualization showing the age distribution of passengers, with color indicating survival status

5.5 Investigating Survival Factors in the Titanic Dataset

Dig deeper into what factors influenced survival on the Titanic.

Create a visualization showing survival rates by passenger class (Pclass) and sex
Investigate if fare amount was related to survival chances
Explore if traveling with family members (SibSp + Parch > 0) affected survival rates
Create a visualization showing survival rates by age groups (e.g., children, adults, elderly)
Design a composite visualization that effectively communicates the most important factors that influenced survival

5.6 Effective Communication with the Titanic Dataset

Create presentation-quality visualizations that tell a story about the Titanic disaster.

Create a visualization that effectively communicates the “women and children first” policy
Design a visualization that shows how social class (indicated by passenger class) affected survival chances
Create a visualization that communicates how survival rates varied by the deck/location of the cabin (extract the deck from the cabin column)

Reuse

CC BY-NC-ND 4.0