06 | Data Visualization

Data Science with Polars and Plotly: EDA and Communication

Author
Affiliation

Mr. Ozan Ozbeker

1 Exploratory Data Analysis with Polars and Plotly

Exploratory Data Analysis (EDA) is a critical first step in any data analysis project. It involves examining the data to identify patterns, spot anomalies, test hypotheses, and check assumptions through summary statistics and graphical representations. This module adapts the EDA concepts from R for Data Science (2e) to the Python ecosystem, specifically using Polars for data manipulation and Plotly Express for visualization.

# Import required libraries
import polars as pl
import plotly.express as px
import plotly.graph_objects as go

1.1 The Dataset

For this module, we’ll use the “diamonds” dataset, similar to the one used in R4DS. This dataset contains information about ~54,000 diamonds, including their prices and various attributes.

# Load the diamonds dataset
diamonds = pl.read_csv("data/diamonds.csv")

# Display first few rows
diamonds.head()
shape: (5, 10)
carat cut color clarity depth table price x y z
f64 str str str f64 f64 i64 f64 f64 f64
0.23 "Ideal" "E" "SI2" 61.5 55.0 326 3.95 3.98 2.43
0.21 "Premium" "E" "SI1" 59.8 61.0 326 3.89 3.84 2.31
0.23 "Good" "E" "VS1" 56.9 65.0 327 4.05 4.07 2.31
0.29 "Premium" "I" "VS2" 62.4 58.0 334 4.2 4.23 2.63
0.31 "Good" "J" "SI2" 63.3 58.0 335 4.34 4.35 2.75

Let’s examine the structure of our dataset:

# Get a quick summary of the dataset
diamonds.describe()
shape: (9, 11)
statistic carat cut color clarity depth table price x y z
str f64 str str str f64 f64 f64 f64 f64 f64
"count" 53940.0 "53940" "53940" "53940" 53940.0 53940.0 53940.0 53940.0 53940.0 53940.0
"null_count" 0.0 "0" "0" "0" 0.0 0.0 0.0 0.0 0.0 0.0
"mean" 0.79794 null null null 61.749405 57.457184 3932.799722 5.731157 5.734526 3.538734
"std" 0.474011 null null null 1.432621 2.234491 3989.439738 1.121761 1.142135 0.705699
"25%" 0.4 null null null 61.0 56.0 950.0 4.71 4.72 2.91
"50%" 0.7 null null null 61.8 57.0 2401.0 5.7 5.71 3.53
"75%" 1.04 null null null 62.5 59.0 5324.0 6.54 6.54 4.04
"max" 5.01 "Very Good" "J" "VVS2" 79.0 95.0 18823.0 10.74 58.9 31.8
diamonds.glimpse()
Rows: 53940
Columns: 10
$ carat   <f64> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23
$ cut     <str> 'Ideal', 'Premium', 'Good', 'Premium', 'Good', 'Very Good', 'Very Good', 'Very Good', 'Fair', 'Very Good'
$ color   <str> 'E', 'E', 'E', 'I', 'J', 'J', 'I', 'H', 'E', 'H'
$ clarity <str> 'SI2', 'SI1', 'VS1', 'VS2', 'SI2', 'VVS2', 'VVS1', 'SI1', 'VS2', 'VS1'
$ depth   <f64> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4
$ table   <f64> 55.0, 61.0, 65.0, 58.0, 58.0, 57.0, 57.0, 55.0, 61.0, 61.0
$ price   <i64> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338
$ x       <f64> 3.95, 3.89, 4.05, 4.2, 4.34, 3.94, 3.95, 4.07, 3.87, 4.0
$ y       <f64> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05
$ z       <f64> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39

1.2 Data Visualization for EDA

1.2.1 Creating a Basic Plot

Let’s start with a simple visualization to examine the relationship between carat (weight) and price. We’ve used Plotly Express to create plots quickly, and it even has a built in way of updating the labels, but we can fine tune these plots with fig.update_layout(), where fig is the plot object created with Plotly Express:

# Create a basic scatter plot
fig = px.scatter(
    diamonds, 
    x="carat", 
    y="price",
    opacity=0.5,  # Make points semi-transparent
    title="Diamond Price vs Carat"
)

fig.update_layout(
    xaxis_title="Weight (carats)",
    yaxis_title="Price (USD)"
)

fig.show()

The plot reveals a positive relationship between a diamond’s weight and its price, but the relationship isn’t perfectly linear, and there’s significant variation.

1.2.2 Visualizing Distributions

Understanding the distribution of individual variables is an important part of EDA.

# Histogram of diamond prices
fig = px.histogram(
    diamonds, 
    x="price", 
    nbins=50,
    title="Distribution of Diamond Prices"
)

fig.update_layout(
    xaxis_title="Price (USD)",
    yaxis_title="Count"
)

fig.show()

We can also look at the distribution of categorical variables:

# Count of diamonds by cut quality
fig = px.bar(
    diamonds.group_by("cut").agg(pl.count()).sort("cut"),
    x="cut", 
    y="count",
    title="Number of Diamonds by Cut Quality"
)

fig.update_layout(
    xaxis_title="Cut Quality",
    yaxis_title="Count"
)

fig.show()
C:\Users\WVU\AppData\Local\Temp\ipykernel_30000\990919602.py:3: DeprecationWarning:

`pl.count()` is deprecated. Please use `pl.len()` instead.

1.2.3 Visualizing Relationships

Let’s explore the relationship between multiple variables. For instance, how does the price-to-carat relationship change based on the diamond’s cut quality?

# Create a scatter plot with color showing cut quality
fig = px.scatter(
    diamonds, 
    x="carat", 
    y="price", 
    color="cut",
    opacity=0.6,
    title="Diamond Price vs Carat by Cut Quality"
)

fig.update_layout(
    xaxis_title="Weight (carats)",
    yaxis_title="Price (USD)"
)

fig.show()

We can also look at average prices by cut and color:

# Calculate average price by cut and color
avg_price_by_cut_color = (
    diamonds
    .group_by(["cut", "color"])
    .agg(pl.mean("price").alias("avg_price"))
    .sort(["cut", "color"])
)

# Create a heatmap
fig = px.density_heatmap(
    avg_price_by_cut_color,
    x="color", 
    y="cut", 
    z="avg_price",
    title="Average Diamond Price by Cut and Color"
)

fig.update_layout(
    xaxis_title="Color (D is best)",
    yaxis_title="Cut Quality"
)

fig.show()

1.2.4 Handling Overplotting

When working with large datasets like the diamonds dataset, overplotting can be an issue. Here are techniques to address this:

  1. Using transparency
fig = px.scatter(
    diamonds, 
    x="carat", 
    y="price", 
    opacity=0.2,
    title="Using Transparency to Handle Overplotting"
)
fig.show()
  1. Using smaller points
fig = px.scatter(
    diamonds, 
    x="carat", 
    y="price", 
    title="Using Smaller Points to Handle Overplotting"
)
fig.update_traces(marker=dict(size=2))
fig.show()
  1. Using a 2D density plot (hexbin)
fig = px.density_heatmap(
    diamonds, 
    x="carat", 
    y="price", 
    nbinsx=30, 
    nbinsy=30,
    title="Using a 2D Density Plot to Handle Overplotting"
)
fig.show()

Adding jitter for categorical variables

fig = px.strip(
    diamonds.sample(n=1000), # sample here is important
    x="cut", 
    y="price", 
    title="Using Jitter for Categorical Variables"
)
fig.show()

1.3 Patterns and Models

Visualizations help us identify patterns in the data, which can then inform our modeling approach. Let’s visualize a few more relationships:

# Create a boxplot of price by cut
fig = px.box(
    diamonds, 
    x="cut", 
    y="price",
    title="Diamond Price Distribution by Cut"
)
fig.show()
# Create a violin plot of price by cut
fig = px.violin(
    diamonds, 
    x="cut", 
    y="price", 
    box=True,
    title="Diamond Price Violin Plot by Cut"
)
fig.show()

1.4 Typical EDA Workflow

A typical EDA workflow with Polars and Plotly might look like this:

  1. Load and inspect the data

    # Load data
    data = pl.read_csv("your_dataset.csv")
    
    # Inspect structure
    data.head()
    data.schema
    data.shape
    
    # Check for missing values
    data.null_count()
  2. Compute summary statistics

    # Basic summary statistics
    data.describe()
    
    # Custom aggregations by group
    (data
    .group_by("category_column")
    .agg([
        pl.mean("numeric_column1").alias("mean_value"),
        pl.median("numeric_column1").alias("median_value"),
        pl.std("numeric_column1").alias("std_value"),
        pl.count().alias("count")
    ]))
  3. Visualize univariate distributions

    # For numeric variables
    fig = px.histogram(data, x="numeric_column")
    fig.show()
    
    # For categorical variables
    counts = data.group_by("category_column").agg(pl.count()).sort("count", descending=True)
    fig = px.bar(counts, x="category_column", y="count")
    fig.show()
  4. Explore relationships between variables

    # Scatter plot for two numeric variables
    fig = px.scatter(data, x="numeric_column1", y="numeric_column2")
    fig.show()
    
    # Add a third variable using color
    fig = px.scatter(data, x="numeric_column1", y="numeric_column2", color="category_column")
    fig.show()
    
    # Boxplots for numeric vs categorical
    fig = px.box(data, x="category_column", y="numeric_column")
    fig.show()
  5. Identify and investigate unusual observations

    # Filter to outliers (e.g., beyond 3 standard deviations)
    mean_expr = pl.col("numeric_column").mean()
    std_expr = pl.col("numeric_column").std()
    
    outliers = data.filter(
        (pl.col("numeric_column") > pl.col("numeric_column").mean() + 3 * pl.col("numeric_column").std()) | 
        (pl.col("numeric_column") < pl.col("numeric_column").mean() - 3 * pl.col("numeric_column").std())
    )
    print(outliers)
    
    # Visualize with outliers highlighted
    data = data.with_columns(
        pl.when(
            ((pl.col("numeric_column") > pl.col("numeric_column").mean() + 3 * pl.col("numeric_column").std()) | 
            (pl.col("numeric_column") < pl.col("numeric_column").mean() - 3 * pl.col("numeric_column").std())
        )
        .then(True)
        .otherwise(False)
        .alias("is_outlier")
    )
    
    fig = px.scatter(
        data, 
        x="numeric_column1", 
        y="numeric_column2", 
        color="is_outlier",
        color_discrete_map={True: "red", False: "blue"},
        title="Outlier Identification"
    )
    fig.show()
  6. Transform variables if needed

    # Log transformation for skewed data
    data = data.with_column(
        pl.col("skewed_column").log().alias("log_skewed_column")
    )
    
    # Before and after histograms
    fig1 = px.histogram(data, x="skewed_column", title="Original Distribution")
    fig2 = px.histogram(data, x="log_skewed_column", title="Log-Transformed Distribution")
    
    fig1.show()
    fig2.show()

1.5 Practical EDA Questions

When performing EDA, it’s helpful to have some guiding questions:

  1. What type of variation occurs within my variables?
    • What values are common? What values are rare?
    • Are there any unexpected values or outliers?
    • What’s the shape of the distribution?
  2. What type of covariation occurs between my variables?
    • How do variables relate to each other?
    • Are there any clear patterns or relationships?
    • Do these relationships make sense given the domain?
  3. Are there interesting subgroups in the data?
    • Do patterns change when you filter or group the data?
    • Are there clusters or segments that behave differently?
  4. What might explain the observed patterns?
    • Can domain knowledge explain the relationships?
    • What additional data might help understanding?
    • What hypotheses can you form for further analysis?

2 Communicating with Data

Once you’ve explored and analyzed your data, the next crucial step is effectively communicating your findings. The goal is to help others understand what you’ve discovered without requiring them to go through the entire analysis process themselves. This section adapts the concepts from R for Data Science (2e) - Communication chapter to Polars and Plotly.

2.1 Creating Effective Visualizations

The key to effective visualization is clarity and purpose. Each visualization should answer a specific question or highlight a particular insight.

2.1.1 Improving Basic Plots

Let’s start with a simple scatter plot of diamond price vs. carat, and progressively improve it for better communication:

# Basic scatter plot
fig = px.scatter(
    diamonds.sample(n=1000), 
    x="carat", 
    y="price",
    title="Diamond Price vs Carat"
)
fig.show()

Improved version with better labels and context:

fig = px.scatter(
    diamonds.sample(n=1000), 
    x="carat", 
    y="price",
    color="cut",
    title="Diamond Price vs Weight by Cut Quality",
    labels={
        "carat": "Weight (carats)",
        "price": "Price (USD)",
        "cut": "Cut Quality"
    },
    hover_data=["clarity", "color", "depth", "table"]
)

fig.update_layout(
    title_x=0.5,  # Center the title
    legend_title_text="Cut Quality",
    xaxis=dict(
        tickmode='linear',
        tick0=0,
        dtick=0.5
    ),
    yaxis=dict(
        tickprefix="$",
        rangemode="tozero"
    )
)

fig.show()

2.1.2 Using Annotations

Annotations can help draw attention to important aspects of your visualization:

# Create a scatter plot with an annotation
fig = px.scatter(
    diamonds.sample(n=1000), 
    x="carat", 
    y="price",
    color="cut",
    title="Diamond Price vs Weight with Annotation",
    labels={
        "carat": "Weight (carats)",
        "price": "Price (USD)",
        "cut": "Cut Quality"
    }
)

# Add annotation highlighting an interesting pattern
fig.add_annotation(
    x=2,
    y=15000,
    text="Premium cut diamonds maintain<br>higher value at larger sizes", # <br> is an HTML element
    showarrow=True,
    arrowhead=1,
    ax=50,
    ay=-50
)

fig.show()

2.1.3 Multiple Views with Facets

Faceting allows you to create multiple views of the same data, split by categories:

# Create a faceted scatter plot
fig = px.scatter(
    diamonds.sample(n=2000), 
    x="carat", 
    y="price",
    color="color",
    facet_col="cut",
    title="Diamond Price vs Weight by Cut and Color",
    labels={
        "carat": "Weight (carats)",
        "price": "Price (USD)",
        "cut": "Cut Quality",
        "color": "Color (D is best)"
    }
)

# Update layout
fig.update_layout(
    title_x=0.5,
    legend_title_text="Diamond Color"
)

# Make facet column titles more readable
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[1]))
# this removes "Cut Quality" from "Cut Quality=Ideal" in the facet

fig.show()

2.1.4 Creating Small Multiples

Small multiples are a powerful way to compare patterns across different subgroups:

# Create small multiples with box plots
fig = px.box(
    diamonds, 
    x="cut", 
    y="price",
    color="cut",
    facet_col="clarity", 
    facet_col_wrap=4,
    title="Diamond Price Distribution by Cut and Clarity",
    labels={
        "cut": "Cut Quality",
        "price": "Price (USD)",
        "clarity": "Clarity"
    }
)

# Update layout
fig.update_layout(title_x=0.5)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[1]))
fig.update_xaxes(tickangle=45)

fig.show()

2.2 Scales and Guides

The choice of scales and guides (legends, axes, etc.) can significantly impact how a visualization is interpreted.

2.2.1 Adjusting Scales

# Scatter plot with adjusted scales
fig = px.scatter(
    diamonds.sample(n=1000), 
    x="carat", 
    y="price",
    color="depth",
    title="Diamond Price vs Weight with Color Scale",
    labels={
        "carat": "Weight (carats)",
        "price": "Price (USD)",
        "depth": "Depth Percentage"
    }
)

# Update color scale
fig.update_layout(
    coloraxis_colorbar=dict(
        title="Depth %",
        tickvals=[55, 60, 65, 70],
        ticktext=["55%", "60%", "65%", "70%"]
    )
)

# Log scale for price
fig.update_yaxes(type="log")

fig.show()

2.2.2 Customizing Legends

# Plot with custom legend
fig = px.scatter(
    diamonds.sample(n=1000), 
    x="carat", 
    y="price",
    color="cut",
    symbol="clarity",
    title="Diamond Price vs Weight with Custom Legend",
    labels={
        "carat": "Weight (carats)",
        "price": "Price (USD)",
        "cut": "Cut Quality",
        "clarity": "Clarity"
    }
)

# Update legend
fig.update_layout(
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1,
        title=""
    )
)

fig.show()

2.3 Themes and Typography

Consistent themes and typography help create a cohesive look across all your visualizations:

# Define WVU color theme
wvu_theme = {
    "layout": {
        "font": {"family": "Arial, sans-serif", "size": 12, "color": "#333333"},
        "title": {"font": {"size": 18, "color": "#002855"}},  # WVU Blue for title
        "plot_bgcolor": "#FFFFFF",
        "paper_bgcolor": "#FFFFFF",
        "colorway": ["#002855", "#EAAA00", "#0033A0", "#9AABBA", "#4F5B66"],  # WVU Blue, Gold, and complementary colors
        "xaxis": {"gridcolor": "#E5E5E5", "zerolinecolor": "#E5E5E5"},
        "yaxis": {"gridcolor": "#E5E5E5", "zerolinecolor": "#E5E5E5"},
        "legend": {"bgcolor": "#FFFFFF", "bordercolor": "#E5E5E5"},
        "margin": {"t": 60, "b": 60, "l": 50, "r": 50}
    }
}

cut_summary = diamonds.group_by('cut').agg(
    pl.mean('price').alias('avg_price'),
    pl.count('price').alias('count')
).sort('avg_price', descending=True)

# Apply theme to a plot
fig = px.bar(
    cut_summary,
    x='cut',
    y='avg_price',
    color='cut',
    title='Average Diamond Price by Cut (WVU Theme)',
    labels={'avg_price': 'Average Price ($)', 'cut': 'Cut Quality'},
    color_discrete_sequence=wvu_theme['layout']['colorway']
)

# Apply WVU theme
fig.update_layout(**wvu_theme['layout'])

# Add values on top of bars
fig.update_traces(
    texttemplate='$%{y:.0f}',
    textposition='outside',
    textfont=dict(size=12, color='#333333'),
    marker_line_width=1,
    marker_line_color='#FFFFFF'
)

# Add count as a hover info
fig.update_traces(
    hovertemplate='<b>%{x}</b><br>Average Price: $%{y:.2f}<br>Count: %{customdata}<extra></extra>',
    customdata=cut_summary.select('count').to_numpy()
)

# Show the figure
fig.show()

2.4 Tables for Communication

While visualizations are powerful, sometimes a well-designed table is the best way to communicate specific values or detailed information. The Great Tables package can help us create print-ready tables:

from great_tables import GT

# Create a summary table
summary_data = (
    diamonds
    .group_by(["cut", "color"])
    .agg([
        pl.mean("price").round(2).alias("avg_price"),
        pl.len().alias("count"),
        pl.median("carat").round(3).alias("median_carat")
    ])
    .sort(["cut", "color"])
)

# Create the GT table
gt_table = (
    GT(summary_data)
    # Set title and subtitle
    .tab_header(
        title="Diamond Summary Statistics",
        subtitle="Aggregated by Cut and Color"
    )
    # Format columns
    .fmt_currency(
        columns=["avg_price"],
        currency="USD"
    )
    .fmt_number(
        columns=["count"],
        use_seps=True
    )
    .fmt_number(
        columns=["median_carat"],
        decimals=3
    )
    # Rename columns for display
    .cols_label(
        cut="Cut Quality",
        color="Color Grade",
        avg_price="Average Price",
        count="Count",
        median_carat="Median Carat"
    )
    # Add source note
    .tab_source_note(
        source_note="Data summarized from diamonds dataset"
    )
)

# Display the table
gt_table
Diamond Summary Statistics
Aggregated by Cut and Color
Cut Quality Color Grade Average Price Count Median Carat
Fair D $4,291.06 163.00 0.900
Fair E $3,682.31 224.00 0.900
Fair F $3,827.00 312.00 0.900
Fair G $4,239.25 314.00 0.980
Fair H $5,135.68 303.00 1.010
Fair I $4,685.45 175.00 1.010
Fair J $4,975.66 119.00 1.030
Good D $3,405.38 662.00 0.700
Good E $3,423.64 933.00 0.700
Good F $3,495.75 909.00 0.710
Good G $4,123.48 871.00 0.900
Good H $4,276.25 702.00 0.900
Good I $5,078.53 522.00 1.000
Good J $4,574.17 307.00 1.020
Ideal D $2,629.09 2,834.00 0.500
Ideal E $2,597.55 3,903.00 0.500
Ideal F $3,374.94 3,826.00 0.530
Ideal G $3,720.71 4,884.00 0.540
Ideal H $3,889.33 3,115.00 0.700
Ideal I $4,451.97 2,093.00 0.740
Ideal J $4,918.19 896.00 1.030
Premium D $3,631.29 1,603.00 0.580
Premium E $3,538.91 2,337.00 0.580
Premium F $4,324.89 2,331.00 0.760
Premium G $4,500.74 2,924.00 0.755
Premium H $5,216.71 2,360.00 1.010
Premium I $5,946.18 1,428.00 1.140
Premium J $6,294.59 808.00 1.250
Very Good D $3,470.47 1,513.00 0.610
Very Good E $3,214.65 2,400.00 0.570
Very Good F $3,778.82 2,164.00 0.700
Very Good G $3,872.75 2,299.00 0.700
Very Good H $4,535.39 1,824.00 0.900
Very Good I $5,255.88 1,204.00 1.005
Very Good J $5,103.51 678.00 1.060
Data summarized from diamonds dataset

2.5 Combining Text and Visualizations

In a Quarto document, you can combine text explanations with your visualizations to create a coherent narrative:

## Diamond Price Analysis

Our analysis of the diamond dataset reveals several interesting patterns:

```python
# Load necessary libraries
import polars as pl
import plotly.express as px

# Load the diamonds dataset
diamonds = pl.read_csv("data/diamonds.csv")

# Create a visualization
fig = px.scatter(
    diamonds.sample(n=1000), 
    x="carat", 
    y="price",
    color="cut",
    title="Diamond Price vs Weight by Cut Quality"
)
fig.show()
```

As shown in the plot above, there is a strong positive relationship between a diamond's weight (carat) and its price. However, this relationship is moderated by the quality of the cut, with higher quality cuts generally commanding premium prices across the weight spectrum.

2.6 Principles of Effective Communication

Here are key principles to follow when communicating with data:

  1. Know your audience: Adapt your visualization complexity and terminology to match your audience’s expertise.

  2. Tell a story: Structure your communication as a narrative with a beginning, middle, and end.

  3. Focus on the message: Every element in your visualization should support your main message.

  4. Simplify: Remove chart junk and unnecessary elements that don’t contribute to understanding.

  5. Choose appropriate visualizations: Select chart types that best represent your data and answer your specific questions.

  6. Iterate: Create drafts, get feedback, and refine your visualizations before finalizing them.

2.7 Practical Communication Workflow

A practical workflow for communicating with data might look like this:

  1. Identify your key findings: What are the 2-3 most important insights from your analysis?

  2. Determine your audience: Who will be consuming your visualization? What do they already know? What do they need to learn?

  3. Select appropriate visualization types: Choose the charts that best communicate your findings.

  4. Create draft visualizations: Build initial versions of your plots with Plotly Express.

  5. Refine and polish: Add appropriate titles, labels, colors, and annotations.

  6. Integrate with narrative: Combine your visualizations with explanatory text in your Quarto document.

  7. Review and revise: Get feedback and iterate as needed.

3 Conclusion

Effective exploratory data analysis and communication are essential skills for any data scientist. With Polars for data manipulation and Plotly Express for visualization, you have powerful tools to explore, understand, and communicate insights from your data.

This module has provided a Python-focused adaptation of the concepts from R4DS, demonstrating how to perform EDA and create effective visualizations using modern Python data science tools. By mastering these techniques, you’ll be better equipped to extract meaningful insights from data and effectively share those insights with others.

4 Further Resources

5 Exercises

These exercises will help you practice the concepts of Exploratory Data Analysis and Data Communication using Polars and Plotly Express. We’ll work with two datasets throughout these exercises: the diamonds dataset from our module and the Titanic dataset, which is widely used in data science education.

  • Learn more about diamonds here
  • Learn more about titanic here
import polars as pl
import plotly.express as px
import plotly.graph_objects as go

# Diamonds Dataset
diamonds = pl.read_csv("data/diamonds.csv")

# Load the Titanic dataset
titanic_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic = pl.read_csv(titanic_url)

5.1 Basic EDA with Diamonds Dataset

Perform a basic exploratory data analysis on the diamonds dataset:

  1. Create a summary of the dataset showing the count, mean, standard deviation, min, and max for all numeric columns
  2. Calculate the correlation between price and all other numeric variables
  3. Create a histogram of diamond prices with an appropriate number of bins
  4. Create a box plot showing price distribution by cut quality
  5. Identify the top 10 most expensive diamonds and display their attributes

5.2 Visualizing Relationships in the Diamonds Dataset

Explore relationships between variables in the diamonds dataset:

  1. Create a scatter plot of carat vs. price with color representing cut quality
  2. Implement at least two different techniques to handle overplotting in this visualization
  3. Create a visualization showing how the price-to-carat relationship varies across different clarity categories
  4. Create a heatmap showing the average price by cut and color
  5. Design a visualization that effectively communicates which combination of diamond attributes tends to yield the highest value (price relative to carat weight)

5.3 Advanced Visualization Techniques for Diamonds

Apply more advanced visualization techniques to the diamonds dataset:

  1. Create a small multiples (faceted) visualization showing the price-to-carat relationship across different cut and clarity combinations
  2. Create a violin plot comparing the price distributions across cut categories
  3. Design an interactive visualization that allows users to explore how different combinations of attributes affect diamond prices
  4. Add appropriate annotations to highlight key insights in one of your visualizations
  5. Create a custom theme for your plots that could be used consistently across a presentation or report

5.4 Initial EDA with Titanic Dataset

Perform an initial exploratory data analysis on the Titanic dataset:

  1. Create a summary of the dataset, including the count of missing values for each column
  2. Calculate the overall survival rate and visualize survival counts
  3. Create a visualization showing survival rates by passenger class (Pclass)
  4. Create a visualization showing survival rates by sex
  5. Create a visualization showing the age distribution of passengers, with color indicating survival status

5.5 Investigating Survival Factors in the Titanic Dataset

Dig deeper into what factors influenced survival on the Titanic.

  1. Create a visualization showing survival rates by passenger class (Pclass) and sex
  2. Investigate if fare amount was related to survival chances
  3. Explore if traveling with family members (SibSp + Parch > 0) affected survival rates
  4. Create a visualization showing survival rates by age groups (e.g., children, adults, elderly)
  5. Design a composite visualization that effectively communicates the most important factors that influenced survival

5.6 Effective Communication with the Titanic Dataset

Create presentation-quality visualizations that tell a story about the Titanic disaster.

  1. Create a visualization that effectively communicates the “women and children first” policy
  2. Design a visualization that shows how social class (indicated by passenger class) affected survival chances
  3. Create a visualization that communicates how survival rates varied by the deck/location of the cabin (extract the deck from the cabin column)
Back to top