07 | Data Wrangling

Data Science with Polars

Author
Affiliation

Mr. Ozan Ozbeker

1 Overview

Data wrangling is the process of transforming and mapping data from one “raw” form into another format to make it more valuable and appropriate for analysis. This module introduces you to the core concepts of the Polars API that will help you efficiently wrangle your data.

In your data journey, you’ll likely spend a significant amount of time preparing data - selecting, filtering, transforming, joining, and reshaping it before analysis. Polars provides a powerful interface for these tasks with its expressive API built around the concept of expressions and contexts. By understanding these fundamental concepts, you’ll be able to write efficient and readable code that makes the most of Polars’ optimization capabilities.

This module will guide you through Polars data structures, expression-based operations, and various transformation techniques. You’ll learn how to manipulate strings, handle missing data, join datasets, and concatenate dataframes - essential skills for any data science workflow.

2 Polars data types and structures

2.1 Data types

Polars supports a variety of data types that fall broadly under the following categories:

  • Numeric data types: signed integers, unsigned integers, floating point numbers, and decimals.
  • Nested data types: lists, structs, and arrays.
  • Temporal: dates, datetimes, times, and time deltas.
  • Miscellaneous: strings, binary data, Booleans, categoricals, enums, and objects.

All types support missing values represented by the special value null. You can also find a full table with all data types supported in the Polars documentation.

2.2 Series

The core base data structures provided by Polars are series and dataframes. A series is a 1-dimensional homogeneous data structure. “Homogeneous” means that all elements inside a series have the same data type. The snippet below shows how to create a named series:

import polars as pl

s = pl.Series("ints", [1, 2, 3, 4, 5])
print(s)
shape: (5,)
Series: 'ints' [i64]
[
    1
    2
    3
    4
    5
]

When creating a series, Polars will infer the data type from the values you provide. You can specify a concrete data type to override the inference mechanism:

s1 = pl.Series("ints", [1, 2, 3, 4, 5])
s2 = pl.Series("uints", [1, 2, 3, 4, 5], dtype=pl.UInt64)
print(s1.dtype, s2.dtype)
Int64 UInt64

2.3 Dataframe

A dataframe is a 2-dimensional heterogeneous data structure that contains uniquely named series. By holding your data in a dataframe you will be able to use the Polars API to write queries that manipulate your data.

from datetime import date

df = pl.DataFrame(
    {
        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
        "birthdate": [
            date(1997, 1, 10),
            date(1985, 2, 15),
            date(1983, 3, 22),
            date(1981, 4, 30),
        ],
        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
    }
)

df
shape: (4, 4)
name birthdate weight height
str date f64 f64
"Alice Archer" 1997-01-10 57.9 1.56
"Ben Brown" 1985-02-15 72.5 1.77
"Chloe Cooper" 1983-03-22 53.6 1.65
"Daniel Donovan" 1981-04-30 83.1 1.75

2.4 Expressions and contexts

Polars has developed its own Domain Specific Language (DSL) for transforming data. The language is very easy to use and allows for complex queries that remain human readable. Expressions and contexts, which will be introduced here, are very important in achieving this readability while also allowing the Polars query engine to optimize your queries to make them run as fast as possible.

2.4.1 Expressions

In Polars, an expression is a lazy representation of a data transformation. Expressions are modular and flexible, which means you can use them as building blocks to build more complex expressions. Here is an example of a Polars expression:

pl.col("weight") / (pl.col("height") ** 2)
[(col("weight")) / (col("height").pow([dyn int: 2]))]

As you might be able to guess, this expression takes a column named “weight” and divides its values by the square of the values in a column “height”, computing a person’s BMI.

The code above expresses an abstract computation that we can save in a variable, manipulate further, or just print:

bmi_expr = pl.col("weight") / (pl.col("height") ** 2)
print(bmi_expr)
[(col("weight")) / (col("height").pow([dyn int: 2]))]

Because expressions are lazy, no computations have taken place yet. That’s what we need contexts for.

2.4.2 Contexts

Polars expressions need a context in which they are executed to produce a result. Depending on the context it is used in, the same Polars expression can produce different results. The four most common contexts are:

  • select: Choose columns or create new columns based on expressions
  • with_columns: Add new columns or replace existing ones
  • filter: Keep rows that match specific conditions
  • group_by: Group data by common values for aggregation

3 Expression expansion

In the following example, there are two grouping expressions and three aggregating expressions but the resulting dataframe will have six columns instead of five:

result = (
    df.group_by(
        (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
        (pl.col("height") < 1.7).alias("short?"),
    )
    .agg(
        pl.len(),
        pl.col("height").max().alias("tallest"),
        pl.col("weight", "height").mean().name.prefix("avg_"),
    )
)

result
shape: (3, 6)
decade short? len tallest avg_weight avg_height
i32 bool u32 f64 f64 f64
1980 true 1 1.65 53.6 1.65
1980 false 2 1.77 77.8 1.76
1990 true 1 1.56 57.9 1.56

If we look closely, the last aggregating expression mentioned two different columns: “weight” and “height”.

Polars expressions support a feature called expression expansion. Expression expansion is like a shorthand notation for when you want to apply the same transformation to multiple columns. As we have seen, the expression

pl.col("weight", "height").mean().name.prefix("avg_")
.rename_alias(cols(["weight", "height"]).mean())

will compute the mean value of the columns “weight” and “height” and will rename them as “avg_weight” and “avg_height”, respectively. In fact, the expression above is equivalent to using the two following expressions:

[
    pl.col("weight").mean().alias("avg_weight"),
    pl.col("height").mean().alias("avg_height"),
]
[<Expr ['col("weight").mean().alias("av…'] at 0x19201400F80>,
 <Expr ['col("height").mean().alias("av…'] at 0x19201401010>]

In this case, this expression expands into two independent expressions that Polars can execute in parallel. In other cases, we may not be able to know in advance how many independent expressions an expression will unfold into.

Consider this simple but elucidative example:

(pl.col(pl.Float64) * 1.1).name.suffix("*1.1")
.rename_alias([(dtype_columns([Float64])) * (dyn float: 1.1)])

This expression will multiply all columns with data type Float64 by 1.1. The number of columns this applies to depends on the schema of each dataframe. In the case of the dataframe we have been using, it applies to two columns:

expr = (pl.col(pl.Float64) * 1.1).name.suffix("*1.1")
result = df.select(expr)
result
shape: (4, 2)
weight*1.1 height*1.1
f64 f64
63.69 1.716
79.75 1.947
58.96 1.815
91.41 1.925

In the case of the dataframe df2 below, the same expression expands to 0 columns because no column has the data type Float64:

df2 = pl.DataFrame(
    {
        "ints": [1, 2, 3, 4],
        "letters": ["A", "B", "C", "D"],
    }
)
result = df2.select(expr)
result
shape: (0, 0)

It is equally easy to imagine a scenario where the same expression would expand to dozens of columns.

This feature isn’t just decorative or syntactic sugar. It allows for a very powerful application of DRY principles in your code: a single expression that specifies multiple columns expands into a list of expressions, which means you can write one single expression and reuse the computation that it represents.

3.1 The function col

The function col is the most common way of making use of expression expansion features in Polars.

3.1.1 Explicit expansion by column name

The simplest form of expression expansion happens when you provide multiple column names to the function col.

The example below uses a single function col with multiple column names to convert the values in USD to EUR:

df = pl.DataFrame(
    {  # As of 14th October 2024, ~3pm UTC
        "ticker": ["AAPL", "NVDA", "MSFT", "GOOG", "AMZN"],
        "company_name": ["Apple", "NVIDIA", "Microsoft", "Alphabet (Google)", "Amazon"],
        "price": [229.9, 138.93, 420.56, 166.41, 188.4],
        "day_high": [231.31, 139.6, 424.04, 167.62, 189.83],
        "day_low": [228.6, 136.3, 417.52, 164.78, 188.44],
        "year_high": [237.23, 140.76, 468.35, 193.31, 201.2],
        "year_low": [164.08, 39.23, 324.39, 121.46, 118.35],
    }
)

eur_usd_rate = 1.09  # As of 14th October 2024

result = df.with_columns(
    (
        pl.col(
            "price",
            "day_high",
            "day_low",
            "year_high",
            "year_low",
        )
        / eur_usd_rate
    ).round(2)
)

result
shape: (5, 7)
ticker company_name price day_high day_low year_high year_low
str str f64 f64 f64 f64 f64
"AAPL" "Apple" 210.92 212.21 209.72 217.64 150.53
"NVDA" "NVIDIA" 127.46 128.07 125.05 129.14 35.99
"MSFT" "Microsoft" 385.83 389.03 383.05 429.68 297.61
"GOOG" "Alphabet (Google)" 152.67 153.78 151.17 177.35 111.43
"AMZN" "Amazon" 172.84 174.16 172.88 184.59 108.58

When you list the column names you want the expression to expand to, you can predict what the expression will expand to. In this case, the expression that does the currency conversion is expanded to a list of five expressions:

exprs = [
    (pl.col("price") / eur_usd_rate).round(2),
    (pl.col("day_high") / eur_usd_rate).round(2),
    (pl.col("day_low") / eur_usd_rate).round(2),
    (pl.col("year_high") / eur_usd_rate).round(2),
    (pl.col("year_low") / eur_usd_rate).round(2),
]

result2 = df.with_columns(exprs)
print(result.equals(result2))
True

3.1.2 Expansion by data type

We had to type five column names in the previous example but the function col can also conveniently accept one or more data types. If you provide data types instead of column names, the expression is expanded to all columns that match one of the data types provided.

The example below performs the exact same computation as before:

result = df.with_columns((pl.col(pl.Float64) / eur_usd_rate).round(2))
result
shape: (5, 7)
ticker company_name price day_high day_low year_high year_low
str str f64 f64 f64 f64 f64
"AAPL" "Apple" 210.92 212.21 209.72 217.64 150.53
"NVDA" "NVIDIA" 127.46 128.07 125.05 129.14 35.99
"MSFT" "Microsoft" 385.83 389.03 383.05 429.68 297.61
"GOOG" "Alphabet (Google)" 152.67 153.78 151.17 177.35 111.43
"AMZN" "Amazon" 172.84 174.16 172.88 184.59 108.58

When we use a data type with expression expansion we cannot know, beforehand, how many columns a single expression will expand to. We need the schema of the input dataframe if we want to determine what is the final list of expressions that is to be applied.

If we weren’t sure about whether the price columns where of the type Float64 or Float32, we could specify both data types:

result2 = df.with_columns(
    (
        pl.col(
            pl.Float32,
            pl.Float64,
        )
        / eur_usd_rate
    ).round(2)
)
print(result.equals(result2))
True

3.1.3 Expansion by pattern matching

You can also use regular expressions to specify patterns that are used to match the column names. To distinguish between a regular column name and expansion by pattern matching, regular expressions start and end with ^ and $, respectively. This also means that the pattern must match against the whole column name string.

Regular expressions can be mixed with regular column names:

result = df.select(pl.col("ticker", "^.*_high$", "^.*_low$"))

result
shape: (5, 5)
ticker day_high year_high day_low year_low
str f64 f64 f64 f64
"AAPL" 231.31 237.23 228.6 164.08
"NVDA" 139.6 140.76 136.3 39.23
"MSFT" 424.04 468.35 417.52 324.39
"GOOG" 167.62 193.31 164.78 121.46
"AMZN" 189.83 201.2 188.44 118.35

3.1.4 Arguments cannot be of mixed types

In Python, the function col accepts an arbitrary number of strings (as column names or as regular expressions) or an arbitrary number of data types, but you cannot mix both in the same function call:

df.select(pl.col("ticker", pl.Float64))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 df.select(pl.col("ticker", pl.Float64))

File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\functions\col.py:329, in Col.__call__(self, name, *more_names)
    176 def __call__(
    177     self,
    178     name: (
   (...)    185     *more_names: str | PolarsDataType | PythonDataType,
    186 ) -> Expr:
    187     """
    188     Create one or more expressions representing columns in a DataFrame.
    189 
   (...)    327     └─────┴───────────┴─────┘
    328     """
--> 329     return _create_col(name, *more_names)

File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\functions\col.py:47, in _create_col(name, *more_names)
     45     names_str = [name]
     46     names_str.extend(more_names)  # type: ignore[arg-type]
---> 47     return wrap_expr(plr.cols(names_str))
     48 elif is_polars_dtype(name):
     49     dtypes = [name]

TypeError: argument 'names': 'DataTypeClass' object cannot be converted to 'PyString'

3.2 Selecting all columns

Polars provides the function all as shorthand notation to refer to all columns of a dataframe:

result = df.select(pl.all())
print(result.equals(df))
True

The function all is syntactic sugar for col("*"), but since the argument "*" is a special case and all reads more like English, the usage of all is preferred.

3.3 Excluding columns

Polars also provides a mechanism to exclude certain columns from expression expansion. For that, you use the function exclude, which accepts exactly the same types of arguments as col:

result = df.select(pl.all().exclude("^day_.*$"))
result
shape: (5, 5)
ticker company_name price year_high year_low
str str f64 f64 f64
"AAPL" "Apple" 229.9 237.23 164.08
"NVDA" "NVIDIA" 138.93 140.76 39.23
"MSFT" "Microsoft" 420.56 468.35 324.39
"GOOG" "Alphabet (Google)" 166.41 193.31 121.46
"AMZN" "Amazon" 188.4 201.2 118.35

3.4 Column renaming

By default, when you apply an expression to a column, the result keeps the same name as the original column.

Preserving the column name can be semantically wrong and in certain cases Polars may even raise an error if duplicate names occur:

gbp_usd_rate = 1.31  # As of 14th October 2024

df.select(
    pl.col("price") / gbp_usd_rate,  # This would be named "price"...
    pl.col("price") / eur_usd_rate,  # And so would this.
)
---------------------------------------------------------------------------
DuplicateError                            Traceback (most recent call last)
Cell In[21], line 3
      1 gbp_usd_rate = 1.31  # As of 14th October 2024
----> 3 df.select(
      4     pl.col("price") / gbp_usd_rate,  # This would be named "price"...
      5     pl.col("price") / eur_usd_rate,  # And so would this.
      6 )

File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\dataframe\frame.py:9633, in DataFrame.select(self, *exprs, **named_exprs)
   9533 def select(
   9534     self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
   9535 ) -> DataFrame:
   9536     """
   9537     Select columns from this DataFrame.
   9538 
   (...)   9631     └──────────────┘
   9632     """
-> 9633     return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)

File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\_utils\deprecation.py:93, in deprecate_streaming_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     89         kwargs["engine"] = "in-memory"
     91     del kwargs["streaming"]
---> 93 return function(*args, **kwargs)

File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\lazyframe\frame.py:2188, in LazyFrame.collect(self, type_coercion, _type_check, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, engine, background, _check_order, _eager, **_kwargs)
   2186 # Only for testing purposes
   2187 callback = _kwargs.get("post_opt_callback", callback)
-> 2188 return wrap_df(ldf.collect(engine, callback))

DuplicateError: the name 'price' is duplicate

It's possible that multiple expressions are returning the same default column name. If this is the case, try renaming the columns with `.alias("new_name")` to avoid duplicate column names.

To prevent errors like this, and to allow users to rename their columns when appropriate, Polars provides a series of functions that let you change the name of a column or a group of columns.

3.4.1 Renaming a single column with alias

The function alias has been used thoroughly in the documentation already and it lets you rename a single column:

result = df.select(
    (pl.col("price") / gbp_usd_rate).alias("price (GBP)"),
    (pl.col("price") / eur_usd_rate).alias("price (EUR)"),
)

3.4.2 Prefixing and suffixing column names

When using expression expansion you cannot use the function alias because the function alias is designed specifically to rename a single column.

When it suffices to add a static prefix or a static suffix to the existing names, we can use the functions prefix and suffix from the namespace name:

result = df.select(
    (pl.col("^year_.*$") / eur_usd_rate).name.prefix("in_eur_"),
    (pl.col("day_high", "day_low") / gbp_usd_rate).name.suffix("_gbp"),
)
result
shape: (5, 4)
in_eur_year_high in_eur_year_low day_high_gbp day_low_gbp
f64 f64 f64 f64
217.642202 150.53211 176.572519 174.503817
129.137615 35.990826 106.564885 104.045802
429.678899 297.605505 323.694656 318.717557
177.348624 111.431193 127.954198 125.78626
184.587156 108.577982 144.908397 143.847328

3.4.3 Dynamic name replacement

If a static prefix/suffix is not enough, the namespace name also provides the function map that accepts a callable that accepts the old column names and produces the new ones:

# There is also `.name.to_uppercase`, so this usage of `.map` is moot.
result = df.select(pl.all().name.map(str.upper))
result
shape: (5, 7)
TICKER COMPANY_NAME PRICE DAY_HIGH DAY_LOW YEAR_HIGH YEAR_LOW
str str f64 f64 f64 f64 f64
"AAPL" "Apple" 229.9 231.31 228.6 237.23 164.08
"NVDA" "NVIDIA" 138.93 139.6 136.3 140.76 39.23
"MSFT" "Microsoft" 420.56 424.04 417.52 468.35 324.39
"GOOG" "Alphabet (Google)" 166.41 167.62 164.78 193.31 121.46
"AMZN" "Amazon" 188.4 189.83 188.44 201.2 118.35

4 Expressions Deep Dive

In this section we will focus on exploring the types of expressions that Polars offers. Each section gives an overview of what they do and provides additional examples.

4.1 Basic Operations

This section shows how to do basic operations on dataframe columns, like do basic arithmetic calculations, perform comparisons, and other general-purpose operations. We will use the following dataframe for the examples that follow:

import numpy as np

np.random.seed(42)  # For reproducibility.

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", "spam"],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "A", "B"],
    }
)

df
shape: (5, 4)
nrs names random groups
i64 str f64 str
1 "foo" 0.37454 "A"
2 "ham" 0.950714 "A"
3 "spam" 0.731994 "B"
null "egg" 0.598658 "A"
5 "spam" 0.156019 "B"

4.1.1 Basic arithmetic

Polars supports basic arithmetic between series of the same length, or between series and literals. When literals are mixed with series, the literals are broadcast to match the length of the series they are being used with.

result = df.select(
    (pl.col("nrs") + 5).alias("nrs + 5"),
    (pl.col("nrs") - 5).alias("nrs - 5"),
    (pl.col("nrs") * pl.col("random")).alias("nrs * random"),
    (pl.col("nrs") / pl.col("random")).alias("nrs / random"),
    (pl.col("nrs") ** 2).alias("nrs ** 2"),
    (pl.col("nrs") % 3).alias("nrs % 3"),
)

result
shape: (5, 6)
nrs + 5 nrs - 5 nrs * random nrs / random nrs ** 2 nrs % 3
i64 i64 f64 f64 i64 i64
6 -4 0.37454 2.669941 1 1
7 -3 1.901429 2.103681 4 2
8 -2 2.195982 4.098395 9 0
null null null null null null
10 0 0.780093 32.047453 25 2

The example above shows that when an arithmetic operation takes null as one of its operands, the result is null.

Polars uses operator overloading to allow you to use your language’s native arithmetic operators within your expressions. If you prefer, in Python you can use the corresponding named functions, as the snippet below demonstrates:

# Python only:
result_named_operators = df.select(
    (pl.col("nrs").add(5)).alias("nrs + 5"),
    (pl.col("nrs").sub(5)).alias("nrs - 5"),
    (pl.col("nrs").mul(pl.col("random"))).alias("nrs * random"),
    (pl.col("nrs").truediv(pl.col("random"))).alias("nrs / random"),
    (pl.col("nrs").pow(2)).alias("nrs ** 2"),
    (pl.col("nrs").mod(3)).alias("nrs % 3"),
)

print(result.equals(result_named_operators))
True

4.1.2 Comparisons

Like with arithmetic operations, Polars supports comparisons via the overloaded operators or named functions:

result = df.select(
    (pl.col("nrs") > 1).alias("nrs > 1"),  # .gt
    (pl.col("nrs") >= 3).alias("nrs >= 3"),  # ge
    (pl.col("random") < 0.2).alias("random < .2"),  # .lt
    (pl.col("random") <= 0.5).alias("random <= .5"),  # .le
    (pl.col("nrs") != 1).alias("nrs != 1"),  # .ne
    (pl.col("nrs") == 1).alias("nrs == 1"),  # .eq
)

result
shape: (5, 6)
nrs > 1 nrs >= 3 random < .2 random <= .5 nrs != 1 nrs == 1
bool bool bool bool bool bool
false false false true false true
true false false false true false
true true false false true false
null null false false null null
true true true true true false

4.1.3 Boolean operations

In Python, you may use the operators &, |, and ~, for the Boolean operations “and”, “or”, and “not”, respectively, or the functions of the same name:

# Boolean operators & | ~
result = df.select(
    ((~pl.col("nrs").is_null()) & (pl.col("groups") == "A")).alias(
        "number not null and group A"
    ),
    ((pl.col("random") < 0.5) | (pl.col("groups") == "B")).alias(
        "random < 0.5 or group B"
    ),
)

# Corresponding named functions `and_`, `or_`, and `not_`.
result2 = df.select(
    (pl.col("nrs").is_null().not_().and_(pl.col("groups") == "A")).alias(
        "number not null and group A"
    ),
    ((pl.col("random") < 0.5).or_(pl.col("groups") == "B")).alias(
        "random < 0.5 or group B"
    ),
)
print(result.equals(result2))
True

4.1.4 Counting (unique) values

Polars has two functions to count the number of unique values in a series. The function n_unique can be used to count the exact number of unique values in a series. However, for very large data sets, this operation can be quite slow. In those cases, if an approximation is good enough, you can use the function approx_n_unique that uses the algorithm HyperLogLog++ to estimate the result.

The example below shows an example series where the approx_n_unique estimation is wrong by 0.9%:

long_df = pl.DataFrame({"numbers": np.random.randint(0, 100_000, 100_000)})

result = long_df.select(
    pl.col("numbers").n_unique().alias("n_unique"),
    pl.col("numbers").approx_n_unique().alias("approx_n_unique"),
)

result
shape: (1, 2)
n_unique approx_n_unique
u32 u32
63218 64141

You can get more information about the unique values and their counts with the function value_counts, that Polars also provides:

result = df.select(
    pl.col("names").value_counts().alias("value_counts"),
)

result
shape: (4, 1)
value_counts
struct[2]
{"foo",1}
{"ham",1}
{"spam",2}
{"egg",1}

4.2 Casting

Casting converts the underlying data type of a column to a new one. Casting is available through the function cast.

Let’s take a look at the following dataframe which contains both integers and floating point numbers:

df = pl.DataFrame(
    {
        "integers": [1, 2, 3],
        "big_integers": [10000002, 2, 30000003],
        "floats": [4.0, 5.8, -6.3],
    }
)

df
shape: (3, 3)
integers big_integers floats
i64 i64 f64
1 10000002 4.0
2 2 5.8
3 30000003 -6.3

To perform casting operations between floats and integers, or vice versa, we use the function cast:

result = df.select(
    pl.col("integers").cast(pl.Float32).alias("integers_as_floats"),
    pl.col("floats").cast(pl.Int32).alias("floats_as_integers"),
)

result
shape: (3, 2)
integers_as_floats floats_as_integers
f32 i32
1.0 4
2.0 5
3.0 -6

Strings that represent numbers can be converted to the appropriate data types via casting. The opposite conversion is also possible:

df = pl.DataFrame(
    {
        "integers_as_strings": ["1", "2", "3"],
        "floats_as_strings": ["4.0", "5.8", "-6.3"],
        "floats": [4.0, 5.8, -6.3],
    }
)

result = df.select(
    pl.col("integers_as_strings").cast(pl.Int32),
    pl.col("floats_as_strings").cast(pl.Float64),
    pl.col("floats").cast(pl.String),
)

result
shape: (3, 3)
integers_as_strings floats_as_strings floats
i32 f64 str
1 4.0 "4.0"
2 5.8 "5.8"
3 -6.3 "-6.3"

In case the column contains a non-numerical value, or a poorly formatted one, Polars will throw an error with details on the conversion error. You can set strict=False to circumvent the error and get a null value instead.

df = pl.DataFrame(
    {
        "floats": ["4.0", "5.8", "- 6 . 3"],
    }
)

df.select(pl.col("floats").cast(pl.Float64))
---------------------------------------------------------------------------
InvalidOperationError                     Traceback (most recent call last)
Cell In[35], line 7
      1 df = pl.DataFrame(
      2     {
      3         "floats": ["4.0", "5.8", "- 6 . 3"],
      4     }
      5 )
----> 7 df.select(pl.col("floats").cast(pl.Float64))

File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\dataframe\frame.py:9633, in DataFrame.select(self, *exprs, **named_exprs)
   9533 def select(
   9534     self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
   9535 ) -> DataFrame:
   9536     """
   9537     Select columns from this DataFrame.
   9538 
   (...)   9631     └──────────────┘
   9632     """
-> 9633     return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)

File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\_utils\deprecation.py:93, in deprecate_streaming_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     89         kwargs["engine"] = "in-memory"
     91     del kwargs["streaming"]
---> 93 return function(*args, **kwargs)

File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\lazyframe\frame.py:2188, in LazyFrame.collect(self, type_coercion, _type_check, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, engine, background, _check_order, _eager, **_kwargs)
   2186 # Only for testing purposes
   2187 callback = _kwargs.get("post_opt_callback", callback)
-> 2188 return wrap_df(ldf.collect(engine, callback))

InvalidOperationError: conversion from `str` to `f64` failed in column 'floats' for 1 out of 3 values: ["- 6 . 3"]
df.select(pl.col("floats").cast(pl.Float64, strict=False))
shape: (3, 1)
floats
f64
4.0
5.8
null

4.2.1 Parsing / formatting temporal data types

All temporal data types are represented internally as the number of time units elapsed since a reference moment, usually referred to as the epoch. For example, values of the data type Date are stored as the number of days since the epoch. For the data type Datetime the time unit is the microsecond (us) and for Time the time unit is the nanosecond (ns).

Casting between numerical types and temporal data types is allowed and exposes this relationship:

from datetime import date, datetime, time

df = pl.DataFrame(
    {
        "date": [
            date(1970, 1, 1),  # epoch
            date(1970, 1, 10),  # 9 days later
        ],
        "datetime": [
            datetime(1970, 1, 1, 0, 0, 0),  # epoch
            datetime(1970, 1, 1, 0, 1, 0),  # 1 minute later
        ],
        "time": [
            time(0, 0, 0),  # reference time
            time(0, 0, 1),  # 1 second later
        ],
    }
)

result = df.select(
    pl.col("date").cast(pl.Int64).alias("days_since_epoch"),
    pl.col("datetime").cast(pl.Int64).alias("us_since_epoch"),
    pl.col("time").cast(pl.Int64).alias("ns_since_midnight"),
)

result
shape: (2, 3)
days_since_epoch us_since_epoch ns_since_midnight
i64 i64 i64
0 0 0
9 60000000 1000000000

To format temporal data types as strings we can use the function dt.to_string and to parse temporal data types from strings we can use the function str.to_datetime. Both functions adopt the chrono format syntax for formatting.

df = pl.DataFrame(
    {
        "date": [date(2022, 1, 1), date(2022, 1, 2)],
        "string": ["2022-01-01", "2022-01-02"],
    }
)

result = df.select(
    pl.col("date").dt.to_string("%Y-%m-%d"),
    pl.col("string").str.to_datetime("%Y-%m-%d"),
)

result
shape: (2, 2)
date string
str datetime[μs]
"2022-01-01" 2022-01-01 00:00:00
"2022-01-02" 2022-01-02 00:00:00

It’s worth noting that str.to_datetime features additional options that support timezone functionality. Refer to the API documentation for further information.

4.3 Strings

The following section discusses operations performed on string data, which is a frequently used data type when working with dataframes. String processing functions are available in the namespace str.

4.3.1 The string namespace

When working with string data you will likely need to access the namespace str, which aggregates 40+ functions that let you work with strings. As an example of how to access functions from within that namespace, the snippet below shows how to compute the length of the strings in a column in terms of the number of bytes and the number of characters:

df = pl.DataFrame(
    {
        "language": ["English", "Dutch", "Portuguese", "Finish"],
        "fruit": ["pear", "peer", "pêra", "päärynä"],
    }
)

result = df.with_columns(
    pl.col("fruit").str.len_bytes().alias("byte_count"),
    pl.col("fruit").str.len_chars().alias("letter_count"),
)

result
shape: (4, 4)
language fruit byte_count letter_count
str str u32 u32
"English" "pear" 4 4
"Dutch" "peer" 4 4
"Portuguese" "pêra" 5 4
"Finish" "päärynä" 10 7

4.3.2 Parsing strings

Polars offers multiple methods for checking and parsing elements of a string column, namely checking for the existence of given substrings or patterns, and counting, extracting, or replacing, them. We will demonstrate some of these operations in the upcoming examples.

4.3.2.1 Check for the existence of a pattern

We can use the function contains to check for the presence of a pattern within a string. By default, the argument to the function contains is interpreted as a regular expression. If you want to specify a literal substring, set the parameter literal to True.

For the special cases where you want to check if the strings start or end with a fixed substring, you can use the functions starts_with or ends_with, respectively.

result = df.select(
    pl.col("fruit"),
    pl.col("fruit").str.starts_with("p").alias("starts_with_p"),
    pl.col("fruit").str.contains("p..r").alias("p..r"),
    pl.col("fruit").str.contains("e+").alias("e+"),
    pl.col("fruit").str.ends_with("r").alias("ends_with_r"),
)

result
shape: (4, 5)
fruit starts_with_p p..r e+ ends_with_r
str bool bool bool bool
"pear" true true true true
"peer" true true true true
"pêra" true false false false
"päärynä" true true false false
4.3.2.2 Extract a pattern

The function extract allows us to extract patterns from the string values in a column. The function extract accepts a regex pattern with one or more capture groups and extracts the capture group specified as the second argument.

df = pl.DataFrame(
    {
        "urls": [
            "http://vote.com/ballon_dor?candidate=messi&ref=polars",
            "http://vote.com/ballon_dor?candidat=jorginho&ref=polars",
            "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars",
        ]
    }
)
result = df.select(
    pl.col("urls").str.extract(r"candidate=(\w+)", group_index=1),
)

result
shape: (3, 1)
urls
str
"messi"
null
"ronaldo"

To extract all occurrences of a pattern within a string, we can use the function extract_all. In the example below, we extract all numbers from a string using the regex pattern (\d+), which matches one or more digits. The resulting output of the function extract_all is a list containing all instances of the matched pattern within the string.

df = pl.DataFrame({"text": ["123 bla 45 asd", "xyz 678 910t"]})
result = df.select(
    pl.col("text").str.extract_all(r"(\d+)").alias("extracted_nrs"),
)

result
shape: (2, 1)
extracted_nrs
list[str]
["123", "45"]
["678", "910"]
4.3.2.3 Replace a pattern

Akin to the functions extract and extract_all, Polars provides the functions replace and replace_all. These accept a regex pattern or a literal substring (if the parameter literal is set to True) and perform the replacements specified. The function replace will make at most one replacement whereas the function replace_all will make all the non-overlapping replacements it finds.

df = pl.DataFrame({"text": ["123abc", "abc456"]})
result = df.with_columns(
    pl.col("text").str.replace(r"\d", "-"),
    pl.col("text").str.replace_all(r"\d", "-").alias("text_replace_all"),
)

result
shape: (2, 2)
text text_replace_all
str str
"-23abc" "---abc"
"abc-56" "abc---"

4.3.3 Modifying strings

4.3.3.1 Case conversion

Converting the casing of a string is a common operation and Polars supports it out of the box with the functions to_lowercase, to_titlecase, and to_uppercase:

addresses = pl.DataFrame(
    {
        "addresses": [
            "128 PERF st",
            "Rust blVD, 158",
            "PoLaRs Av, 12",
            "1042 Query sq",
        ]
    }
)

addresses = addresses.select(
    pl.col("addresses").alias("originals"),
    pl.col("addresses").str.to_titlecase(),
    pl.col("addresses").str.to_lowercase().alias("lower"),
    pl.col("addresses").str.to_uppercase().alias("upper"),
)

addresses
shape: (4, 4)
originals addresses lower upper
str str str str
"128 PERF st" "128 Perf St" "128 perf st" "128 PERF ST"
"Rust blVD, 158" "Rust Blvd, 158" "rust blvd, 158" "RUST BLVD, 158"
"PoLaRs Av, 12" "Polars Av, 12" "polars av, 12" "POLARS AV, 12"
"1042 Query sq" "1042 Query Sq" "1042 query sq" "1042 QUERY SQ"
4.3.3.2 Stripping characters from the ends

Polars provides five functions in the namespace str that let you strip characters from the ends of the string:

  • strip_chars: removes leading and trailing occurrences of the characters specified.
  • strip_chars_end: removes trailing occurrences of the characters specified.
  • strip_chars_start: removes leading occurrences of the characters specified.
  • strip_prefix: removes an exact substring prefix if present.
  • strip_suffix: removes an exact substring suffix if present.
4.3.3.3 Slicing

Besides extracting substrings as specified by patterns, you can also slice strings at specified offsets to produce substrings. The general-purpose function for slicing is slice and it takes the starting offset and the optional length of the slice. If the length of the slice is not specified or if it’s past the end of the string, Polars slices the string all the way to the end.

The functions head and tail are specialised versions used for slicing the beginning and end of a string, respectively.

df = pl.DataFrame(
    {
        "fruits": ["pear", "mango", "dragonfruit", "passionfruit"],
        "n": [1, -1, 4, -4],
    }
)

result = df.with_columns(
    pl.col("fruits").str.slice(pl.col("n")).alias("slice"),
    pl.col("fruits").str.head(pl.col("n")).alias("head"),
    pl.col("fruits").str.tail(pl.col("n")).alias("tail"),
)

result
shape: (4, 5)
fruits n slice head tail
str i64 str str str
"pear" 1 "ear" "p" "r"
"mango" -1 "o" "mang" "ango"
"dragonfruit" 4 "onfruit" "drag" "ruit"
"passionfruit" -4 "ruit" "passionf" "ionfruit"

4.4 Missing data

In Polars, missing data is represented by the value null. This missing value null is used for all data types, including numerical types. Polars also supports the value NaN (“Not a Number”) for columns with floating point numbers. The value NaN is considered to be a valid floating point value, which is different from missing data.

df = pl.DataFrame(
    {
        "value": [1, None],
    },
)

df
shape: (2, 1)
value
i64
1
null

Missing data in a series can be filled with the function fill_null. You can specify how missing data is effectively filled in a couple of different ways:

  • a literal of the correct data type;
  • a Polars expression, such as replacing with values computed from another column;
  • a strategy based on neighbouring values, such as filling forwards or backwards; and
  • interpolation.

To illustrate how each of these methods work we start by defining a simple dataframe with two missing values in the second column:

df = pl.DataFrame(
    {
        "col1": [0.5, 1, 1.5, 2, 2.5],
        "col2": [1, None, 3, None, 5],
    },
)

df
shape: (5, 2)
col1 col2
f64 i64
0.5 1
1.0 null
1.5 3
2.0 null
2.5 5

4.4.1 Fill with a specified literal value

You can fill the missing data with a specified literal value. This literal value will replace all of the occurrences of the value null:

fill_literal_df = df.with_columns(
    pl.col("col2").fill_null(3),
)

fill_literal_df
shape: (5, 2)
col1 col2
f64 i64
0.5 1
1.0 3
1.5 3
2.0 3
2.5 5

4.4.2 Fill with an expression

In the general case, the missing data can be filled by extracting the corresponding values from the result of a general Polars expression. For example, we can fill the second column with values taken from the double of the first column:

fill_expression_df = df.with_columns(
    pl.col("col2").fill_null((2 * pl.col("col1")).cast(pl.Int64)),
)

fill_expression_df
shape: (5, 2)
col1 col2
f64 i64
0.5 1
1.0 2
1.5 3
2.0 4
2.5 5

4.4.3 Fill with a strategy based on neighbouring values

You can also fill the missing data by following a fill strategy based on the neighbouring values. The two simpler strategies look for the first non-null value that comes immediately before or immediately after the value null that is being filled:

fill_forward_df = df.with_columns(
    pl.col("col2").fill_null(strategy="forward").alias("forward"),
    pl.col("col2").fill_null(strategy="backward").alias("backward"),
)

fill_forward_df
shape: (5, 4)
col1 col2 forward backward
f64 i64 i64 i64
0.5 1 1 1
1.0 null 1 3
1.5 3 3 3
2.0 null 3 5
2.5 5 5 5

You can find other fill strategies in the API docs.

4.4.4 Fill with interpolation

Additionally, you can fill missing data with interpolation by using the function interpolate instead of the function fill_null:

fill_interpolation_df = df.with_columns(
    pl.col("col2").interpolate(),
)

fill_interpolation_df
shape: (5, 2)
col1 col2
f64 f64
0.5 1.0
1.0 2.0
1.5 3.0
2.0 4.0
2.5 5.0

4.4.5 Not a Number, or NaN values

Missing data in a series is only ever represented by the value null, regardless of the data type of the series. Columns with a floating point data type can sometimes have the value NaN, which might be confused with null.

The special value NaN can be created directly:

import numpy as np

nan_df = pl.DataFrame(
    {
        "value": [1.0, np.nan, float("nan"), 3.0],
    },
)

nan_df
shape: (4, 1)
value
f64
1.0
NaN
NaN
3.0

And it might also arise as the result of a computation:

df = pl.DataFrame(
    {
        "dividend": [1, 0, -1],
        "divisor": [1, 0, -1],
    }
)
result = df.select(pl.col("dividend") / pl.col("divisor"))

result
shape: (3, 1)
dividend
f64
1.0
NaN
1.0

NaN values are considered to be a type of floating point data and are not considered to be missing data in Polars. This means:

  • NaN values are not counted with the function null_count; and
  • NaN values are filled when you use the specialised function fill_nan method but are not filled with the function fill_null.

Polars has the functions is_nan and fill_nan, which work in a similar way to the functions is_null and fill_null. Unlike with missing data, Polars does not hold any metadata regarding the NaN values, so the function is_nan entails actual computation.

One further difference between the values null and NaN is that numerical aggregating functions, like mean and sum, skip the missing values when computing the result, whereas the value NaN is considered for the computation and typically propagates into the result. If desirable, this behavior can be avoided by replacing the occurrences of the value NaN with the value null:

mean_nan_df = nan_df.with_columns(
    pl.col("value").fill_nan(None).alias("replaced"),
).select(
    pl.all().mean().name.suffix("_mean"),
    pl.all().sum().name.suffix("_sum"),
)

mean_nan_df
shape: (1, 4)
value_mean replaced_mean value_sum replaced_sum
f64 f64 f64 f64
NaN 2.0 NaN 4.0

5 Transformations

The focus of this section is to describe different types of data transformations and provide some examples on how to use them.

5.1 Joins

A join operation combines columns from one or more dataframes into a new dataframe. The different “joining strategies” and matching criteria used by the different types of joins influence how columns are combined and also what rows are included in the result of the join operation.

In a join, rows are matched by checking equality of a key expression. You can do an join with the function join by specifying the name of the column to be used as key. For the examples, we will be loading some (modified) Monopoly property data.

First, we load a dataframe that contains property names and their colour group in the game:

props_groups = pl.read_csv("data/monopoly_props_groups.csv")
props_groups
shape: (29, 2)
property_name group
str str
"Old Ken Road" "brown"
"Whitechapel Road" "brown"
"The Shire" "fantasy"
"Kings Cross Station" "stations"
"Bond Street" "green"
"Liverpool Street Station" "stations"
"Park Lane" "dark_blue"
"Mayfair" "dark_blue"

Next, we load a dataframe that contains property names and their price in the game:

props_prices = pl.read_csv("data/monopoly_props_prices.csv")
props_prices
shape: (29, 2)
property_name cost
str i64
"Old Ken Road" 60
"Whitechapel Road" 60
"Sesame Street" 100
"Kings Cross Station" 200
"Bond Street" 320
"Liverpool Street Station" 200
"Park Lane" 350
"Mayfair" 400

Now, we join both dataframes to create a dataframe that contains property names, colour groups, and prices:

result = props_groups.join(props_prices, on="property_name")
result
shape: (28, 3)
property_name group cost
str str i64
"Old Ken Road" "brown" 60
"Whitechapel Road" "brown" 60
"Kings Cross Station" "stations" 200
"The Angel, Islington" "light_blue" 100
"Bond Street" "green" 320
"Liverpool Street Station" "stations" 200
"Park Lane" "dark_blue" 350
"Mayfair" "dark_blue" 400

The result has four rows but both dataframes used in the operation had five rows. Polars uses a joining strategy to determine what happens with rows that have multiple matches or with rows that have no match at all. By default, Polars computes an “inner join” but there are other join strategies that we show next.

In the example above, the two dataframes conveniently had the column we wish to use as key with the same name and with the values in the exact same format. Suppose, for the sake of argument, that one of the dataframes had a differently named column and the other had the property names in lower case:

props_groups2 = props_groups.with_columns(
    pl.col("property_name").str.to_lowercase(),
)

props_groups2
shape: (29, 2)
property_name group
str str
"old ken road" "brown"
"whitechapel road" "brown"
"the shire" "fantasy"
"kings cross station" "stations"
"bond street" "green"
"liverpool street station" "stations"
"park lane" "dark_blue"
"mayfair" "dark_blue"
props_prices2 = props_prices.select(
    pl.col("property_name").alias("name"), pl.col("cost")
)

props_prices2
shape: (29, 2)
name cost
str i64
"Old Ken Road" 60
"Whitechapel Road" 60
"Sesame Street" 100
"Kings Cross Station" 200
"Bond Street" 320
"Liverpool Street Station" 200
"Park Lane" 350
"Mayfair" 400

In a situation like this, where we may want to perform the same join as before, we can leverage join’s flexibility and specify arbitrary expressions to compute the joining key on the left and on the right, allowing one to compute row keys dynamically:

result = props_groups2.join(
    props_prices2,
    left_on="property_name",
    right_on=pl.col("name").str.to_lowercase(),
)

result
shape: (28, 4)
property_name group name cost
str str str i64
"old ken road" "brown" "Old Ken Road" 60
"whitechapel road" "brown" "Whitechapel Road" 60
"kings cross station" "stations" "Kings Cross Station" 200
"the angel, islington" "light_blue" "The Angel, Islington" 100
"bond street" "green" "Bond Street" 320
"liverpool street station" "stations" "Liverpool Street Station" 200
"park lane" "dark_blue" "Park Lane" 350
"mayfair" "dark_blue" "Mayfair" 400

Because we are joining on the right with an expression, Polars preserves the column “property_name” from the left and the column “name” from the right so we can have access to the original values that the key expressions were applied to.

5.1.1 Join strategies

When computing a join with df1.join(df2, ...), we can specify one of many different join strategies. A join strategy specifies what rows to keep from each dataframe based on whether they match rows from the other dataframe.

5.1.1.1 Inner join

In an inner join the resulting dataframe only contains the rows from the left and right dataframes that matched. That is the default strategy used by join and above we can see an example of that. We repeat the example here and explicitly specify the join strategy:

result = props_groups.join(props_prices, on="property_name", how="inner")
result
shape: (28, 3)
property_name group cost
str str i64
"Old Ken Road" "brown" 60
"Whitechapel Road" "brown" 60
"Kings Cross Station" "stations" 200
"The Angel, Islington" "light_blue" 100
"Bond Street" "green" 320
"Liverpool Street Station" "stations" 200
"Park Lane" "dark_blue" 350
"Mayfair" "dark_blue" 400

The result does not include the row from props_groups that contains “The Shire” and the result also does not include the row from props_prices that contains “Sesame Street”.

5.1.1.2 Left join

A left outer join is a join where the result contains all the rows from the left dataframe and the rows of the right dataframe that matched any rows from the left dataframe.

result = props_groups.join(props_prices, on="property_name", how="left")
result
shape: (29, 3)
property_name group cost
str str i64
"Old Ken Road" "brown" 60
"Whitechapel Road" "brown" 60
"The Shire" "fantasy" null
"Kings Cross Station" "stations" 200
"Bond Street" "green" 320
"Liverpool Street Station" "stations" 200
"Park Lane" "dark_blue" 350
"Mayfair" "dark_blue" 400

If there are any rows from the left dataframe that have no matching rows on the right dataframe, they get the value null on the new columns.

5.1.1.3 Right join

Computationally speaking, a right outer join is exactly the same as a left outer join, but with the arguments swapped. Here is an example:

result = props_groups.join(props_prices, on="property_name", how="right")
result
shape: (29, 3)
group property_name cost
str str i64
"brown" "Old Ken Road" 60
"brown" "Whitechapel Road" 60
null "Sesame Street" 100
"stations" "Kings Cross Station" 200
"green" "Bond Street" 320
"stations" "Liverpool Street Station" 200
"dark_blue" "Park Lane" 350
"dark_blue" "Mayfair" 400

We show that df1.join(df2, how="right", ...) is the same as df2.join(df1, how="left", ...), up to the order of the columns of the result, with the computation below:

print(
    result.equals(
        props_prices.join(
            props_groups,
            on="property_name",
            how="left",
            # Reorder the columns to match the order from above.
        ).select(pl.col("group"), pl.col("property_name"), pl.col("cost"))
    )
)
True
5.1.1.4 Full join

A full outer join will keep all of the rows from the left and right dataframes, even if they don’t have matching rows in the other dataframe:

result = props_groups.join(props_prices, on="property_name", how="full")
result
shape: (30, 4)
property_name group property_name_right cost
str str str i64
"Old Ken Road" "brown" "Old Ken Road" 60
"Whitechapel Road" "brown" "Whitechapel Road" 60
null null "Sesame Street" 100
"Kings Cross Station" "stations" "Kings Cross Station" 200
"Liverpool Street Station" "stations" "Liverpool Street Station" 200
"Park Lane" "dark_blue" "Park Lane" 350
"Mayfair" "dark_blue" "Mayfair" 400
"The Shire" "fantasy" null null

In this case, we see that we get two columns property_name and property_name_right to make up for the fact that we are matching on the column property_name of both dataframes and there are some names for which there are no matches. The two columns help differentiate the source of each row data. If we wanted to force join to coalesce the two columns property_name into a single column, we could set coalesce=True explicitly:

result = props_groups.join(
    props_prices,
    on="property_name",
    how="full",
    coalesce=True,
)

result
shape: (30, 3)
property_name group cost
str str i64
"Old Ken Road" "brown" 60
"Whitechapel Road" "brown" 60
"Sesame Street" null 100
"Kings Cross Station" "stations" 200
"Liverpool Street Station" "stations" 200
"Park Lane" "dark_blue" 350
"Mayfair" "dark_blue" 400
"The Shire" "fantasy" null

When not set, the parameter coalesce is determined automatically from the join strategy and the key(s) specified, which is why the inner, left, and right, joins acted as if coalesce=True, even though we didn’t set it.

5.1.1.5 Semi join

A semi join will return the rows of the left dataframe that have a match in the right dataframe, but we do not actually join the matching rows:

result = props_groups.join(props_prices, on="property_name", how="semi")
result
shape: (28, 2)
property_name group
str str
"Old Ken Road" "brown"
"Whitechapel Road" "brown"
"Kings Cross Station" "stations"
"The Angel, Islington" "light_blue"
"Bond Street" "green"
"Liverpool Street Station" "stations"
"Park Lane" "dark_blue"
"Mayfair" "dark_blue"

A semi join acts as a sort of row filter based on a second dataframe.

5.1.1.6 Anti join

Conversely, an anti join will return the rows of the left dataframe that do not have a match in the right dataframe:

result = props_groups.join(props_prices, on="property_name", how="anti")
result
shape: (1, 2)
property_name group
str str
"The Shire" "fantasy"

5.2 Concatenation

There are a number of ways to concatenate data from separate DataFrames:

  • two dataframes with the same columns can be vertically concatenated to make a longer dataframe
  • two dataframes with non-overlapping columns can be horizontally concatenated to make a wider dataframe
  • two dataframes with different numbers of rows and columns can be diagonally concatenated to make a dataframe which might be longer and/ or wider. Where column names overlap values will be vertically concatenated. Where column names do not overlap new rows and columns will be added. Missing values will be set as null

5.2.1 Vertical concatenation - getting longer

In a vertical concatenation you combine all of the rows from a list of DataFrames into a single longer DataFrame.

df_v1 = pl.DataFrame(
    {
        "a": [1],
        "b": [3],
    }
)
df_v2 = pl.DataFrame(
    {
        "a": [2],
        "b": [4],
    }
)
df_vertical_concat = pl.concat(
    [
        df_v1,
        df_v2,
    ],
    how="vertical",
)

df_vertical_concat
shape: (2, 2)
a b
i64 i64
1 3
2 4

Vertical concatenation fails when the dataframes do not have the same column names.

5.2.2 Horizontal concatenation - getting wider

In a horizontal concatenation you combine all of the columns from a list of DataFrames into a single wider DataFrame.

df_h1 = pl.DataFrame(
    {
        "l1": [1, 2],
        "l2": [3, 4],
    }
)
df_h2 = pl.DataFrame(
    {
        "r1": [5, 6],
        "r2": [7, 8],
        "r3": [9, 10],
    }
)
df_horizontal_concat = pl.concat(
    [
        df_h1,
        df_h2,
    ],
    how="horizontal",
)

df_horizontal_concat
shape: (2, 5)
l1 l2 r1 r2 r3
i64 i64 i64 i64 i64
1 3 5 7 9
2 4 6 8 10

Horizontal concatenation fails when dataframes have overlapping columns.

When dataframes have different numbers of rows, columns will be padded with null values at the end up to the maximum length.

df_h1 = pl.DataFrame(
    {
        "l1": [1, 2],
        "l2": [3, 4],
    }
)
df_h2 = pl.DataFrame(
    {
        "r1": [5, 6, 7],
        "r2": [8, 9, 10],
    }
)
df_horizontal_concat = pl.concat(
    [
        df_h1,
        df_h2,
    ],
    how="horizontal",
)

df_horizontal_concat
shape: (3, 4)
l1 l2 r1 r2
i64 i64 i64 i64
1 3 5 8
2 4 6 9
null null 7 10

5.2.3 Diagonal concatenation - getting longer, wider and nullier

In a diagonal concatenation you combine all of the row and columns from a list of DataFrames into a single longer and/or wider DataFrame.

df_d1 = pl.DataFrame(
    {
        "a": [1],
        "b": [3],
    }
)
df_d2 = pl.DataFrame(
    {
        "a": [2],
        "d": [4],
    }
)

df_diagonal_concat = pl.concat(
    [
        df_d1,
        df_d2,
    ],
    how="diagonal",
)

df_diagonal_concat
shape: (2, 3)
a b d
i64 i64 i64
1 3 null
2 null 4

Diagonal concatenation generates nulls when the column names do not overlap.

6 Exercises

Now let’s practice with some hands-on exercises using the Northwind dataset. We’ll work with several CSVs including customers, orders, products, and more to demonstrate data wrangling concepts with Polars.

6.1 Basic Data Loading and Exploration

  1. Load the customers.csv, orders.csv, and order_details.csv files.
  2. For each dataframe, examine the schema and print the number of rows.
  3. Create a new expression to calculate the total value of each order detail line (unitPrice * quantity * (1-discount)).

6.2 Missing Data

  1. Check each of the dataframes for missing values.
  2. For the orders.csv file, identify orders where the shippedDate is missing.
  3. Create a new column that identifies whether an order has been shipped or not.

6.3 String Operations

  1. In the customers.csv file, create a new column that extracts the domain name from the contact’s email address.
  2. Convert company names to uppercase.
  3. Create a column that indicates whether the company name contains “restaurant” or “café” (case insensitive).

6.4 Expression Expansion

  1. Load the products.csv file.
  2. Use expression expansion to apply a 10% price increase to all products.
  3. Create columns that show the unit price in both EUR and GBP (use conversion rates of your choice).
  4. Use pattern matching to select all columns related to pricing.

6.5 Joining Dataframes

  1. Join the orders dataframe with the customers dataframe to see customer information with each order.
  2. Join the orders dataframe with the order_details dataframe to get a complete view of each order.
  3. Create a three-way join of orders, order_details, and products to show what products were ordered.
  4. Find orders that don’t have any order details (if any) using an anti-join.

6.6 Aggregations

  1. Calculate the total revenue by customer.
  2. Identify the top 5 customers by order count.
  3. For each product, calculate the total quantity sold, average unit price, and total revenue generated.
  4. Find the average time between order date and shipped date for each shipper.
Back to top