import polars as pl
= pl.Series("ints", [1, 2, 3, 4, 5])
s print(s)
shape: (5,)
Series: 'ints' [i64]
[
1
2
3
4
5
]
Data Science with Polars
Data wrangling is the process of transforming and mapping data from one “raw” form into another format to make it more valuable and appropriate for analysis. This module introduces you to the core concepts of the Polars API that will help you efficiently wrangle your data.
In your data journey, you’ll likely spend a significant amount of time preparing data - selecting, filtering, transforming, joining, and reshaping it before analysis. Polars provides a powerful interface for these tasks with its expressive API built around the concept of expressions and contexts. By understanding these fundamental concepts, you’ll be able to write efficient and readable code that makes the most of Polars’ optimization capabilities.
This module will guide you through Polars data structures, expression-based operations, and various transformation techniques. You’ll learn how to manipulate strings, handle missing data, join datasets, and concatenate dataframes - essential skills for any data science workflow.
Polars supports a variety of data types that fall broadly under the following categories:
All types support missing values represented by the special value
null
. You
can also find a full table with all data types supported in the Polars
documentation.
The core base data structures provided by Polars are series and dataframes. A series is a 1-dimensional homogeneous data structure. “Homogeneous” means that all elements inside a series have the same data type. The snippet below shows how to create a named series:
import polars as pl
= pl.Series("ints", [1, 2, 3, 4, 5])
s print(s)
shape: (5,)
Series: 'ints' [i64]
[
1
2
3
4
5
]
When creating a series, Polars will infer the data type from the values you provide. You can specify a concrete data type to override the inference mechanism:
= pl.Series("ints", [1, 2, 3, 4, 5])
s1 = pl.Series("uints", [1, 2, 3, 4, 5], dtype=pl.UInt64)
s2 print(s1.dtype, s2.dtype)
Int64 UInt64
A dataframe is a 2-dimensional heterogeneous data structure that contains uniquely named series. By holding your data in a dataframe you will be able to use the Polars API to write queries that manipulate your data.
from datetime import date
= pl.DataFrame(
df
{"name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate": [
1997, 1, 10),
date(1985, 2, 15),
date(1983, 3, 22),
date(1981, 4, 30),
date(
],"weight": [57.9, 72.5, 53.6, 83.1], # (kg)
"height": [1.56, 1.77, 1.65, 1.75], # (m)
}
)
df
name | birthdate | weight | height |
---|---|---|---|
str | date | f64 | f64 |
"Alice Archer" | 1997-01-10 | 57.9 | 1.56 |
"Ben Brown" | 1985-02-15 | 72.5 | 1.77 |
"Chloe Cooper" | 1983-03-22 | 53.6 | 1.65 |
"Daniel Donovan" | 1981-04-30 | 83.1 | 1.75 |
Polars has developed its own Domain Specific Language (DSL) for transforming data. The language is very easy to use and allows for complex queries that remain human readable. Expressions and contexts, which will be introduced here, are very important in achieving this readability while also allowing the Polars query engine to optimize your queries to make them run as fast as possible.
In Polars, an expression is a lazy representation of a data transformation. Expressions are modular and flexible, which means you can use them as building blocks to build more complex expressions. Here is an example of a Polars expression:
"weight") / (pl.col("height") ** 2) pl.col(
As you might be able to guess, this expression takes a column named “weight” and divides its values by the square of the values in a column “height”, computing a person’s BMI.
The code above expresses an abstract computation that we can save in a variable, manipulate further, or just print:
= pl.col("weight") / (pl.col("height") ** 2)
bmi_expr print(bmi_expr)
[(col("weight")) / (col("height").pow([dyn int: 2]))]
Because expressions are lazy, no computations have taken place yet. That’s what we need contexts for.
Polars expressions need a context in which they are executed to produce a result. Depending on the context it is used in, the same Polars expression can produce different results. The four most common contexts are:
select
: Choose columns or create new columns based on
expressionswith_columns
: Add new columns or replace existing
onesfilter
: Keep rows that match specific conditionsgroup_by
: Group data by common values for
aggregationIn the following example, there are two grouping expressions and three aggregating expressions but the resulting dataframe will have six columns instead of five:
= (
result
df.group_by("birthdate").dt.year() // 10 * 10).alias("decade"),
(pl.col("height") < 1.7).alias("short?"),
(pl.col(
)
.agg(len(),
pl."height").max().alias("tallest"),
pl.col("weight", "height").mean().name.prefix("avg_"),
pl.col(
)
)
result
decade | short? | len | tallest | avg_weight | avg_height |
---|---|---|---|---|---|
i32 | bool | u32 | f64 | f64 | f64 |
1980 | true | 1 | 1.65 | 53.6 | 1.65 |
1980 | false | 2 | 1.77 | 77.8 | 1.76 |
1990 | true | 1 | 1.56 | 57.9 | 1.56 |
If we look closely, the last aggregating expression mentioned two different columns: “weight” and “height”.
Polars expressions support a feature called expression expansion. Expression expansion is like a shorthand notation for when you want to apply the same transformation to multiple columns. As we have seen, the expression
"weight", "height").mean().name.prefix("avg_") pl.col(
will compute the mean value of the columns “weight” and “height” and will rename them as “avg_weight” and “avg_height”, respectively. In fact, the expression above is equivalent to using the two following expressions:
["weight").mean().alias("avg_weight"),
pl.col("height").mean().alias("avg_height"),
pl.col( ]
[<Expr ['col("weight").mean().alias("av…'] at 0x19201400F80>,
<Expr ['col("height").mean().alias("av…'] at 0x19201401010>]
In this case, this expression expands into two independent expressions that Polars can execute in parallel. In other cases, we may not be able to know in advance how many independent expressions an expression will unfold into.
Consider this simple but elucidative example:
* 1.1).name.suffix("*1.1") (pl.col(pl.Float64)
This expression will multiply all columns with data type
Float64
by 1.1
. The number of columns this
applies to depends on the schema of each dataframe. In the case of the
dataframe we have been using, it applies to two columns:
= (pl.col(pl.Float64) * 1.1).name.suffix("*1.1")
expr = df.select(expr)
result result
weight*1.1 | height*1.1 |
---|---|
f64 | f64 |
63.69 | 1.716 |
79.75 | 1.947 |
58.96 | 1.815 |
91.41 | 1.925 |
In the case of the dataframe df2
below, the same
expression expands to 0 columns because no column has the data type
Float64
:
= pl.DataFrame(
df2
{"ints": [1, 2, 3, 4],
"letters": ["A", "B", "C", "D"],
}
)= df2.select(expr)
result result
It is equally easy to imagine a scenario where the same expression would expand to dozens of columns.
This feature isn’t just decorative or syntactic sugar. It allows for a very powerful application of DRY principles in your code: a single expression that specifies multiple columns expands into a list of expressions, which means you can write one single expression and reuse the computation that it represents.
col
The function col
is the most common way of making use of
expression expansion features in Polars.
The simplest form of expression expansion happens when you provide multiple column names to the function col.
The example below uses a single function col with multiple column names to convert the values in USD to EUR:
= pl.DataFrame(
df # As of 14th October 2024, ~3pm UTC
{ "ticker": ["AAPL", "NVDA", "MSFT", "GOOG", "AMZN"],
"company_name": ["Apple", "NVIDIA", "Microsoft", "Alphabet (Google)", "Amazon"],
"price": [229.9, 138.93, 420.56, 166.41, 188.4],
"day_high": [231.31, 139.6, 424.04, 167.62, 189.83],
"day_low": [228.6, 136.3, 417.52, 164.78, 188.44],
"year_high": [237.23, 140.76, 468.35, 193.31, 201.2],
"year_low": [164.08, 39.23, 324.39, 121.46, 118.35],
}
)
= 1.09 # As of 14th October 2024
eur_usd_rate
= df.with_columns(
result
(
pl.col("price",
"day_high",
"day_low",
"year_high",
"year_low",
)/ eur_usd_rate
round(2)
).
)
result
ticker | company_name | price | day_high | day_low | year_high | year_low |
---|---|---|---|---|---|---|
str | str | f64 | f64 | f64 | f64 | f64 |
"AAPL" | "Apple" | 210.92 | 212.21 | 209.72 | 217.64 | 150.53 |
"NVDA" | "NVIDIA" | 127.46 | 128.07 | 125.05 | 129.14 | 35.99 |
"MSFT" | "Microsoft" | 385.83 | 389.03 | 383.05 | 429.68 | 297.61 |
"GOOG" | "Alphabet (Google)" | 152.67 | 153.78 | 151.17 | 177.35 | 111.43 |
"AMZN" | "Amazon" | 172.84 | 174.16 | 172.88 | 184.59 | 108.58 |
When you list the column names you want the expression to expand to, you can predict what the expression will expand to. In this case, the expression that does the currency conversion is expanded to a list of five expressions:
= [
exprs "price") / eur_usd_rate).round(2),
(pl.col("day_high") / eur_usd_rate).round(2),
(pl.col("day_low") / eur_usd_rate).round(2),
(pl.col("year_high") / eur_usd_rate).round(2),
(pl.col("year_low") / eur_usd_rate).round(2),
(pl.col(
]
= df.with_columns(exprs)
result2 print(result.equals(result2))
True
We had to type five column names in the previous example but the function col can also conveniently accept one or more data types. If you provide data types instead of column names, the expression is expanded to all columns that match one of the data types provided.
The example below performs the exact same computation as before:
= df.with_columns((pl.col(pl.Float64) / eur_usd_rate).round(2))
result result
ticker | company_name | price | day_high | day_low | year_high | year_low |
---|---|---|---|---|---|---|
str | str | f64 | f64 | f64 | f64 | f64 |
"AAPL" | "Apple" | 210.92 | 212.21 | 209.72 | 217.64 | 150.53 |
"NVDA" | "NVIDIA" | 127.46 | 128.07 | 125.05 | 129.14 | 35.99 |
"MSFT" | "Microsoft" | 385.83 | 389.03 | 383.05 | 429.68 | 297.61 |
"GOOG" | "Alphabet (Google)" | 152.67 | 153.78 | 151.17 | 177.35 | 111.43 |
"AMZN" | "Amazon" | 172.84 | 174.16 | 172.88 | 184.59 | 108.58 |
When we use a data type with expression expansion we cannot know, beforehand, how many columns a single expression will expand to. We need the schema of the input dataframe if we want to determine what is the final list of expressions that is to be applied.
If we weren’t sure about whether the price columns where of the type
Float64
or Float32
, we could specify both data
types:
= df.with_columns(
result2
(
pl.col(
pl.Float32,
pl.Float64,
)/ eur_usd_rate
round(2)
).
)print(result.equals(result2))
True
You can also use regular expressions to specify patterns that are
used to match the column names. To distinguish between a regular column
name and expansion by pattern matching, regular expressions start and
end with ^
and $
, respectively. This also
means that the pattern must match against the whole column name
string.
Regular expressions can be mixed with regular column names:
= df.select(pl.col("ticker", "^.*_high$", "^.*_low$"))
result
result
ticker | day_high | year_high | day_low | year_low |
---|---|---|---|---|
str | f64 | f64 | f64 | f64 |
"AAPL" | 231.31 | 237.23 | 228.6 | 164.08 |
"NVDA" | 139.6 | 140.76 | 136.3 | 39.23 |
"MSFT" | 424.04 | 468.35 | 417.52 | 324.39 |
"GOOG" | 167.62 | 193.31 | 164.78 | 121.46 |
"AMZN" | 189.83 | 201.2 | 188.44 | 118.35 |
In Python, the function col
accepts an arbitrary number
of strings (as column names or as regular expressions) or an arbitrary
number of data types, but you cannot mix both in the same function
call:
"ticker", pl.Float64)) df.select(pl.col(
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[18], line 1 ----> 1 df.select(pl.col("ticker", pl.Float64)) File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\functions\col.py:329, in Col.__call__(self, name, *more_names) 176 def __call__( 177 self, 178 name: ( (...) 185 *more_names: str | PolarsDataType | PythonDataType, 186 ) -> Expr: 187 """ 188 Create one or more expressions representing columns in a DataFrame. 189 (...) 327 └─────┴───────────┴─────┘ 328 """ --> 329 return _create_col(name, *more_names) File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\functions\col.py:47, in _create_col(name, *more_names) 45 names_str = [name] 46 names_str.extend(more_names) # type: ignore[arg-type] ---> 47 return wrap_expr(plr.cols(names_str)) 48 elif is_polars_dtype(name): 49 dtypes = [name] TypeError: argument 'names': 'DataTypeClass' object cannot be converted to 'PyString'
Polars provides the function all
as shorthand notation
to refer to all columns of a dataframe:
= df.select(pl.all())
result print(result.equals(df))
True
The function all
is syntactic sugar for
col("*")
, but since the argument "*"
is a
special case and all reads more like English, the usage of
all
is preferred.
Polars also provides a mechanism to exclude certain columns from
expression expansion. For that, you use the function
exclude
, which accepts exactly the same types of arguments
as col
:
= df.select(pl.all().exclude("^day_.*$"))
result result
ticker | company_name | price | year_high | year_low |
---|---|---|---|---|
str | str | f64 | f64 | f64 |
"AAPL" | "Apple" | 229.9 | 237.23 | 164.08 |
"NVDA" | "NVIDIA" | 138.93 | 140.76 | 39.23 |
"MSFT" | "Microsoft" | 420.56 | 468.35 | 324.39 |
"GOOG" | "Alphabet (Google)" | 166.41 | 193.31 | 121.46 |
"AMZN" | "Amazon" | 188.4 | 201.2 | 118.35 |
By default, when you apply an expression to a column, the result keeps the same name as the original column.
Preserving the column name can be semantically wrong and in certain cases Polars may even raise an error if duplicate names occur:
= 1.31 # As of 14th October 2024
gbp_usd_rate
df.select("price") / gbp_usd_rate, # This would be named "price"...
pl.col("price") / eur_usd_rate, # And so would this.
pl.col( )
--------------------------------------------------------------------------- DuplicateError Traceback (most recent call last) Cell In[21], line 3 1 gbp_usd_rate = 1.31 # As of 14th October 2024 ----> 3 df.select( 4 pl.col("price") / gbp_usd_rate, # This would be named "price"... 5 pl.col("price") / eur_usd_rate, # And so would this. 6 ) File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\dataframe\frame.py:9633, in DataFrame.select(self, *exprs, **named_exprs) 9533 def select( 9534 self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr 9535 ) -> DataFrame: 9536 """ 9537 Select columns from this DataFrame. 9538 (...) 9631 └──────────────┘ 9632 """ -> 9633 return self.lazy().select(*exprs, **named_exprs).collect(_eager=True) File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\_utils\deprecation.py:93, in deprecate_streaming_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs) 89 kwargs["engine"] = "in-memory" 91 del kwargs["streaming"] ---> 93 return function(*args, **kwargs) File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\lazyframe\frame.py:2188, in LazyFrame.collect(self, type_coercion, _type_check, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, engine, background, _check_order, _eager, **_kwargs) 2186 # Only for testing purposes 2187 callback = _kwargs.get("post_opt_callback", callback) -> 2188 return wrap_df(ldf.collect(engine, callback)) DuplicateError: the name 'price' is duplicate It's possible that multiple expressions are returning the same default column name. If this is the case, try renaming the columns with `.alias("new_name")` to avoid duplicate column names.
To prevent errors like this, and to allow users to rename their columns when appropriate, Polars provides a series of functions that let you change the name of a column or a group of columns.
alias
The function alias has been used thoroughly in the documentation already and it lets you rename a single column:
= df.select(
result "price") / gbp_usd_rate).alias("price (GBP)"),
(pl.col("price") / eur_usd_rate).alias("price (EUR)"),
(pl.col( )
When using expression expansion you cannot use the function alias because the function alias is designed specifically to rename a single column.
When it suffices to add a static prefix or a static suffix to the existing names, we can use the functions prefix and suffix from the namespace name:
= df.select(
result "^year_.*$") / eur_usd_rate).name.prefix("in_eur_"),
(pl.col("day_high", "day_low") / gbp_usd_rate).name.suffix("_gbp"),
(pl.col(
) result
in_eur_year_high | in_eur_year_low | day_high_gbp | day_low_gbp |
---|---|---|---|
f64 | f64 | f64 | f64 |
217.642202 | 150.53211 | 176.572519 | 174.503817 |
129.137615 | 35.990826 | 106.564885 | 104.045802 |
429.678899 | 297.605505 | 323.694656 | 318.717557 |
177.348624 | 111.431193 | 127.954198 | 125.78626 |
184.587156 | 108.577982 | 144.908397 | 143.847328 |
If a static prefix/suffix is not enough, the namespace name also provides the function map that accepts a callable that accepts the old column names and produces the new ones:
# There is also `.name.to_uppercase`, so this usage of `.map` is moot.
= df.select(pl.all().name.map(str.upper))
result result
TICKER | COMPANY_NAME | PRICE | DAY_HIGH | DAY_LOW | YEAR_HIGH | YEAR_LOW |
---|---|---|---|---|---|---|
str | str | f64 | f64 | f64 | f64 | f64 |
"AAPL" | "Apple" | 229.9 | 231.31 | 228.6 | 237.23 | 164.08 |
"NVDA" | "NVIDIA" | 138.93 | 139.6 | 136.3 | 140.76 | 39.23 |
"MSFT" | "Microsoft" | 420.56 | 424.04 | 417.52 | 468.35 | 324.39 |
"GOOG" | "Alphabet (Google)" | 166.41 | 167.62 | 164.78 | 193.31 | 121.46 |
"AMZN" | "Amazon" | 188.4 | 189.83 | 188.44 | 201.2 | 118.35 |
In this section we will focus on exploring the types of expressions that Polars offers. Each section gives an overview of what they do and provides additional examples.
This section shows how to do basic operations on dataframe columns, like do basic arithmetic calculations, perform comparisons, and other general-purpose operations. We will use the following dataframe for the examples that follow:
import numpy as np
42) # For reproducibility.
np.random.seed(
= pl.DataFrame(
df
{"nrs": [1, 2, 3, None, 5],
"names": ["foo", "ham", "spam", "egg", "spam"],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "A", "B"],
}
)
df
nrs | names | random | groups |
---|---|---|---|
i64 | str | f64 | str |
1 | "foo" | 0.37454 | "A" |
2 | "ham" | 0.950714 | "A" |
3 | "spam" | 0.731994 | "B" |
null | "egg" | 0.598658 | "A" |
5 | "spam" | 0.156019 | "B" |
Polars supports basic arithmetic between series of the same length, or between series and literals. When literals are mixed with series, the literals are broadcast to match the length of the series they are being used with.
= df.select(
result "nrs") + 5).alias("nrs + 5"),
(pl.col("nrs") - 5).alias("nrs - 5"),
(pl.col("nrs") * pl.col("random")).alias("nrs * random"),
(pl.col("nrs") / pl.col("random")).alias("nrs / random"),
(pl.col("nrs") ** 2).alias("nrs ** 2"),
(pl.col("nrs") % 3).alias("nrs % 3"),
(pl.col(
)
result
nrs + 5 | nrs - 5 | nrs * random | nrs / random | nrs ** 2 | nrs % 3 |
---|---|---|---|---|---|
i64 | i64 | f64 | f64 | i64 | i64 |
6 | -4 | 0.37454 | 2.669941 | 1 | 1 |
7 | -3 | 1.901429 | 2.103681 | 4 | 2 |
8 | -2 | 2.195982 | 4.098395 | 9 | 0 |
null | null | null | null | null | null |
10 | 0 | 0.780093 | 32.047453 | 25 | 2 |
The example above shows that when an arithmetic operation takes
null
as one of its operands, the result is
null
.
Polars uses operator overloading to allow you to use your language’s native arithmetic operators within your expressions. If you prefer, in Python you can use the corresponding named functions, as the snippet below demonstrates:
# Python only:
= df.select(
result_named_operators "nrs").add(5)).alias("nrs + 5"),
(pl.col("nrs").sub(5)).alias("nrs - 5"),
(pl.col("nrs").mul(pl.col("random"))).alias("nrs * random"),
(pl.col("nrs").truediv(pl.col("random"))).alias("nrs / random"),
(pl.col("nrs").pow(2)).alias("nrs ** 2"),
(pl.col("nrs").mod(3)).alias("nrs % 3"),
(pl.col(
)
print(result.equals(result_named_operators))
True
Like with arithmetic operations, Polars supports comparisons via the overloaded operators or named functions:
= df.select(
result "nrs") > 1).alias("nrs > 1"), # .gt
(pl.col("nrs") >= 3).alias("nrs >= 3"), # ge
(pl.col("random") < 0.2).alias("random < .2"), # .lt
(pl.col("random") <= 0.5).alias("random <= .5"), # .le
(pl.col("nrs") != 1).alias("nrs != 1"), # .ne
(pl.col("nrs") == 1).alias("nrs == 1"), # .eq
(pl.col(
)
result
nrs > 1 | nrs >= 3 | random < .2 | random <= .5 | nrs != 1 | nrs == 1 |
---|---|---|---|---|---|
bool | bool | bool | bool | bool | bool |
false | false | false | true | false | true |
true | false | false | false | true | false |
true | true | false | false | true | false |
null | null | false | false | null | null |
true | true | true | true | true | false |
In Python, you may use the operators &
,
|
, and ~
, for the Boolean operations “and”,
“or”, and “not”, respectively, or the functions of the same name:
# Boolean operators & | ~
= df.select(
result ~pl.col("nrs").is_null()) & (pl.col("groups") == "A")).alias(
(("number not null and group A"
),"random") < 0.5) | (pl.col("groups") == "B")).alias(
((pl.col("random < 0.5 or group B"
),
)
# Corresponding named functions `and_`, `or_`, and `not_`.
= df.select(
result2 "nrs").is_null().not_().and_(pl.col("groups") == "A")).alias(
(pl.col("number not null and group A"
),"random") < 0.5).or_(pl.col("groups") == "B")).alias(
((pl.col("random < 0.5 or group B"
),
)print(result.equals(result2))
True
Polars has two functions to count the number of unique values in a
series. The function n_unique
can be used to count the
exact number of unique values in a series. However, for very large data
sets, this operation can be quite slow. In those cases, if an
approximation is good enough, you can use the function
approx_n_unique
that uses the algorithm HyperLogLog++ to
estimate the result.
The example below shows an example series where the
approx_n_unique
estimation is wrong by 0.9%:
= pl.DataFrame({"numbers": np.random.randint(0, 100_000, 100_000)})
long_df
= long_df.select(
result "numbers").n_unique().alias("n_unique"),
pl.col("numbers").approx_n_unique().alias("approx_n_unique"),
pl.col(
)
result
n_unique | approx_n_unique |
---|---|
u32 | u32 |
63218 | 64141 |
You can get more information about the unique values and their counts
with the function value_counts
, that Polars also
provides:
= df.select(
result "names").value_counts().alias("value_counts"),
pl.col(
)
result
value_counts |
---|
struct[2] |
{"foo",1} |
{"ham",1} |
{"spam",2} |
{"egg",1} |
Casting converts the underlying data type of a column to a new one.
Casting is available through the function cast
.
Let’s take a look at the following dataframe which contains both integers and floating point numbers:
= pl.DataFrame(
df
{"integers": [1, 2, 3],
"big_integers": [10000002, 2, 30000003],
"floats": [4.0, 5.8, -6.3],
}
)
df
integers | big_integers | floats |
---|---|---|
i64 | i64 | f64 |
1 | 10000002 | 4.0 |
2 | 2 | 5.8 |
3 | 30000003 | -6.3 |
To perform casting operations between floats and integers, or vice
versa, we use the function cast
:
= df.select(
result "integers").cast(pl.Float32).alias("integers_as_floats"),
pl.col("floats").cast(pl.Int32).alias("floats_as_integers"),
pl.col(
)
result
integers_as_floats | floats_as_integers |
---|---|
f32 | i32 |
1.0 | 4 |
2.0 | 5 |
3.0 | -6 |
Strings that represent numbers can be converted to the appropriate data types via casting. The opposite conversion is also possible:
= pl.DataFrame(
df
{"integers_as_strings": ["1", "2", "3"],
"floats_as_strings": ["4.0", "5.8", "-6.3"],
"floats": [4.0, 5.8, -6.3],
}
)
= df.select(
result "integers_as_strings").cast(pl.Int32),
pl.col("floats_as_strings").cast(pl.Float64),
pl.col("floats").cast(pl.String),
pl.col(
)
result
integers_as_strings | floats_as_strings | floats |
---|---|---|
i32 | f64 | str |
1 | 4.0 | "4.0" |
2 | 5.8 | "5.8" |
3 | -6.3 | "-6.3" |
In case the column contains a non-numerical value, or a poorly
formatted one, Polars will throw an error with details on the conversion
error. You can set strict=False
to circumvent the error and
get a null
value instead.
= pl.DataFrame(
df
{"floats": ["4.0", "5.8", "- 6 . 3"],
}
)
"floats").cast(pl.Float64)) df.select(pl.col(
--------------------------------------------------------------------------- InvalidOperationError Traceback (most recent call last) Cell In[35], line 7 1 df = pl.DataFrame( 2 { 3 "floats": ["4.0", "5.8", "- 6 . 3"], 4 } 5 ) ----> 7 df.select(pl.col("floats").cast(pl.Float64)) File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\dataframe\frame.py:9633, in DataFrame.select(self, *exprs, **named_exprs) 9533 def select( 9534 self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr 9535 ) -> DataFrame: 9536 """ 9537 Select columns from this DataFrame. 9538 (...) 9631 └──────────────┘ 9632 """ -> 9633 return self.lazy().select(*exprs, **named_exprs).collect(_eager=True) File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\_utils\deprecation.py:93, in deprecate_streaming_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs) 89 kwargs["engine"] = "in-memory" 91 del kwargs["streaming"] ---> 93 return function(*args, **kwargs) File c:\Users\WVU\Classes\wvu-ieng-331\wvu-ieng-331.github.io\.venv\Lib\site-packages\polars\lazyframe\frame.py:2188, in LazyFrame.collect(self, type_coercion, _type_check, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, engine, background, _check_order, _eager, **_kwargs) 2186 # Only for testing purposes 2187 callback = _kwargs.get("post_opt_callback", callback) -> 2188 return wrap_df(ldf.collect(engine, callback)) InvalidOperationError: conversion from `str` to `f64` failed in column 'floats' for 1 out of 3 values: ["- 6 . 3"]
"floats").cast(pl.Float64, strict=False)) df.select(pl.col(
floats |
---|
f64 |
4.0 |
5.8 |
null |
All temporal data types are represented internally as the number of time units elapsed since a reference moment, usually referred to as the epoch. For example, values of the data type Date are stored as the number of days since the epoch. For the data type Datetime the time unit is the microsecond (us) and for Time the time unit is the nanosecond (ns).
Casting between numerical types and temporal data types is allowed and exposes this relationship:
from datetime import date, datetime, time
= pl.DataFrame(
df
{"date": [
1970, 1, 1), # epoch
date(1970, 1, 10), # 9 days later
date(
],"datetime": [
1970, 1, 1, 0, 0, 0), # epoch
datetime(1970, 1, 1, 0, 1, 0), # 1 minute later
datetime(
],"time": [
0, 0, 0), # reference time
time(0, 0, 1), # 1 second later
time(
],
}
)
= df.select(
result "date").cast(pl.Int64).alias("days_since_epoch"),
pl.col("datetime").cast(pl.Int64).alias("us_since_epoch"),
pl.col("time").cast(pl.Int64).alias("ns_since_midnight"),
pl.col(
)
result
days_since_epoch | us_since_epoch | ns_since_midnight |
---|---|---|
i64 | i64 | i64 |
0 | 0 | 0 |
9 | 60000000 | 1000000000 |
To format temporal data types as strings we can use the function
dt.to_string
and to parse temporal data types from strings
we can use the function str.to_datetime
. Both functions
adopt the chrono
format syntax for formatting.
= pl.DataFrame(
df
{"date": [date(2022, 1, 1), date(2022, 1, 2)],
"string": ["2022-01-01", "2022-01-02"],
}
)
= df.select(
result "date").dt.to_string("%Y-%m-%d"),
pl.col("string").str.to_datetime("%Y-%m-%d"),
pl.col(
)
result
date | string |
---|---|
str | datetime[μs] |
"2022-01-01" | 2022-01-01 00:00:00 |
"2022-01-02" | 2022-01-02 00:00:00 |
It’s worth noting that str.to_datetime
features
additional options that support timezone functionality. Refer to the API
documentation for further information.
The following section discusses operations performed on string data,
which is a frequently used data type when working with dataframes.
String processing functions are available in the namespace
str
.
When working with string data you will likely need to access the
namespace str
, which aggregates 40+ functions that let you
work with strings. As an example of how to access functions from within
that namespace, the snippet below shows how to compute the length of the
strings in a column in terms of the number of bytes and the number of
characters:
= pl.DataFrame(
df
{"language": ["English", "Dutch", "Portuguese", "Finish"],
"fruit": ["pear", "peer", "pêra", "päärynä"],
}
)
= df.with_columns(
result "fruit").str.len_bytes().alias("byte_count"),
pl.col("fruit").str.len_chars().alias("letter_count"),
pl.col(
)
result
language | fruit | byte_count | letter_count |
---|---|---|---|
str | str | u32 | u32 |
"English" | "pear" | 4 | 4 |
"Dutch" | "peer" | 4 | 4 |
"Portuguese" | "pêra" | 5 | 4 |
"Finish" | "päärynä" | 10 | 7 |
Polars offers multiple methods for checking and parsing elements of a string column, namely checking for the existence of given substrings or patterns, and counting, extracting, or replacing, them. We will demonstrate some of these operations in the upcoming examples.
We can use the function contains
to check for the
presence of a pattern within a string. By default, the argument to the
function contains
is interpreted as a regular expression.
If you want to specify a literal substring, set the parameter
literal
to True
.
For the special cases where you want to check if the strings start or
end with a fixed substring, you can use the functions
starts_with
or ends_with
, respectively.
= df.select(
result "fruit"),
pl.col("fruit").str.starts_with("p").alias("starts_with_p"),
pl.col("fruit").str.contains("p..r").alias("p..r"),
pl.col("fruit").str.contains("e+").alias("e+"),
pl.col("fruit").str.ends_with("r").alias("ends_with_r"),
pl.col(
)
result
fruit | starts_with_p | p..r | e+ | ends_with_r |
---|---|---|---|---|
str | bool | bool | bool | bool |
"pear" | true | true | true | true |
"peer" | true | true | true | true |
"pêra" | true | false | false | false |
"päärynä" | true | true | false | false |
The function extract
allows us to extract patterns from
the string values in a column. The function extract
accepts
a regex pattern with one or more capture groups and extracts the capture
group specified as the second argument.
= pl.DataFrame(
df
{"urls": [
"http://vote.com/ballon_dor?candidate=messi&ref=polars",
"http://vote.com/ballon_dor?candidat=jorginho&ref=polars",
"http://vote.com/ballon_dor?candidate=ronaldo&ref=polars",
]
}
)= df.select(
result "urls").str.extract(r"candidate=(\w+)", group_index=1),
pl.col(
)
result
urls |
---|
str |
"messi" |
null |
"ronaldo" |
To extract all occurrences of a pattern within a string, we can use
the function extract_all
. In the example below, we extract
all numbers from a string using the regex pattern (\d+)
,
which matches one or more digits. The resulting output of the function
extract_all
is a list containing all instances of the
matched pattern within the string.
= pl.DataFrame({"text": ["123 bla 45 asd", "xyz 678 910t"]})
df = df.select(
result "text").str.extract_all(r"(\d+)").alias("extracted_nrs"),
pl.col(
)
result
extracted_nrs |
---|
list[str] |
["123", "45"] |
["678", "910"] |
Akin to the functions extract
and
extract_all
, Polars provides the functions
replace
and replace_all
. These accept a regex
pattern or a literal substring (if the parameter literal
is
set to True
) and perform the replacements specified. The
function replace
will make at most one replacement whereas
the function replace_all
will make all the non-overlapping
replacements it finds.
= pl.DataFrame({"text": ["123abc", "abc456"]})
df = df.with_columns(
result "text").str.replace(r"\d", "-"),
pl.col("text").str.replace_all(r"\d", "-").alias("text_replace_all"),
pl.col(
)
result
text | text_replace_all |
---|---|
str | str |
"-23abc" | "---abc" |
"abc-56" | "abc---" |
Converting the casing of a string is a common operation and Polars
supports it out of the box with the functions to_lowercase
,
to_titlecase
, and to_uppercase
:
= pl.DataFrame(
addresses
{"addresses": [
"128 PERF st",
"Rust blVD, 158",
"PoLaRs Av, 12",
"1042 Query sq",
]
}
)
= addresses.select(
addresses "addresses").alias("originals"),
pl.col("addresses").str.to_titlecase(),
pl.col("addresses").str.to_lowercase().alias("lower"),
pl.col("addresses").str.to_uppercase().alias("upper"),
pl.col(
)
addresses
originals | addresses | lower | upper |
---|---|---|---|
str | str | str | str |
"128 PERF st" | "128 Perf St" | "128 perf st" | "128 PERF ST" |
"Rust blVD, 158" | "Rust Blvd, 158" | "rust blvd, 158" | "RUST BLVD, 158" |
"PoLaRs Av, 12" | "Polars Av, 12" | "polars av, 12" | "POLARS AV, 12" |
"1042 Query sq" | "1042 Query Sq" | "1042 query sq" | "1042 QUERY SQ" |
Polars provides five functions in the namespace str
that
let you strip characters from the ends of the string:
strip_chars
: removes leading and trailing occurrences
of the characters specified.strip_chars_end
: removes trailing occurrences of the
characters specified.strip_chars_start
: removes leading occurrences of the
characters specified.strip_prefix
: removes an exact substring prefix if
present.strip_suffix
: removes an exact substring suffix if
present.Besides extracting substrings as specified by patterns, you can also
slice strings at specified offsets to produce substrings. The
general-purpose function for slicing is slice
and it takes
the starting offset and the optional length of the slice. If the length
of the slice is not specified or if it’s past the end of the string,
Polars slices the string all the way to the end.
The functions head
and tail
are specialised
versions used for slicing the beginning and end of a string,
respectively.
= pl.DataFrame(
df
{"fruits": ["pear", "mango", "dragonfruit", "passionfruit"],
"n": [1, -1, 4, -4],
}
)
= df.with_columns(
result "fruits").str.slice(pl.col("n")).alias("slice"),
pl.col("fruits").str.head(pl.col("n")).alias("head"),
pl.col("fruits").str.tail(pl.col("n")).alias("tail"),
pl.col(
)
result
fruits | n | slice | head | tail |
---|---|---|---|---|
str | i64 | str | str | str |
"pear" | 1 | "ear" | "p" | "r" |
"mango" | -1 | "o" | "mang" | "ango" |
"dragonfruit" | 4 | "onfruit" | "drag" | "ruit" |
"passionfruit" | -4 | "ruit" | "passionf" | "ionfruit" |
In Polars, missing data is represented by the value
null
. This missing value null
is used for all
data types, including numerical types. Polars also supports the value
NaN
(“Not a Number”) for columns with floating point
numbers. The value NaN
is considered to be a valid floating
point value, which is different from missing data.
= pl.DataFrame(
df
{"value": [1, None],
},
)
df
value |
---|
i64 |
1 |
null |
Missing data in a series can be filled with the function
fill_null
. You can specify how missing data is effectively
filled in a couple of different ways:
To illustrate how each of these methods work we start by defining a simple dataframe with two missing values in the second column:
= pl.DataFrame(
df
{"col1": [0.5, 1, 1.5, 2, 2.5],
"col2": [1, None, 3, None, 5],
},
)
df
col1 | col2 |
---|---|
f64 | i64 |
0.5 | 1 |
1.0 | null |
1.5 | 3 |
2.0 | null |
2.5 | 5 |
You can fill the missing data with a specified literal value. This literal value will replace all of the occurrences of the value null:
= df.with_columns(
fill_literal_df "col2").fill_null(3),
pl.col(
)
fill_literal_df
col1 | col2 |
---|---|
f64 | i64 |
0.5 | 1 |
1.0 | 3 |
1.5 | 3 |
2.0 | 3 |
2.5 | 5 |
In the general case, the missing data can be filled by extracting the corresponding values from the result of a general Polars expression. For example, we can fill the second column with values taken from the double of the first column:
= df.with_columns(
fill_expression_df "col2").fill_null((2 * pl.col("col1")).cast(pl.Int64)),
pl.col(
)
fill_expression_df
col1 | col2 |
---|---|
f64 | i64 |
0.5 | 1 |
1.0 | 2 |
1.5 | 3 |
2.0 | 4 |
2.5 | 5 |
You can also fill the missing data by following a fill strategy based on the neighbouring values. The two simpler strategies look for the first non-null value that comes immediately before or immediately after the value null that is being filled:
= df.with_columns(
fill_forward_df "col2").fill_null(strategy="forward").alias("forward"),
pl.col("col2").fill_null(strategy="backward").alias("backward"),
pl.col(
)
fill_forward_df
col1 | col2 | forward | backward |
---|---|---|---|
f64 | i64 | i64 | i64 |
0.5 | 1 | 1 | 1 |
1.0 | null | 1 | 3 |
1.5 | 3 | 3 | 3 |
2.0 | null | 3 | 5 |
2.5 | 5 | 5 | 5 |
You can find other fill strategies in the API docs.
Additionally, you can fill missing data with interpolation by using
the function interpolate
instead of the function
fill_null
:
= df.with_columns(
fill_interpolation_df "col2").interpolate(),
pl.col(
)
fill_interpolation_df
col1 | col2 |
---|---|
f64 | f64 |
0.5 | 1.0 |
1.0 | 2.0 |
1.5 | 3.0 |
2.0 | 4.0 |
2.5 | 5.0 |
Missing data in a series is only ever represented by the value
null
, regardless of the data type of the series. Columns
with a floating point data type can sometimes have the value
NaN
, which might be confused with null
.
The special value NaN
can be created directly:
import numpy as np
= pl.DataFrame(
nan_df
{"value": [1.0, np.nan, float("nan"), 3.0],
},
)
nan_df
value |
---|
f64 |
1.0 |
NaN |
NaN |
3.0 |
And it might also arise as the result of a computation:
= pl.DataFrame(
df
{"dividend": [1, 0, -1],
"divisor": [1, 0, -1],
}
)= df.select(pl.col("dividend") / pl.col("divisor"))
result
result
dividend |
---|
f64 |
1.0 |
NaN |
1.0 |
NaN
values are considered to be a type of floating point
data and are not considered to be missing data in
Polars. This means:
NaN
values are not counted with the
function null_count
; andNaN
values are filled when you use the specialised
function fill_nan
method but are not filled with the
function fill_null
.Polars has the functions is_nan
and
fill_nan
, which work in a similar way to the functions
is_null
and fill_null
. Unlike with missing
data, Polars does not hold any metadata regarding the NaN values, so the
function is_nan
entails actual computation.
One further difference between the values null
and
NaN
is that numerical aggregating functions, like
mean
and sum
, skip the missing values when
computing the result, whereas the value NaN
is considered
for the computation and typically propagates into the result. If
desirable, this behavior can be avoided by replacing the occurrences of
the value NaN
with the value null
:
= nan_df.with_columns(
mean_nan_df "value").fill_nan(None).alias("replaced"),
pl.col(
).select(all().mean().name.suffix("_mean"),
pl.all().sum().name.suffix("_sum"),
pl.
)
mean_nan_df
value_mean | replaced_mean | value_sum | replaced_sum |
---|---|---|---|
f64 | f64 | f64 | f64 |
NaN | 2.0 | NaN | 4.0 |
The focus of this section is to describe different types of data transformations and provide some examples on how to use them.
A join operation combines columns from one or more dataframes into a new dataframe. The different “joining strategies” and matching criteria used by the different types of joins influence how columns are combined and also what rows are included in the result of the join operation.
In a join, rows are matched by checking equality of a key expression.
You can do an join with the function join
by specifying the
name of the column to be used as key. For the examples, we will be
loading some (modified) Monopoly property data.
First, we load a dataframe that contains property names and their colour group in the game:
= pl.read_csv("data/monopoly_props_groups.csv")
props_groups props_groups
property_name | group |
---|---|
str | str |
"Old Ken Road" | "brown" |
"Whitechapel Road" | "brown" |
"The Shire" | "fantasy" |
"Kings Cross Station" | "stations" |
… | … |
"Bond Street" | "green" |
"Liverpool Street Station" | "stations" |
"Park Lane" | "dark_blue" |
"Mayfair" | "dark_blue" |
Next, we load a dataframe that contains property names and their price in the game:
= pl.read_csv("data/monopoly_props_prices.csv")
props_prices props_prices
property_name | cost |
---|---|
str | i64 |
"Old Ken Road" | 60 |
"Whitechapel Road" | 60 |
"Sesame Street" | 100 |
"Kings Cross Station" | 200 |
… | … |
"Bond Street" | 320 |
"Liverpool Street Station" | 200 |
"Park Lane" | 350 |
"Mayfair" | 400 |
Now, we join both dataframes to create a dataframe that contains property names, colour groups, and prices:
= props_groups.join(props_prices, on="property_name")
result result
property_name | group | cost |
---|---|---|
str | str | i64 |
"Old Ken Road" | "brown" | 60 |
"Whitechapel Road" | "brown" | 60 |
"Kings Cross Station" | "stations" | 200 |
"The Angel, Islington" | "light_blue" | 100 |
… | … | … |
"Bond Street" | "green" | 320 |
"Liverpool Street Station" | "stations" | 200 |
"Park Lane" | "dark_blue" | 350 |
"Mayfair" | "dark_blue" | 400 |
The result has four rows but both dataframes used in the operation had five rows. Polars uses a joining strategy to determine what happens with rows that have multiple matches or with rows that have no match at all. By default, Polars computes an “inner join” but there are other join strategies that we show next.
In the example above, the two dataframes conveniently had the column we wish to use as key with the same name and with the values in the exact same format. Suppose, for the sake of argument, that one of the dataframes had a differently named column and the other had the property names in lower case:
= props_groups.with_columns(
props_groups2 "property_name").str.to_lowercase(),
pl.col(
)
props_groups2
property_name | group |
---|---|
str | str |
"old ken road" | "brown" |
"whitechapel road" | "brown" |
"the shire" | "fantasy" |
"kings cross station" | "stations" |
… | … |
"bond street" | "green" |
"liverpool street station" | "stations" |
"park lane" | "dark_blue" |
"mayfair" | "dark_blue" |
= props_prices.select(
props_prices2 "property_name").alias("name"), pl.col("cost")
pl.col(
)
props_prices2
name | cost |
---|---|
str | i64 |
"Old Ken Road" | 60 |
"Whitechapel Road" | 60 |
"Sesame Street" | 100 |
"Kings Cross Station" | 200 |
… | … |
"Bond Street" | 320 |
"Liverpool Street Station" | 200 |
"Park Lane" | 350 |
"Mayfair" | 400 |
In a situation like this, where we may want to perform the same join
as before, we can leverage join
’s flexibility and specify
arbitrary expressions to compute the joining key on the left and on the
right, allowing one to compute row keys dynamically:
= props_groups2.join(
result
props_prices2,="property_name",
left_on=pl.col("name").str.to_lowercase(),
right_on
)
result
property_name | group | name | cost |
---|---|---|---|
str | str | str | i64 |
"old ken road" | "brown" | "Old Ken Road" | 60 |
"whitechapel road" | "brown" | "Whitechapel Road" | 60 |
"kings cross station" | "stations" | "Kings Cross Station" | 200 |
"the angel, islington" | "light_blue" | "The Angel, Islington" | 100 |
… | … | … | … |
"bond street" | "green" | "Bond Street" | 320 |
"liverpool street station" | "stations" | "Liverpool Street Station" | 200 |
"park lane" | "dark_blue" | "Park Lane" | 350 |
"mayfair" | "dark_blue" | "Mayfair" | 400 |
Because we are joining on the right with an expression, Polars preserves the column “property_name” from the left and the column “name” from the right so we can have access to the original values that the key expressions were applied to.
When computing a join with df1.join(df2, ...)
, we can
specify one of many different join strategies. A join strategy specifies
what rows to keep from each dataframe based on whether they match rows
from the other dataframe.
In an inner join the resulting dataframe only contains the rows from
the left and right dataframes that matched. That is the default strategy
used by join
and above we can see an example of that. We
repeat the example here and explicitly specify the join strategy:
= props_groups.join(props_prices, on="property_name", how="inner")
result result
property_name | group | cost |
---|---|---|
str | str | i64 |
"Old Ken Road" | "brown" | 60 |
"Whitechapel Road" | "brown" | 60 |
"Kings Cross Station" | "stations" | 200 |
"The Angel, Islington" | "light_blue" | 100 |
… | … | … |
"Bond Street" | "green" | 320 |
"Liverpool Street Station" | "stations" | 200 |
"Park Lane" | "dark_blue" | 350 |
"Mayfair" | "dark_blue" | 400 |
The result does not include the row from props_groups that contains “The Shire” and the result also does not include the row from props_prices that contains “Sesame Street”.
A left outer join is a join where the result contains all the rows from the left dataframe and the rows of the right dataframe that matched any rows from the left dataframe.
= props_groups.join(props_prices, on="property_name", how="left")
result result
property_name | group | cost |
---|---|---|
str | str | i64 |
"Old Ken Road" | "brown" | 60 |
"Whitechapel Road" | "brown" | 60 |
"The Shire" | "fantasy" | null |
"Kings Cross Station" | "stations" | 200 |
… | … | … |
"Bond Street" | "green" | 320 |
"Liverpool Street Station" | "stations" | 200 |
"Park Lane" | "dark_blue" | 350 |
"Mayfair" | "dark_blue" | 400 |
If there are any rows from the left dataframe that have no matching
rows on the right dataframe, they get the value null
on the
new columns.
Computationally speaking, a right outer join is exactly the same as a left outer join, but with the arguments swapped. Here is an example:
= props_groups.join(props_prices, on="property_name", how="right")
result result
group | property_name | cost |
---|---|---|
str | str | i64 |
"brown" | "Old Ken Road" | 60 |
"brown" | "Whitechapel Road" | 60 |
null | "Sesame Street" | 100 |
"stations" | "Kings Cross Station" | 200 |
… | … | … |
"green" | "Bond Street" | 320 |
"stations" | "Liverpool Street Station" | 200 |
"dark_blue" | "Park Lane" | 350 |
"dark_blue" | "Mayfair" | 400 |
We show that df1.join(df2, how="right", ...)
is the same
as df2.join(df1, how="left", ...)
, up to the order of the
columns of the result, with the computation below:
print(
result.equals(
props_prices.join(
props_groups,="property_name",
on="left",
how# Reorder the columns to match the order from above.
"group"), pl.col("property_name"), pl.col("cost"))
).select(pl.col(
) )
True
A full outer join will keep all of the rows from the left and right dataframes, even if they don’t have matching rows in the other dataframe:
= props_groups.join(props_prices, on="property_name", how="full")
result result
property_name | group | property_name_right | cost |
---|---|---|---|
str | str | str | i64 |
"Old Ken Road" | "brown" | "Old Ken Road" | 60 |
"Whitechapel Road" | "brown" | "Whitechapel Road" | 60 |
null | null | "Sesame Street" | 100 |
"Kings Cross Station" | "stations" | "Kings Cross Station" | 200 |
… | … | … | … |
"Liverpool Street Station" | "stations" | "Liverpool Street Station" | 200 |
"Park Lane" | "dark_blue" | "Park Lane" | 350 |
"Mayfair" | "dark_blue" | "Mayfair" | 400 |
"The Shire" | "fantasy" | null | null |
In this case, we see that we get two columns
property_name
and property_name_right
to make
up for the fact that we are matching on the column
property_name
of both dataframes and there are some names
for which there are no matches. The two columns help differentiate the
source of each row data. If we wanted to force join to coalesce the two
columns property_name
into a single column, we could set
coalesce=True
explicitly:
= props_groups.join(
result
props_prices,="property_name",
on="full",
how=True,
coalesce
)
result
property_name | group | cost |
---|---|---|
str | str | i64 |
"Old Ken Road" | "brown" | 60 |
"Whitechapel Road" | "brown" | 60 |
"Sesame Street" | null | 100 |
"Kings Cross Station" | "stations" | 200 |
… | … | … |
"Liverpool Street Station" | "stations" | 200 |
"Park Lane" | "dark_blue" | 350 |
"Mayfair" | "dark_blue" | 400 |
"The Shire" | "fantasy" | null |
When not set, the parameter coalesce
is determined
automatically from the join strategy and the key(s) specified, which is
why the inner, left, and right, joins acted as if
coalesce=True
, even though we didn’t set it.
A semi join will return the rows of the left dataframe that have a match in the right dataframe, but we do not actually join the matching rows:
= props_groups.join(props_prices, on="property_name", how="semi")
result result
property_name | group |
---|---|
str | str |
"Old Ken Road" | "brown" |
"Whitechapel Road" | "brown" |
"Kings Cross Station" | "stations" |
"The Angel, Islington" | "light_blue" |
… | … |
"Bond Street" | "green" |
"Liverpool Street Station" | "stations" |
"Park Lane" | "dark_blue" |
"Mayfair" | "dark_blue" |
A semi join acts as a sort of row filter based on a second dataframe.
Conversely, an anti join will return the rows of the left dataframe that do not have a match in the right dataframe:
= props_groups.join(props_prices, on="property_name", how="anti")
result result
property_name | group |
---|---|
str | str |
"The Shire" | "fantasy" |
There are a number of ways to concatenate data from separate DataFrames:
In a vertical concatenation you combine all of the rows from a list
of DataFrames
into a single longer
DataFrame
.
= pl.DataFrame(
df_v1
{"a": [1],
"b": [3],
}
)= pl.DataFrame(
df_v2
{"a": [2],
"b": [4],
}
)= pl.concat(
df_vertical_concat
[
df_v1,
df_v2,
],="vertical",
how
)
df_vertical_concat
a | b |
---|---|
i64 | i64 |
1 | 3 |
2 | 4 |
Vertical concatenation fails when the dataframes do not have the same column names.
In a horizontal concatenation you combine all of the columns from a
list of DataFrames
into a single wider
DataFrame
.
= pl.DataFrame(
df_h1
{"l1": [1, 2],
"l2": [3, 4],
}
)= pl.DataFrame(
df_h2
{"r1": [5, 6],
"r2": [7, 8],
"r3": [9, 10],
}
)= pl.concat(
df_horizontal_concat
[
df_h1,
df_h2,
],="horizontal",
how
)
df_horizontal_concat
l1 | l2 | r1 | r2 | r3 |
---|---|---|---|---|
i64 | i64 | i64 | i64 | i64 |
1 | 3 | 5 | 7 | 9 |
2 | 4 | 6 | 8 | 10 |
Horizontal concatenation fails when dataframes have overlapping columns.
When dataframes have different numbers of rows, columns will be
padded with null
values at the end up to the maximum
length.
= pl.DataFrame(
df_h1
{"l1": [1, 2],
"l2": [3, 4],
}
)= pl.DataFrame(
df_h2
{"r1": [5, 6, 7],
"r2": [8, 9, 10],
}
)= pl.concat(
df_horizontal_concat
[
df_h1,
df_h2,
],="horizontal",
how
)
df_horizontal_concat
l1 | l2 | r1 | r2 |
---|---|---|---|
i64 | i64 | i64 | i64 |
1 | 3 | 5 | 8 |
2 | 4 | 6 | 9 |
null | null | 7 | 10 |
In a diagonal concatenation you combine all of the row and columns
from a list of DataFrames
into a single longer and/or wider
DataFrame
.
= pl.DataFrame(
df_d1
{"a": [1],
"b": [3],
}
)= pl.DataFrame(
df_d2
{"a": [2],
"d": [4],
}
)
= pl.concat(
df_diagonal_concat
[
df_d1,
df_d2,
],="diagonal",
how
)
df_diagonal_concat
a | b | d |
---|---|---|
i64 | i64 | i64 |
1 | 3 | null |
2 | null | 4 |
Diagonal concatenation generates nulls when the column names do not overlap.
Now let’s practice with some hands-on exercises using the Northwind dataset. We’ll work with several CSVs including customers, orders, products, and more to demonstrate data wrangling concepts with Polars.
customers.csv
, orders.csv
, and
order_details.csv
files.unitPrice * quantity * (1-discount)
).orders.csv
file, identify orders where the
shippedDate
is missing.customers.csv
file, create a new column that
extracts the domain name from the contact’s email address.products.csv
file.orders
dataframe with the
customers
dataframe to see customer information with each
order.orders
dataframe with the
order_details
dataframe to get a complete view of each
order.orders
,
order_details
, and products
to show what
products were ordered.