For those who use python’s pandas module daily, the first thing you would notice is there are often more ways than one to do almost everything.
The purpose of this article is to demonstrate how we can limit this by drawing inspiration from R’s dplyr
and tidyverse
libraries
Tidying up pandas?
As an academic, often enough the go to lingua franca for data science is R. Especially if you’re coming from Computational Biology/Bioinformatics or Statistics.
And likely you’ll be hooked on the famous tidyverse
meta-package, which includes dplyr
(previously plyr
for ply(e)r), lubridate
(time-series) and tidyr
.
PS. As I am writing this article I realised it isn’t just
tidyverse
, but the whole R ecosystem which I’ve come to love whist doing metagenomics and computational biology in general.
For the benefit of those who started from R, pandas
is the dataframe module for python, several other packages like datatable exists and is is heavily inspired by R’s own datatable.
Now back to how tidyverse specifically dplyr organises dataframe manipulation.
In his talk, Hadley Wickham, mentioned what we really need for table manipulation are just a handful of functions.
- filter
- select
- arrange
- mutate
- group_by
- summarise
- merge
Although, I would argue you need just a bit more. For example, knowing R’s family of apply
functions will help tonnes. Or a couple of summary statistics functions like summary
or str
, although nowadays I use skimr::skim
a lot.
skim(iris)
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## ── Variable type:factor ──────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n n_unique top_counts ordered
## Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0 FALSE
##
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▁▂▅▅▃▁
## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5 ▇▁▁▅▃▃▂▂
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ▂▇▅▇▆▅▂▂
## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4 ▁▂▅▇▃▂▁▁
In fact, Google’s Facets behaves somewhat like this as well (see image below).
Thus, in this post I’ll try my best to demonstrate 1-to-1 mappings of the tidyverse
vocabularies with pandas
methods.
For demonstration, We will be using the famous Iris flower dataset.
# python
import seaborn as sns
iris = sns.load_data("iris")
I’ve chosen to imports the iris data using seaborn rather than sklearn’s datasets which are numpy arrays
The first thing I usually do when I import a table is to run the str
function on the table
# R (iris is already loaded by default)
str(iris)
# python
iris.info(null_counts=True)
# if the number of rows are too much, pandas will not do the count,
# so I have to forcibly set `null_counts` to `True`.
Filter
The closest method similar to R’s filter
is pd.query
.
# R
cutoff = 30
iris %>%
filter(sepal.width > cutoff)
There’s two ways to do this in python. The first is probably what you’ll find most python users using
# python
cutoff = 30
iris[iris.sepal_width > cutoff]
However, pd.DataFrame.query()
maps more closely with dplyr::filter()
.
# Python
iris. \
query("sepal_width > @cutoff”) # this is using a SQL like language
One downside of using this is linters which follows the
pep8
convention likeflake8
will complain about thecutoff
variable not being used although it has already been declared. This is because the linters are unable to recognise the use ofcutoff
inside the query quoted string.
Surprisingly, filter makes a return in pySpark. :)
# python (pyspark)
type(flights)
pyspark.sql.dataframe.DataFrame
# filters flights which are > 1000 miles long
flights.filter('distance > 1000')
Select
This is reminiscent of SQL’s select
keyword which allows you to choose columns.
# R
iris %>%
select(sepal.width, sepal.length)
# Python
iris \
.loc[:5, [["sepal_width", "sepal_length"]]] # selects the 1st 5 rows
Initially, I thought the following df[['col1', 'col2']]
pattern would be a good map. But quickly realised we cannot do slices of the columns similar to select
.
# R
iris %>% select(Sepal.Length:Petal.Width)
# Python
iris.loc[:, "sepal_length":"petal_width"]
A thing to note about the loc
method is that it could return a series instead of a DataFrame when the selection is just one row. so you’ll have to slice it in order to return a dataframe.
# Python
iris.loc[1, :] # returns a Series
iris.loc[[1],:] # returns a dataframe
But the really awesome thing about select
, function its ability to /unselect/ columns which is missing in the loc
method.
# R
df %>% select(-col1)
You have to use the .drop()
method.
# Python
df.drop(columns=["col1"])
Note I had to add the param
columns
because drop can not only be used to drop columns, the method can also drop rows based on their index.
Like filter
, select
is also used in pySpark!
# python (pySpark)
df.select("xyz").show() # shows the column xyz of the spark dataframe.
# alternative
df.select(df.xyz)
Arrange
The arrange function lets one sort the table by a particular column.
# R
df %>% arrange(col1, descreasing=TRUE)
# Python
df.sort_values(by="col1", ascending=False) # everything is reversed in python fml.
Mutate
dplyr
’s mutate
was really an upgrade from R’s apply
.
NOTE: Other applies which is useful in R for example includes
mapply
andlapply
# R
df %>% mutate(
new = something / col2,
newcol = col+1
)
# Python
iris.assign(
new = iris.sepal_width / iris.sepal,
newcol = lambda x: x["col"] + 1
)
tidyverse
’s mutate
function by default takes the whole column and does vectorised operations on it.
If you want to apply the function row by row, you’ll have to couple rowwise
with mutate
.
# R
# my_function does not take vectorised input of the entire column
# this will fail
iris %>%
mutate(new_column = my_function(sepal.width, sepal.length))
# this will force mutate to be applied row by row
iris %>%
rowwise %>%
mutate(new_column = my_function(sepal.width, sepal.length))
To achieve the same using the .assign
method you can nest an apply
inside the function.
# Python
def do_something_string(col):
#set_trace()
if re.search(r".*(osa)$", col):
value = "is_setosa"
else:
value = "not_setosa"
return value
iris = iris.assign(
transformed_species = lambda df: df["species"] \
.apply(do_something_string)
)
If you’re lazy, you could just chain two anoymous functions together.
# Python
iris = iris.assign(
transformed_species = lambda df: df.species.apply(do_something_string))
Apply
From R’s apply
help docs:
apply(X, MARGIN, FUN, ...)
Where the value of MARGIN
takes either 1
or 2
for (rows, columns), ie. if you want to apply to each row, you’ll set the axis as 0
.
However, in pandas axis refers to what values (index i or columns j) will be used for the applied functions input parameter’s index.
be using the 0
refers to the DataFrame’s index and axis 1
refers to the columns.
So if you wanted to carry out row wise operations you could set axis to 0.
# R
df %>%
apply(0, function(row){
...
do some compute
...
})
Rarely do that now since
plyr
and laterdplyr.
However there is no plyr
in pandas. So we have to go back to using apply if you want row-wise operations, however, the axis now is 1 not 0. I initially found this very confusing. The reason is because the row is a really just a pandas.Series
whose index is the parent pandas.DataFame’s columns. Thus in this the axis is referring to which axis to set as the index.
# python
iris.apply(lambda row: do_something(row), axis=1)
Interesting pattern which I do not use in R, is to use apply on columns, in this case pandas.Series
objects.
# python
iris.sepal_width.apply(lambda x: x**2)
# if you want a fancy progress bar, you could use the tqdm function
iris.sepal_width.apply_progress(lambda x: x**2)
# If u need parallel apply
# this works with dask underneath
import swifter
iris.sepal_width.swifter.apply(lambda x : x**2)
In R, one of the common idioms, which I keep going back to for a parallel version of groupby
is as follows:
# R
unique_list %>%
lapply(function(x){
...
df %>% filter(col == x) %>%
do_something() # do something to the subset
...
}) %>% do.call(rbind,.)
If you want a parallel version you’ll just have to change the lapply
to mclapply
.
Additionally, there’s mclapply
from the parallel
/snow
library in R.
# R
ncores = 10 # the number of cores
unique_list %>%
mclapply(function(x){
...
df %>% filter(col == x) %>%
do_something() # do something to the subset
...
}, mc.cores=ncores) %>% do.call(rbind,.)
Separately, in pySpark, you can split the whole table into partitions and do the manipulations in parallel.
# Python (pyspark)
dd.from_pandas(my_df,npartitions=nCores).\
map_partitions(
lambda df : df.apply(
lambda x : nearest_street(x.lat,x.lon),axis=1)).\
compute(get=get)
# imports at the end
To achieve the same, what we can use the dask
, or a higher level wrapper from the swiftapply
library.
# Python
# you can easily vectorise the example using by adding the `swift` method before `.apply`
series.swift.apply()
Group by
The .groupby
method in pandas is equivalent to R function dplyr::group_by
returning a DataFrameGroupBy
object.
In Tidyverse there’s the
ungroup
function to ungroup grouped DataFrames, in order to achieve the same, there does not exists a1-to-1 mappable function.One way is to complete the
groupby
->apply
(two-step process) and feeding apply with an identity functionapply(lambda x: x)
. Which is an identity function.
Summarise
In pandas the equivalent of the summarise
function is aggregate
abbreviated as the agg
function. And you will have to couple this with groupby
, so it’ll similar again a two step groupby
-> agg
transformation.
# R
r_mt = mtcars %>%
mutate(model = rownames(mtcars)) %>%
select(cyl, model, hp, drat) %>%
filter(cyl < 8) %>%
group_by(cyl) %>%
summarise(
hp_mean = mean(hp),
drat_mean = mean(drat),
drat_std = sd(drat),
diff = max(drat) - min(drat)
) %>%
arrange(drat_mean) %>%
as.data.frame
The same series of transformation written in Python would follow:
# Python
def transform1(x):
return max(x)-min(x)
def transform2(x):
return max(x)+5
py_mt = (
mtcars.
loc[:,["cyl", "model", "hp", "drat"]]. #select
query("cyl < 8"). #filter
groupby("cyl"). #group_by
agg( #summarise, agg is an abbreviation of aggregation
{
'hp':'mean',
'drat':['mean', 'std', transform1, transform2] # R wins... this sux for pandas
}).
sort_values(by=[("drat", "mean")]) #multindex sort (unique to pandas)
)
py_mt
# R
df %>%
group_by(col) %>%
summarise(my_new_column = do_something(some_col))
Join
Natively, R supports the merge
function and similarly in Pandas there’s the pd.merge
function.
Along side the other join
functions: left_join
, right_join
, inner_join
and anti_join
.
Inplace
In R there’s the compound assignment pipe-operator %<>%
, which is similar to the inplace=True
argument in some pandas functions but not all. :(
Apparently Pandas is going to remove inplace altogether…
Debugging
In R, we have the browser()
function.
# R
unique(iris$species) %>%
lapply(function(s){
browser()
iris %>% filter(species == s)
....
})
It’ll let you step into the function which is extremely useful if you want to do some debugging.
In Python, there’s the set_trace
function.
# Python
from IPython.core.debugger import set_trace
(
iris
.groupby("species")
.apply(lambda groupedDF: set_trace())
)
Last but not least if you really need to use some R function you could always rely on the rpy2
package. For me I rely on this a lot for plotting. ggplot2 ftw!
# python
import rpy2 # imports the library
%load_ext rpy2.ipython # load the magic
Sometimes there’s issues installing r packages using R. You can run
conda install -r r r-tidyverse r-ggplot
There after you can always use R and Python interchangeably in the same Jupyter notebook.
%%R -i python_df -o transformed_df
transformed_df = python_df %>%
select(-some_columns) %>%
mutate(newcol = somecol * 2)
NOTE:
%%R
is cell magic and%R
is line magic.
If you need outputs to be printed like a normal pandas DataFrame, you can you the single percent magic
%R some_dataFrame %>% skim
Elipisis
In R, one nifty trick you can do is to pass arguments to inner functions without ever having to define them in the outer function’s function signature.
# R
#' Simple function which takes two parameters `one` and `two` and elipisis `...`,
somefunction = function(one, two, ...){
three = one + two
sometwo = function(x, four){
x + four
}
sometwo(three, ...) # four exists within the elipisis
}
# because of the elipisis, we can pass as many parameters as we we want. the extras will be stored in the elipisis
somefunction(one=2, two=3, four=5, name="wesley")
In python, **kwargs
takes the place of ...
. Below is an explanation of how exactly it works.
Explanation
Firstly, the double asterisks **
is called unpack operator (it’s placed before a function signature eg. kwargs
so together it’ll look like **kwargs
).
The convention is to let that variable be named
kwargs
(which stands for keyworded arguments) but it could be named anything.
Most articles which describe the unpack operator will start off with this explanation: where dictionaries are used to pass functions their parameters.
# Python
adictionary = {
'first' : 1,
'second': 2
}
def some_function(first, second): return first + second
some_function(**adictionary)
# which gives 3
But you could also twist this around and set **kwargs
as a function signature. Doing this lets you key in an arbitrary number of function signatures when calling the function.
The signature-value pairs are wrapped into a dictionary named kwargs
which is accessible inside the function.
# Python
# dummy function which prints `kwargs`
def some_function (**kwargs): print(kwargs)
some_function(first=1, second=2)
The previous two cases are not exclusive, you could actually ~mix~ them together. Ie. have named signatures as well as a **kwargs
.
# Python
adictionary = {
'first' : 1,
'second': 2,
'useless_value' : "wesley"
}
def some_function(first, second, **kwargs):
print(kwargs)
return first + second
print(some_function(**adictionary))
The output will be: {'useless_value': 'wesley'}
It allows a python function to accept as many function signatures as you supply it. Those which are already defined during the declaration of the function would be directly used. And those which do not appear within them can be accessed from kwargs.
By putting the **kwargs
as an argument in the inner function, you’re basically unwrapping the dictionary into the function params.
# Python
def somefunction(one, two, **kwargs):
print(f"outer function:\n\t{kwargs}")
three = one + two
def sometwo(x, four):
print(f"inner function: \n\t{kwargs}")
return x + four
return sometwo(three, **kwargs)
somefunction(one=2, two=3, four=5, name=“wesley”)
outside function:
{“four”:5, “name”:”wesley”}
Inside
inside kwargs:
{'name': 'jw'}
Lets now compare this with the original R elipsis
# R
#' Simple function which takes two parameters `one` and `two` and elipisis `...`,
somefunction = function(one, two, ...){
three = one + two
sometwo = function(x, four){
x + four
}
sometwo(three, ...) # four exists within the elipisis
}
# because of the elipisis, we can pass as many parameters as we we want. the extras will be stored in the elipisis
somefunction(one=2, two=3, four=5, name="wesley")
Conclusion
There’s many ways to do thing in pandas more so than the tidyverse way, and I wish this was clearer.
Additionally, something which caught me off guard after coming to Honestbee was the amount of SQL I need.
For example postgreSQL to query RDS and it’s dialect for querying Redshift, KSQL for querying data streams via Kafka and Athena’s query language build on top of presto DB for querying S3, where most of the data use to exist in parquet files.
The shows one big deviation from academia where data in a company is usually stored in a database / datalake / datastream whereas in academia its usually just one big flat data file.
We’ve come to the ending of this attempt at mapping tidyverse vocabularies to pandas, hope you’ve found this informative and useful! See you guys soon!
comments powered by Disqus