I had started learning python long before I ever used R. Then I started a job where all the legacy codes were in R and I was kind of forced to learn R, albeit reluctantly. But once I get going, I found the language surprisingly good. Before learning python I had mostly used VBA and SAS. The one thing I started liking the most about R was the method chaining in dplyr. A task that could have easily taken few dozen lines of code in SAS, I could do it in few lines in R. Let’s look an example.
Sample code in SAS
/* Step 1: Filter rows where mpg > 20 */
data filtered_mtcars;
set mtcars;
if mpg > 20;
run;
/* Step 2: Select specific columns */
data selected_columns;
set filtered_mtcars(keep=mpg cyl gear wt);
run;
/* Step 3: Create a new column 'gear_ratio' */
data mutated_data;
set selected_columns;
gear_ratio = mpg / cyl;
run;
/* Step 4: Arrange rows by 'gear_ratio' in descending order */
proc sort data=mutated_data out=sorted_data;
by descending gear_ratio;
run;
/* Step 5: Group by 'gear' and summarize to get mean of 'mpg' and 'wt' by 'gear' */
proc sql;
create table result as
select gear,
mean(mpg) as mean_mpg,
mean(wt) as mean_wt
from sorted_data
group by gear;
quit;
/* View the result */
proc print data=result;
run;
Implementation in R
library(dplyr)
# Example data manipulation in R with dplyr
result <- mtcars %>%
filter(mpg > 20) %>% # Filter rows where mpg > 20
select(mpg, cyl, gear, wt) %>% # Select specific columns
mutate(gear_ratio = mpg / cyl) %>% # Create a new column 'gear_ratio'
arrange(desc(gear_ratio)) %>% # Arrange rows by 'gear_ratio' in descending order
group_by(gear) %>% # Group by 'gear'
summarize(mean_mpg = mean(mpg), mean_wt = mean(wt)) %>% # Summarize to get mean of 'mpg' and 'wt' by 'gear'
ungroup() # Remove grouping information
result
A tibble: 3 × 3
gear mean_mpg mean_wt
<dbl> <dbl> <dbl>
3 21.45 2.8400
4 25.74 2.4520
5 28.20 1.8265
We can see how concise the R code is compared to its SAS equivalent. Not sure why anyone would like to pay for SAS when open-source alternatives can do so much more and are free! Any way coming back to the point, the R code due to method chaining is surprisingly not only concise but is very easy to follow as well.
Now let’s see how we can do the same thing in python.
Implementation in python: without method chaining
import pandas as pd
mtcars=pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/898a40b035f7c951579041aecbfb2149331fa9f6/mtcars.csv')
# Step 1: Filter rows where mpg > 20
filtered_mtcars = mtcars[mtcars['mpg'] > 20]
# Step 2: Select specific columns
selected_columns = filtered_mtcars[['mpg', 'cyl', 'gear', 'wt']].copy() # Use .copy() to ensure a copy is created
# Step 3: Create a new column 'gear_ratio' using .loc to avoid SettingWithCopyWarning
selected_columns.loc[:, 'gear_ratio'] = selected_columns['mpg'] / selected_columns['cyl']
# Step 4: Arrange rows by 'gear_ratio' in descending order
sorted_data = selected_columns.sort_values(by='gear_ratio', ascending=False)
# Step 5: Group by 'gear' and summarize to get mean of 'mpg' and 'wt' by 'gear'
result = sorted_data.groupby('gear').agg(mean_mpg=('mpg', 'mean'), mean_wt=('wt', 'mean')).reset_index()
# Display the result
print(result)
gear mean_mpg mean_wt
0 3 21.45 2.8400
1 4 25.74 2.4520
2 5 28.20 1.8265
Ok at first glance, the python code is also quite concise but is not as readable as was the case with R. So, what are we missing here. And the answer is most of the times, the tutorials don’t show method chaining because it’s not quite a norm in python as is the case with R. That does not make python less useful but certainly a bit inconvenient for people coming from R who are quite used to using method chaining.
There are some python packages that try to replicate method chaining in python using symbols like “»”. But I believe that is unnecessary and may be counterproductive to make a cheap copy of R in python. At the best it will just be a copy. The two languages are inherently different. While R is more specific for data analysis, python is more of a general language. It’s more advisable to leverage the natural method chaining approach in python.
The other issue is that in python there is more than one approach to skin the cat. This also makes it a little more challenging for beginners who are coming from R where generally there is one very established way of doing the things. I don’t mean to say that in R you cannot do the same things in multiple ways. But let’s take example of dplyr verbs like filter select etc. Each of these verbs are very powerful and are very intuitive. For ex, filter in R means filtering rows, filter in python can be used to filter both rows and columns. I hope you get the idea.
With that said below I have tried to show how we can achieve the similar method chaining in python that we can get in R.
Implementation in python: with method chaining
result=(
mtcars
.query("mpg>20") # equivalent of filter in R
.reindex(columns=['mpg', 'cyl', 'gear', 'wt']) # equivalent of select in R
.assign(gear_ratio=lambda x:x['mpg']/x['cyl']) # equivalent of mutate in R; may have performance issue with large datasets
.sort_values(by='gear_ratio', ascending=False) # equivalent of arrange in R
.groupby('gear') # equivalent of group_by in R
.agg(mean_mpg=('mpg', 'mean'), mean_wt=('wt', 'mean')) # equivalent of summarize in R
.reset_index() # kind of similar to ungroup() but not really!
)
print(result)
gear mean_mpg mean_wt
0 3 21.45 2.8400
1 4 25.74 2.4520
2 5 28.20 1.8265
Just a word of caution that sometimes the method chaining can have performance issues when the data set is large especially because of the use of lambda function or majority of the use case however the difference is immaterial and it’s better to have a more readable code than a little bit faster code. But if performance is really issue you can always vectorize the slower methods using. pipe or breaking the one chained code in to maybe two or more smaller chains.