Cover image
Data Science and Databases
7 minute read

Python vs. R: Syntactic Sugar Magic

Python and R empower data scientists to solve problems using elegant syntactic sugar, simplifying coding and solution exploration. Each language brings its unique capabilities and approach to bear.

My development palate has expanded since I learned to appreciate the sweetness found in Python and R. Data science is an art that can be approached from multiple angles but requires a careful balance of language, libraries, and expertise. The expansive capabilities of Python and R provide syntactic sugar: syntax that eases our work and allows us to address complex problems with short, elegant solutions.

These languages provide us with unique ways to explore our solution space. Each language has its own strengths and weaknesses. The trick to using each effectively is recognizing which problem types benefit from each tool and deciding how we want to communicate our findings. The syntactic sugar in each language allows us to work more efficiently.

R and Python function as interactive interfaces on top of lower-level code, allowing data scientists to use their chosen language for data exploration, visualization, and modeling. This interactivity enables us to avoid the incessant loop of editing and compiling code, which needlessly complicates our job.

These high-level languages allow us to work with minimal friction and do more with less code. Each language’s syntactic sugar enables us to quickly test our ideas in a REPL (read-evaluate-print loop), an interactive interface where code can be executed in real-time. This iterative approach is a key component in the modern data process cycle.

R vs. Python: Expressive and Specialized

The power of R and Python lies in their expressiveness and flexibility. Each language has specific use cases in which it is more powerful than the other. Additionally, each language solves problems along different vectors and with very different types of output. These styles tend to have different developer communities where one language is preferred. As each community grows organically, their preferred language and feature sets trend toward unique syntactic sugar styles that reduce the code volume required to solve problems. And as the community and language mature, the language’s syntactic sugar often gets even sweeter.

Although each language offers a powerful toolset for solving data problems, we must approach those problems in ways that exploit the particular strengths of the tools. R was born as a statistical computing language and has a wide set of tools available for performing statistical analyses and explaining the data. Python and its machine learning approaches solve similar problems but only those that fit into a machine learning model. Think of statistical computing and machine learning as two schools for data modeling: Although these schools are highly interconnected, their origins and paradigms for data modeling are different.

R Loves Statistics

R has evolved into a rich package offering for statistical analysis, linear modeling, and visualization. Because these packages have been part of the R ecosystem for decades, they are mature, efficient, and well documented. When a problem calls for a statistical computing approach, R is the right tool for the job.

The main reasons R is loved by its community boils down to:

  • Discrete data manipulation, computation, and filtering methods.
  • Flexible chaining operators to connect those methods.
  • A succinct syntactic sugar that allows developers to solve complex problems using comfortable statistical and visualization methods.

A Simple Linear Model With R

To see just how succinct R can be, let’s create an example that predicts diamond prices. First, we need data. We will use the diamonds default dataset, which is installed with R and contains attributes such as color and cut.

We will also demonstrate R’s pipe operator (%>%), the equivalent of the Unix command-line pipe (|) operator. This popular piece of R’s syntactic sugar feature is made available by the tidyverse package suite. This operator and the resulting code style is a game changer in R because it allows for the chaining of R verbs (i.e., R functions) to divide and conquer a breadth of problems.

The following code loads the required libraries, processes our data, and generates a linear model:

library(tidyverse)
library(ggplot2)

mode <- function(data) {
  freq <- unique(data)
  freq[which.max(tabulate(match(data, freq)))]
}

data <- diamonds %>% 
        mutate(across(where(is.numeric), ~ replace_na(., median(., na.rm = TRUE)))) %>% 
        mutate(across(where(is.numeric), scale))  %>%
        mutate(across(where(negate(is.numeric)), ~ replace_na(.x, mode(.x)))) 

model <- lm(price~., data=data)

model <- step(model)
summary(model)
Call:
lm(formula = price ~ carat + cut + color + clarity + depth + 
    table + x + z, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3588 -0.1485 -0.0460  0.0943  2.6806 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -0.140019   0.002461  -56.892  < 2e-16 ***
carat        1.337607   0.005775  231.630  < 2e-16 ***
cut.L        0.146537   0.005634   26.010  < 2e-16 ***
cut.Q       -0.075753   0.004508  -16.805  < 2e-16 ***
cut.C        0.037210   0.003876    9.601  < 2e-16 ***
cut^4       -0.005168   0.003101   -1.667  0.09559 .  
color.L     -0.489337   0.004347 -112.572  < 2e-16 ***
color.Q     -0.168463   0.003955  -42.599  < 2e-16 ***
color.C     -0.041429   0.003691  -11.224  < 2e-16 ***
color^4      0.009574   0.003391    2.824  0.00475 ** 
color^5     -0.024008   0.003202   -7.497 6.64e-14 ***
color^6     -0.012145   0.002911   -4.172 3.02e-05 ***
clarity.L    1.027115   0.007584  135.431  < 2e-16 ***
clarity.Q   -0.482557   0.007075  -68.205  < 2e-16 ***
clarity.C    0.246230   0.006054   40.676  < 2e-16 ***
clarity^4   -0.091485   0.004834  -18.926  < 2e-16 ***
clarity^5    0.058563   0.003948   14.833  < 2e-16 ***
clarity^6    0.001722   0.003438    0.501  0.61640    
clarity^7    0.022716   0.003034    7.487 7.13e-14 ***
depth       -0.022984   0.001622  -14.168  < 2e-16 ***
table       -0.014843   0.001631   -9.103  < 2e-16 ***
x           -0.281282   0.008097  -34.740  < 2e-16 ***
z           -0.008478   0.005872   -1.444  0.14880    
---
Signif. codes:  0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 ‘.' 0.1 ‘ ' 1

Residual standard error: 0.2833 on 53917 degrees of freedom
Multiple R-squared:  0.9198,    Adjusted R-squared:  0.9198 
F-statistic: 2.81e+04 on 22 and 53917 DF,  p-value: < 2.2e-16

R makes this linear equation simple to program and understand with its syntactic sugar. Now, let’s shift our attention to where Python is king.

Python Is Best for Machine Learning

Python is a powerful, general-purpose language, with one of its primary user communities focused on machine learning, leveraging popular libraries like scikit-learn, imbalanced-learn, and Optuna. Many of the most influential machine learning toolkits, such as TensorFlow, PyTorch, and Jax, are written primarily for Python.

Python’s syntactic sugar is the magic that machine learning experts love, including succinct data pipeline syntax, as well as scikit-learn’s fit-transform-predict pattern:

  1. Transform data to prepare it for the model.
  2. Construct a model (implicit or explicitly).
  3. Fit the model.
  4. Predict new data (supervised model) or transform the data (unsupervised).
    • For supervised models, compute an error metric for the new data points.

The scikit-learn library encapsulates functionality matching this pattern while simplifying programming for exploration and visualization. There are also many features corresponding to each step of the machine learning cycle, providing cross-validation, hyperparameter tuning, and pipelines.

A Diamond Machine Learning Model

We’ll now focus on a simple machine learning example using Python, which has no direct comparison in R. We’ll use the same dataset and highlight the fit-transform-predict pattern in a very tight piece of code.

Following a machine learning approach, we’ll split the data into training and testing partitions. We’ll apply the same transformations on each partition and chain the contained operations with a pipeline. The methods (fit and score) are key examples of powerful machine learning methods contained in scikit-learn:

import numpy as np
import pandas as pd
from sklearn.linear_model LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from pandas.api.types import is_numeric_dtype

diamonds = sns.load_dataset('diamonds')
diamonds = diamonds.dropna()

x_train,x_test,y_train,y_test = train_test_split(diamonds.drop("price", axis=1), diamonds["price"], test_size=0.2, random_state=0)

num_idx = x_train.apply(lambda x: is_numeric_dtype(x)).values
num_cols = x_train.columns[num_idx].values
cat_cols = x_train.columns[~num_idx].values

num_pipeline = Pipeline(steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())])
cat_steps = Pipeline(steps=[("imputer", SimpleImputer(strategy="constant", fill_value="missing")), ("onehot", OneHotEncoder(drop="first", sparse=False))])

# data transformation and model constructor
preprocessor = ColumnTransformer(transformers=[("num", num_pipeline, num_cols), ("cat", cat_steps, cat_cols)])

mod = Pipeline(steps=[("preprocessor", preprocessor), ("linear", LinearRegression())])

# .fit() calls .fit_transform() in turn
mod.fit(x_train, y_train)

# .predict() calls .transform() in turn
mod.predict(x_test)

print(f"R squared score: {mod.score(x_test, y_test):.3f}")

We can see how streamlined the machine learning process is in Python. Additionally, Python’s sklearn classes help developers avoid leaks and problems related to passing data through our model while also generating structured and production-level code.

What Else Can R and Python Do?

Aside from solving statistical applications and creating machine learning models, R and Python excel at reporting, APIs, interactive dashboards, and simple inclusion of external low-level code libraries.

Developers can generate interactive reports in both R and Python, but it’s far simpler to develop them in R. R also supports exporting those reports to PDF and HTML.

Both languages allow data scientists to create interactive data applications. R and Python use the libraries Shiny and Streamlit, respectively, to create these applications.

Lastly, R and Python both support external bindings to low-level code. This is typically used to inject highly performant operations into a library and then call those functions from within the language of choice. R uses the Rcpp package, while Python uses the pybind11 package to accomplish this.

Python and R: Getting Sweeter Every Day

In my work as a data scientist, I use both R and Python regularly. The key is to understand where each language is strongest and then adjust a problem to fit within an elegantly coded solution.

When communicating with clients, data scientists want to do so in the language that is most easily understood. Therefore, we must weigh whether a statistical or machine learning presentation is more effective and then use the most suitable programming language.

Python and R each provide an ever-growing collection of syntactic sugar, which both simplify our work as data scientists and ease its comprehensibility to others. The more refined our syntax, the easier it is to automate and interact with our preferred languages. I like my data science language sweet, and the elegant solutions that result are even sweeter.

Further Reading on the Toptal Engineering Blog:

Understanding the basics

R is better than Python for statistical analysis, both in terms of code and presentation.

R is easier to use when exploring a problem focused on statistical analysis, linear modeling, or visualization.

R is most used for exploring a solution space using statistical methods and linear models.

Python shines when implementing a machine learning solution. Having access to many advanced libraries and efficient syntactic sugar eases coding and reduces development time.

Python is a popular general programming language. It is easy to learn, flexible, and has strong library support.