


First, I’ll say what this talk is not:
It won’t be a deep dive into Rust
And it won’t be a detailed comparison between Rust and other languages
But what it will be:
A taste of what makes Rust great
Some reasons for why you might want to include it in your toolbox
And a chance to show you where you might already be using Rust
Python (probably the most popular)
R
SQL
But you also find some others
JVM languages like Scala and Java
Array languages like K/Q (kdb+)
And for high-performance code you might find
Languages like Python and R are popular because they’re easy to use
You don’t really have to think about memory
You don’t need to explicitly define types
They give you access to great libraries that are useful for data science
But they also have their downsides
They tend to be slow
They’re not great for light-weight parallel code
And you don’t need to explicitly define types!
One way to make Python/R faster is to write performance critical code in C/C++ or Fortran1
Many popular libraries actually call these languages under the hood
numpy includes a lot of C, scipy includes a lot of C and Fortran
dplyr relies on some C++ code
But writing C/C++ code usually comes with big costs
Development times can be slower because C/C++ is harder
You can spend hours hunting segfaults (memory bugs)
linters and LSPs aren’t always great
build systems are difficult to use
lacks easy tools / repositories for package management
they’re not well-designed for writing code in a high-level (e.g., functional) style
Even though writing code is a critical part of data-science, most data-scientists aren’t professional programmers
So putting up with the slowness is usually a better choice than trying to optimize some code in C/C++
Enter RUST…
…so what is Rust?
Originally developed inside Mozilla in ~2006
Designed with an emphasis on security, performance, and usability
Sometimes pigeonholed as a “systems programming” language, but has some high-level “functional programming”-like features
Incredibly flexible macro system, which makes it possible to easily construct DSLs1
Emphasis on “zero-cost abstractions” which allows you to write high-level style code (generic functions, collections and iterators, macros) that compile down to the same assembly code you’d get if you wrote low-level C-style code
In languages like C/C++ you have to manage memory manually
This can cause problems because it’s difficult to get right
Anybody who’s spent any time writing C code would’ve come across dreaded segfaults!
In languages like Python/R you don’t have to manage memory manually
Garbage collection makes life easier, but you take a performance hit while the garbage collector does its job
Rust doesn’t have a garbage collector, but memory management is made easier through its ownership system
Each value must have an owner
It can only have one owner at a time
When the owner goes out of scope, the value is dropped (the memory is freed)
Although each value can only have one owner, functions can “borrow” a value (with a reference), and then give it back to the owner when they’re done
fn main() {
let s1 = "Hello".to_string();
let length = calc_len(&s1);
print!("The length of {s1} is {length}");
}
fn calc_len(s: &String) -> usize {
return s.len();
}s is passed to calc_len as a reference (borrow), so the main function still owns it. If it wasn’t passed as a reference, then calc_len would own it and the code wouldn’t compile because the value would be dropped when calc_len goes out of scope
Data is immutable by default, so multiple functions (across threads) can borrow values without danger
Because of the memory safety guarantees, you don’t have to worry about race conditions
If you want to apply functions to vectors, then it’s as easy as swapping out iter() for par_iter()
It’s also trivial to spin up light-weight threads or write async code (with similar syntax to JS async/await)
“Zero-cost” abstraction means that writing high-level style code with the performance of low-level style code
But if you want to write low-level code, then you can do that too
Even when using generic functions (i.e., functions that can take any type [that have certain properties]) don’t take a performance hit
In Python and R have null types (None [python], NULL, NA, NaN [R])
Working with these types will often produce errors, so you have to remember to check for them!
Rust instead has an Option type1 with a Some and a None variants
In Python and R you have to handle errors through catch exceptions (Try: Except: [python], TryCatch)
Rust instead has a Result type2 with Err and Ok variants
The Option and Result type are examples of enumerated types (enums)
With enums you can do pattern matching and the compiler checks that you’re matching against all possible variants
enums allow you to wrap more complex types (structs) and do OO-style inheritance without all the problems of OO-style inheritance
The trait system (similar to interfaces), and generics also give you a tonne of features that can make writing Rust feel like writing code in a high-level language rather than a “systems” language
Rust let’s you write powerful compile-time macros that let you create simple (or complex) DSLs
Rust functions don’t support default or named arguments (like e.g., python and R), but you can just write a macro that makes it possible
You can run the pt() function in R with named arguments, and any you leave off just get the default value…
But I can do exactly the same in Rust…
I can even change the argument order around, just like I could do in R
With macros I don’t have to change languages, I can just make rust work how I want it to work
The previous example, is just a silly little toy example, but other macros let you do things like write HTML in rust!
html! {
<div>
<div data-key="abc"></div>
<div class="parent">
<span class="child" value="anything"></span>
<label for="first-name">{ "First Name" }</label>
<input type="text" id="first-name" value="placeholder" />
<input type="checkbox" checked=true />
<textarea value="write a story" />
<select name="status">
<option selected=true disabled=false value="">{ "Selected" }</option>
<option selected=false disabled=true value="">{ "Unselected" }</option>
</select>
</div>
</div>
};A pipe operator1
Or infix function notation
Error[E0382]: borrow of moved value: `v1`
--> test.rs:9:17
|
3 | let v1 = vec![1, 2, 3, 4];
| -- move occurs because `v1` has type `Vec<i32>`, which does not implement the `Copy` trait
...
6 | let v2 = v1;
| -- value moved here
...
9 | print!("{:#?}", v1);
| ^^ value borrowed here after move
|
= note: this error originates in the macro `$crate::format_args` which comes from the expansion of the macro `print` (in Nightly builds, run with -Z macro-backtrace for more info)
help: consider cloning the value if the performance cost is acceptable
|
6 | let v2 = v1.clone();
| ++++++++The Rust tooling is amazing!
Rust includes a build-system and a package manager (Cargo)
It also ships with a built-in linter (Clippy)
It includes an LSP (rust-analyser)
And it has a built-in formatter (rustfmt)
crates.io also serves as a centralized package repository (i.e., like npm, CRAN, pypi etc)
One python library that is ubiquitous in data science is pandas, the data frame library for python
pandas is slow despite being partly written in Cython and CAn alternative to pandas is pola-rs
pola-rs is a data frame library for Python, Rust, NodeJS, R (or anywhere that has a C FFI) written in Rust.
pola-rs is faster than pandas, dask, dplyr, data.table, DataFrames.jl1
You can start using pola-rs today without writing any Rust
import {viewof query_string, viewof table, viewof example } from "@colling-lab/rust-for-data-science"Note, that it’s perfectly safe for me to expose a raw sql interface to the internet, because there isn’t actually a sql server running anywhere
It’s just a .csv file and some client-side magic with the power of rust
The ndarray crate gives you all the functionality of something like numpy from Rust
This includes all the basic functionality you’d expect, but also powerful linear algebra tools
ndarray makes it a lot easier to build machine learning and statistics systems in Rust
linfa is a Rust crate (which makes use of ndarray) that provides a lot of functionality of sklearn
This includes e.g., PCA, ICA, t-SNE, LASSO, SVM, Decision Trees, Logistic Regression, PLS
The ecosystem is expanding with crates for Gaussian Process Regression, automatic differentiation, deep learning, and wrappers for, e.g., the GSL and libtorch
A popular library for data validation is pydantic
pydantic V2 (due for release later with year) has undergone a Rust rewrite of it’s core giving 5-50x speed up
In large codebases, python linters can be slow
ruff is a new python linter, written in rust that
juliaup, a Julia installer and manager written in Rust
rye, a Python package management system written in Rust
As a compiled language, Rust isn’t great for interactive computing (even though there’s a Rust Jupyter Kernel!)
But it’s easy to call Rust from C, so you can call Rust from any language that speaks C, including Python, R, Matlab, and Julia
For some of these languages you have to write the C interface yourself1
PyO3 lets you call Rust from python, and pass python types and classes back and forth between Rust
extendr lets you call Rust from R, and pass R types and classes back and forth between R. You can even call R functions from Rust, or access the R C API (using the Rust crate libsysR)2
There’s an R package called BFDA helps you do sample size planning for analyses using Bayes Factors
BFDA using simulations to do this, and it is incredibly slow even when using multiple cores
A single simulation takes over 6 minutes!
Although writing it in a more functional style gets it down to just over 1 minute, but even that is still slow
import pyby
import time
import pandas as pd
tic = time.perf_counter()
df = pyby.bf_sim(
0.5,
"t.paired",
{"family": "Cauchy", "params": [0, 0.707], "alternative": "two.sided"},
sampling_rule={"n_min": 50, "n_max": 300, "step_size": 5},
alternative="two.sided",
reps=1000,
seed=1600,
)
df = pd.DataFrame.from_records(
df, columns=["id", "true.ES", "n", "logBF", "emp.ES", "statistic", "p.value"]
)As an added bonus, I can easily use the C libraries that ship with R for probability distributions (this is also the default in Julia) rather than relying on scipy
I’ve mainly talked about why I love Rust, but where and when should you be using it?
The primary place is high-performance library code (as the polars, pydantic etc examples show)
But I think it’s also great if you find functional-style programming works better for data (which I certainly do!)
And it’s great for multithreading and parallel code
Finally, if you’re parsing data (from string, to JSON, to xml) Rust’s type system and ecosystem (e.g., the Nom crate) make it a breeze (even when dealing with streaming data)
So if you find any of this appealing, then check out Rust!
Ferris will thank you for it!
You can find the slides at https://talks.colling.net.nz/sds