Sussex Data Science

High performance code for Data Science

Does Rust have a place in the data scientist’s toolbox?




Lincoln Colling

University of Sussex

Software Sustainability Institute

Before we start

First, I’ll say what this talk is not:

  • It won’t be a deep dive into Rust

  • And it won’t be a detailed comparison between Rust and other languages

But what it will be:

  • A taste of what makes Rust great

  • Some reasons for why you might want to include it in your toolbox

  • And a chance to show you where you might already be using Rust

Typical tools in the data science toolbox

  • Python (probably the most popular)

  • R

  • SQL

But you also find some others

  • JVM languages like Scala and Java

  • Array languages like K/Q (kdb+)

And for high-performance code you might find

  • C / C++ (and Fortran)

The pros and cons of Python and R

Languages like Python and R are popular because they’re easy to use

  • You don’t really have to think about memory

  • You don’t need to explicitly define types

  • They give you access to great libraries that are useful for data science

But they also have their downsides

  • They tend to be slow

  • They’re not great for light-weight parallel code

  • And you don’t need to explicitly define types!

Getting around the slowness of Python/R

  • One way to make Python/R faster is to write performance critical code in C/C++ or Fortran1

    • Many popular libraries actually call these languages under the hood

    • numpy includes a lot of C, scipy includes a lot of C and Fortran

    • dplyr relies on some C++ code

But writing C/C++ code usually comes with big costs

Costs of writing C/C++

  • Development times can be slower because C/C++ is harder

    • You can spend hours hunting segfaults (memory bugs)

    • linters and LSPs aren’t always great

    • build systems are difficult to use

    • lacks easy tools / repositories for package management

    • they’re not well-designed for writing code in a high-level (e.g., functional) style

Even though writing code is a critical part of data-science, most data-scientists aren’t professional programmers

So putting up with the slowness is usually a better choice than trying to optimize some code in C/C++

An alternative to C/C++

Enter RUST…

…so what is Rust?

The Rust Programming Language

  • Originally developed inside Mozilla in ~2006

  • Designed with an emphasis on security, performance, and usability

  • Sometimes pigeonholed as a “systems programming” language, but has some high-level “functional programming”-like features

  • Incredibly flexible macro system, which makes it possible to easily construct DSLs1

  • Emphasis on “zero-cost abstractions” which allows you to write high-level style code (generic functions, collections and iterators, macros) that compile down to the same assembly code you’d get if you wrote low-level C-style code

What makes Rust great: Memory safety

In languages like C/C++ you have to manage memory manually

  • This can cause problems because it’s difficult to get right

  • Anybody who’s spent any time writing C code would’ve come across dreaded segfaults!

In languages like Python/R you don’t have to manage memory manually

  • The runtime manages it for you with a garbage collector which checks and frees memory that is no longer being used

Garbage collection makes life easier, but you take a performance hit while the garbage collector does its job

What makes Rust great: Memory safety

Rust doesn’t have a garbage collector, but memory management is made easier through its ownership system

  • Each value must have an owner

  • It can only have one owner at a time

  • When the owner goes out of scope, the value is dropped (the memory is freed)

// create a vector
let v1 = vec![1, 2, 3, 4];

// move the value in v1 into v2 (making v2 the new owner)
let v2 = v1;

// try print out the content of v1
// won't compile because v1 is no longer accessible
print!("{:#?}", v1);

What makes Rust great: Memory safety

Although each value can only have one owner, functions can “borrow” a value (with a reference), and then give it back to the owner when they’re done

  • The borrow checker checks that the rules around borrowing are followed
fn main() {

    let s1 = "Hello".to_string();

    let length = calc_len(&s1);
    print!("The length of {s1} is {length}");
}

fn calc_len(s: &String) -> usize {
    return s.len();
}

s is passed to calc_len as a reference (borrow), so the main function still owns it. If it wasn’t passed as a reference, then calc_len would own it and the code wouldn’t compile because the value would be dropped when calc_len goes out of scope

What makes Rust great: Performance

  • Data is immutable by default, so multiple functions (across threads) can borrow values without danger

  • Because of the memory safety guarantees, you don’t have to worry about race conditions

  • If you want to apply functions to vectors, then it’s as easy as swapping out iter() for par_iter()

  • It’s also trivial to spin up light-weight threads or write async code (with similar syntax to JS async/await)

// create a vector and multiply each value by 10
let v1 = vec![1, 2, 3];
let v2: Vec<i32> = v1.iter().map(|x| x * 10).collect();

// do the same but in parallel
let v1 = vec![1, 2, 3];
let v2: Vec<i32> = v1.par_iter().map(|x| x * 10).collect();

What makes Rust great: Performance

  • “Zero-cost” abstraction means that writing high-level style code with the performance of low-level style code

  • But if you want to write low-level code, then you can do that too

    • You’re using a low-level language, so if you want to write SIMD vectorized code, then you can! And Rust makes it easy!
  • Even when using generic functions (i.e., functions that can take any type [that have certain properties]) don’t take a performance hit

What makes Rust great: It’s magical type system

  • In Python and R have null types (None [python], NULL, NA, NaN [R])

  • Working with these types will often produce errors, so you have to remember to check for them!

  • Rust instead has an Option type1 with a Some and a None variants

  • In Python and R you have to handle errors through catch exceptions (Try: Except: [python], TryCatch)

  • Rust instead has a Result type2 with Err and Ok variants

What makes Rust great: It’s magical type system

  • The Option and Result type are examples of enumerated types (enums)

  • With enums you can do pattern matching and the compiler checks that you’re matching against all possible variants

  • enums allow you to wrap more complex types (structs) and do OO-style inheritance without all the problems of OO-style inheritance

  • The trait system (similar to interfaces), and generics also give you a tonne of features that can make writing Rust feel like writing code in a high-level language rather than a “systems” language

let s = v1.get(3); // returns an Option which might be None
let s = match s {
    Some(v) => v, // if it's Some give me the value inside
    None => &0,  // if it's None give me 0
};
println!("{}",s)

What makes Rust great: The powerful macro system

  • Rust let’s you write powerful compile-time macros that let you create simple (or complex) DSLs

  • Rust functions don’t support default or named arguments (like e.g., python and R), but you can just write a macro that makes it possible

  • You can run the pt() function in R with named arguments, and any you leave off just get the default value…

result <- pt(q = 5.5, df = 1.0)

But I can do exactly the same in Rust…

let result = pt!(q = 5.5, df = 1.0);

I can even change the argument order around, just like I could do in R

let result = pt!(df = 1.0, q = 5.5);

What makes Rust great: The powerful macro system

  • With macros I don’t have to change languages, I can just make rust work how I want it to work

  • The previous example, is just a silly little toy example, but other macros let you do things like write HTML in rust!

html! {
    <div>
        <div data-key="abc"></div>
        <div class="parent">
            <span class="child" value="anything"></span>
            <label for="first-name">{ "First Name" }</label>
            <input type="text" id="first-name" value="placeholder" />
            <input type="checkbox" checked=true />
            <textarea value="write a story" />
            <select name="status">
                <option selected=true disabled=false value="">{ "Selected" }</option>
                <option selected=false disabled=true value="">{ "Unselected" }</option>
            </select>
        </div>
    </div>
};

What makes Rust great: The powerful macro system

  • Or you easily add new language features like:

A pipe operator1

let res = pipe!(
   4
   => (times(2))
   => {|x| x + 2}
);


Or infix function notation

let res = a | dotprod | b;

What makes Rust great: Usability

  • The compiler explains the error and suggests a fix!
Error[E0382]: borrow of moved value: `v1`
 --> test.rs:9:17
  |
3 | let v1 = vec![1, 2, 3, 4];
  |     -- move occurs because `v1` has type `Vec<i32>`, which does not implement the `Copy` trait
...
6 | let v2 = v1;
  |          -- value moved here
...
9 | print!("{:#?}", v1);
  |                 ^^ value borrowed here after move
  |
  = note: this error originates in the macro `$crate::format_args` which comes from the expansion of the macro `print` (in Nightly builds, run with -Z macro-backtrace for more info)
help: consider cloning the value if the performance cost is acceptable
  |
6 | let v2 = v1.clone();
  |            ++++++++
  • And if I need more help:
cargo --explain E0382

What makes Rust great: Usability

  • The Rust tooling is amazing!

  • Rust includes a build-system and a package manager (Cargo)

  • It also ships with a built-in linter (Clippy)

  • It includes an LSP (rust-analyser)

  • And it has a built-in formatter (rustfmt)

  • crates.io also serves as a centralized package repository (i.e., like npm, CRAN, pypi etc)

But Rust for data science?

High-performance data frames

  • One python library that is ubiquitous in data science is pandas, the data frame library for python

    • But pandas is slow despite being partly written in Cython and C
  • An alternative to pandas is pola-rs

    • pola-rs is a data frame library for Python, Rust, NodeJS, R (or anywhere that has a C FFI) written in Rust.

    • pola-rs is faster than pandas, dask, dplyr, data.table, DataFrames.jl1

  • You can start using pola-rs today without writing any Rust

A lazy dashboard with Polars and WASM

  • Note, that it’s perfectly safe for me to expose a raw sql interface to the internet, because there isn’t actually a sql server running anywhere

  • It’s just a .csv file and some client-side magic with the power of rust

Array programming: A numpy alternative

  • The ndarray crate gives you all the functionality of something like numpy from Rust

  • This includes all the basic functionality you’d expect, but also powerful linear algebra tools

  • ndarray makes it a lot easier to build machine learning and statistics systems in Rust

    • linfa is a Rust crate (which makes use of ndarray) that provides a lot of functionality of sklearn

    • This includes e.g., PCA, ICA, t-SNE, LASSO, SVM, Decision Trees, Logistic Regression, PLS

The ecosystem is expanding with crates for Gaussian Process Regression, automatic differentiation, deep learning, and wrappers for, e.g., the GSL and libtorch

Other Rust tools you can use now

High-performance data validation

  • A popular library for data validation is pydantic

  • pydantic V2 (due for release later with year) has undergone a Rust rewrite of it’s core giving 5-50x speed up

High-performance tokenizers

  • huggingface’s tokenizer library for python is written in Rust

High-performance linting

  • In large codebases, python linters can be slow

  • ruff is a new python linter, written in rust that

Package management tools

  • juliaup, a Julia installer and manager written in Rust

  • rye, a Python package management system written in Rust

It’s not Rust or Python/R, it’s Rust and Python/R

  • As a compiled language, Rust isn’t great for interactive computing (even though there’s a Rust Jupyter Kernel!)

  • But it’s easy to call Rust from C, so you can call Rust from any language that speaks C, including Python, R, Matlab, and Julia

  • For some of these languages you have to write the C interface yourself1

  • PyO3 lets you call Rust from python, and pass python types and classes back and forth between Rust

  • extendr lets you call Rust from R, and pass R types and classes back and forth between R. You can even call R functions from Rust, or access the R C API (using the Rust crate libsysR)2

It’s not Rust or Python/R, it’s Rust and Python/R: An Example

  • There’s an R package called BFDA helps you do sample size planning for analyses using Bayes Factors

  • BFDA using simulations to do this, and it is incredibly slow even when using multiple cores

    • A single simulation takes over 6 minutes!

    • Although writing it in a more functional style gets it down to just over 1 minute, but even that is still slow

It’s not Rust or Python/R, it’s Rust and Python/R: An Example

  • I rewrote it in Rust and now the simulations take ~1.5 seconds!
import pyby
import time
import pandas as pd

tic = time.perf_counter()

df = pyby.bf_sim(
    0.5,
    "t.paired",
    {"family": "Cauchy", "params": [0, 0.707], "alternative": "two.sided"},
    sampling_rule={"n_min": 50, "n_max": 300, "step_size": 5},
    alternative="two.sided",
    reps=1000,
    seed=1600,
)

df = pd.DataFrame.from_records(
    df, columns=["id", "true.ES", "n", "logBF", "emp.ES", "statistic", "p.value"]
)

As an added bonus, I can easily use the C libraries that ship with R for probability distributions (this is also the default in Julia) rather than relying on scipy

Where does Rust fit in?

  • I’ve mainly talked about why I love Rust, but where and when should you be using it?

  • The primary place is high-performance library code (as the polars, pydantic etc examples show)

  • But I think it’s also great if you find functional-style programming works better for data (which I certainly do!)

  • And it’s great for multithreading and parallel code

  • Finally, if you’re parsing data (from string, to JSON, to xml) Rust’s type system and ecosystem (e.g., the Nom crate) make it a breeze (even when dealing with streaming data)

So if you find any of this appealing, then check out Rust!

Ferris will thank you for it!

You can find the slides at https://talks.colling.net.nz/sds