DISCnet have been funded by STFC (Science and Technology Facilities Council) since 2017, it was graded “A” in an STFC review of all their Centres for Doctoral Training

We have trained of 80 PhD students.

These students are very successful, having won national prizes such as Heidi Thiemann winning Sir Arthur Clarke Award for Student Space Achievement for her work on SpaceCareers.uk and Lorenzo Zanisi won Silver award STEM for Britain research showcase at the House of Commons by appling the same methods that he used to model galaxy populations to assess the effectiveness of current clinical treatment strategies for hypertension.

Most PhD students are doing research projects in particle physics or Astrophsyics and the institutions involved in DISCnet cover about 10% of all UK activity in these areas.

Each student is expected to undertake two 3-month research placements in non academic environments with companies, public sector organisations etc. These help businesses in immediate research challenges, but also help identify skills and talent for the future and develop partnerships for universities. This slide is from one example of using ML to identify “interesting” images in a photo stream for Brighton startup Deckchair.com. They have done over 50 placements in businesses.

Case studies on our www site discnet.co.uk

DISCnet is currently looking for

New placement projects

Enthusiastic partners to work with us on a new proposal for Centre for Doctoral Training funding for another eight years.

Contact

Sussex Data Science

High performance code for Data Science

Does Rust have a place in the data scientist’s toolbox?

Lincoln Colling

University of Sussex

Software Sustainability Institute

Before we start

First, I’ll say what this talk is not:

It won’t be a deep dive into Rust
And it won’t be a detailed comparison between Rust and other languages

But what it will be:

A taste of what makes Rust great
Some reasons for why you might want to include it in your toolbox
And a chance to show you where you might already be using Rust

Typical tools in the data science toolbox

Python (probably the most popular)
R
SQL

But you also find some others

JVM languages like Scala and Java
Array languages like K/Q (kdb+)

And for high-performance code you might find

C / C++ (and Fortran)

The pros and cons of Python and R

Languages like Python and R are popular because they’re easy to use

You don’t really have to think about memory
You don’t need to explicitly define types
They give you access to great libraries that are useful for data science

But they also have their downsides

They tend to be slow
They’re not great for light-weight parallel code
And you don’t need to explicitly define types!

Getting around the slowness of Python/R

One way to make Python/R faster is to write performance critical code in C/C++ or Fortran¹
- Many popular libraries actually call these languages under the hood
- numpy includes a lot of C, scipy includes a lot of C and Fortran
- dplyr relies on some C++ code

But writing C/C++ code usually comes with big costs

Costs of writing C/C++

Development times can be slower because C/C++ is harder
- You can spend hours hunting segfaults (memory bugs)
- linters and LSPs aren’t always great
- build systems are difficult to use
- lacks easy tools / repositories for package management
- they’re not well-designed for writing code in a high-level (e.g., functional) style

Even though writing code is a critical part of data-science, most data-scientists aren’t professional programmers

So putting up with the slowness is usually a better choice than trying to optimize some code in C/C++

An alternative to C/C++

Enter RUST…

…so what is Rust?

The Rust Programming Language

Originally developed inside Mozilla in ~2006
Designed with an emphasis on security, performance, and usability
Sometimes pigeonholed as a “systems programming” language, but has some high-level “functional programming”-like features
Incredibly flexible macro system, which makes it possible to easily construct DSLs¹
Emphasis on “zero-cost abstractions” which allows you to write high-level style code (generic functions, collections and iterators, macros) that compile down to the same assembly code you’d get if you wrote low-level C-style code

What makes Rust great: Memory safety

In languages like C/C++ you have to manage memory manually

This can cause problems because it’s difficult to get right
Anybody who’s spent any time writing C code would’ve come across dreaded segfaults!

In languages like Python/R you don’t have to manage memory manually

The runtime manages it for you with a garbage collector which checks and frees memory that is no longer being used

Garbage collection makes life easier, but you take a performance hit while the garbage collector does its job

What makes Rust great: Memory safety

Rust doesn’t have a garbage collector, but memory management is made easier through its ownership system

Each value must have an owner
It can only have one owner at a time
When the owner goes out of scope, the value is dropped (the memory is freed)

// create a vector
let v1 = vec![1, 2, 3, 4];

// move the value in v1 into v2 (making v2 the new owner)
let v2 = v1;

// try print out the content of v1
// won't compile because v1 is no longer accessible
print!("{:#?}", v1);

What makes Rust great: Memory safety

Although each value can only have one owner, functions can “borrow” a value (with a reference), and then give it back to the owner when they’re done

The borrow checker checks that the rules around borrowing are followed

fn main() {

    let s1 = "Hello".to_string();

    let length = calc_len(&s1);
    print!("The length of {s1} is {length}");
}

fn calc_len(s: &String) -> usize {
    return s.len();
}

s is passed to calc_len as a reference (borrow), so the main function still owns it. If it wasn’t passed as a reference, then calc_len would own it and the code wouldn’t compile because the value would be dropped when calc_len goes out of scope

What makes Rust great: Performance

Data is immutable by default, so multiple functions (across threads) can borrow values without danger
Because of the memory safety guarantees, you don’t have to worry about race conditions
If you want to apply functions to vectors, then it’s as easy as swapping out iter() for par_iter()
It’s also trivial to spin up light-weight threads or write async code (with similar syntax to JS async/await)

// create a vector and multiply each value by 10
let v1 = vec![1, 2, 3];
let v2: Vec<i32> = v1.iter().map(|x| x * 10).collect();

// do the same but in parallel
let v1 = vec![1, 2, 3];
let v2: Vec<i32> = v1.par_iter().map(|x| x * 10).collect();

What makes Rust great: Performance

“Zero-cost” abstraction means that writing high-level style code with the performance of low-level style code
But if you want to write low-level code, then you can do that too
- You’re using a low-level language, so if you want to write SIMD vectorized code, then you can! And Rust makes it easy!
Even when using generic functions (i.e., functions that can take any type [that have certain properties]) don’t take a performance hit

What makes Rust great: It’s magical type system

In Python and R have null types (None [python], NULL, NA, NaN [R])
Working with these types will often produce errors, so you have to remember to check for them!
Rust instead has an Option type¹ with a Some and a None variants
In Python and R you have to handle errors through catch exceptions (Try: Except: [python], TryCatch)
Rust instead has a Result type² with Err and Ok variants

What makes Rust great: It’s magical type system

The Option and Result type are examples of enumerated types (enums)
With enums you can do pattern matching and the compiler checks that you’re matching against all possible variants
enums allow you to wrap more complex types (structs) and do OO-style inheritance without all the problems of OO-style inheritance
The trait system (similar to interfaces), and generics also give you a tonne of features that can make writing Rust feel like writing code in a high-level language rather than a “systems” language

let s = v1.get(3); // returns an Option which might be None
let s = match s {
    Some(v) => v, // if it's Some give me the value inside
    None => &0,  // if it's None give me 0
};
println!("{}",s)

What makes Rust great: The powerful macro system

Rust let’s you write powerful compile-time macros that let you create simple (or complex) DSLs
Rust functions don’t support default or named arguments (like e.g., python and R), but you can just write a macro that makes it possible
You can run the pt() function in R with named arguments, and any you leave off just get the default value…

result <- pt(q = 5.5, df = 1.0)

But I can do exactly the same in Rust…

let result = pt!(q = 5.5, df = 1.0);

I can even change the argument order around, just like I could do in R

let result = pt!(df = 1.0, q = 5.5);

What makes Rust great: The powerful macro system

With macros I don’t have to change languages, I can just make rust work how I want it to work
The previous example, is just a silly little toy example, but other macros let you do things like write HTML in rust!

html! {
    <div>
        <div data-key="abc"></div>
        <div class="parent">
            <span class="child" value="anything"></span>
            <label for="first-name">{ "First Name" }</label>
            <input type="text" id="first-name" value="placeholder" />
            <input type="checkbox" checked=true />
            <textarea value="write a story" />
            <select name="status">
                <option selected=true disabled=false value="">{ "Selected" }</option>
                <option selected=false disabled=true value="">{ "Unselected" }</option>
            </select>
        </div>
    </div>
};

What makes Rust great: The powerful macro system

Or you easily add new language features like:

A pipe operator¹

let res = pipe!(
   4
   => (times(2))
   => {|x| x + 2}
);

Or infix function notation

let res = a | dotprod | b;

What makes Rust great: Usability

The compiler explains the error and suggests a fix!

Error[E0382]: borrow of moved value: `v1`
 --> test.rs:9:17
  |
3 | let v1 = vec![1, 2, 3, 4];
  |     -- move occurs because `v1` has type `Vec<i32>`, which does not implement the `Copy` trait
...
6 | let v2 = v1;
  |          -- value moved here
...
9 | print!("{:#?}", v1);
  |                 ^^ value borrowed here after move
  |
  = note: this error originates in the macro `$crate::format_args` which comes from the expansion of the macro `print` (in Nightly builds, run with -Z macro-backtrace for more info)
help: consider cloning the value if the performance cost is acceptable
  |
6 | let v2 = v1.clone();
  |            ++++++++

And if I need more help:

cargo --explain E0382

What makes Rust great: Usability

The Rust tooling is amazing!
Rust includes a build-system and a package manager (Cargo)
It also ships with a built-in linter (Clippy)
It includes an LSP (rust-analyser)
And it has a built-in formatter (rustfmt)
crates.io also serves as a centralized package repository (i.e., like npm, CRAN, pypi etc)

But Rust for data science?

High-performance data frames

One python library that is ubiquitous in data science is pandas, the data frame library for python
- But pandas is slow despite being partly written in Cython and C
An alternative to pandas is pola-rs
- pola-rs is a data frame library for Python, Rust, NodeJS, R (or anywhere that has a C FFI) written in Rust.
- pola-rs is faster than pandas, dask, dplyr, data.table, DataFrames.jl¹
You can start using pola-rs today without writing any Rust

A lazy dashboard with Polars and WASM

import {viewof query_string, viewof table, viewof example } from "@colling-lab/rust-for-data-science"

viewof query_string
viewof example

viewof table

Note, that it’s perfectly safe for me to expose a raw sql interface to the internet, because there isn’t actually a sql server running anywhere
It’s just a .csv file and some client-side magic with the power of rust

Array programming: A numpy alternative

The ndarray crate gives you all the functionality of something like numpy from Rust
This includes all the basic functionality you’d expect, but also powerful linear algebra tools
ndarray makes it a lot easier to build machine learning and statistics systems in Rust
- linfa is a Rust crate (which makes use of ndarray) that provides a lot of functionality of sklearn
- This includes e.g., PCA, ICA, t-SNE, LASSO, SVM, Decision Trees, Logistic Regression, PLS

The ecosystem is expanding with crates for Gaussian Process Regression, automatic differentiation, deep learning, and wrappers for, e.g., the GSL and libtorch

Other Rust tools you can use now

High-performance data validation

A popular library for data validation is pydantic
pydantic V2 (due for release later with year) has undergone a Rust rewrite of it’s core giving 5-50x speed up

High-performance tokenizers

huggingface’s tokenizer library for python is written in Rust

High-performance linting

In large codebases, python linters can be slow
ruff is a new python linter, written in rust that

Package management tools

juliaup, a Julia installer and manager written in Rust
rye, a Python package management system written in Rust

It’s not Rust or Python/R, it’s Rust and Python/R

As a compiled language, Rust isn’t great for interactive computing (even though there’s a Rust Jupyter Kernel!)
But it’s easy to call Rust from C, so you can call Rust from any language that speaks C, including Python, R, Matlab, and Julia
For some of these languages you have to write the C interface yourself¹
PyO3 lets you call Rust from python, and pass python types and classes back and forth between Rust
extendr lets you call Rust from R, and pass R types and classes back and forth between R. You can even call R functions from Rust, or access the R C API (using the Rust crate libsysR)²

It’s not Rust or Python/R, it’s Rust and Python/R: An Example

There’s an R package called BFDA helps you do sample size planning for analyses using Bayes Factors
BFDA using simulations to do this, and it is incredibly slow even when using multiple cores
- A single simulation takes over 6 minutes!
- Although writing it in a more functional style gets it down to just over 1 minute, but even that is still slow

It’s not Rust or Python/R, it’s Rust and Python/R: An Example

I rewrote it in Rust and now the simulations take ~1.5 seconds!

import pyby
import time
import pandas as pd

tic = time.perf_counter()

df = pyby.bf_sim(
    0.5,
    "t.paired",
    {"family": "Cauchy", "params": [0, 0.707], "alternative": "two.sided"},
    sampling_rule={"n_min": 50, "n_max": 300, "step_size": 5},
    alternative="two.sided",
    reps=1000,
    seed=1600,
)

df = pd.DataFrame.from_records(
    df, columns=["id", "true.ES", "n", "logBF", "emp.ES", "statistic", "p.value"]
)

As an added bonus, I can easily use the C libraries that ship with R for probability distributions (this is also the default in Julia) rather than relying on scipy

Where does Rust fit in?

I’ve mainly talked about why I love Rust, but where and when should you be using it?
The primary place is high-performance library code (as the polars, pydantic etc examples show)
But I think it’s also great if you find functional-style programming works better for data (which I certainly do!)
And it’s great for multithreading and parallel code
Finally, if you’re parsing data (from string, to JSON, to xml) Rust’s type system and ecosystem (e.g., the Nom crate) make it a breeze (even when dealing with streaming data)

So if you find any of this appealing, then check out Rust!

Ferris will thank you for it!

You can find the slides at https://talks.colling.net.nz/sds