hgrsd

Building Drivel

Drivel

Over the past few weeks, I built a small CLI tool called drivel. The purpose of drivel is twofold:

I work with a lot of divergent data sets in my day job, and having a tool like drivel makes it easy to explore input data or generate large test datasets for things like performance testing. Also (more importantly) I was craving a bit of side project work, and have been meaning to write more Rust for ages.

In this blog I will gather a few thoughts on building this project.

Schema inference

Recursive structures and enums

Perhaps the most challenging part of drivel was figuring out how to infer a schema from JSON input data. For JSON scalars (numbers, strings, booleans, null), things are simple. If your JSON input is a string, we infer a string schema. If it is a number, we infer a number. However, most JSON data consists of arrays or objects of scalars, often nested.

This makes JSON schemata recursive. One might find an array at the root of the JSON structure. This array has elements, which themselves have a schema that can be inferred. These elements might in turn consist of arrays or objects, which also have elements, ... and so on.

In Rust, this translates neatly to an enum with some recursive variants. Note that these recursive variants require indirection; otherwise the compiler doesn't know what size they have. This is why we Box them.

pub enum SchemaState {
    Initial,
    Null,
    Nullable(Box<SchemaState>),
    String(StringType),
    Number(NumberType),
    Boolean,
    Array {
        min_length: usize,
        max_length: usize,
        schema: Box<SchemaState>,
    },
    Object {
        required: std::collections::HashMap<String, SchemaState>,
        optional: std::collections::HashMap<String, SchemaState>,
    },
    Indefinite,
}

Merging states

As said above, scalars are easily mapped to their corresponding SchemaState variant. The same applies to an object; we just need to map each field to a SchemaState variant.

Arrays are trickier, however, because in order to infer the SchemaState for an array's elements, we need to infer a schema that applies to all elements of the array, even though these elements might be disparate. For instance, an array might include nulls and numbers. Or multiple objects, each with different keys. Somehow we need to be able to merge these different values into a single SchemaState.

Thankfully, our SchemaState enum has some nice properties:

Let's look at an example.

Imagine wanting to infer the schema of the following array: [1, 3, null, 5]. All we need to do is infer the schema of each of its elements and merge these together. Inferring the schema for each elements gives us: [SchemaState::Number, SchemaState:: Number, SchemaState::Null, SchemaState::Number]

Now let's assume that we have a merge function:

fn merge(left: SchemaState, right: SchemaState) -> SchemaState {
    match (left, right) {
        // a bunch of match arms
    }
}

The shape of this function allows us to fold it over a list of SchemaStates in order to collapse it into a single, merged SchemaState.

let state = inferred_states.into_iter().fold(SchemaState::Initial, merge);

If we run through our example array, the following merge steps would be taken:

  1. SchemaState::Initial merged with SchemaState::Number yields SchemaState::Number
  2. SchemaState::Number merged with SchemaState::Number yields SchemaState::Number
  3. SchemaState::Number merged with SchemaState::Null yields SchemaState::Nullable(Box::new(SchemaState::Number))
  4. SchemaState::Nullable(Box::new(SchemaState::Number)) merged with with SchemaState::Number yields SchemaState::Nullable(Box::new(SchemaState::Number))

In drivel, these merging rules are encoded in a large match expression that lives in the merge function.

SchemaState as monoid

Some readers may have noticed that the "nice properties" of the SchemaState that I described are an informal way of saying that this data type is a monoid, under the merge operation.

That is:

Parallelisation

drivel has a few key code paths where a lot of iteration happens:

For most use-cases, doing this on a single thread and in sequence will be plenty fast. But when working with large data sets, or when producing large quantities of synthetic data, it would be much nicer if we can parallelise this work and make use of all available CPU cores.

Rayon

To parallelise this work, I opted to use the excellent rayon crate. This is a crate specifically meant for parallelising computation heavy, data-oriented workloads. It provides an ergonomic API that allows us to turn regular iterators into a parallelised equivalent, with the library taking care of deciding how exactly to parallelise based on the available cores and the expeted workload.

Parallel workloads in drivel

Inferring the schema of all elements in an array was extremely straightforward to parallelise. All that was required was to turn a normal iterator over a vector into a parallel iterator.

// before
let inferred_states: Vec<_> = vec![1, 3, null, 5]
    .into_iter()
    .map(infer_schema)
    .collect();

// after
let inferred_states: Vec<_> = vec![1, 3, null, 5]
    .into_par_iter()
    .map(infer_schema)
    .collect();

Merging inferred schemata, by contrast, appears more difficult to parallelise. In the map above, none of the elements depend on each other. Each is mapped separately and in isolation. This is different for merging; here, the elements do depend on one another because every merging operation folds two elements into one.

Before parallelising, the code looked something like this:

let state = inferred_states
    .into_iter()
    .fold(SchemaState::Initial, |left, right| merge(left, right))
    .collect();

This fold starts with the initial state and iterates over all inferred states to fold them into a single, merged state.

Perhaps surprisingly, parallelising this logic with rayon was in fact incredibly easy:

let state = inferred_states
    .into_par_iter()
    .reduce(|| SchemaState::Initial, |left, right| merge(left, right))
    .collect();

We use the reduce method on the parallel iterator and provide it with a closure that returns the initial state, as well as the merging function. Recall the properties of our SchemaState enum: there is an "identity" element that, when merged with any other element, acts as a no-op, yielding the other element.

This means that rayon can divide the list of states into chunks, running the reduce operation on each of those chunks, inserting the identity element at the start. Because the merge operation is associative, merging the chunks will always yield the same final result irrespective of how the chunks are split.

Allocation woes

Parallelising these operations with rayon let to about a 3x speedup on my MacBook Pro. This is especially useful for large file. Inferring the schema for 500MB of JSON data went from taking about 18 seconds to taking 6 seconds. I'm sure there's more optimisation to be done, but this is fast enough for my use case.

On my Linux laptop, however, the story wasn't quite as rosy. Rather than a 3x speedup, I observed a 2x slowdown. Especially for larger files. This was highly unexpected; surely parallelisation should have sped things up?

After doing some profiling, I realised that it was the overhead of parallelisation that was causing this slowdown. In particular, a lot of time seemed to be spent doing memory allocations. Doing a bit more research, I decided to try using jemallocator instead, which is billed as an allocator that provides scalable concurrency.

Swapping out the global allocator for a Rust project is as easy as adding a few lines of code to the entrypoint:

// in main.rs
#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;

Doing this dramatically improved the performance on Linux. Now I was seeing the same 3x speedup as I was seeing on MacOS. Hurrah.

Further development

drivel now provides the core functionality that I wanted from the tool, and its performance is at an acceptable level.

That said, there is ample scope for further work. For instance, drivel could be much smarter about inferring string types. At the moment, it detects things like datetimes, uuids, emails and hostnames. If it doesn't detect one of these types, it keeps track of the character distribution and lengths of the strings so that when it generates data, it looks roughly similar to its input data. This is a reasonable approach but it doesn't work well when there is a fixed set of possible values in the data. I'd love for drivel to be able to infer when a string is an enum with fixed values, but I'm not quite sure yet about how to model this.

In addition, I'd also like it if a user of drivel could "correct" the inferred schema before producing data. I imagine a TUI-like interface that shows the inferred schema, asking the user if this is right and offering them the possibility to amend the schema if need be.

So, if you feel like contributing... here is the repo.

Tags: #rust #open source #drivel #json #data