Drivel
Over the past few weeks, I built a small CLI tool called drivel.
The purpose of drivel
is twofold:
- to infer a schema from a bunch of JSON data
- to produce synthetic data (drivel) based on the inferred schema
I work with a lot of divergent data sets in my day job, and having a tool like drivel
makes
it easy to explore input data or generate large test datasets for things like performance
testing. Also (more importantly) I was craving a bit of side project work, and have been
meaning to write more Rust for ages.
In this blog I will gather a few thoughts on building this project.
Schema inference
Recursive structures and enums
Perhaps the most challenging part of drivel
was figuring out how to infer a schema
from JSON input data. For JSON scalars (numbers, strings, booleans, null), things are simple.
If your JSON input is a string, we infer a string schema. If it is a number, we infer a number.
However, most JSON data consists of arrays or objects of scalars, often nested.
This makes JSON schemata recursive. One might find an array at the root of the JSON structure. This array has elements, which themselves have a schema that can be inferred. These elements might in turn consist of arrays or objects, which also have elements, ... and so on.
In Rust, this translates neatly to an enum with some recursive variants. Note that these
recursive variants require indirection; otherwise the compiler doesn't know what size
they have. This is why we Box
them.
pub enum SchemaState {
Initial,
Null,
Nullable(Box<SchemaState>),
String(StringType),
Number(NumberType),
Boolean,
Array {
min_length: usize,
max_length: usize,
schema: Box<SchemaState>,
},
Object {
required: std::collections::HashMap<String, SchemaState>,
optional: std::collections::HashMap<String, SchemaState>,
},
Indefinite,
}
Merging states
As said above, scalars are easily mapped to their corresponding SchemaState
variant. The
same applies to an object; we just need to map each field to a SchemaState
variant.
Arrays are trickier, however, because in order to infer the SchemaState for an array's
elements, we need to infer a schema that applies to all elements of the array, even though
these elements might be disparate. For instance, an array might include nulls
and numbers.
Or multiple objects, each with different keys. Somehow we need to be able to merge these
different values into a single SchemaState.
Thankfully, our SchemaState enum has some nice properties:
- for every pair of states, there is a one way to combine them
- there is a state (SchemaState::Initial) that acts as a "no-op"; combining this state with any other state will yield the other state
Let's look at an example.
Imagine wanting to infer the schema of the following array: [1, 3, null, 5]
. All we need to
do is infer the schema of each of its elements and merge these together. Inferring the schema
for each elements gives us:
[SchemaState::Number, SchemaState:: Number, SchemaState::Null, SchemaState::Number]
Now let's assume that we have a merge
function:
fn merge(left: SchemaState, right: SchemaState) -> SchemaState {
match (left, right) {
// a bunch of match arms
}
}
The shape of this function allows us to fold it over a list of SchemaStates in order to collapse it into a single, merged SchemaState.
let state = inferred_states.into_iter().fold(SchemaState::Initial, merge);
If we run through our example array, the following merge steps would be taken:
SchemaState::Initial
merged withSchemaState::Number
yieldsSchemaState::Number
SchemaState::Number
merged withSchemaState::Number
yieldsSchemaState::Number
SchemaState::Number
merged withSchemaState::Null
yieldsSchemaState::Nullable(Box::new(SchemaState::Number))
SchemaState::Nullable(Box::new(SchemaState::Number))
merged with withSchemaState::Number
yieldsSchemaState::Nullable(Box::new(SchemaState::Number))
In drivel
, these merging rules are encoded in a large match expression that lives in the
merge
function.
SchemaState as monoid
Some readers may have noticed that the "nice properties" of the SchemaState that I described
are an informal way of saying that this data type is a monoid, under the merge
operation.
That is:
- there is a binary operation to combine two SchemaStates (the merge operation)
- there is an "identity" element that, when combined with another element, yields that other element (the SchemaState::Initial variant)
- the binary operation is associative:
merge(merge(state_one, state_two), state_three) == merge(state_one, merge(state_two, state_three))
Parallelisation
drivel
has a few key code paths where a lot of iteration happens:
- inferring the schema for all of the elements in an array
- merging inferred schemata
- producing arrays of synthetic data
For most use-cases, doing this on a single thread and in sequence will be plenty fast. But when working with large data sets, or when producing large quantities of synthetic data, it would be much nicer if we can parallelise this work and make use of all available CPU cores.
Rayon
To parallelise this work, I opted to use the excellent rayon crate. This is a crate specifically meant for parallelising computation heavy, data-oriented workloads. It provides an ergonomic API that allows us to turn regular iterators into a parallelised equivalent, with the library taking care of deciding how exactly to parallelise based on the available cores and the expeted workload.
Parallel workloads in drivel
Inferring the schema of all elements in an array was extremely straightforward to parallelise. All that was required was to turn a normal iterator over a vector into a parallel iterator.
// before
let inferred_states: Vec<_> = vec![1, 3, null, 5]
.into_iter()
.map(infer_schema)
.collect();
// after
let inferred_states: Vec<_> = vec![1, 3, null, 5]
.into_par_iter()
.map(infer_schema)
.collect();
Merging inferred schemata, by contrast, appears more difficult to parallelise. In the map
above, none of the elements depend on each other. Each is mapped separately and in isolation.
This is different for merging; here, the elements do depend on one another because every
merging operation folds two elements into one.
Before parallelising, the code looked something like this:
let state = inferred_states
.into_iter()
.fold(SchemaState::Initial, |left, right| merge(left, right))
.collect();
This fold starts with the initial state and iterates over all inferred states to fold them into a single, merged state.
Perhaps surprisingly, parallelising this logic with rayon
was in fact incredibly easy:
let state = inferred_states
.into_par_iter()
.reduce(|| SchemaState::Initial, |left, right| merge(left, right))
.collect();
We use the reduce
method on the parallel iterator and provide it with a closure that
returns the initial state, as well as the merging function. Recall the properties of our
SchemaState enum: there is an "identity" element that, when merged with any other element,
acts as a no-op, yielding the other element.
This means that rayon
can divide the list of states into chunks, running the reduce
operation
on each of those chunks, inserting the identity element at the start. Because the merge
operation is associative, merging the chunks will always yield the same final result
irrespective of how the chunks are split.
Allocation woes
Parallelising these operations with rayon
let to about a 3x speedup on my MacBook Pro. This
is especially useful for large file. Inferring the schema for 500MB of JSON data went from
taking about 18 seconds to taking 6 seconds. I'm sure there's more optimisation to be done,
but this is fast enough for my use case.
On my Linux laptop, however, the story wasn't quite as rosy. Rather than a 3x speedup, I observed a 2x slowdown. Especially for larger files. This was highly unexpected; surely parallelisation should have sped things up?
After doing some profiling, I realised that it was the overhead of parallelisation that was causing this slowdown. In particular, a lot of time seemed to be spent doing memory allocations. Doing a bit more research, I decided to try using jemallocator instead, which is billed as an allocator that provides scalable concurrency.
Swapping out the global allocator for a Rust project is as easy as adding a few lines of code to the entrypoint:
// in main.rs
#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;
Doing this dramatically improved the performance on Linux. Now I was seeing the same 3x speedup as I was seeing on MacOS. Hurrah.
Further development
drivel
now provides the core functionality that I wanted from the tool, and its performance
is at an acceptable level.
That said, there is ample scope for further work. For instance, drivel
could be much smarter
about inferring string types. At the moment, it detects things like datetimes, uuids, emails and
hostnames. If it doesn't detect one of these types, it keeps track of the character distribution
and lengths of the strings so that when it generates data, it looks roughly similar to its input
data. This is a reasonable approach but it doesn't work well when there is a fixed set of
possible values in the data. I'd love for drivel
to be able to infer when a string is
an enum with fixed values, but I'm not quite sure yet about how to model this.
In addition, I'd also like it if a user of drivel
could "correct" the inferred schema before
producing data. I imagine a TUI-like interface that shows the inferred schema,
asking the user if this is right and offering them the possibility to amend the schema
if need be.
So, if you feel like contributing... here is the repo.