Duplik and learning Zig

Introduction

The ascent of AI-assisted (or, rather, AI-led) coding has made me feel very jaded about programming as a practice. In my dayjob, engineers are being pushed fairly hard to lean heavily on tools like Claude Code; speed is of the essence, quality less so, understanding even less. This is not necessarily a damning judgment of my employer specifically, as the same dynamics are at play in pretty much all of the industry right now.

In a previous blog article, I have written about how I've been grappling with AI and am limiting my personal use of it. In this article, I want to share my experience reclaiming the "old way" of building software and learning about new languages. Yes, I might be a dinosaur. But programming and learning is a lot of fun - and we are not yet extinct.

The project: duplik

Note: this project is available on codeberg; please feel free to fork it, contribute to it, give me feedback on it (at [email protected], or ask me questions about it.

I have been wanting to / trying to learn Zig for a while now. Until this point, this has consisted mainly in an aborted attempt to finish all of last year's advent of code, the remnants of which are still available on my codeberg.

But what to build? A friend of mine wanted to use a CLI tool to deduplicate files on his NAS, which felt like an ideal well-scoped Zig project that can make use of its character as a systems language. There is a bunch of prior art here, too, with classic tools like fdupes or more modern, performance-focused versions like fclones.

Many such tools have a fairly sophisticated feature set. For instance, they allow you to act based on the found duplicates by removing duplicate files, or symlinking them to the original; or they let you configure the hashing algorithm for detecting duplicates; etc.

Duplik has a smaller scope. It has a single way of deciding whether two files are duplicates, and it does not act on any of the duplicates that it finds. Instead, it outputs duplicate files grouped by their hash and size as JSON, meant to be composed with jq and any further CLI tools to act on them if that is what the user wants to do.

My programming practice

I wanted to be deliberate about how to build duplik. I didn't just want to build this specific tool but I also, perhaps mostly, wanted to learn. But learning is hard. LAs is the case for many of my colleagues, my brain too has been well and truly fried by feelings of tiredness and burnout, the incessant demand to be increasingly productive (read: fast) in my outputs at work, and the wholesale abdication of responsibility for our own cognitive work pushed by CEOs and AI grifters alike... OK, I'll stop. What I mean is: learning is hard, and I want to give myself a good chance at making progress.

I know that personally (and I think this applies more generally), I often learn best through productive struggle: running into problems and limitations, reading about approaches, making decisions, failing again, and changing course. The key is that I need to be engaged rather than passive. So I went old school. I wrote all code in vim and did not install a language server. I did not use any LLM assistance (I tend to avoid this in the first place). I used the zig compiler in watch mode, and used the mostly-excellent-sometimes-sparse Zig documentation (the language reference and stdlib docs) to guide me. I have to say that it was extremely satisfying to go from very little knowledge and very many compiler errors to a working tool. I do feel like it has really helped new concepts stick in my mind. Would recommend.

Iterations of the tool

The various iterations of duplik tracked my growing understanding of Zig and its standard library as well as the specific problem space of duplicate detection.

0. Naive approach

The first iteration of the tool was a single-threaded pass over all files in the hierarchy. For each file, I'd memory map it and then use the Blake3 hasher from the standard library to hash its contents. I kept track of seen hashes in a HashMap, and then for every file whose hash had already been seen I'd print out its path to stdout. This "worked" in that it streamed a list of duplicates, but it wasn't very useful yet. For instance, it wasn't aware of file sizes or the entire group of duplicate file paths. It also assumed that the first-encountered version of a file was the original, and classified subsequent equivalent hashes as duplicates - but we don't know whether that is what the user intends at all!

1. More composable and useful output

I wanted to improve the output of duplik. Rather than just printing a line for every duplicate file path, I wanted to make it more composable. I opted for JSON-based output that can be piped into jq in order to build a transformation pipeline:

[
  {
    "hash": "18d1af8d48726adde098078c26107358e0f3af5c609804ac5592c2ee1e7e4b23",
    "bytes": 451592,
    "paths": [
      ".zig-cache/o/91cbe7e9e527f5f6363b5565027d7d98/duplik",
      "zig-out/bin/duplik"
    ]
  },
  {
    "hash": "c09afee0c9f361fb61e5ff28a7739893de766fb470c5fa82b4e5e31de27fbad4",
    "bytes": 9,
    "paths": [
      ".zig-cache/tmp/6eF86OTTwYx7SkkC/one.test",
      ".zig-cache/tmp/rcysl3VUWmjMS1G7/two.test",
      ".zig-cache/tmp/rcysl3VUWmjMS1G7/one.test",
      ".zig-cache/tmp/9aONBAv5sGdqnu_Z/two.test",
      ".zig-cache/tmp/9aONBAv5sGdqnu_Z/one.test",
      ".zig-cache/tmp/mqCXxbRJhZ6zzHSo/two.test",
      ".zig-cache/tmp/mqCXxbRJhZ6zzHSo/one.test",
      ".zig-cache/tmp/L8nDmQLRwWZ-IVc3/two.test",
      ".zig-cache/tmp/L8nDmQLRwWZ-IVc3/one.test"
    ]
  }
]

This output allows things like filtering by size, and for users to specify a strategy for deciding which of the files should be considered the original based on their path, e.g. to make decisions about which files to keep and which to delete. For example, from the recipes in the project's readme, the following pipe recurses through the directory, selecting only duplicate files that are more than 1MB in size, and considers the shortest path to be the original, only outputting the others as duplicates.

duplik -A -D 99 ~ | jq -r '.[] | select(.bytes >= 1048576) | .paths | sort_by(length) | .[1:] | .[]'

In order to support this output, two things needed to change.

First, we needed to collect file sizes in addition to hashes. This actually led nicely into an optimisation: if two files don't share the same size, we know that they aren't duplicates of each other. This means that we could first group all files by their sizes, and that any files that don't share their size with at least one other file in the tree are guaranteed to be unique. Getting a file's size is significantly cheaper than hashing its contents, so it is worth doing a first pass just based on their sizes to filter out known unique files before we even get to the hashing stage. As an aside, I first implemented this by opening each file using std.Dir.openFile, but then realised that this adds unnecessary syscall overhead: first we have to open the file, then stat it, then close it again. The Dir struct itself has a statFile function that does not require this overhead and only takes one syscall (on non-Windows systems).

// one syscall
const stat = try entry.dir.statFile(io, entry.basename, .{});

// three syscalls
const file = try entry.dir.openFile(io, entry.basename, .{ .mode = .read_only });
const stat = try file.stat(io);
defer file.close(io);

Second, we needed to keep track of the group of file paths for each hash, using HashMaps. The tool was getting a bit more allocation-heavy as a result. But I think that's OK; given its nature as a CLI tool with a limited lifetime, I am using Zig's arena allocator and am deliberately not worrying about keeping track of memory allocations so that we can free them accordingly. Combined with the fact that all data is read-only, this has the nice property that we can let all allocations live until the end of the program so that we can reference them from multiple steps in the pipeline. Avoiding free-ing during the runtime of the program also avoids having to keep track of the ownership of allocations and therefore avoids a lot of complexity. That said, I do need to learn about more sophisticated memory management in Zig, given how big an aspect of the language this is. Oh well, another time.

2. Concurrency

Up until this point, the program was a single threaded sequential pipeline. It first iterated through all directories to stat the files for their sizes, then did a second pass where it iterated through all files that are potential duplicates (because they share their size with at least one other file) to hash them and decide if they are indeed duplicates. Especially with modern SSDs and multi-core machines, this leaves a lot of performance on the table. And, more importantly, I wanted to learn about asynchrony and concurrency in Zig, especially with the new Io approach in 0.16.

So I redesigned the approach to be more concurrency-friendly. I wanted to use a channel-based design where n workers read files and hash their contents based on jobs they pull off a queue, then write the files and their hashes back out to a results queue. A single reader pulls from that results queue and groups the results by hash to find duplicates. This parallelises the most resource-intensive work of reading files from disk and hashing their contents, while avoiding race conditions. The std.Io.Queue primitive was really nice to use for this exact use case.

As you can see from the implementation, there are a few gotchas that you need to wrap your head around. One major nuance is the difference in Zig between async and concurrent. When you dispatch to a function using io.async, you are expressing that this function can run concurrently, but not that it must run concurrently to be correct. If you want that latter guarantee, you must use io.concurrent instead. In the case of duplik, this matters a lot. We need to do the following in sequence: spin up a worker pool, spin up the worker to receive the hashed files, fill up the work queue, then close the work queue to mark completion, wait for all of the hash workers to complete and then also close the results queue to mark completion, and then wait for the combined results to come in. If we do not successfully spin up the workers and fill up the queue concurrently, but if one of those steps blocks, then we never get to process the work and we will never finish it - we end up in a deadlock. So I needed to reach for concurrent; I only found this out after duplik was hanging for minutes, ha.

// this deadlocks
var worker_group = std.Io.Group.init;
for (0..n_workers) |_| {
    worker_group.async(io, hashWorker, .{ io, root, &work_queue, &results_queue });
}

// this works
var worker_group = std.Io.Group.init;
for (0..n_workers) |_| {
    _ = try worker_group.concurrent(io, hashWorker, .{ io, root, &work_queue, &results_queue });
}

As a result of the change to a concurrent execution model, the runtimes got 5-6x faster depending on the workload, based on my fairly unscientific benchmarks. duplik is now on par with highly performant tools like fclones for this specific use case, which is very cool and a testament to the power of a language like Zig, despite my novice status. I did add a flag to override the number of workers (which defaults to the number of CPU cores available). If you have a spinning disk, for instance, you probably want to limit concurrent reads instead of maximise them.

3. ???

There are a lot of things that I am thinking of implementing, though it's not clear whether I'll get to it. In terms of performance optimisation, there is more work to be done. Instead of going from checking file sizes to hashing the full file contents, we can sample the first or last 4kb of the potential duplicates, tossing out paths of files that can be proven to be unique with this cheaper approach. We could save the full hashing for a third pass over a smaller subset of the data.

Unlike some of its peers, duplik considers hard links (i.e., multiple paths pointing at the same underlying inode) to be duplicates. This might be unexpected behaviour for users and as such I should consider changing the default or, at least, introduce a flag to control the behaviour.

Argument parsing is a bit janky and brittle. I'm looping over the arguments and making assumptions about how to parse options that are easy to break. I could depend on a package for this but I'm trying to avoid taking dependencies, so I'll have to think about this a bit more.

There is likely potential memory usage optimisation. I'm sure there are places where I can use references instead of duplicating and allocating data. This is probably not that important in terms of runtime, compared to the file I/O and CPU-bound hashing, but it'll be nice to drive down the memory footprint of the tool. It's not crazy (<100MB for many tens of thousands of files), but it's not optimal either.

TL;DR

Building stuff is fun, especially in combination with learning a new language and new concepts. I really enjoyed programming using a more low tech environment. No LLMs, no language server, just offline-accessible docs, a running compiler in the background, and a text editor. Recommend everyone to give that a try if, like me, you are trying to rekindle your love for programming given the state of the current tech industry.