I'm coming back home from Rust Week 2026 in Utrecht 🦀. It was two days of interesting and
thought-provoking talks, followed by one day of coding together on Rust-related themes.
The venue was intelligently chosen: a cinema! No meet up can beat these comfortable human holders:
Talking to the sponsors, they use Rust for all sorts of projects, like developing microchips (Espressif), self-hosted
clouds (0xide), a platform for EV chargers in Holland (TandemDrive), data analysis (Polars), GPUs (Vectorware),
networking infrastructure (NLNetLabs), and editor (Zed).
I felt shy around the big crowd (I'm working on it...), but I was happy with myself because I've managed to discuss a
little with people in different moments of their Rust journey.
Today I'll share some points that I've learned while playing with Spark, Parquet and the Delta format. Even if you don't
use these technologies, I hope you can spot some neat ideas to reuse somewhere else later.
I like to picture in my head that the most important architectural distinction between a datalake table and a typical
database (like Postgres) is that compute and storage are handled very separately: the table's data is stored in one
distributed system (typically a cloud object storage), while another distributed system (or even multiple ones) read and
write to those files that compose the table.
There are competing formats to represent these tables, with distinct trade-offs of course: Delta, Iceberg, Hudi. But
from my research, I don't think there
is anything fundamentally different between then. This post
will focus on Delta, but most of it should be easily transposable for the others.
I like to understand technical solutions by framing the fundamental problems that they aim to solve best, so I'll
present it like that. Just remember that, although I have read
the specification and
have used Spark with Delta tables for years, I didn't invent any of this: I'm just an outside observer who
can be wrong. If you spot a misconception, please tell me in the comments!
This is my last week on my current job, and I was reflecting about some of the things that I've learned in the past 7
years there. There's a lot of course! As a staff data engineer, I worked with many teams and codebases and learned a
couple of tricks at scale. So I'm starting a new blog series "performance advice nugget" (or PAN for short), in which
I'll share some insights of what worked quite well in practice.
So welcome to PAN 01: use flat representations.
I'll try to make these posts quite short. As usual with everything related to "performance", you should
always measure and benchmark with your real workload, and always balance whether additional complexity is worth the
performance gains.
Nice, forewords are said and out of the way. Let's focus on the matter: imagine that you are handling a data that has
multiple levels, for example, a paragraph, that is made of sentences, each made of words, each made of characters:
In Rust, the tokio's ecosystem has a fundamental crate called bytes that abstracts
and helps dealing with bytes (you don't say!). I've indirectly used it a billion times and I thought that I had a good
mental model of how it worked.
So, in the spirit of the "decrusting" series
by the excellent Jon Gjengset, I've decided to peek behind the curtains to understand more what axum, tokio, hyper and
the kind do to them bytes! The code is well written, but surprisingly complex. I understand now what it does, but I
still don't fully grasp why it does some things in a certain way.
I'm ready to share with you my discoveries. I hope that you are sitting, laying or squatting comfortably. This is the
first post in a small series. I'm legally required by my marketing department to remind you that you can subscribe to
my low-traffic newsletter, so that you'll know when new posts are up!
A quick note before we start: this posts is based on the current bytes version 1.11.1.
This blog is written in Rust, and I wanted a way to reload the web pages automatically while I change the posts'
contents, styles, etc. This is common-place with JavaScript frameworks, but not automatic in the Rust land. So I've
embarked on a side quest to achieve just that: the "type and auto-reload" experience. In the end, I was surprised to
learn a bit more about sockets and processes in Linux.
This post is a note to myself about these nuggets that I've learned and to share the solution. It may be helpful for
future me and I hope for someonelse out there.