In Rust, the tokio's ecosystem has a fundamental crate called bytes that abstracts and helps dealing with bytes (you don't say!). I've indirectly used it a billion times and I thought that I had a good mental model of how it worked.

So, in the spirit of the "decrusting" series by the excellent Jon Gjengset, I've decided to peek behind the curtains to understand more what axum, tokio, hyper and the kind do to them bytes! The code is well written, but surprisingly complex. I understand now what it does, but I still don't fully grasp why it does some things in a certain way.

I'm ready to share with you my discoveries. I hope that you are sitting, laying or squatting comfortably. This is the first post in a small series. I'm legally required by my marketing department to remind you that you can subscribe to my low-traffic newsletter, so that you'll know when new posts are up!

A quick note before we start: this posts is based on the current bytes version 1.11.1.

What the Bytes?

The bytes crate does many things, but its public API is quite small: it basically has two structs and two traits. Quoting from the documentation:

  • struct Bytes: a cheaply cloneable and sliceable chunk of contiguous memory
  • struct BytesMut: a unique reference to a contiguous slice of memory
  • trait Buf: read bytes from a buffer
  • trait BufMut: a trait for values that provide sequential write access to bytes

In this series, I'll focus on the Bytes struct, because yeah, it's not as simple as one may be led to believe!

A compelling example

I would like to start concrete, with an example of what the problem that Bytes helps solve: zero-copy parsing. Imagine that you're parsing HTTP requests and your code receives these bytes from the network:

GET /post/2026/some-post HTTP/2
Host: sitegui.dev
Accept: text/html
Referer: https://sitegui.dev/
Connection: keep-alive

You would like to efficiently parse it into a struct like this:

struct Request {
    method: Something,
    path: Something,
    version: Something,
    /// Note: headers can repeat, so I'm using a
    /// `Vec` not a `HashMap` to represent them
    headers: Vec<(Something, Something)>,
}

This post is not about the parsing bit, instead it's about the bytes themselves. What should we choose as Something above?

If you chose a type like Vec<u8>, your parser would need to allocate and copy the information into many multiple instances of Vec<u8>. But with Bytes, all these instances are cheap to produce as they share the same storage (your initial bytes are represented only once) and each instance then has a "window" to this buffer. The underlying buffer will only be deallocated when all the instances sharing it are dropped.

As usual in computing, using Bytes (or any "shared" construction) vs Vec<u8> (or any "owned" construction) is a matter of trade-offs. For example, if your parser code only produces output that references a small part of the input, it's probably better to copy over just that. Otherwise, the Bytes instances will keep the whole input around in memory.

An initial mental model

As I said before, I had a useful mental model before embarking in this exploration: a Bytes instance simply has a reference-counted shared reference to the actual bytes and some information to where in the buffer those bytes are. The fundamental operations available are:

trait BytesStorage {
    /// Create a new instance with some data
    /// It should "adopt" the data, not copy it
    fn new(data: Vec<u8>) -> Self;
    /// Create a new instance that references some part of this data
    /// Constant time, and it should not copy the actual data
    fn slice(&self, range: Range<usize>) -> Self;
    /// Return the bytes slice
    /// Very cheap
    fn as_ref(&self) -> &[u8];
}

We could draft a similar-in-spirit implementation using Arc<Vec<u8>>:

pub struct SharedBytes {
    range: Range<usize>,
    data: Arc<Vec<u8>>,
}

impl BytesStorage for SharedBytes {
    fn new(data: Vec<u8>) -> Self {
        Self {
            range: 0..data.len(),
            data: Arc::new(data),
        }
    }

    fn slice(&self, range: Range<usize>) -> Self {
        // Translate the "local" range to a "global" range
        let root_range = range.start + self.range.start..range.end + self.range.start;
        Self {
            range: root_range,
            data: self.data.clone(),
        }
    }

    fn as_ref(&self) -> &[u8] {
        &self.data[self.range.clone()]
    }
}

Note how new() does not clone the data, instead the Vec is adopted into the SharedBytes instance. Also, slice() clones the Arc<_>, which uses reference counting and avoids cloning the actual data in the Vec.

However, as_ref() is not very good for two reasons: Rust will execute bound checks to ensure that the range is valid at every access, and the actual buffer is behind two memory references Arc (in SharedBytes) -> Vec -> buffer.

The actual Bytes implementation avoids these two problems by storing the slice components (pointer to first byte and length) as two fields directly in the struct, trading off some bloat in the struct and a small performance hit in the slice() operation for a simpler as_slice():

pub struct Bytes {
    ptr: *const u8,
    len: usize,
    // ... more fields ...
}

impl Bytes {
    #[inline]
    fn as_slice(&self) -> &[u8] {
        unsafe { slice::from_raw_parts(self.ptr, self.len) }
    }
}

Let's test our mental model

In Rust, you can set a different global memory allocator like this:

use std::alloc::{GlobalAlloc, Layout, System};

/// A global allocator that can track all allocations made
struct TrackingAllocator;

// Ask Rust to use our allocator as the global allocator
#[global_allocator]
static ALLOCATOR: TrackingAllocator = TrackingAllocator;

unsafe impl GlobalAlloc for TrackingAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        // Do stuff
        unsafe { System.alloc(layout) }
    }

    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        // Do stuff
        unsafe { System.dealloc(ptr, layout) };
    }
}

There is a Rust working group to have a more fine control of the allocator, allowing us to overwrite it only for some parts of the code. But for this reduced example, it's okay to capture all allocations.

I used this mechanism to track exactly which allocations and deallocations the code was doing for Vec<u8>, our own SharedBytes and bytes::Bytes and where in the source code they happened. You can check the full code here, if you are curious or want to reproduce my work.

To simulate the HTTP parser, I'm using this simple code:

fn parse_request<B: BytesStorage>(data: Vec<u8>) -> Request<B> {
    let data = B::new(data);
    Request {
        method: data.slice(0..3),
        path: data.slice(4..24),
        version: data.slice(25..31),
        headers: vec![
            (data.slice(33..37), data.slice(39..50)),
            (data.slice(52..58), data.slice(60..69)),
            (data.slice(71..78), data.slice(80..100)),
            (data.slice(102..112), data.slice(114..124)),
        ],
    }
}

The following table shows what each line of the code above allocates for Vec<u8>, SharedBytes and bytes::Bytes, in bytes.

But before looking at the results, take a while to run this code against your mental model and try to predict how many allocations will occur for each implementation and where that will happen.

Code source Vec<u8> SharedBytes bytes::Bytes
data = B::new(data) 40
method: data.slice(0..3) 3 24
path: data.slice(4..24) 20
version: data.slice(25..31) 6
vec![...] 192 192 256
8x data.slice(...) 57
Total 298 232 280

As I expected, Vec<u8> allocates a new independent buffer in every call to .slice(), with the necessary capacity for the data piece then copy it. SharedBytes allocates the Arc<Vec<u8>> once at the start, and later the Vec<(B, B)> with 8 × 24 bytes.

I want to explore in a separate blog post the whys and the hows Arc<Vec<u8>> has 40 bytes, and also compare it against Arc<[u8]>. Stay tuned!

The behavior of Bytes was interesting to me for two reasons:

  • it somehow defers the first allocation: B::new() doesn't allocate (in this case), but the first call to .slice() does. Spoiler alert: it's pretty neat but complex. I'll explore it in details later on.
  • it's big! 32 bytes each instance. So Vec<(B, B)> has 8 × 32 bytes

No Arc in sight

Armed with the above initial mental model, it's not a surprise that the documentation and comments in the file implementing Bytes mention Arc 6 times. What is a bit eyebrow-raising is that the code itself does not use Arc at all!

Maybe it's related to the defered allocation we saw before? Maybe it's related to the 32 bytes needed for each instance? Well, kind of...

I'll use memory diagrams like this one below to represent what the structs (in grey) are made of. Simple numerical fields are in yellow and pointers are in green. In parentheses, I've put the memory size in bytes, assuming a 64-bit system.

For example, in Rust you can represent a growable owned sequence with Vec<T> and a fixed owned sequence with Box<[T]>. The physical difference is that Vec may have allocated more space than it is currently using, to amortize buffer grow on inserts, so it uses an additional capacity field. A Box uses exactly the bufffer's size, so length and capacity are the same. Converting from a Vec that is "full" (that is, for which capacity is the same as length) is cheap because the underlying buffer is reused:

Converting a Vec to Box

Here's a Box. Give-me Bytes

The implementation of Bytes::from(Vec<u8>) will first check if it's a "full" Vec to convert it into a Box<[u8]> and build a Bytes instance like below. When the Vec<u8> it not "full", a different logic is used, but let's put that aside for now.

Converting a Box to Bytes

That's a lot to unpack. I'll explain what I've understood of the design and the trade-offs of this representation:

First, note that the buffer is reused, as expected.

The start and length fields simply represent the slice information. It seems a bit redundant with the data field, that also points to the buffer. However, this distinction will be useful when we slice this data: the start of our "window" will be different from the buffer's start, and we need to keep the buffer's start around so that we can deallocate it when the moment comes.

The vtable field implements the "dynamic dispatch" pattern. Rust has "trait objects" to implement this pattern with &dyn SomeTrait or Box<dyn SomeTrait>, but I guess the designers of the crate didn't use them because this pattern has restrictions that are a deal-breaker for this use. If you know more why, please tell me!

Note that vtable points to a statically declared instance of VTable called "promotable". Aha! This name is related to how Bytes deferr allocation as we observed earlier.

The data field is funny-looking (no judgements!): it's an AtomicPtr<_>, that unlike a normal pointer, uses atomic operations to read/write to it. But why do we need atomic operations here? Spoiler alert: it's the deferred allocation again. Somehow (we'll see next), this pointer will point to something else in the future.

Continuing on the data field, it's not just a pointer. It uses the pattern of "tagged pointer" to store one bit of information there. I think you can already guess it: it's the deferred allocation again. This trick avoids having a separate field for this, but it requires the code to ensure and check that the pointer to the buffer is always a multiple of 2, so that the least bit carries no useful information and can be co-opted as a tag.

The other side of the river

Let's spring straight into action: what happens when we slice our newly-created Bytes? I've tried my best to convey everything that happens in the diagram below without making it too busy. Take some time to navigate it first.

Slicing a promotable Bytes

First, note that calling .slice(&self) the first time actually modifies the original Bytes: the previous state of the instance is represented in a dashed countour, the new state uses red on the changed field: data. Naturally, only this field can be modified, because .slice(&self) takes a shared reference, so only internal mutation is possible, through the AtomicPtr<_>.

As we've observed earlier, a new piece of data is allocated: an instance of the Shared struct with 24 bytes. It carries information about the original buffer that will be used to deallocate (buffer and capacity). I was surprised to learn that in Rust the original size of the buffer is actually necessary to deallocate it. I was used to C's free(ptr) that clearly doesn't need it.

The Shared struct also carries an atomic counter with the number of Bytes instances referencing this buffer. The implementation doesn't use Rust's native Arc. Instead the designers preferred re-implementing it. I could not find a definitive answer to why, but I guess that's because they wanted to avoid the overhead in Arc related to the implementation of "weak references", that is not necessary for Bytes' use case. To implement it, Arc uses an additional atomic counter and executes some extra atomic instructions when cloning and droping.

Finally, note how the vtable pointer does not change: it's still the same "promotable" method list. The code uses the tag in the data pointer to distinguish a not-yet-promoted data (tag = 1) from a promoted data (tag = 0).

The deferred allocation feature is a trade-off: the code is more complex because we need an AtomicPtr and a tag to implement it. However, for code that creates a Bytes without actually slicing and sharing them, it avoids the Shared allocation entirely.

It's a lie

Let's step back a bit: the behavior above only happens when we create a Bytes from a Box<[u8]> or "full" Vec<u8>. When we start with a non-full Vec<u8>, the code will not use the deferred mechanism and instead allocates a Shared struct right away. In this case, there's no tagging of the data field and a dedicated vtable is used that will not check for the tag:

Converting a Vec to Bytes

This is very similar to how the simpler SharedBytes work, except that Bytes has 8 more bytes for the vtable and Shared has 8 fewer bytes than Arc<Vec<u8>> for the lack of weak references.

Bytes are versatile

The dynamic dispatch implemented by Bytes with the vtable and data fields effectively allows it to use custom implementation for different backing storages of the data.

We saw what happens to Box and Vec. Here's what happens when you create a Bytes from an existing statically allocated slice of bytes (in Rust represented as &'static [u8]):

Converting static slice to Bytes

Note that there is no reference counting, because in this mode, Bytes does not own the data: instead it borrows data that outlives it, as indicated by the 'static lifetime.

Another interesting usage of the dynamic dispatch is to create Bytes from anything else that somehow owns a buffer. For example, from a memmap2::Mmap instance that represents a memory-mapped region:

Converting owned struct to Bytes

Note that the owned struct is copied over into Onwed that, just like Shared, acts like an Arc<_> without the weak reference feature. The major distinction between Owned and Shared is that Bytes does not know how to take ownership of the buffer, as it only requires that the given type implements AsRef<[u8]> to produces a borrowed slice of bytes.

Micro benchmarks

Never trust a benchmark you see online. So here's one more for you to ignore:

fn test<B: BytesStorage>() {
    let data = black_box(example_request_package());
    let request = parse_request::<B>(data);
    assert_example_request(&request);
}
Vec<u8> SharedBytes bytes::Bytes
Allocated (bytes) 298 232 280
Execution time (ns) 115 115 150

So Bytes seems both chunkier and slower than my half-backed SharedBytes on this synthetic benchmark 🫣. But of course, my implementation does not offer the same versatility.

I searched for public benchmarks that justified some design decisions on the bytes crate, but could not find them. If you know any, please tell me!