Systems Deep Dive · Serialization & Deserialization

How We Came to SerDes

From raw bytes and endianness to Thrift's binary protocol and Frozen/2 — a complete walkthrough.

What is Serialization & Deserialization?

At its core, serialization is the process of transforming an in-memory data structure into a flat sequence of bytes that can be stored or transmitted. Deserialization is the reverse — rebuilding the original structure from those bytes.

Consider a simple Date struct with year, month, and day. If we pack 2026-03-15 into a 32-bit word using bit-fields — 23 bits for year, 4 bits for month, 5 bits for day — and emit the result in big-endian byte order, we get exactly four bytes:

serialize_date.cpp
// Pack date into 32 bits: [31..9]=year [8..5]=month [4..0]=day
uint32_t packed = ((static_cast<uint32_t>(year) & 0x7FFFFF) << 9)
                | ((month & 0x0F) << 5)
                | (day & 0x1F);

// Extract bytes in big-endian order (MSB first)
uint8_t b[4];
b[0] = (packed >> 24) & 0xFF;  // bits [31..24]
b[1] = (packed >> 16) & 0xFF;  // bits [23..16]
b[2] = (packed >> 8)  & 0xFF;  // bits [15..8]
b[3] = (packed >> 0)  & 0xFF;  // bits [7..0]

Result for 2026 / 03 / 15:

0x00b[0] MSB
0x0Fb[1]
0xD0b[2]
0x6Fb[3] LSB

That's it. Serialization is just bytes. The hard part is agreeing on which order those bytes live in memory — and that's where endianness enters the picture.

Endianness

📖
The term was coined by Danny Cohen, inspired by Jonathan Swift's Gulliver's Travels, where the Lilliputians and Blefuscans fought wars over which end of a boiled egg to crack — the big end or the little end. Cohen applied the metaphor to the two camps in computer byte ordering: big-endian (most significant byte first) and little-endian (least significant byte first).

Think of it like how we write numbers in natural language. 8621 is 8×10³ + 6×10² + 2×10¹ + 1×10⁰ — big end first. If we reversed that ordering, we'd write 1268 for the same value. Different convention, identical information, total incompatibility.

For a 32-bit value 0x12345678 stored at address 0x1000:

uint32_t x = 0x12345678
arch
0x1000
0x1001
0x1002
0x1003
big-endian
0x12
MSB
0x34
0x56
0x78
LSB
little-endian
0x78
LSB
0x56
0x34
0x12
MSB
Each cell = 1 byte · bytes are identical, only their order changes · x86/ARM = little-endian · network/SPARC/PowerPC = big-endian
endian_demo.cpp
unsigned short readBigEndian(unsigned char* data) {
    return (data[0] << 8) | data[1];
}

unsigned short readLittleEndian(unsigned char* data) {
    return (data[1] << 8) | data[0];
}

int main() {
    unsigned char data[] = {0x01, 0x10};
    // big-endian:    0x0110 = 272
    // little-endian: 0x1001 = 4097
}
Key takeaway: most host machines (x86, ARM) are little-endian, but TCP/IP mandates big-endian ("network byte order"). Serialization formats must pick a side and document it.

SerDes in One Slide

📦
Serialization
Write in-memory structs into a flat bytestream. The order of bytes must be agreed upon by both sides.
📤
Deserialization
Reconstruct in-memory structs from a bytestream by reading fields in the exact order they were written.
🖥️
Host order
Most host CPUs (x86 / ARM) use little-endian. LSB lives at the lowest address.
🌐
Network order
TCP/IP mandates big-endian. MSB first. Network protocols have used big-endian since the 1970s.

Serialization in Thrift's Binary Protocol

The hand-rolled bit-packing above works, but production systems need something more principled. Apache Thrift's TBinaryProtocol uses a self-describing format: every field carries its own type tag and field ID before its value, so the decoder never has to guess.

The wire layout for a single field looks like this:

type code
1 byte
field id
2 bytes · big-endian i16
value encoding
N bytes · type-dependent
···
STOP
0x00 · 1 byte

The type codes are fixed constants (e.g. 0x08 = I32, 0x02 = BOOL, 0x0B = STRING, 0x00 = STOP). Multi-byte integers are always big-endian on the wire regardless of host byte order.

Serializing a Thrift Struct, Byte by Byte

Take this Thrift definition and instance:

user.thrift
struct User {
  1: i32    id,      // = 42
  2: bool   active,  // = true
  3: string name,    // = "Bob"
}

Here is the complete field-by-field serialization:

Field 1 1: i32 id = 42
type (I32)08
field id = 10001
value 42 (big-endian i32)0000002A
wire bytes →
08 00 01 00 00 00 2A
Field 2 2: bool active = true
type (BOOL)02
field id = 20002
value true01
wire bytes →
02 00 02 01
Field 3 3: string name = "Bob"
type (STRING)0B
field id = 30003
length = 3 (i32)00000003
"Bob" ASCII426F62
wire bytes →
0B 00 03 00 00 00 03 42 6F 62

Concatenating all three fields plus the terminal STOP byte (0x00):

08 00 01 00 00 00 2A  02 00 02 01  0B 00 03 00 00 00 03 42 6F 62  00
← field 1 (i32=42)     ← field 2 (bool=true)     ← field 3 (string="Bob")     ← STOP

22 bytes total. Every byte is accounted for by the schema.

Why Deserialization Is More Expensive Than Serialization

The fundamental constraint: you don't know what to read next until you've finished reading what came before. Deserialization is inherently sequential and cannot be easily parallelized at the field level with TBinaryProtocol.

During serialization, the writer knows the full in-memory layout of every field upfront. It can write type code → field ID → value in a tight loop, with no lookups required.

During deserialization, the reader must parse each field header (type + field ID) before it knows how many bytes to consume for the value. For variable-length types like STRING, it must first read the 4-byte length prefix, then read that many payload bytes. Each field depends on parsing the previous one — creating a strict sequential dependency chain.

This is the fundamental motivation for formats like Frozen/2.

Frozen/2: Zero-Copy Random Field Access

Frozen/2 is Meta's extension to Thrift that solves the sequential parse problem by pre-computing a compact layout descriptor at freeze time. Instead of scanning the bytestream from the beginning for every field access, Frozen/2 uses three components:

📐
LAYOUT<T>
A compile-time-derived descriptor capturing each field's bit offset, bit width, and mask within the frozen buffer.
📍
Field Position
Stores the byte offset of each field so the reader can seek directly to any field without sequential parsing.
🧊
Frozen Buffer
A contiguous, packed representation of the struct's data that can be memory-mapped and read zero-copy.

For example, a Person{ id=5, age=32 } struct frozen into 2 bytes works as follows:

Layout<Person> descriptors
id · offset=0 3 bits · mask=0x07
5 = 0b101 → fits in 3 bits
age · offset=1 6 bits · mask=0x3F
32 = 0b100000 → fits in 6 bits
Frozen buffer (2 bytes)
0x05
base+0 · id
0x20
base+1 · age
0x00×8
padding
view.id() → *(base+0) & 0x07 = 5
view.age() → *(base+1) & 0x3F = 32
Field access is now a single pointer dereference plus a bitwise AND — O(1) regardless of struct size, with no sequential parsing required.

What About cpp.lazy?

Another optimization in Thrift's C++ runtime is cpp.lazy deserialization. With cpp.lazy, a field marked lazy is not deserialized when the parent struct is read from the wire. Instead, the raw bytes for that field are retained in a buffer, and the field is only parsed on first access.

This is a direct answer to the sequential parsing cost: if you have a large struct but only ever access two or three fields, cpp.lazy lets you pay the deserialization cost only for the fields you actually touch — deferring or entirely skipping the rest.

The serialized wire format for a cpp.lazy struct is identical to a normal Thrift binary-encoded struct. The optimization lives entirely in the reader path.

The Full Picture

🔢
Bytes first
Serialization reduces structured data to a flat byte sequence. The bit-packing schema must be documented and shared.
↔️
Byte order matters
Little-endian dominates on host machines. Big-endian dominates on the wire. Convert explicitly at boundaries.
🏷️
Thrift binary
Self-describing: every field carries type + field ID before its value. Terminated by a 0x00 STOP byte.
❄️
Frozen/2
Pre-computed bit offsets enable O(1) random field access via direct pointer arithmetic — no sequential scan.