Systems Deep Dive · Serialization & Deserialization
From raw bytes and endianness to Thrift's binary protocol and Frozen/2 — a complete walkthrough.
01 — Fundamentals
At its core, serialization is the process of transforming an in-memory data structure into a flat sequence of bytes that can be stored or transmitted. Deserialization is the reverse — rebuilding the original structure from those bytes.
Consider a simple Date struct with year, month, and day. If we pack 2026-03-15 into a 32-bit word using bit-fields — 23 bits for year, 4 bits for month, 5 bits for day — and emit the result in big-endian byte order, we get exactly four bytes:
// Pack date into 32 bits: [31..9]=year [8..5]=month [4..0]=day uint32_t packed = ((static_cast<uint32_t>(year) & 0x7FFFFF) << 9) | ((month & 0x0F) << 5) | (day & 0x1F); // Extract bytes in big-endian order (MSB first) uint8_t b[4]; b[0] = (packed >> 24) & 0xFF; // bits [31..24] b[1] = (packed >> 16) & 0xFF; // bits [23..16] b[2] = (packed >> 8) & 0xFF; // bits [15..8] b[3] = (packed >> 0) & 0xFF; // bits [7..0]
Result for 2026 / 03 / 15:
That's it. Serialization is just bytes. The hard part is agreeing on which order those bytes live in memory — and that's where endianness enters the picture.
02 — Byte Ordering
Think of it like how we write numbers in natural language. 8621 is 8×10³ + 6×10² + 2×10¹ + 1×10⁰ — big end first. If we reversed that ordering, we'd write 1268 for the same value. Different convention, identical information, total incompatibility.
For a 32-bit value 0x12345678 stored at address 0x1000:
unsigned short readBigEndian(unsigned char* data) { return (data[0] << 8) | data[1]; } unsigned short readLittleEndian(unsigned char* data) { return (data[1] << 8) | data[0]; } int main() { unsigned char data[] = {0x01, 0x10}; // big-endian: 0x0110 = 272 // little-endian: 0x1001 = 4097 }
03 — Refresh
04 — Apache Thrift
The hand-rolled bit-packing above works, but production systems need something more principled. Apache Thrift's TBinaryProtocol uses a self-describing format: every field carries its own type tag and field ID before its value, so the decoder never has to guess.
The wire layout for a single field looks like this:
The type codes are fixed constants (e.g. 0x08 = I32, 0x02 = BOOL, 0x0B = STRING, 0x00 = STOP). Multi-byte integers are always big-endian on the wire regardless of host byte order.
05 — Worked Example
Take this Thrift definition and instance:
struct User { 1: i32 id, // = 42 2: bool active, // = true 3: string name, // = "Bob" }
Here is the complete field-by-field serialization:
Concatenating all three fields plus the terminal STOP byte (0x00):
22 bytes total. Every byte is accounted for by the schema.
06 — Performance
During serialization, the writer knows the full in-memory layout of every field upfront. It can write type code → field ID → value in a tight loop, with no lookups required.
During deserialization, the reader must parse each field header (type + field ID) before it knows how many bytes to consume for the value. For variable-length types like STRING, it must first read the 4-byte length prefix, then read that many payload bytes. Each field depends on parsing the previous one — creating a strict sequential dependency chain.
This is the fundamental motivation for formats like Frozen/2.
07 — Frozen/2
Frozen/2 is Meta's extension to Thrift that solves the sequential parse problem by pre-computing a compact layout descriptor at freeze time. Instead of scanning the bytestream from the beginning for every field access, Frozen/2 uses three components:
For example, a Person{ id=5, age=32 } struct frozen into 2 bytes works as follows:
08 — Optimization
cpp.lazy?Another optimization in Thrift's C++ runtime is cpp.lazy deserialization. With cpp.lazy, a field marked lazy is not deserialized when the parent struct is read from the wire. Instead, the raw bytes for that field are retained in a buffer, and the field is only parsed on first access.
This is a direct answer to the sequential parsing cost: if you have a large struct but only ever access two or three fields, cpp.lazy lets you pay the deserialization cost only for the fields you actually touch — deferring or entirely skipping the rest.
cpp.lazy struct is identical to a normal Thrift binary-encoded struct. The optimization lives entirely in the reader path.
Summary