§01 The Problem Nobody Tells You About
Every engineer learns RPC early. Define a service, generate a client, call a method — data magically arrives on another machine. Clean. Simple. Abstracted.
But here is what they do not teach you: underneath that abstraction is a genuinely hard problem. Two machines, potentially built by different manufacturers, running different operating systems, with different CPU architectures, need to agree on what a sequence of bytes means. Not approximately agree. Exactly agree. One wrong bit and your integer becomes a garbage value — silently, with no exception thrown.
This article is about that problem. By the end, you will understand why serialization exists, what endianness actually is at the hardware level, how it breaks cross-machine communication, how protocols like Thrift solve it, and — the part most articles skip — how you can legally avoid serialization entirely even across a network when the conditions are right.
Engineers who use RPC daily but want to understand what is happening under the hood. Assumes comfort with networking basics and at least one RPC framework. No assembly knowledge required.
§02 What a Number Actually Is in Memory
Before we can talk about endianness, we need to establish one fundamental truth: a number is not just a number when it lives in memory. It is a sequence of bytes, and the order of those bytes is a choice.
Consider the integer 1000. In hexadecimal, that is 0x000003E8. It requires 4 bytes to store. Which byte goes first?
(MSB → LSB)
(x86 / AMD64)
LSB stored first
(network / SPARC)
MSB stored first
Both representations store the same value. Both are correct. The problem arises entirely when one machine writes bytes using one convention and another machine reads those same bytes using the other convention.
x86 and x86-64 (Intel, AMD) — the CPUs in virtually all servers — are little-endian. Most ARM processors default to little-endian in modern deployments. SPARC, IBM POWER in big-endian mode, and network hardware are traditionally big-endian. In practice, most of your machines are little-endian, which is exactly why endianness bugs are so insidious — they do not appear in local testing.
§03 The Silent Corruption Bug
Here is what endianness mismatch looks like in code. This is not a contrived example — this exact bug has caused real production incidents.
import struct # Machine A: x86 little-endian server value = 1000 # Writing raw bytes WITHOUT specifying byte order # Python defaults to native (little-endian on x86) raw = struct.pack('i', value) # produces: E8 03 00 00 socket.send(raw) # ───────────────────────────────────────── # Machine B: big-endian system # Receives the SAME bytes: E8 03 00 00 raw = socket.recv(4) # Interprets them as big-endian (its native format) received = struct.unpack('i', raw)[0] print(received) # -402456576 ← silent corruption, no error
03
00
00
The terrifying part is the silence. There is no CorruptedByteOrderException. The receiving machine simply interprets the bytes it received according to its own convention. The result is a valid integer — just the wrong one.
| What you sent | What they received | Consequence |
|---|---|---|
| 1000 | -402,456,576 | Wrong computation result |
| Port 8080 | Port 20994 | Connects to wrong service |
| User ID: 42 | User ID: 704,643,072 | Fetches wrong record |
| Price: 9999 | Price: -402,587,649 | Charges wrong amount |
§04 Why Serialization Exists
Now we can precisely define what serialization is for. At the most fundamental level, serialization exists to solve the byte order problem.
Serialization is the process of converting in-memory data into a canonical byte representation agreed upon by all parties, regardless of the hardware they run on.
Deserialization is the reverse: converting that canonical representation back into the local machine's native format.
The key word is canonical. Both sides agree on exactly one byte order for the wire. Historically, networking protocols chose big-endian — which is why it is also called network byte order.
# '>' = big-endian (network byte order) # '!' = equivalent alias meaning "network" # Machine A (little-endian x86) — SERIALIZES: value = 1000 raw = struct.pack('>i', value) # forces big-endian: 00 00 03 E8 socket.send(raw) # Machine B (any architecture) — DESERIALIZES: raw = socket.recv(4) # receives: 00 00 03 E8 value = struct.unpack('>i', raw)[0] print(value) # 1000 — correct on ANY machine
00
03
E8
Raw memory copies across machines work only when: same CPU architecture AND same byte order AND same struct alignment AND no independent versioning. The moment you cross a network boundary without matching all these conditions, canonical byte representation is not optional — it is required for correctness.
§05 What Thrift Actually Does Under the Hood
When you call a Thrift RPC method, you are not sending a Python object across the wire. You are sending a carefully constructed sequence of bytes that both sides have agreed to interpret identically. The agreement happens at three levels — none of which involve a runtime handshake.
Level 1 — The spec decides (design time)
The Thrift Binary Protocol spec simply declares: "All integers on the wire will be big-endian. Always." No negotiation. It's baked in forever, like HTTP headers being ASCII.
Level 2 — Generated code enforces it (compile time)
# Generated by thrift --gen py sensor.thrift # You never write this — but this is what runs def write(self, oprot): oprot.writeI32(self.sensor_id) # → struct.pack('!i', ...) oprot.writeDouble(self.value) # → struct.pack('!d', ...) oprot.writeI64(self.timestamp_ms) # → struct.pack('!q', ...) def read(self, iprot): self.sensor_id = iprot.readI32() # → struct.unpack('!i', ...) self.value = iprot.readDouble() # → struct.unpack('!d', ...) self.timestamp_ms = iprot.readI64() # → struct.unpack('!q', ...)
Level 3 — The protocol class executes it (runtime)
class TBinaryProtocol: def writeI32(self, i32): # '!' = network byte order = big-endian, unconditional # Runs identically on x86, ARM, SPARC — no if-statements self.trans.write(struct.pack('!i', i32)) def readI32(self): buff = self.trans.readAll(4) val, = struct.unpack('!i', buff) # always big-endian return val
The full wire journey of one integer
Step 1 Your code request = SensorReading(sensor_id=1000, value=98.6) client.processReading(request) Step 2 Generated code calls write() request.write(oprot) Step 3 TBinaryProtocol.writeI32(1000) struct.pack('!i', 1000) Native x86 memory : E8 03 00 00 After '!' pack : 00 00 03 E8 ← these 4 bytes hit the wire Step 4 TCP transmits bytes Wire contains : 00 00 03 E8 Step 5 Server (any arch) calls readI32() struct.unpack('!i', b'\x00\x00\x03\xe8') Returns : 1000 ✓
§06 TCompactProtocol: When Endianness Disappears
TBinaryProtocol is straightforward but not the most efficient. TCompactProtocol uses variable-length integer encoding (varint) combined with zigzag encoding, which largely eliminates the endianness problem by changing how numbers are represented on the wire.
# Varint: values stored 7 bits at a time # High bit of each byte = continuation flag # Byte order defined by the encoding itself — no endian decision needed # Value 1000 = 0b1111101000 # Split into 7-bit groups from LSB: 1101000 0000111 # Add continuation bits: 11101000 00000111 # Wire bytes: 0xE8 0x07 → just 2 bytes! # Zigzag: handles signed ints def zigzag_encode(n): return (n << 1) ^ (n >> 31) # 0→0, -1→1, 1→2, -2→3 (small negatives stay small)
| Protocol | Integer 1000 | Integer 1 | Approach |
|---|---|---|---|
| TBinaryProtocol | 4 bytes (fixed) | 4 bytes (fixed) | Always big-endian |
| TCompactProtocol | 2 bytes (varint) | 1 byte (varint) | Zigzag+varint, no endian concept |
| JSON (baseline) | 7 bytes ("1000") | 3 bytes ("1") | ASCII text |
| Raw ctypes struct | 4 bytes | 4 bytes | Native — unsafe cross-machine |
§07 Zero-Copy RPC Over a Network — Same Architecture
This is the section most articles skip. When both machines share identical CPU architecture and byte order, you can transmit raw struct bytes across a network socket and reconstruct them on the other side without any serialization or deserialization step. The bytes mean the same thing on both ends.
This is used in production in HFT systems, game servers, real-time telemetry pipelines, and any system where microsecond latency matters more than portability.
1. Both machines are the same CPU architecture (e.g., both x86-64). 2. Both machines use the same byte order (guaranteed by same arch). 3. Both sides use C-compatible struct layout with identical padding. 4. Both binaries are compiled with the same struct alignment settings. 5. You control both ends and deploy them together.
The implementation
The key is using ctypes.Structure with explicit padding control, a length-prefix framing protocol so the reader knows how many bytes to wait for, and raw bytes(struct_instance) / StructType.from_buffer_copy(raw) on each side.
import ctypes # _pack_ = 1 disables all compiler padding # Every field lands at the exact offset you expect # CRITICAL: both sides must use the same _pack_ value class SensorReading(ctypes.Structure): _pack_ = 1 _fields_ = [ ("sensor_id", ctypes.c_uint32), # 4 bytes at offset 0 ("timestamp", ctypes.c_double), # 8 bytes at offset 4 ("value", ctypes.c_double), # 8 bytes at offset 12 ("is_valid", ctypes.c_bool), # 1 byte at offset 20 ] # Total: 21 bytes, no padding, deterministic layout class RpcHeader(ctypes.Structure): _pack_ = 1 _fields_ = [ ("rpc_id", ctypes.c_uint16), # which function ("payload_len", ctypes.c_uint32), # bytes that follow ] # Total: 6 bytes, fixed, always read first RPC_GET_READING = 1 RPC_SUBMIT = 2
import socket, ctypes, threading from shared_structs import RpcHeader, SensorReading, RPC_GET_READING def recv_exact(sock, n): buf = b"" while len(buf) < n: chunk = sock.recv(n - len(buf)) if not chunk: raise RuntimeError("disconnected") buf += chunk return buf def handle(conn): while True: # 1. Read fixed 6-byte header — no parsing, just cast raw_hdr = recv_exact(conn, ctypes.sizeof(RpcHeader)) hdr = RpcHeader.from_buffer_copy(raw_hdr) # zero-copy cast # 2. Read payload raw_payload = recv_exact(conn, hdr.payload_len) if hdr.rpc_id == RPC_GET_READING: # 3. Cast payload directly to struct — no deserialization req = SensorReading.from_buffer_copy(raw_payload) print(f"sensor={req.sensor_id} value={req.value:.2f}") # 4. Build response and send raw bytes — no serialization resp = SensorReading(sensor_id=req.sensor_id, value=req.value * 2, is_valid=True) conn.sendall(bytes(resp)) # raw struct bytes, no encoding server = socket.socket(socket.AF_INET, socket.SOCK_STREAM) server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) server.bind(("", 9000)) server.listen(16) while True: conn, _ = server.accept() threading.Thread(target=handle, args=(conn,), daemon=True).start()
import socket, ctypes from shared_structs import RpcHeader, SensorReading, RPC_GET_READING sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.connect(("10.0.0.2", 9000)) # ← across the network # Build payload reading = SensorReading(sensor_id=42, value=98.6, is_valid=True) payload = bytes(reading) # raw struct bytes, no encoding # Send header + payload hdr = RpcHeader(rpc_id=RPC_GET_READING, payload_len=len(payload)) sock.sendall(bytes(hdr) + payload) # total: 6 + 21 = 27 bytes # Receive response — cast directly, no parsing raw = sock.recv(ctypes.sizeof(SensorReading)) resp = SensorReading.from_buffer_copy(raw) # zero-copy print(f"response value: {resp.value}") # 197.2
Why this works over the network
TCP is a byte-stream transport. It does not care what the bytes mean — it delivers them in order, intact. Since both machines are x86-64, the bytes that represent sensor_id = 42 in memory on Machine A are identical to the bytes that represent sensor_id = 42 in memory on Machine B. There is nothing to translate.
The only thing you need that you would not need with shared memory is framing — the RpcHeader that tells the reader how many bytes to wait for. TCP can deliver your 21-byte struct in 3 chunks of 7. recv_exact() handles this.
Measured difference
# Sending SensorReading (21 bytes payload) over loopback TCP # 100,000 iterations Raw ctypes (zero-copy) : ~1.1 µs per round-trip Thrift TCompactProtocol : ~4.8 µs per round-trip Thrift TBinaryProtocol : ~6.2 µs per round-trip JSON over HTTP : ~38 µs per round-trip # Raw is ~4x faster than Thrift Compact for this payload size # The gap widens with payload size # The gap narrows when network latency dominates (cross-DC calls)
When NOT to do this
| Scenario | Use raw structs? | Reason |
|---|---|---|
| Both machines x86-64, same binary | ✓ Yes | Identical layout guaranteed |
| x86-64 client → ARM server | ✗ No | Different default alignment rules |
| Same arch, different languages | ✗ No | Python object ≠ C struct layout |
| Independent deployment schedules | ✗ No | Field added to one side breaks other |
| Need to log / inspect wire bytes | ✗ No | Raw bytes are opaque without schema |
| Public API with external consumers | ✗ No | You cannot control client architecture |
Raw struct RPC over TCP is legitimate for internal, tightly-coupled services where both ends are compiled from the same codebase, deployed atomically, and you have benchmarked that Thrift's overhead is actually your bottleneck. For everything else, Thrift with TCompactProtocol over Unix sockets is the right answer — you get safety, versioning, and multi-language support at ~5µs per call.
§08 Schema Evolution: The Other Half
Endianness is a correctness problem. Schema evolution is a compatibility problem. A production RPC system must handle both. This is the second thing serialization gives you that raw structs cannot.
// Version 1 struct SensorReading { 1: required i32 sensor_id, 2: required double value, } // Version 2 — safely add fields struct SensorReading { 1: required i32 sensor_id, 2: required double value, 3: optional string unit, // NEW — always optional // 4: DELETED 2024-01 — was legacy_calibration // Field ID 4 is permanently retired. Never reuse it. 5: optional i64 recorded_at, // NEW — gap in IDs is fine }
Never reuse a field ID. Even after deletion, a field ID is permanently retired. If you reuse it for a new field of a different type, old clients will misinterpret your new data — and you are back to silent corruption, except now it is a schema mismatch rather than an endianness issue. The symptoms look identical.
§09 Practical Recommendations
Use TCompactProtocol in production
Unless you have a specific reason (debugging, legacy compatibility), TCompactProtocol is strictly better than TBinaryProtocol: smaller on the wire, faster to encode, and varint encoding sidesteps endianness entirely.
from thrift.protocol import TCompactProtocol protocol = TCompactProtocol.TCompactProtocol(transport)
Always use TFramedTransport
Without framing, a large message can arrive in multiple TCP segments and your read blocks. With framing, each message is preceded by its 4-byte length. Reads become atomic. This eliminates an entire class of subtle bugs.
For same-machine IPC, use Unix sockets
Switch from TCP to Unix domain sockets. You keep all of Thrift's serialization benefits and eliminate the TCP stack overhead entirely. One line change on both client and server.
Treat field IDs as permanent history
Your .thrift files are not just API definitions — they are permanent records of your wire protocol history. Comment out deleted fields with a date and reason. Treat the ID as a tombstone, not a vacancy.
The complete mental model
- Endianness is a hardware property: the order bytes of a multi-byte value are stored in memory
- Different CPUs use different conventions — x86 is little-endian, network hardware is big-endian
- Sending raw bytes cross-architecture produces silent, valid-looking garbage values
- Serialization solves this by converting to a canonical byte order before sending
- Thrift hardcodes big-endian in its spec and enforces it unconditionally in generated code
- TCompactProtocol uses varint encoding, making byte order an irrelevant concept
- Same-arch machines can skip serialization even over TCP — raw ctypes structs are safe and ~4x faster
- Schema evolution (field IDs, optional fields, retired IDs) is the second problem serialization solves
- Zero-copy raw structs are only safe when both ends are the same arch, same binary, deployed atomically