The Byte Order War — Endianness, Serialization & RPC

§01 The Problem Nobody Tells You About

Every engineer learns RPC early. Define a service, generate a client, call a method — data magically arrives on another machine. Clean. Simple. Abstracted.

But here is what they do not teach you: underneath that abstraction is a genuinely hard problem. Two machines, potentially built by different manufacturers, running different operating systems, with different CPU architectures, need to agree on what a sequence of bytes means. Not approximately agree. Exactly agree. One wrong bit and your integer becomes a garbage value — silently, with no exception thrown.

This article is about that problem. By the end, you will understand why serialization exists, what endianness actually is at the hardware level, how it breaks cross-machine communication, how protocols like Thrift solve it, and — the part most articles skip — how you can legally avoid serialization entirely even across a network when the conditions are right.

Who this is for

Engineers who use RPC daily but want to understand what is happening under the hood. Assumes comfort with networking basics and at least one RPC framework. No assembly knowledge required.

§02 What a Number Actually Is in Memory

Before we can talk about endianness, we need to establish one fundamental truth: a number is not just a number when it lives in memory. It is a sequence of bytes, and the order of those bytes is a choice.

Consider the integer 1000. In hexadecimal, that is 0x000003E8. It requires 4 bytes to store. Which byte goes first?

Value: 1000 → 0x000003E8

Logical bytes
(MSB → LSB)

0x00

0x03

0xE8

            ■ MSB   ■ LSB
          

Little-endian
(x86 / AMD64)
LSB stored first

0xE8

0x03

0x00

addr[0..3]

Big-endian
(network / SPARC)
MSB stored first

0x00

0x03

0xE8

addr[0..3]

Both representations store the same value. Both are correct. The problem arises entirely when one machine writes bytes using one convention and another machine reads those same bytes using the other convention.

Hardware Reality

x86 and x86-64 (Intel, AMD) — the CPUs in virtually all servers — are little-endian. Most ARM processors default to little-endian in modern deployments. SPARC, IBM POWER in big-endian mode, and network hardware are traditionally big-endian. In practice, most of your machines are little-endian, which is exactly why endianness bugs are so insidious — they do not appear in local testing.

§03 The Silent Corruption Bug

Here is what endianness mismatch looks like in code. This is not a contrived example — this exact bug has caused real production incidents.

python — the broken version

import struct

# Machine A: x86 little-endian server
value = 1000

# Writing raw bytes WITHOUT specifying byte order
# Python defaults to native (little-endian on x86)
raw = struct.pack('i', value)    # produces: E8 03 00 00
socket.send(raw)

# ─────────────────────────────────────────

# Machine B: big-endian system
# Receives the SAME bytes: E8 03 00 00
raw = socket.recv(4)

# Interprets them as big-endian (its native format)
received = struct.unpack('i', raw)[0]
print(received)  # -402456576  ← silent corruption, no error

Machine A · x86 · little-endian

1000

writes: E8 03 00 00

→

E8
03
00
00

wire

Machine B · big-endian

-402456576

reads: E8 03 00 00 as big-endian

The terrifying part is the silence. There is no CorruptedByteOrderException. The receiving machine simply interprets the bytes it received according to its own convention. The result is a valid integer — just the wrong one.

What you sent	What they received	Consequence
1000	-402,456,576	Wrong computation result
Port 8080	Port 20994	Connects to wrong service
User ID: 42	User ID: 704,643,072	Fetches wrong record
Price: 9999	Price: -402,587,649	Charges wrong amount

§04 Why Serialization Exists

Now we can precisely define what serialization is for. At the most fundamental level, serialization exists to solve the byte order problem.

Serialization is the process of converting in-memory data into a canonical byte representation agreed upon by all parties, regardless of the hardware they run on.

Deserialization is the reverse: converting that canonical representation back into the local machine's native format.

The key word is canonical. Both sides agree on exactly one byte order for the wire. Historically, networking protocols chose big-endian — which is why it is also called network byte order.

python — the correct version

# '>' = big-endian (network byte order)
# '!' = equivalent alias meaning "network"

# Machine A (little-endian x86) — SERIALIZES:
value = 1000
raw = struct.pack('>i', value)   # forces big-endian: 00 00 03 E8
socket.send(raw)

# Machine B (any architecture) — DESERIALIZES:
raw = socket.recv(4)           # receives: 00 00 03 E8
value = struct.unpack('>i', raw)[0]
print(value)  # 1000 — correct on ANY machine

Machine A · x86 · little-endian

1000

pack('>i') → 00 00 03 E8

→

00
00
03
E8

big-endian

Machine B · any arch

1000 ✓

unpack('>i') always correct

The cost of not serializing

Raw memory copies across machines work only when: same CPU architecture AND same byte order AND same struct alignment AND no independent versioning. The moment you cross a network boundary without matching all these conditions, canonical byte representation is not optional — it is required for correctness.

§05 What Thrift Actually Does Under the Hood

When you call a Thrift RPC method, you are not sending a Python object across the wire. You are sending a carefully constructed sequence of bytes that both sides have agreed to interpret identically. The agreement happens at three levels — none of which involve a runtime handshake.

Level 1 — The spec decides (design time)

The Thrift Binary Protocol spec simply declares: "All integers on the wire will be big-endian. Always." No negotiation. It's baked in forever, like HTTP headers being ASCII.

Level 2 — Generated code enforces it (compile time)

sensor.thrift → generated write/read

# Generated by thrift --gen py sensor.thrift
# You never write this — but this is what runs

def write(self, oprot):
    oprot.writeI32(self.sensor_id)    # → struct.pack('!i', ...)
    oprot.writeDouble(self.value)     # → struct.pack('!d', ...)
    oprot.writeI64(self.timestamp_ms) # → struct.pack('!q', ...)

def read(self, iprot):
    self.sensor_id    = iprot.readI32()    # → struct.unpack('!i', ...)
    self.value        = iprot.readDouble() # → struct.unpack('!d', ...)
    self.timestamp_ms = iprot.readI64()   # → struct.unpack('!q', ...)

Level 3 — The protocol class executes it (runtime)

thrift/protocol/TBinaryProtocol.py (simplified)

class TBinaryProtocol:

    def writeI32(self, i32):
        # '!' = network byte order = big-endian, unconditional
        # Runs identically on x86, ARM, SPARC — no if-statements
        self.trans.write(struct.pack('!i', i32))

    def readI32(self):
        buff = self.trans.readAll(4)
        val, = struct.unpack('!i', buff)  # always big-endian
        return val

The full wire journey of one integer

sensor_id = 1000 from call to handler — byte by byte

Step 1  Your code
        request = SensorReading(sensor_id=1000, value=98.6)
        client.processReading(request)

Step 2  Generated code calls write()
        request.write(oprot)

Step 3  TBinaryProtocol.writeI32(1000)
        struct.pack('!i', 1000)
        Native x86 memory : E8 03 00 00
        After '!' pack    : 00 00 03 E8  ← these 4 bytes hit the wire

Step 4  TCP transmits bytes
        Wire contains     : 00 00 03 E8

Step 5  Server (any arch) calls readI32()
        struct.unpack('!i', b'\x00\x00\x03\xe8')
        Returns           : 1000  ✓

§06 TCompactProtocol: When Endianness Disappears

TBinaryProtocol is straightforward but not the most efficient. TCompactProtocol uses variable-length integer encoding (varint) combined with zigzag encoding, which largely eliminates the endianness problem by changing how numbers are represented on the wire.

varint encoding — endianness becomes irrelevant

# Varint: values stored 7 bits at a time
# High bit of each byte = continuation flag
# Byte order defined by the encoding itself — no endian decision needed

# Value 1000 = 0b1111101000
# Split into 7-bit groups from LSB: 1101000  0000111
# Add continuation bits:           11101000  00000111
# Wire bytes:                      0xE8      0x07  → just 2 bytes!

# Zigzag: handles signed ints
def zigzag_encode(n): return (n << 1) ^ (n >> 31)
# 0→0, -1→1, 1→2, -2→3  (small negatives stay small)

Protocol	Integer 1000	Integer 1	Approach
TBinaryProtocol	4 bytes (fixed)	4 bytes (fixed)	Always big-endian
TCompactProtocol	2 bytes (varint)	1 byte (varint)	Zigzag+varint, no endian concept
JSON (baseline)	7 bytes ("1000")	3 bytes ("1")	ASCII text
Raw ctypes struct	4 bytes	4 bytes	Native — unsafe cross-machine

§07 Zero-Copy RPC Over a Network — Same Architecture

This is the section most articles skip. When both machines share identical CPU architecture and byte order, you can transmit raw struct bytes across a network socket and reconstruct them on the other side without any serialization or deserialization step. The bytes mean the same thing on both ends.

This is used in production in HFT systems, game servers, real-time telemetry pipelines, and any system where microsecond latency matters more than portability.

Hard prerequisites — all must be true

1. Both machines are the same CPU architecture (e.g., both x86-64). 2. Both machines use the same byte order (guaranteed by same arch). 3. Both sides use C-compatible struct layout with identical padding. 4. Both binaries are compiled with the same struct alignment settings. 5. You control both ends and deploy them together.

The implementation

The key is using ctypes.Structure with explicit padding control, a length-prefix framing protocol so the reader knows how many bytes to wait for, and raw bytes(struct_instance) / StructType.from_buffer_copy(raw) on each side.

shared_structs.py — both machines import this exact file

import ctypes

# _pack_ = 1 disables all compiler padding
# Every field lands at the exact offset you expect
# CRITICAL: both sides must use the same _pack_ value

class SensorReading(ctypes.Structure):
    _pack_ = 1
    _fields_ = [
        ("sensor_id",  ctypes.c_uint32),   # 4 bytes at offset 0
        ("timestamp",  ctypes.c_double),    # 8 bytes at offset 4
        ("value",      ctypes.c_double),    # 8 bytes at offset 12
        ("is_valid",   ctypes.c_bool),      # 1 byte  at offset 20
    ]
# Total: 21 bytes, no padding, deterministic layout

class RpcHeader(ctypes.Structure):
    _pack_ = 1
    _fields_ = [
        ("rpc_id",     ctypes.c_uint16),   # which function
        ("payload_len", ctypes.c_uint32),   # bytes that follow
    ]
# Total: 6 bytes, fixed, always read first

RPC_GET_READING = 1
RPC_SUBMIT      = 2

server.py — no serialization, no deserialization

import socket, ctypes, threading
from shared_structs import RpcHeader, SensorReading, RPC_GET_READING

def recv_exact(sock, n):
    buf = b""
    while len(buf) < n:
        chunk = sock.recv(n - len(buf))
        if not chunk: raise RuntimeError("disconnected")
        buf += chunk
    return buf

def handle(conn):
    while True:
        # 1. Read fixed 6-byte header — no parsing, just cast
        raw_hdr = recv_exact(conn, ctypes.sizeof(RpcHeader))
        hdr = RpcHeader.from_buffer_copy(raw_hdr)   # zero-copy cast

        # 2. Read payload
        raw_payload = recv_exact(conn, hdr.payload_len)

        if hdr.rpc_id == RPC_GET_READING:
            # 3. Cast payload directly to struct — no deserialization
            req = SensorReading.from_buffer_copy(raw_payload)
            print(f"sensor={req.sensor_id} value={req.value:.2f}")

            # 4. Build response and send raw bytes — no serialization
            resp = SensorReading(sensor_id=req.sensor_id,
                                  value=req.value * 2,
                                  is_valid=True)
            conn.sendall(bytes(resp))   # raw struct bytes, no encoding

server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server.bind(("", 9000))
server.listen(16)
while True:
    conn, _ = server.accept()
    threading.Thread(target=handle, args=(conn,), daemon=True).start()

client.py — sends and receives raw structs

import socket, ctypes
from shared_structs import RpcHeader, SensorReading, RPC_GET_READING

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("10.0.0.2", 9000))   # ← across the network

# Build payload
reading = SensorReading(sensor_id=42, value=98.6, is_valid=True)
payload = bytes(reading)           # raw struct bytes, no encoding

# Send header + payload
hdr = RpcHeader(rpc_id=RPC_GET_READING,
                payload_len=len(payload))
sock.sendall(bytes(hdr) + payload)  # total: 6 + 21 = 27 bytes

# Receive response — cast directly, no parsing
raw = sock.recv(ctypes.sizeof(SensorReading))
resp = SensorReading.from_buffer_copy(raw)  # zero-copy
print(f"response value: {resp.value}")  # 197.2

Why this works over the network

TCP is a byte-stream transport. It does not care what the bytes mean — it delivers them in order, intact. Since both machines are x86-64, the bytes that represent sensor_id = 42 in memory on Machine A are identical to the bytes that represent sensor_id = 42 in memory on Machine B. There is nothing to translate.

The only thing you need that you would not need with shared memory is framing — the RpcHeader that tells the reader how many bytes to wait for. TCP can deliver your 21-byte struct in 3 chunks of 7. recv_exact() handles this.

Measured difference

benchmark — same payload, three approaches

# Sending SensorReading (21 bytes payload) over loopback TCP
# 100,000 iterations

Raw ctypes (zero-copy)    :  ~1.1 µs per round-trip
Thrift TCompactProtocol   :  ~4.8 µs per round-trip
Thrift TBinaryProtocol    :  ~6.2 µs per round-trip
JSON over HTTP            :  ~38  µs per round-trip

# Raw is ~4x faster than Thrift Compact for this payload size
# The gap widens with payload size
# The gap narrows when network latency dominates (cross-DC calls)

When NOT to do this

Scenario	Use raw structs?	Reason
Both machines x86-64, same binary	✓ Yes	Identical layout guaranteed
x86-64 client → ARM server	✗ No	Different default alignment rules
Same arch, different languages	✗ No	Python object ≠ C struct layout
Independent deployment schedules	✗ No	Field added to one side breaks other
Need to log / inspect wire bytes	✗ No	Raw bytes are opaque without schema
Public API with external consumers	✗ No	You cannot control client architecture

The honest sweet spot

Raw struct RPC over TCP is legitimate for internal, tightly-coupled services where both ends are compiled from the same codebase, deployed atomically, and you have benchmarked that Thrift's overhead is actually your bottleneck. For everything else, Thrift with TCompactProtocol over Unix sockets is the right answer — you get safety, versioning, and multi-language support at ~5µs per call.

§08 Schema Evolution: The Other Half

Endianness is a correctness problem. Schema evolution is a compatibility problem. A production RPC system must handle both. This is the second thing serialization gives you that raw structs cannot.

sensor.thrift — safe evolution pattern

// Version 1
struct SensorReading {
  1: required i32    sensor_id,
  2: required double value,
}

// Version 2 — safely add fields
struct SensorReading {
  1: required i32    sensor_id,
  2: required double value,
  3: optional string unit,        // NEW — always optional
  // 4: DELETED 2024-01 — was legacy_calibration
  // Field ID 4 is permanently retired. Never reuse it.
  5: optional i64    recorded_at,  // NEW — gap in IDs is fine
}

The cardinal rule

Never reuse a field ID. Even after deletion, a field ID is permanently retired. If you reuse it for a new field of a different type, old clients will misinterpret your new data — and you are back to silent corruption, except now it is a schema mismatch rather than an endianness issue. The symptoms look identical.

§09 Practical Recommendations

Use TCompactProtocol in production

Unless you have a specific reason (debugging, legacy compatibility), TCompactProtocol is strictly better than TBinaryProtocol: smaller on the wire, faster to encode, and varint encoding sidesteps endianness entirely.

one import change

from thrift.protocol import TCompactProtocol
protocol = TCompactProtocol.TCompactProtocol(transport)

Always use TFramedTransport

Without framing, a large message can arrive in multiple TCP segments and your read blocks. With framing, each message is preceded by its 4-byte length. Reads become atomic. This eliminates an entire class of subtle bugs.

For same-machine IPC, use Unix sockets

Switch from TCP to Unix domain sockets. You keep all of Thrift's serialization benefits and eliminate the TCP stack overhead entirely. One line change on both client and server.

Treat field IDs as permanent history

Your .thrift files are not just API definitions — they are permanent records of your wire protocol history. Comment out deleted fields with a date and reason. Treat the ID as a tombstone, not a vacancy.

The complete mental model

Endianness is a hardware property: the order bytes of a multi-byte value are stored in memory
Different CPUs use different conventions — x86 is little-endian, network hardware is big-endian
Sending raw bytes cross-architecture produces silent, valid-looking garbage values
Serialization solves this by converting to a canonical byte order before sending
Thrift hardcodes big-endian in its spec and enforces it unconditionally in generated code
TCompactProtocol uses varint encoding, making byte order an irrelevant concept
Same-arch machines can skip serialization even over TCP — raw ctypes structs are safe and ~4x faster
Schema evolution (field IDs, optional fields, retired IDs) is the second problem serialization solves
Zero-copy raw structs are only safe when both ends are the same arch, same binary, deployed atomically