Skip to content

VOM Specification

Razvan Musaloiu-E edited this page Jul 18, 2016 · 1 revision

Vanadium Object Marshalling (VOM) is a data format capable of serializing all values representable in VDL. It is the underlying serialization format used by the Vanadium RPC system.

VOM is a binary format that represents a stream of typed values. The sending side sequentially encodes values into a stream of bytes, and the receiving side decodes the stream of bytes back into the original sequence of values.

The format is self-describing. Given all data generated by an encoder, a generic decoder with no a priori knowledge of the contents can recover the types and values that were originally encoded. This enables support for reflection, and allows generic tools to be written to process the data. E.g., a storage system might use VOM to encode user-provided values into its log files. Specific fields of the data may be indexed for fast lookups, and the log files may be decoded even if the original encoding program no longer exists.

Binary format

VOM is a byte-based binary format, consisting of a sequence of messages. Each message is either a type definition (TypeMsg) or a value definition (ValueMsg). Types must be defined before they are instantiated by values.

The format starts with a single version byte, which must be 0x80 for the initial version of VOM.

Grammar

The format is described by the following grammar:

// VOM starts with a version byte, followed by a sequence of messages.
VOM:         version Message*
// Each message defines either a type or a value.  Both kinds of messages
// start with a typeid, and are distinguished by whether the typeid is
// positive or negative.
Message:     TypeMsg | ValueMsg

// Type messages start with a negative ID that identifies the type,
// followed by the WireType that defines the type, encoded just like
// any other main value.
TypeMsg:     -typeid MainValue(WireType)

// Value messages start with a positive ID that references a previously
// defined type, which describes the static type of the value,
// followed by the main value encoding.
ValueMsg:    +typeid MainValue

// If the main value is a primitive, its encoding is based on its type.
// Otherwise it's a composite, which starts with the unsigned byte length
// of the rest of the encoding, followed by the composite encoding.
MainValue:   primitive | bytelen CompositeV
// Other values (e.g. elements of an list) are encoded like the main value,
// without the byte length prefix for composites.
Value:       primitive | CompositeV

// Composite values have various encodings based on the kind of type.
CompositeV:  SeqV | SeqPairV | SparseSeqV | ElemV | OptV | AnyV
// Arrays, lists and sets are encoded as a dense sequence, starting with
// the unsigned number of items, followed by that many elements.
SeqV:        num Value*{num}
// Maps are encoded as a dense sequence of pairs, starting with the
// unsigned number of entries, followed by that many (key, elem) pairs.
SeqPairV:    num (Value Value)*{num}
// Structs are encoded as a sparse sequence of (index, field) pairs.
// The index is the 0-based unsigned index of the struct field.
// Zero-valued fields are skipped, and fields may be encoded in any order.
// The fields are terminated by an END control byte.
SparseSeqV:  (index Value)* END
// Unions are encoded as a single (index, field) pair.
// The index is the 0-based unsigned index of the union field.
ElemV:       index Value
// Optional is encoded as either a NIL control byte representing
// non-existence, or the value.
OptV:        NIL | Value
// Any is encoded as either a NIL control byte representing
// non-existence, or the unsigned type ID followed by the value.
AnyV:        NIL | typeid Value

All rules in the above grammar terminate in primitive values; the entire VOM encoding comprises combinations of primitives. The encoding of primitive values is in turn based on a single concept called var128.

Var128

The basis for VOM encoding is var128, a variable-length unsigned integer with a maximum size of 128 bits (16 bytes). This is a byte-based encoding. Values in the range [0, 127] are encoded verbatim in a single byte. Larger values use the first byte to encode the byte length of the value, with subsequent bytes encoding the value in its minimal big-endian representation.

The first byte of the encoding is the crux of the encoding. Of the 256 possible states for the first byte, 128 states are used to represent the values [0, 127], and another 16 states are used to represent the multi-byte lengths [1, 16]. That leaves 112 states, which are reserved for control entries.

var128 first byte 7 6 5 4 3 2 1 0
0x00..0x7F
Single byte value
0 value in range [0, 127]
0x80..0xBF
Control1
(64 entries)
1 0 - - - - - -
0xC0..0xDF
Control2
(32 entries)
1 1 0 - - - - -
0xE0..0xEF
Control3
(16 entries)
1 1 1 0 - - - -
0xF0..0xFF
Multi-byte length
(FF=1, FE=2, ...)
1 1 1 1 len in range [1, 16]

The encoding of the value and control entries are disjoint from each other; each var128 can either hold a single value of up to 128 bits, or a control entry with 112 possible states, grouped in consecutive 4, 5 and 6 bit chunks. The encoding favors small values; values less than 128 and all control entries are encoded in a single byte.

This is based on the fundamental format for the Go gob encoding, with the addition of the control entries. This format strikes a balance between size and speed; we try not to be wasteful of space, but still keep the format simple and fast to encode and decode.

Control entries

The var128 control entries are used to represent special properties in the VOM encoding. We take advantage of the fact that control entries may appear at the same position where regular values are expected, and distinguished appropriately.

Most control entries are currently unused, but reserved for future expansion.

0xE0 NIL Represents a non-existent value
0xE1 END Represents the end of a sequence

Primitives

Primitives types are encoded using either var128 or raw bytes.

  • Bool: Encoded directly as a single raw byte; 0 means false and 1 means true.
  • Byte: Encoded directly as a single raw byte.
  • Unsigned integer: Encoded directly as a var128.
  • Signed integer: Encoded as a var128, with bits 1 and up containing the value, and bit 0 indicating whether to complement the other bits to recover the signed integer.
  • Floating point: Encoded as a var128 containing a byte-reversed IEEE 754 64-bit float.
  • Complex number: Encoded as two consecutive floats, the first representing the real component and the second representing the imaginary component.
  • String: Encoded as a var128 containing the byte length, followed by that many raw bytes.

Built-in types

A fixed set of built-in types are used to bootstrap the encoding format, and are assumed to be known by both the encoder and decoder. Each built-in type has a fixed type ID. The semantics of the built-in types are defined by the VDL type system.

ID Built-in type
Primitive types
1 bool
2 byte
3 string
4 uint16
5 uint32
6 uint64
7 int16
8 int32
9 int64
10 float32
11 float64
12 complex64
13 complex128
14 typeobject
15 any
Composite types
39 []byte
40 []string
User defined types
41 first user defined type id

Wire types

User defined types are described in values of type WireType, and encoded into TypeMsg just like regular values. Unlike the regular encoding process, WireType cannot result in the generation of further type messages; there must be an end to the recursion. Similar to the built-in types, WireType and its sub types are assumed to be known by both the encoder and decoder, for bootstrapping purposes.

The space of all types consists of the built-in types, and user-defined types as described by WireType.

WireType and associated sub types are defined as follows. Since these types are used to bootstrap the system, the ordering of fields within each of these types is fixed and cannot be changed.

// WireType represents a user-defined type, and is encoded into the
// payload portion of each type message.
type WireType union {
  NamedT    WireNamed    // INDEX = 0
  EnumT     WireEnum     // INDEX = 1
  ArrayT    WireArray    // INDEX = 2
  ListT     WireList     // INDEX = 3
  SetT      WireSet      // INDEX = 4
  MapT      WireMap      // INDEX = 5
  StructT   WireStruct   // INDEX = 6
  UnionT    WireUnion    // INDEX = 7
  OptionalT WireOptional // INDEX = 8
}

// TypeID uniquely identifies a type definition within a VOM stream.
type TypeID uint64

// WireNamed represents the definition of named primitive types.
type WireNamed struct {
  Name string // INDEX = 0
  Base TypeID // INDEX = 1
}

// WireEnum represents the definition of enum types.
type WireEnum struct {
  Name   string   // INDEX = 0
  Labels []string // INDEX = 1
}

// WireArray represents the definition of array types.
type WireArray struct {
  Name string // INDEX = 0
  Elem TypeID // INDEX = 1
  Len  uint64 // INDEX = 2
}

// WireList represents the definition of list types.
type WireList struct {
  Name string // INDEX = 0
  Elem TypeID // INDEX = 1
}

// WireSet represents the definition of set types.
type WireSet struct {
  Name string // INDEX = 0
  Key  TypeID // INDEX = 1
}

// WireMap represents the definition of map types.
type WireMap struct {
  Name string // INDEX = 0
  Key  TypeID // INDEX = 1
  Elem TypeID // INDEX = 2
}

// WireField represents a field in a struct or union type.
type WireField struct {
  Name string // INDEX = 0
  Type TypeID // INDEX = 1
}

// WireStruct represents the definition of struct types.
type WireStruct struct {
  Name   string      // INDEX = 0
  Fields []WireField // INDEX = 1
}

// WireUnion represents the definition of union types.
type WireUnion struct {
  Name   string      // INDEX = 0
  Fields []WireField // INDEX = 1
}

// WireOptional represents the definition of optional types.
type WireOptional struct {
  Name string // INDEX = 0
  Elem TypeID // INDEX = 1
}