-
Notifications
You must be signed in to change notification settings - Fork 28
VOM Specification
Vanadium Object Marshalling (VOM) is a data format capable of serializing all values representable in VDL. It is the underlying serialization format used by the Vanadium RPC system.
VOM is a binary format that represents a stream of typed values. The sending side sequentially encodes values into a stream of bytes, and the receiving side decodes the stream of bytes back into the original sequence of values.
The format is self-describing. Given all data generated by an encoder, a generic decoder with no a priori knowledge of the contents can recover the types and values that were originally encoded. This enables support for reflection, and allows generic tools to be written to process the data. E.g., a storage system might use VOM to encode user-provided values into its log files. Specific fields of the data may be indexed for fast lookups, and the log files may be decoded even if the original encoding program no longer exists.
VOM is a byte-based binary format, consisting of a sequence of messages. Each
message is either a type definition (TypeMsg
) or a value definition
(ValueMsg
). Types must be defined before they are instantiated by values.
The format starts with a single version byte, which must be 0x80
for the
initial version of VOM.
The format is described by the following grammar:
// VOM starts with a version byte, followed by a sequence of messages.
VOM: version Message*
// Each message defines either a type or a value. Both kinds of messages
// start with a typeid, and are distinguished by whether the typeid is
// positive or negative.
Message: TypeMsg | ValueMsg
// Type messages start with a negative ID that identifies the type,
// followed by the WireType that defines the type, encoded just like
// any other main value.
TypeMsg: -typeid MainValue(WireType)
// Value messages start with a positive ID that references a previously
// defined type, which describes the static type of the value,
// followed by the main value encoding.
ValueMsg: +typeid MainValue
// If the main value is a primitive, its encoding is based on its type.
// Otherwise it's a composite, which starts with the unsigned byte length
// of the rest of the encoding, followed by the composite encoding.
MainValue: primitive | bytelen CompositeV
// Other values (e.g. elements of an list) are encoded like the main value,
// without the byte length prefix for composites.
Value: primitive | CompositeV
// Composite values have various encodings based on the kind of type.
CompositeV: SeqV | SeqPairV | SparseSeqV | ElemV | OptV | AnyV
// Arrays, lists and sets are encoded as a dense sequence, starting with
// the unsigned number of items, followed by that many elements.
SeqV: num Value*{num}
// Maps are encoded as a dense sequence of pairs, starting with the
// unsigned number of entries, followed by that many (key, elem) pairs.
SeqPairV: num (Value Value)*{num}
// Structs are encoded as a sparse sequence of (index, field) pairs.
// The index is the 0-based unsigned index of the struct field.
// Zero-valued fields are skipped, and fields may be encoded in any order.
// The fields are terminated by an END control byte.
SparseSeqV: (index Value)* END
// Unions are encoded as a single (index, field) pair.
// The index is the 0-based unsigned index of the union field.
ElemV: index Value
// Optional is encoded as either a NIL control byte representing
// non-existence, or the value.
OptV: NIL | Value
// Any is encoded as either a NIL control byte representing
// non-existence, or the unsigned type ID followed by the value.
AnyV: NIL | typeid Value
All rules in the above grammar terminate in primitive values; the entire VOM encoding comprises combinations of primitives. The encoding of primitive values is in turn based on a single concept called var128.
The basis for VOM encoding is var128, a variable-length unsigned integer with a maximum size of 128 bits (16 bytes). This is a byte-based encoding. Values in the range [0, 127] are encoded verbatim in a single byte. Larger values use the first byte to encode the byte length of the value, with subsequent bytes encoding the value in its minimal big-endian representation.
The first byte of the encoding is the crux of the encoding. Of the 256 possible states for the first byte, 128 states are used to represent the values [0, 127], and another 16 states are used to represent the multi-byte lengths [1, 16]. That leaves 112 states, which are reserved for control entries.
var128 first byte | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|
0x00..0x7F Single byte value |
0 | value in range [0, 127] | ||||||
0x80..0xBF Control1 (64 entries) |
1 | 0 | - | - | - | - | - | - |
0xC0..0xDF Control2 (32 entries) |
1 | 1 | 0 | - | - | - | - | - |
0xE0..0xEF Control3 (16 entries) |
1 | 1 | 1 | 0 | - | - | - | - |
0xF0..0xFF Multi-byte length (FF=1, FE=2, ...) |
1 | 1 | 1 | 1 | len in range [1, 16] | |||
The encoding of the value and control entries are disjoint from each other; each var128 can either hold a single value of up to 128 bits, or a control entry with 112 possible states, grouped in consecutive 4, 5 and 6 bit chunks. The encoding favors small values; values less than 128 and all control entries are encoded in a single byte.
This is based on the fundamental format for the Go gob encoding, with the addition of the control entries. This format strikes a balance between size and speed; we try not to be wasteful of space, but still keep the format simple and fast to encode and decode.
The var128 control entries are used to represent special properties in the VOM encoding. We take advantage of the fact that control entries may appear at the same position where regular values are expected, and distinguished appropriately.
Most control entries are currently unused, but reserved for future expansion.
0xE0 | NIL | Represents a non-existent value |
---|---|---|
0xE1 | END | Represents the end of a sequence |
Primitives types are encoded using either var128 or raw bytes.
- Bool: Encoded directly as a single raw byte; 0 means false and 1 means true.
- Byte: Encoded directly as a single raw byte.
- Unsigned integer: Encoded directly as a var128.
- Signed integer: Encoded as a var128, with bits 1 and up containing the value, and bit 0 indicating whether to complement the other bits to recover the signed integer.
- Floating point: Encoded as a var128 containing a byte-reversed IEEE 754 64-bit float.
- Complex number: Encoded as two consecutive floats, the first representing the real component and the second representing the imaginary component.
- String: Encoded as a var128 containing the byte length, followed by that many raw bytes.
A fixed set of built-in types are used to bootstrap the encoding format, and are assumed to be known by both the encoder and decoder. Each built-in type has a fixed type ID. The semantics of the built-in types are defined by the VDL type system.
ID | Built-in type |
---|---|
Primitive types | |
1 | bool |
2 | byte |
3 | string |
4 | uint16 |
5 | uint32 |
6 | uint64 |
7 | int16 |
8 | int32 |
9 | int64 |
10 | float32 |
11 | float64 |
12 | complex64 |
13 | complex128 |
14 | typeobject |
15 | any |
Composite types | |
39 | []byte |
40 | []string |
User defined types | |
41 | first user defined type id |
User defined types are described in values of type WireType
, and encoded into
TypeMsg
just like regular values. Unlike the regular encoding process,
WireType
cannot result in the generation of further type messages; there must
be an end to the recursion. Similar to the built-in types, WireType
and its
sub types are assumed to be known by both the encoder and decoder, for
bootstrapping purposes.
The space of all types consists of the built-in types, and user-defined types
as described by WireType
.
WireType
and associated sub types are defined as follows. Since these types
are used to bootstrap the system, the ordering of fields within each of these
types is fixed and cannot be changed.
// WireType represents a user-defined type, and is encoded into the
// payload portion of each type message.
type WireType union {
NamedT WireNamed // INDEX = 0
EnumT WireEnum // INDEX = 1
ArrayT WireArray // INDEX = 2
ListT WireList // INDEX = 3
SetT WireSet // INDEX = 4
MapT WireMap // INDEX = 5
StructT WireStruct // INDEX = 6
UnionT WireUnion // INDEX = 7
OptionalT WireOptional // INDEX = 8
}
// TypeID uniquely identifies a type definition within a VOM stream.
type TypeID uint64
// WireNamed represents the definition of named primitive types.
type WireNamed struct {
Name string // INDEX = 0
Base TypeID // INDEX = 1
}
// WireEnum represents the definition of enum types.
type WireEnum struct {
Name string // INDEX = 0
Labels []string // INDEX = 1
}
// WireArray represents the definition of array types.
type WireArray struct {
Name string // INDEX = 0
Elem TypeID // INDEX = 1
Len uint64 // INDEX = 2
}
// WireList represents the definition of list types.
type WireList struct {
Name string // INDEX = 0
Elem TypeID // INDEX = 1
}
// WireSet represents the definition of set types.
type WireSet struct {
Name string // INDEX = 0
Key TypeID // INDEX = 1
}
// WireMap represents the definition of map types.
type WireMap struct {
Name string // INDEX = 0
Key TypeID // INDEX = 1
Elem TypeID // INDEX = 2
}
// WireField represents a field in a struct or union type.
type WireField struct {
Name string // INDEX = 0
Type TypeID // INDEX = 1
}
// WireStruct represents the definition of struct types.
type WireStruct struct {
Name string // INDEX = 0
Fields []WireField // INDEX = 1
}
// WireUnion represents the definition of union types.
type WireUnion struct {
Name string // INDEX = 0
Fields []WireField // INDEX = 1
}
// WireOptional represents the definition of optional types.
type WireOptional struct {
Name string // INDEX = 0
Elem TypeID // INDEX = 1
}