SPL is a simple format for representing structured data.
It is heavily inspired by Ron Rivest's S-Expressions (http://people.csail.mit.edu/rivest/sexp.html).
An SPL file is a sequence of SPL objects.
An SPL object is one of the following:
- STRING: a text string (an ordered sequence of unicode code points)
- INTEGER: a signed integer of arbitrary magnitude
- BLOB: a byte array (an ordered sequence of unsigned 8-bit integers)
- LIST: a list of zero or more SPL objects
There are no references. SPL objects are always finite, non-recursive, tree-structured, and have a well-defined tree depth. Also, the STRING type must not contain the NUL character, i.e. the code point with value 0.
typedef some_integer_representation SPL_int;
typedef enum {
SPL_STRING, SPL_INTEGER, SPL_BLOB, SPL_LIST
} SPL_kind;
typedef struct SPL_object {
SPL_kind tag;
union {
char *string;
SPL_int integer;
uint8_t *blob;
struct SPL_object *list;
};
} SPL_object;
When an SPL object needs to be stored in a human-readable text (e.g. as a configuration file), the following format is to be used:
-
STRING
The string is encoded by escaping it and enclosing it in double quotes. Before enclosing, non-printable characters, double quotes and backslashes ale all transformed into escape sequences. Escape sequences supported are:\"
,\\
,\t
,\n
= 0x0a,\r
= 0x0d,\xHH
,\uHHHH
,\UHHHHHHHH
. All in their traditional meaning.\xHH
escape sequence is interpreted as a byte in UTF-8 format. A sequence of adjacent bytes is decoded as a single UTF-8 string. It is permitted to encode a single unicode code point as a sequence of its constituent UTF-8 bytes. -
INTEGER
The integer value is encoded as a conventional decimal number string, with no spaces or other separators, and no enclosing characters. Example: The number -12458 would be represented using the character sequence-
,1
,2
,4
,5
,8
. -
BLOB
The byte array is represented by the character#
, followed by the length of the array in decimal notation, followed by the character:
followed by the bytes of the array, each encoded as two hexadecimal digits in lowercase. Example: The array { 0, 1, 26, 87, 128, 13 } would be represented as#6:00011a57800d
. -
LIST
The object list is represented as a sequence of representations of its constituent objects separated by whitespace, the entire sequence enclosed in a pair of regular parentheses. Example: A list of string "hello", string "world", integer 1337, an empty list, and a blob {0,1,1,2,3,5,8,13}, is represented as("hello" "world" 1337 () #8:000101020305080d)
.
For streaming over a network, a different representation is defined. This representation is designed for efficient transfer of data from one application to another, with as little transformation as possible. If you do not care for efficiency, you can just use the text representation. However, do not use the binary representation anywhere, where you expect a human to read it.
Before any other data, the stream starts with a LIST of up to 112 key STRINGs, as the first SPL object. The set of key strings may be empty. It depends on the application protocol or a statistical analysis of data, and serves to assign a single-byte identifiers to most frequent strings.
Control bytes:
-
0x80
--0xEF
Key STRINGs. -
0xF0
reserved -
0xF1
reserved -
0xF2
reserved -
0xF3
reserved -
0xF4
reserved -
0xF5
reserved -
0xF6
reserved -
0xF7
reserved -
0xF8
reserved -
0xF9
reserved -
0xFA
Start of a LIST. -
0xFB
End of a LIST. -
0xFC
Start of a STRING. -
0xFD
Start of a BLOB. -
0xFE
Start of a non-negative INTEGER. -
0xFF
Start of a negative INTEGER. -
7-bit integer encoding (INT7):
INT7 a variable-length encoding of an unsigned integer. It is a sequence of bytes. The most significant bit of each byte is set to zero, i.e. each byte contains 7 bits of the integer. Bytes are in little-endian order, i.e. the first byte contains the 7 least significant bits of a number. There must be no trailing zero bytes.
Each SPL object is optionally prefixed with the length of the entire object in INT7 encoding, not including the length of the INT7 length itself, but including any control bytes. For objects of type BLOB or INTEGER, this length-prefix is mandatory. After the length, object identification and data follows, as described bellow.
-
STRING:
A byte in range0x80
--0xEF
, or a byte of value0xFC
. In the former case, the byte's value minus 128 identifies one of the key strings. No further data is included. In the latter case, string data encoded as UTF-8 follow the control byte, terminated by a byte of value 0. -
INTEGER: Either
0xFE
(nonnegative) or0xFF
(negative), followed by little-endian sequence of bytes, with no trailing zero bytes. Each byte contains full 8 bits of the absolute value of the number. Zero is nonnegative, i.e.01FE
, not01FF
, nor02FE00
. -
BLOB:
0xFD
followed by bytes of the blob. Example for {1, 2, 3}:04FD010203
. Example for {}:01FD
. -
LIST:
0xFA
, followed by encodings of the contained objects, followed by0xFB
.
Q: What character encoding is the printable text representation?
A: Unimportant, but you are bound to make someone angry if you don't use UTF-8.
Q: How do you encode date/time/datetime information?
A: String (https://tools.ietf.org/html/rfc3339), or integer (https://en.wikipedia.org/wiki/Unix_time) are both good choices. Make your pick and make it stick.
Q: How do you encode anything else?
A: Any way you want. If you think something deserves a guideline here, let me know.
Q: Is there a "canonical" encoding?
A: The binary stream representation without any key strings, and with no length specifications except the mandatory ones. To the best of my knowledge, such restricted encoding is unique. (And you can readily compute this from any other encoding.)