RProtoBufUtils

This package provides some tools and utilities to serialize R objects to with protocol buffers. It builds on the RProtoBuf package, which interfaces to the official protocol buffers C++ library by Google.

The main advantage of serializing an object to a protocol buffer message, as opposed to native R serialization, is that protocol buffers is an inter-operable format that can be read/written by other programming languages. The main disadvantage is that some special R-specific object types are not supported and will get lost in the process (with a warning).

How it works

Designing a R object to protobuf serialization requires defining 3 parts:

The class of R objects that is supported in the serialization.
The .proto file which defines a schema for the structure of the message.
The mapping between this class of R objects and the proto message.

This package contains both some example .proto files designed for serializing R objects, as well as R code that will help with converting R data/objects to this format. Note that in order for a third party to unserialize a message, they will need both the serialized data as well as the specific proto file.

Example

The serialize_pb function mimics native serializion and writes an R object to file or connection, in protobuf format. By default it uses the rexp.proto schema.

msg <- tempfile();
serialize_pb(iris, msg);
obj <- unserialize_pb(msg);
identical(iris, obj);

Proto schemas

By default serialize_pb uses rexp.proto, which is also used by the RHIPE project to serialize R objects for use with HADOOP. This proto is designed to be most general, and supports all standard S3 objects, like vectors, factors, lists, dataframes and any combination thereof. It also stores attributes and missing values. It does not however support some R specific constructs, like functions, environments, S4 classes, etc.

The rexp.proto message definition is very general, but also pretty verbose. In the case of an application that only needs to serialize a certain class of objects, it might be wise to define a proto definition and mapper specifically for this class of objects. For example, the RProtoBufUtils package includes a dataframe.proto specifically for dataframes. This proto is a bit less general and might be more simple to use by 3rd parties when communicating a dataset.

msg <- tempfile();
serialize_pb(iris, msg, proto="dataframe");
obj <- unserialize_pb(msg, proto="dataframe");
identical(iris, obj);

Note, again, that one needs to communicate clearly with the consumer of the message which .proto was used to serialize the object. The serialized data can not be interpreted without the proper .proto file.

Unit Test

The RProtoBuf package ships with a dataframe named testdata which contains all of the common vector types, including some missing values for each.

numeric (double)
integer
factor
character
Date
POSIXct (timestamp)
complex
logical

This dataset is used to test if it properly serializes and unserializes without loss of information or precision. This dataset is also useful for testing unserialization in another language.

#load data
data(testdata)

#test rexp.proto
msg <- tempfile();
serialize_pb(testdata, msg, proto="rexp");
obj <- unserialize_pb(msg, proto="rexp");
identical(testdata, obj);

#test dataframe.proto
msg <- tempfile();
serialize_pb(testdata, msg, proto="dataframe");
obj <- unserialize_pb(msg, proto="dataframe");
identical(testdata, obj);

Limitations

For now the RProtoBuf package does not support any form of validation of the message. For example when the wrong proto file is specified during unserialization, no warning or error is given, but the output is useless. So for now it is up to the application/protocol designer to make sure .proto and message are communicated and validated somehow (e.g. with a checksum).
As mentioned above: because protocol buffers is a general purpose format, there is no straightfoward way of serializing object types specific to R (e.g. functions, environments, etc). However, these types of objects usually have little meaning outside of R in the first place. If you really want to serialize them, you can use something like dput(obj) or serialize(obj, NULL) to turn the object in a character string or raw vector, which are supported by protocol buffers.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.settings		.settings
R		R
data		data
inst		inst
man		man
tests		tests
vignettes		vignettes
.project		.project
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.settings

.settings

R

R

data

data

inst

inst

man

man

tests

tests

vignettes

vignettes

.project

.project

DESCRIPTION

DESCRIPTION

NAMESPACE

NAMESPACE

README.md

README.md

Repository files navigation

RProtoBufUtils

How it works

Example

Proto schemas

Unit Test

Limitations

About

Releases

Packages

Languages

murraystokely/RProtoBufUtils

Folders and files

Latest commit

History

Repository files navigation

RProtoBufUtils

How it works

Example

Proto schemas

Unit Test

Limitations

About

Resources

Stars

Watchers

Forks

Languages