YAML is everywhere. This is pretty much your impression when first trying GNES. Understanding the YAML config is therefore extremely important to use GNES.
Essentially, GNES requires two types of YAML config:
- GNES-compose YAML
- Component-wise YAML
All other YAML files, including the docker-compose YAML config and Kubernetes config generated from the GNES Board or gnes compose
command are not a part of this tutorial. Interested readers are welcome to read their YAML specification respectively.
- Component-wise YAML specification
!CLS
specificationparameters
specificationgnes_config
specification- Every component can be described with YAML in GNES
- Stack multiple encoders into a
PipelineEncoder
- What's Next?
Preprocessor, encoder, indexer and router are fundamental components of GNES. They share the same YAML specification. The component-wise YAML defines how a component behaves. On the highest level, it contains three field:
Argument | Type | Description |
---|---|---|
!CLS |
str | choose from all class names registered in GNES |
parameters |
map/dict | a list of key-value pairs that CLS.__init__() accepts |
gnes_config |
map/dict | a list of key-value pairs for GNES |
Let's take a look an example:
!TorchvisionEncoder
parameters:
model_dir: ${VGG_MODEL}
model_name: vgg16
layers:
- features
- avgpool
- x.view(x.size(0), -1)
- classifier[0]
gnes_config:
is_trained: true
name: my-awesome-vgg
In this example, we define a TorchvisionEncoder
that loads a pretrained VGG16 model from the path${VGG_MODEL}
. We then label this component as trained via is_trained: true
and set its name to my-awesome-vgg
.
!CLS
is a name tag choosed from all class names registered in GNES. Currently, the following names are available:
!CLS |
Component Type |
---|---|
!BasePreprocessor |
Preprocessor |
!SentSplitPreprocessor |
Preprocessor |
!BaseImagePreprocessor |
Preprocessor |
!BaseTextPreprocessor |
Preprocessor |
!VanillaSlidingPreprocessor |
Preprocessor |
!WeightedSlidingPreprocessor |
Preprocessor |
!SegmentPreprocessor |
Preprocessor |
!UnaryPreprocessor |
Preprocessor |
!BaseVideoPreprocessor |
Preprocessor |
!FFmpegPreprocessor |
Preprocessor |
!ShotDetectPreprocessor |
Preprocessor |
!BertEncoder |
Encoder |
!BertEncoderWithServer |
Encoder |
!BertEncoderServer |
Encoder |
!ElmoEncoder |
Encoder |
!FlairEncoder |
Encoder |
!GPTEncoder |
Encoder |
!GPT2Encoder |
Encoder |
!PCALocalEncoder |
Encoder |
!PQEncoder |
Encoder |
!TFPQEncoder |
Encoder |
!Word2VecEncoder |
Encoder |
!BaseEncoder |
Encoder |
!BaseBinaryEncoder |
Encoder |
!BaseTextEncoder |
Encoder |
!BaseNumericEncoder |
Encoder |
!CompositionalEncoder |
Encoder |
!PipelineEncoder |
Encoder |
!HashEncoder |
Encoder |
!TorchvisionEncoder |
Encoder |
!TFInceptionEncoder |
Encoder |
!CVAEEncoder |
Encoder |
!FaissIndexer |
Indexer |
!LVDBIndexer |
Indexer |
!AsyncLVDBIndexer |
Indexer |
!NumpyIndexer |
Indexer |
!BIndexer |
Indexer |
!HBIndexer |
Indexer |
!JointIndexer |
Indexer |
!BaseIndexer |
Indexer |
!BaseTextIndexer |
Indexer |
!AnnoyIndexer |
Indexer |
!BaseRouter |
Router |
!BaseMapRouter |
Router |
!BaseReduceRouter |
Router |
!ChunkToDocRouter |
Router |
!DocFillRouter |
Router |
!ConcatEmbedRouter |
Router |
!PublishRouter |
Router |
!DocBatchRouter |
Router |
The key-value pair defined in parameters
is basically a map of the arguments defined in the constructor of !CLS
. Let's look at the signature of the constructor TorchvisionEncoder
as an example:
__init__() | YAML config |
---|---|
def __init__(self, model_name: str,
layers: List[str],
model_dir: str,
batch_size: int = 64,
*args, **kwargs):
# do model init...
# ...
|
!TorchvisionEncoder
parameters:
model_dir: ${VGG_MODEL}
model_name: vgg16
layers:
- features
- avgpool
- x.view(x.size(0), -1)
- classifier[0] |
Note, if an argument is defined in the __init__()
but not in YAML, the default value will be used, see batch_size
and use_cuda
as examples.
When you port an external package/module to GNES, sometimes the original implementation contains too many arguments. It doesn't make sense to write a super long __init__
as:
def __init__(self, arg1, arg2, arg3, arg4, arg5, ...):
self.arg1 = arg1
ext_module.cool_model(arg2, arg3, arg4, arg5, ...)
We provide a convenient way for this. Let's see BertEncoder
as an example, which invokes BertClient
from the bert-as-service
module. In this case, BertClient
accepts 10 arguments.
__init__() | YAML config |
---|---|
class BertEncoder(BaseTextEncoder):
store_args_kwargs = True
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.bert_client = BertClient(*args, **kwargs)
|
!BertEncoder
parameters:
kwargs:
port: $BERT_CI_PORT
port_out: $BERT_CI_PORT_OUT
ignore_all_checks: true
gnes_config:
is_trained: true |
Note that how we defines a map under kwargs
to describe the arguments, they will be forwarded to the constructor of BertClient
. Similarly, one can also define a list under args
to represent unnamed arguments.
gnes_config
defines some meta-information of this component. It accepts the following arguments
Argument | Type | Description |
---|---|---|
name |
str | the name of the component, default None |
is_trained |
bool | choose from [True, False] represents whether the model has been trained |
batch_size |
int | a number defines the batch size, often used in encode() , train() and index() , default None meaning doing everything in one shot |
work_dir |
str | the working directory of this component, default $GNES_VOLUME or the current directory |
name
is important, as it along with work_dir
determine the io path of serializing and deserializing the component. If you start a component without a name, it will be assigned to a random name with its class name as the prefix.
The examples above are all about encoder. In fact, every component including encoder, preprocessor, router, indexer can all be described with YAML and loaded to GNES. For example,
!SentSplitPreprocessor
parameters:
start_doc_id: 0
random_doc_id: True
deliminator: "[.。!?!?]+"
gnes_config:
is_trained: true
Sometime it could be quite simple, e.g.
!PublishRouter
parameters:
num_part: 2
Or even a one-liner, e.g.
!ConcatEmbedRouter {}
You can find a lot of examples in the unittest
For many real-world applications, a single encoder is often not enough. For example, the output of a BertEncoder
is 768-dimensional. One may want to append it with some dimensional reduction or quantization models. Of course one can spawn every encoder as an independent container and then connect them together via GNES Board/gnes compose
. But if you don't need them to be elastic, why bother? This is where PipelineEncoder
can be very useful: it stacks multiple BaseEncoder
together, simplifying data-flow in all runtimes (i.e. training, indexing and querying).
To define a PipelineEncoder
, you just need to sort the encoders in the right order and put them in a list under the component
field. Let's look at the following example:
!PipelineEncoder
components:
- !TorchvisionEncoder
parameters:
model_dir: /ext_data/image_encoder
model_name: resnet50
layers:
- conv1
- bn1
- relu
- maxpool
- layer1
- layer2
- layer3
- layer4
- avgpool
- x.reshape(x.size(0), -1)
gnes_config:
is_trained: true
- !PCALocalEncoder
parameters:
output_dim: 200
num_locals: 10
gnes_config:
batch_size: 2048
- !PQEncoder
parameters:
cluster_per_byte: 20
num_bytes: 10
gnes_config:
name: my-pipeline
Note how gnes_config
is defined for each component and also globally at the very end.
Now that you have learned how to config a complete GNES app, it is time to run GNES in Shell/Docker/Docker Swarm/Kubernetes!