New API for Piping Transformations #29
Replies: 1 comment
-
Following the discussion with @quentinf00 , we decided on a good API: No API! Basically, we will leave it up to WARNING: NOT FOR THE FAINT OF HEART! Demo Hydra config# initialize
scaler: {_target_: sklearn.StandardScaler}
fwd_pipeline:
- _target_: torch.transformer
kwarg_name: my_params1
_partial_: true
- _target_: ${call_method: scaler, fit_transform}
_partial_: true
- torch.transformer2
p23: my_params2
_partial_: true
inv_pipeline:
- _target_: ${call_method: scaler, inverse_transform}
# =========================
# Step-By-Step (In Hydra)
# =========================
def no_config(state):
# hydra initializes standard scaler
# scaler: {_target_: sklearn.StandardScaler}
scaler = StardardScaler()
transforms = list()
for istuff in hydra_config_list:
# ITERATION I - EASY CASE (stateless transformation)
torch.transformer, kwargs = istuff
# partially initialize function
itransform: Callable = partial(torch.transformer, **kwargs)
# add it to the list
transforms.append(itransform)
# HARD CASE
transforms = list()
# ITERATION II - HARD CASE (stateful transformation)
function, kwargs = istuff
# # what hydra does -> ft.partial(StandardScaler.fit_transform, self=scaler)
itransform: Callable = partial(StandardScaler.fit_transform, self=scaler, **kwargs)
# what we get ↓
scaler.fit_transform
# add it to the list
transforms.append(itransform)
# =========================
# Step-By-Step (For users)
# =========================
def main(cfg):
# hydra initializes all of the functions (fowards and backwards)
fwd_pipeline: List[Callable] = hydra_call(cfg.fwd_pipeline)
inv_pipeline: List[Callable] = hydra_call(cfg.inv_pipeline)
# we can loop through them!
for fn in fwd_pipeline:
state = fn(state)
# train and pred
for ifn in inv_pipeline:
state = ifn(state)
# -----
# you loop through the configs
for one_init in hydra_initializations:
# pass the previous state
state = one_init(state)
# torch.transformer1(state, kwarg_name=my_params1) @quentinf00 will come up with a nice way to do it in a relatively clean way and will give a short tutorial on how to do it. As for the users, we have the following rules: Keep everything functional. We will write functions for everything with independent parameters. For useful/simple stuff, try to write it in numpy, e.g. longitude coordinate change. But in general, for more complicated stuff, try to operate on data arrays, e.g. coordinate fixes, filtering, regridding, etc. All Functions Operate on DataArrays. When we do write functions, try to have them operate on a single params = ...
# do this!
data: xr.DataArray = my_transformation(data: xr.DataArray, params)
# do NOT do this!!
data: xr.DataSet = my_transformation(data: xr.DataSet, params) What about operations on datasets? If we want to operate on a dataset, we will need to implement a wrapper or something that will allow us to apply the same function to the dataset. We discussed a few ways to do this. All of them complicated but we'll see what we come up with. For example: params = ...
variables = ["ssh", "sst"]
# this loops through all variables
for r in ds.variables():
if ivariable in variables:
continue
else:
ds: xr.DataSet = ds.map(my_transformation: Callable, *params) |
Beta Was this translation helpful? Give feedback.
-
Movation
We should have independent functions for all transformations. That way the user can pick and choose which parts of the transformation that they want for their purposes. But we should also have some convenience wrappers which only transform datasets to datasets. This will make it much easier to define custom configs where the users can chain their appropriate transformations together without too much nitty gritty customization.
Example Functional Form:
Example Convenience Wrapper:
Problem
Assuming we are ok with that, how do we implement an API that can chain transformations together with partially initialized functions?
(Pseudo-) Proposal
I have included an example set of pseudo-steps below.
Step 1: I partially initialize a bunch of
xr.Dataset
transformationsStep 2: I create a pipeline which will call each transformation sequentially
Step 3: I will apply this chained transformation to my
xr.Dataset
Step 4 (Optiona): I call the
.compute()
method and let Dask do it's parallelization magic.This might be problematic because I don't know that all of the transformations that we want to do are parallelizable. I don't know but we could try.
Proposal: Scikit-Learn API
I am found of the scikit-learn
pipeline
functionality with composite transformers (scikit-learn API). Perhaps we could do something similar?Step 1: Define a custom transformation
We could have agnostic transformations that don't need any parameters to initialize. Examples of custom transformer from blog
Example Lon Coord Transform
We could also have transformations that are data-dependent (e.g. scaling, normalization) and require some initialization.
Example Scaling Transform
This has the following functionality:
Step 2: Define a pipeline (similar API as above)
Step 3: Apply Pipeline + Dask Magic
Why so complicated?
Why do we have an API on top of an API? So why not something simple like
torchvision.transforms.Compose
(example)?Reason I: We may want data-dependent transfomrs like the scaling transformation above. Those require a fit first and then apply.
Reason II: scikit-learn is a mature API so it should be something relatively familiar the community.
Reason III: I've seen an example (Blog | Code) that works with Hydra which is exactly what we want.
Reason IV: I've also seen an example that works with
dask
using bob.pipelines. There may be some issues with the some of the transformations (dask incompatibilities). But that can come later and we can go with a case by case bases. Should we need more advanced functionality (e.g. checkpointing, lazy operations), we can use something like that package.Beta Was this translation helpful? Give feedback.
All reactions