New API for Piping Transformations #29

jejjohnson · 2023-02-26T12:02:11Z

jejjohnson
Feb 26, 2023
Maintainer

We need a new API for piping transformations. A convenient way to chain a series of partially initialized transformations.

Movation

We should have independent functions for all transformations. That way the user can pick and choose which parts of the transformation that they want for their purposes. But we should also have some convenience wrappers which only transform datasets to datasets. This will make it much easier to define custom configs where the users can chain their appropriate transformations together without too much nitty gritty customization.

Example Functional Form:

# functional form of transformation
# f: appropriate data structure --> appropriate data structure
lon_coord: np.ndarray = transform_180_2_360(lon_coord: np.ndarray)

Example Convenience Wrapper:

# xarray-form of transformation
# f: dataset --> dataset
ds: xr.Dataset = transform_lon_coords(ds, bounds: str="360")

Problem

Assuming we are ok with that, how do we implement an API that can chain transformations together with partially initialized functions?

(Pseudo-) Proposal

I have included an example set of pseudo-steps below.

Step 1: I partially initialize a bunch of xr.Dataset transformations

# define some custom types
transform_dtype = Callable[[xr.Dataset], xr.Dataset] 

# user defined transformations
transform1: transform_dtype = partial(change_domain, domain="spherical")
transform2: transform_dtype = partial(slice_domain, time_slice=("2011", "2012"), lon_slice(0,360))
transform3: transform_dtype = partial(filter_grid, ...)

Step 2: I create a pipeline which will call each transformation sequentially

# create chained transformation pipeline
transform_dtype = Callable[[xr.Dataset], xr.Dataset]
transform_chain: transform_dtype = create_pipe(transform1, transform2, transform3)

Step 3: I will apply this chained transformation to my xr.Dataset

# apply chained transformations to dataset
ds: xr.Dataset = transform_chain(ds)

Step 4 (Optiona): I call the .compute() method and let Dask do it's parallelization magic.

# dask compute (optional)
ds: xr.Dataset = ds.compute()

This might be problematic because I don't know that all of the transformations that we want to do are parallelizable. I don't know but we could try.

Proposal: Scikit-Learn API

I am found of the scikit-learn pipeline functionality with composite transformers (scikit-learn API). Perhaps we could do something similar?

Step 1: Define a custom transformation

We could have agnostic transformations that don't need any parameters to initialize. Examples of custom transformer from blog

Example Lon Coord Transform

class LonCoordTransform(BaseEstimator, TransformerMixin):
    def __init__(self, bounds: str="360"):
        self.bounds = bounds
        
    def transform(self, X: xr.Dataset, y=None) -> xr.Dataset:
        # transform long coords
        X.coords["lon"] = coord_lon_transform(X.coords["lon"], bounds=self.bounds)
        
        return X
    
    def inverse_transform(self, X: xr.Dataset, y=None) -> xr.Dataset:
        # choose coorect inverse
        bounds = "180" if self.bounds == "360" else "360"
        
        # inverse transformation
        X.coords["lon"] = coord_lon_transform(X.coords["lon"], bounds=bounds)
        
        return X

We could also have transformations that are data-dependent (e.g. scaling, normalization) and require some initialization.

Example Scaling Transform

class ScalingTransform(BaseEstimator, TransformerMixin):
    def __init__(self, dims: List[str], variables: List[str]):
        self.variables = variables
        self.dims = dims
        
        
    def fit(self, X: xr.Dataset, y=None) -> xr.Dataset:
        
        # calculate the mean and std from the dimensions
        mean = X[self.variables].mean(dims=self.dims)
        std = X[self.variables].std(dims=self.dims)
        
        self.mean = mean
        self.std = std
        
        return self
    
    
    def transform(self, X: xr.Dataset, y=None) -> xr.Dataset:
        
        # scaling transformation
        X[self.variables] = (X[self.variables] - self.mean) / self.std
        
        return X
    
    def inverse_transform(self, X: xr.Dataset, y=None) -> xr.Dataset:
        
        # inverse scaling transformation
        X[self.variables] = X[self.variables] * self.std + self.mean
        
        return X

This has the following functionality:

# initialize custom transformer
clf = CustomTransformer()

# "fit" transformation to dataset
clf.fit(ds: xr.Dataset)

# forward transformation
clf.transform(ds: xr.Dataset)

# inverse transformation (if applicable)
clf.inverse_transform(ds: xr.Dataset)

Step 2: Define a pipeline (similar API as above)

transform1 = LonCoordTransform()
transform2 = ScalingTransform(dims=["lon", "lat", "time"], variables=["ssh", "sst"])

# define pipeline
pipe = Pipeline(
                steps=[
                       ("lon_coord", transform1),
                       ("scaling", transform2),
                       ]
)

Step 3: Apply Pipeline + Dask Magic

# apply pipeline to xarray dataset
ds: xr.Dataset = pipe.fit_transform(ds)
    
# dask parallelization magic
ds = ds.compute()

Why so complicated?

Why do we have an API on top of an API? So why not something simple like torchvision.transforms.Compose (example)?

Reason I: We may want data-dependent transfomrs like the scaling transformation above. Those require a fit first and then apply.

Reason II: scikit-learn is a mature API so it should be something relatively familiar the community.

Reason III: I've seen an example (Blog | Code) that works with Hydra which is exactly what we want.

Reason IV: I've also seen an example that works with dask using bob.pipelines. There may be some issues with the some of the transformations (dask incompatibilities). But that can come later and we can go with a case by case bases. Should we need more advanced functionality (e.g. checkpointing, lazy operations), we can use something like that package.

jejjohnson · 2023-02-27T15:50:57Z

jejjohnson
Feb 27, 2023
Maintainer Author

Following the discussion with @quentinf00 , we decided on a good API: No API!

Basically, we will leave it up to hydra to determine how we piece things together. We will stick to writing stateless (whenever possible) transformations and let hydra deal with all of the complexity of putting everything together. There is an example below about how we can actually use hydra to do this.

WARNING: NOT FOR THE FAINT OF HEART!

Demo Hydra config

# initialize
scaler: {_target_: sklearn.StandardScaler}
fwd_pipeline:
    - _target_: torch.transformer
      kwarg_name: my_params1
      _partial_: true
    - _target_: ${call_method: scaler, fit_transform}
      _partial_: true
    - torch.transformer2
     p23: my_params2
     _partial_: true
inv_pipeline:
    - _target_: ${call_method: scaler, inverse_transform}


# =========================
# Step-By-Step (In Hydra)
# =========================

def no_config(state):
    # hydra initializes standard scaler
    # scaler: {_target_: sklearn.StandardScaler}
    scaler = StardardScaler()
    
    transforms = list()

    for istuff in hydra_config_list:
            # ITERATION I - EASY CASE (stateless transformation)
            torch.transformer, kwargs = istuff
            
            # partially initialize function
            itransform: Callable = partial(torch.transformer, **kwargs)
           
            # add it to the list
            transforms.append(itransform)

            # HARD CASE
            transforms = list()
            
            # ITERATION II - HARD CASE (stateful transformation)
            function, kwargs = istuff
            
            # # what hydra does -> ft.partial(StandardScaler.fit_transform, self=scaler)
            itransform: Callable = partial(StandardScaler.fit_transform, self=scaler, **kwargs)
            
            # what we get ↓
            scaler.fit_transform
           
            # add it to the list
            transforms.append(itransform)

# =========================
# Step-By-Step (For users)
# =========================
def main(cfg):
    # hydra initializes all of the functions (fowards and backwards)
    fwd_pipeline: List[Callable] = hydra_call(cfg.fwd_pipeline)
    inv_pipeline: List[Callable] = hydra_call(cfg.inv_pipeline)
    
    # we can loop through them!
    for fn in fwd_pipeline:
        state = fn(state)
     # train and pred
    for ifn in inv_pipeline:
        state = ifn(state)
    # -----
# you loop through the configs
for one_init in hydra_initializations:
    # pass the previous state
    state = one_init(state)
    # torch.transformer1(state, kwarg_name=my_params1)

@quentinf00 will come up with a nice way to do it in a relatively clean way and will give a short tutorial on how to do it.

As for the users, we have the following rules:

Keep everything functional. We will write functions for everything with independent parameters. For useful/simple stuff, try to write it in numpy, e.g. longitude coordinate change. But in general, for more complicated stuff, try to operate on data arrays, e.g. coordinate fixes, filtering, regridding, etc.

All Functions Operate on DataArrays. When we do write functions, try to have them operate on a single xr.DataArray. It is more robust and it removes the need to always specify the variable with everything function. For example:

params = ...
# do this!
data: xr.DataArray = my_transformation(data: xr.DataArray, params)
# do NOT do this!!
data: xr.DataSet = my_transformation(data: xr.DataSet, params)

What about operations on datasets? If we want to operate on a dataset, we will need to implement a wrapper or something that will allow us to apply the same function to the dataset. We discussed a few ways to do this. All of them complicated but we'll see what we come up with. For example:

params = ...
variables = ["ssh", "sst"]
# this loops through all variables
for r in ds.variables():
    if ivariable in variables:
         continue
    else:
           ds: xr.DataSet = ds.map(my_transformation: Callable, *params)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New API for Piping Transformations #29

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

New API for Piping Transformations #29

jejjohnson Feb 26, 2023 Maintainer

Movation

Problem

(Pseudo-) Proposal

Proposal: Scikit-Learn API

Why so complicated?

Replies: 1 comment

jejjohnson Feb 27, 2023 Maintainer Author

jejjohnson
Feb 26, 2023
Maintainer

jejjohnson
Feb 27, 2023
Maintainer Author