Split registering multiple configs into a seperate function #804

jsfreischuetz · 2024-07-17T23:37:23Z

Currently, I am working on condensing the return values of suggest, and the arguments of register into a class, which in many ways acts as a named tuple. Because of this, I have been looking into the structure of these classes as per comments on #771.

With the current implementation, it is possible to register multiple configurations simultaneously using the register function, however, this feature does not seem to be utilized in the test cases nor in mlos_bench. Additionally, it causes our observations to be lists of variable-length data frames.

I would propose that instead of supporting bulk registrations directly in the register function, we split this functionality out into an additional function (potentially called bulk_register) that calls into register. This would have the added benefit that it would remove the variable length list of dataframes in our observations as seen in opimizer.py.

self._observations: List[Tuple[pd.DataFrame, pd.DataFrame, Optional[pd.DataFrame]]] = []

This would then allow for a representation of singular observations through an Observation object` turning the type on this object into:

self._observations: List[Observation] = []

Observation would be a implemented as such:

class Observation:
    def __init__(
        self,
        *,
        config: pd.Series,
        performance: Optional[pd.Series] = None,
        context: Optional[pd.Series] = None,
        metadata: Optional[pd.Series] = None,
    ):
        self.config = config
        self.performance = performance
        self.context = context
        self.metadata = metadata

...

The text was updated successfully, but these errors were encountered:

jsfreischuetz · 2024-07-17T23:37:31Z

@bpkroth @motus

bpkroth · 2024-07-18T18:00:29Z

Looks to me like we are using this functionality already:

MLOS/mlos_bench/mlos_bench/optimizers/mlos_core_optimizer.py

Line 120 in bb6c6f2

    
           self._opt.register(configs=df_configs, scores=df_scores[opt_targets].astype(float))

It's especially useful for resuming an experiment and re-warming the optimizer with prior results.
I think the rough idea was that for backend optimizers that supported it, we could amortize the cost of retraining the model.
That said, I don't think this is being done well atm, but let's leave that topic for a future PR.
(for instance, for SMAC, I think we probably want to throw out the existing backend optimizer and bulk register everything in one shot at its initialization whenever we do that)

Back to this change, which was really only meant for code readability improvements,
in part what you're proposing here is to change the API to return NamedTuples of individual (config, score, context, metadata) instances instead of DataFrames of the same. That is a much wider reaching change than I was originally envisioning, I think.

Your proposal of having a class called an Observation is nice, but what I might do instead is something more like the following (quickly hacked together with some copilot help).

Note that we could pretty easily extend some of these helper functions for conversion to other data structure types in the future as well.

Some caveats:

The __iter__ method returns the wrong type info for scores and configs due to needing to support optional values for contexts and metadatas.

An alternative would be to explicitly change the received return type in all callsites, but that might be a bunch of work too.

There's probably some other refinements that could happen as well. As I said, this is just a quick stub idea.

"""Simple dataclass for storing observations from a hyperparameter optimization run."""

import itertools
from dataclasses import dataclass
from typing import List, Optional, Iterator

import pandas as pd


@dataclass(frozen=True)
class Observation:
    """Simple dataclass for storing a single Observation."""

    config: pd.Series
    score: pd.Series
    context: Optional[pd.Series] = None
    metadata: Optional[pd.Series] = None

    def __iter__(self) -> Iterator[Optional[pd.Series]]:
        """A not quite type correct hack to allow existing code to use the Tuple style
        return values.
        """
        # Note: this should be a more effecient return type than using astuple()
        # which makes deepcopies.
        return iter((self.config, self.score, self.context, self.metadata))


@dataclass(frozen=True)
class Observations:
    """Simple dataclass for storing observations from a hyperparameter optimization
    run.
    """

    configs: pd.DataFrame
    scores: pd.DataFrame
    context: Optional[pd.DataFrame] = None
    metadata: Optional[pd.DataFrame] = None

    def __post_init__(self) -> None:
        assert len(self.configs) == len(self.scores)
        if self.context is not None:
            assert len(self.configs) == len(self.context)
        if self.metadata is not None:
            assert len(self.configs) == len(self.metadata)

    def __iter__(self) -> Iterator[Optional[pd.DataFrame]]:
        """A not quite type correct hack to allow existing code to use the Tuple style
        return values.
        """
        # Note: this should be a more effecient return type than using astuple()
        # which makes deepcopies.
        return iter((self.configs, self.scores, self.context, self.metadata))

    def to_observation_list(self) -> List[Observation]:
        """Convert the Observations object to a list of Observation objects."""
        return [
            Observation(
                config=config,
                score=score,
                context=context,
                metadata=metadata,
            )
            for config, score, context, metadata in zip(
                self.configs.iterrows(),
                self.scores.iterrows(),
                self.context.iterrows() if self.context is not None else itertools.repeat(None),
                self.metadata.iterrows() if self.metadata is not None else itertools.repeat(None),
            )
        ]


def get_observations() -> Observations:
    """Get some dummy observations."""
    # Create some dummy data
    configs = pd.DataFrame(
        {
            "x": [1, 2, 3],
            "y": [4, 5, 6],
        }
    )
    scores = pd.DataFrame(
        {
            "score": [0.1, 0.2, 0.3],
        }
    )

    # Create an Observations object
    return Observations(configs=configs, scores=scores)


def test_observations() -> None:
    """Test the Observations dataclass."""
    # Create an Observations object
    observations = get_observations()
    observation = observations.to_observation_list()[0]

    # Print the Observations object
    print(observations)
    print(observations.configs, observations.scores)
    print(observation)
    print(observation.config, observation.score)

    # Or in tuple form using the __iter__ method:
    configs, scores, contexts, metadatas = get_observations()
    print(configs)
    print(scores)
    print(contexts)
    print(metadatas)

    config, score, context, metadata = get_observations().to_observation_list()[0]
    print(config)
    print(score)
    print(context)
    print(metadata)


if __name__ == "__main__":
    test_observations()

bpkroth · 2024-07-18T18:52:37Z

And tbh, that __iter__ hack kinda defeats the purpose a bit of what we're trying to do with this change - avoid mistakes of reordering metadata, context in return(ed) values by making them more explicit.

bpkroth · 2024-11-22T20:44:38Z

Gonna claim that #852 handles this, or at least lays most of the ground work to handle bulk registering more easily.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split registering multiple configs into a seperate function #804

Split registering multiple configs into a seperate function #804

jsfreischuetz commented Jul 17, 2024 •

edited

Loading

jsfreischuetz commented Jul 17, 2024

bpkroth commented Jul 18, 2024 •

edited

Loading

bpkroth commented Jul 18, 2024

bpkroth commented Nov 22, 2024

Split registering multiple configs into a seperate function #804

Split registering multiple configs into a seperate function #804

Comments

jsfreischuetz commented Jul 17, 2024 • edited Loading

jsfreischuetz commented Jul 17, 2024

bpkroth commented Jul 18, 2024 • edited Loading

bpkroth commented Jul 18, 2024

bpkroth commented Nov 22, 2024

jsfreischuetz commented Jul 17, 2024 •

edited

Loading

bpkroth commented Jul 18, 2024 •

edited

Loading