Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split registering multiple configs into a seperate function #804

Open
jsfreischuetz opened this issue Jul 17, 2024 · 4 comments
Open

Split registering multiple configs into a seperate function #804

jsfreischuetz opened this issue Jul 17, 2024 · 4 comments

Comments

@jsfreischuetz
Copy link
Contributor

jsfreischuetz commented Jul 17, 2024

Currently, I am working on condensing the return values of suggest, and the arguments of register into a class, which in many ways acts as a named tuple. Because of this, I have been looking into the structure of these classes as per comments on #771.

With the current implementation, it is possible to register multiple configurations simultaneously using the register function, however, this feature does not seem to be utilized in the test cases nor in mlos_bench. Additionally, it causes our observations to be lists of variable-length data frames.

I would propose that instead of supporting bulk registrations directly in the register function, we split this functionality out into an additional function (potentially called bulk_register) that calls into register. This would have the added benefit that it would remove the variable length list of dataframes in our observations as seen in opimizer.py.

self._observations: List[Tuple[pd.DataFrame, pd.DataFrame, Optional[pd.DataFrame]]] = []

This would then allow for a representation of singular observations through an Observation object` turning the type on this object into:

self._observations: List[Observation] = []

Observation would be a implemented as such:

class Observation:
    def __init__(
        self,
        *,
        config: pd.Series,
        performance: Optional[pd.Series] = None,
        context: Optional[pd.Series] = None,
        metadata: Optional[pd.Series] = None,
    ):
        self.config = config
        self.performance = performance
        self.context = context
        self.metadata = metadata

...
@jsfreischuetz
Copy link
Contributor Author

@bpkroth @motus

@bpkroth
Copy link
Contributor

bpkroth commented Jul 18, 2024

Looks to me like we are using this functionality already:

self._opt.register(configs=df_configs, scores=df_scores[opt_targets].astype(float))

It's especially useful for resuming an experiment and re-warming the optimizer with prior results.
I think the rough idea was that for backend optimizers that supported it, we could amortize the cost of retraining the model.
That said, I don't think this is being done well atm, but let's leave that topic for a future PR.
(for instance, for SMAC, I think we probably want to throw out the existing backend optimizer and bulk register everything in one shot at its initialization whenever we do that)

Back to this change, which was really only meant for code readability improvements,
in part what you're proposing here is to change the API to return NamedTuples of individual (config, score, context, metadata) instances instead of DataFrames of the same. That is a much wider reaching change than I was originally envisioning, I think.

Your proposal of having a class called an Observation is nice, but what I might do instead is something more like the following (quickly hacked together with some copilot help).

Note that we could pretty easily extend some of these helper functions for conversion to other data structure types in the future as well.

Some caveats:

The __iter__ method returns the wrong type info for scores and configs due to needing to support optional values for contexts and metadatas.

An alternative would be to explicitly change the received return type in all callsites, but that might be a bunch of work too.

There's probably some other refinements that could happen as well. As I said, this is just a quick stub idea.

"""Simple dataclass for storing observations from a hyperparameter optimization run."""

import itertools
from dataclasses import dataclass
from typing import List, Optional, Iterator

import pandas as pd


@dataclass(frozen=True)
class Observation:
    """Simple dataclass for storing a single Observation."""

    config: pd.Series
    score: pd.Series
    context: Optional[pd.Series] = None
    metadata: Optional[pd.Series] = None

    def __iter__(self) -> Iterator[Optional[pd.Series]]:
        """A not quite type correct hack to allow existing code to use the Tuple style
        return values.
        """
        # Note: this should be a more effecient return type than using astuple()
        # which makes deepcopies.
        return iter((self.config, self.score, self.context, self.metadata))


@dataclass(frozen=True)
class Observations:
    """Simple dataclass for storing observations from a hyperparameter optimization
    run.
    """

    configs: pd.DataFrame
    scores: pd.DataFrame
    context: Optional[pd.DataFrame] = None
    metadata: Optional[pd.DataFrame] = None

    def __post_init__(self) -> None:
        assert len(self.configs) == len(self.scores)
        if self.context is not None:
            assert len(self.configs) == len(self.context)
        if self.metadata is not None:
            assert len(self.configs) == len(self.metadata)

    def __iter__(self) -> Iterator[Optional[pd.DataFrame]]:
        """A not quite type correct hack to allow existing code to use the Tuple style
        return values.
        """
        # Note: this should be a more effecient return type than using astuple()
        # which makes deepcopies.
        return iter((self.configs, self.scores, self.context, self.metadata))

    def to_observation_list(self) -> List[Observation]:
        """Convert the Observations object to a list of Observation objects."""
        return [
            Observation(
                config=config,
                score=score,
                context=context,
                metadata=metadata,
            )
            for config, score, context, metadata in zip(
                self.configs.iterrows(),
                self.scores.iterrows(),
                self.context.iterrows() if self.context is not None else itertools.repeat(None),
                self.metadata.iterrows() if self.metadata is not None else itertools.repeat(None),
            )
        ]


def get_observations() -> Observations:
    """Get some dummy observations."""
    # Create some dummy data
    configs = pd.DataFrame(
        {
            "x": [1, 2, 3],
            "y": [4, 5, 6],
        }
    )
    scores = pd.DataFrame(
        {
            "score": [0.1, 0.2, 0.3],
        }
    )

    # Create an Observations object
    return Observations(configs=configs, scores=scores)


def test_observations() -> None:
    """Test the Observations dataclass."""
    # Create an Observations object
    observations = get_observations()
    observation = observations.to_observation_list()[0]

    # Print the Observations object
    print(observations)
    print(observations.configs, observations.scores)
    print(observation)
    print(observation.config, observation.score)

    # Or in tuple form using the __iter__ method:
    configs, scores, contexts, metadatas = get_observations()
    print(configs)
    print(scores)
    print(contexts)
    print(metadatas)

    config, score, context, metadata = get_observations().to_observation_list()[0]
    print(config)
    print(score)
    print(context)
    print(metadata)


if __name__ == "__main__":
    test_observations()

@bpkroth
Copy link
Contributor

bpkroth commented Jul 18, 2024

And tbh, that __iter__ hack kinda defeats the purpose a bit of what we're trying to do with this change - avoid mistakes of reordering metadata, context in return(ed) values by making them more explicit.

@bpkroth
Copy link
Contributor

bpkroth commented Nov 22, 2024

Gonna claim that #852 handles this, or at least lays most of the ground work to handle bulk registering more easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants