Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dataset link for estores #5062

Merged
merged 8 commits into from
Nov 8, 2024
Merged

add dataset link for estores #5062

merged 8 commits into from
Nov 8, 2024

Conversation

ritch
Copy link
Contributor

@ritch ritch commented Nov 7, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced support for managing execution stores tied to specific datasets.
    • Added a cleanup function to remove execution data when datasets are deleted.
    • Enhanced the ExecutionStore and ExecutionStoreService to accept and utilize dataset IDs.
  • Bug Fixes

    • Improved key management in execution stores during dataset deletions.
  • Tests

    • Expanded unit tests to cover new dataset ID functionality in execution store operations.

Copy link
Contributor

coderabbitai bot commented Nov 7, 2024

Walkthrough

The changes introduce new functionalities related to managing execution stores for datasets in the FiftyOne framework. A new cleanup function is added to ensure that execution data is removed when a dataset is deleted. Additionally, several classes and methods are updated to accept a dataset_id, allowing operations to be scoped to specific datasets. This includes modifications across various files, enhancing the overall functionality and resource management of execution stores associated with datasets.

Changes

File Change Summary
fiftyone/core/dataset.py Added _cleanup_execution_store_for_dataset(dataset_id) function; called in delete method.
fiftyone/factory/repo_factory.py Updated execution_store_repo() method to accept optional dataset_id.
fiftyone/factory/repos/execution_store.py Modified ExecutionStoreRepo and MongoExecutionStoreRepo to handle dataset_id; added cleanup_for_dataset().
fiftyone/operators/executor.py Updated create_store method to include dataset_id when creating stores.
fiftyone/operators/store/models.py Added optional dataset_id field to KeyDocument class; updated to_mongo_dict method.
fiftyone/operators/store/service.py Modified ExecutionStoreService constructor to accept dataset_id; added cleanup_for_dataset() method.
fiftyone/operators/store/store.py Updated create method to accept dataset_id.
tests/unittests/dataset_tests.py Enhanced test_delete_dataset to include ExecutionStoreService logic for key management.
tests/unittests/execution_store_tests.py Added ExecutionStoreServiceDatasetIdTests class to test dataset_id functionalities.

Possibly related PRs

  • Execution Store #4827: The changes in this PR introduce the ExecutionStore component, which is directly related to the new _cleanup_execution_store_for_dataset function added in the main PR, as both involve managing execution stores and their associated cleanup operations.
  • Fix COCO category IDs #4884: This PR addresses issues with COCO category IDs, which may relate to the cleanup functionality in the main PR if the execution store is used to manage datasets that include COCO annotations.
  • Release/v1.0.1 #4914: The release notes for version 1.0.2 may include updates related to the execution store and its management, which connects to the changes made in the main PR regarding dataset cleanup.
  • Release v1.0.2 cherry picks #4991: This cherry-pick PR includes various updates that may indirectly relate to the execution store functionality and its integration with dataset management, aligning with the changes in the main PR.

Suggested labels

feature

Suggested reviewers

  • benjaminpkane
  • sashankaryal

🐰 In the meadow where datasets play,
A cleanup function saves the day!
With keys and stores all in line,
Each dataset's state will brightly shine.
So hop along, let’s celebrate,
Resource management feels just great! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Here's where to cleanup store docs upon dataset.delete():

def _delete(self):
self._sample_collection.drop()
fos.Sample._reset_docs(self._sample_collection_name)
# Clips datasets directly inherit frames from source dataset
if self._frame_collection_name is not None and not self._is_clips:
self._frame_collection.drop()
fofr.Frame._reset_docs(self._frame_collection_name)
# Update singleton
self._instances.pop(self._doc.name, None)
_delete_dataset_doc(self._doc)
self._deleted = True

@ritch ritch changed the title [WIP] add dataset link for estores add dataset link for estores Nov 8, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (5)
fiftyone/operators/store/models.py (1)

34-35: Add a comment explaining the dataset_id serialization behavior.

The serialization logic is correct, but it would be helpful to document why we exclude null dataset_ids.

+    # Exclude dataset_id when None to maintain clean documents and backward compatibility
     if self.dataset_id is None:
         data.pop("dataset_id", None)
fiftyone/factory/repo_factory.py (1)

Line range hint 54-69: Singleton pattern may cause issues with dataset-specific repos

The current implementation uses a static dictionary to store repository instances, but the key is only based on the collection name, not the dataset_id. This means all datasets will share the same repository instance, which could lead to unexpected behavior when different dataset_ids are used.

Consider using a composite key that includes both collection name and dataset_id:

-        if (
-            MongoExecutionStoreRepo.COLLECTION_NAME
-            not in RepositoryFactory.repos
-        ):
-            RepositoryFactory.repos[
-                MongoExecutionStoreRepo.COLLECTION_NAME
-            ] = MongoExecutionStoreRepo(
-                collection=_get_db()[MongoExecutionStoreRepo.COLLECTION_NAME],
-                dataset_id=dataset_id,
-            )
-
-        return RepositoryFactory.repos[MongoExecutionStoreRepo.COLLECTION_NAME]
+        repo_key = f"{MongoExecutionStoreRepo.COLLECTION_NAME}:{dataset_id}"
+        if repo_key not in RepositoryFactory.repos:
+            RepositoryFactory.repos[repo_key] = MongoExecutionStoreRepo(
+                collection=_get_db()[MongoExecutionStoreRepo.COLLECTION_NAME],
+                dataset_id=dataset_id,
+            )
+
+        return RepositoryFactory.repos[repo_key]
fiftyone/operators/store/store.py (1)

19-24: Good architectural design for dataset scoping

The approach of using an optional dataset_id parameter is well-designed:

  • Maintains backward compatibility for global stores
  • Cleanly separates dataset-scoped stores
  • Follows the principle of least surprise by making the scoping explicit
  • Delegates implementation details to the service layer

Consider documenting this dataset scoping behavior in the class and method docstrings to make the functionality more discoverable.

fiftyone/operators/executor.py (1)

889-890: Consider using a public API to access the dataset ID.

The implementation correctly associates the execution store with a dataset, which is good. However, it uses the private attribute _doc.id to access the dataset ID. Consider using a public API if available, as private attributes may change without notice.

If a public API exists, consider this alternative:

-        dataset_id = self.dataset._doc.id
+        dataset_id = self.dataset.id  # Assuming dataset.id exists as a public API
         return ExecutionStore.create(store_name, dataset_id)
fiftyone/core/dataset.py (1)

10182-10187: LGTM! Clean implementation of execution store cleanup.

The function has a clear purpose and follows single responsibility principle. Consider adding debug logging to help track cleanup operations.

 def _cleanup_execution_store_for_dataset(dataset_id):
     """Cleans up the execution store for the given dataset."""
     from fiftyone.operators.store import ExecutionStoreService
+    logger.debug("Cleaning up execution store for dataset %s", dataset_id)
     svc = ExecutionStoreService(dataset_id=dataset_id)
     svc.cleanup_for_dataset()
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7f73a78 and 1f6733d.

📒 Files selected for processing (9)
  • fiftyone/core/dataset.py (2 hunks)
  • fiftyone/factory/repo_factory.py (3 hunks)
  • fiftyone/factory/repos/execution_store.py (4 hunks)
  • fiftyone/operators/executor.py (1 hunks)
  • fiftyone/operators/store/models.py (3 hunks)
  • fiftyone/operators/store/service.py (2 hunks)
  • fiftyone/operators/store/store.py (2 hunks)
  • tests/unittests/dataset_tests.py (2 hunks)
  • tests/unittests/execution_store_tests.py (2 hunks)
🧰 Additional context used
🪛 Ruff
fiftyone/operators/store/service.py

21-21: Undefined name ExecutionStoreRepo

(F821)

🔇 Additional comments (27)
fiftyone/operators/store/models.py (1)

20-20: LGTM! Verify dataset_id usage across the codebase.

The field addition looks good and aligns with MongoDB's document model.

Let's verify the consistent usage of dataset_id across the codebase:

✅ Verification successful

Dataset ID field implementation is properly integrated and tested

The dataset_id field is consistently implemented across the codebase:

  • Properly propagated through the store creation chain in store.py
  • Correctly stored and used in execution_store.py
  • Comprehensive test coverage in execution_store_tests.py including store creation, key operations, and TTL updates with dataset_id
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistent dataset_id handling in related files
# Expected: Find related classes that should also handle dataset_id

# Look for related store classes that might need dataset_id handling
rg -l "class.*Store" --type py

# Check for existing dataset_id usage patterns
rg "dataset_id.*bson\.ObjectId" --type py

Length of output: 726


Script:

#!/bin/bash
# Let's check the implementation details of dataset_id handling in the found files

# Check store.py implementation
echo "=== store.py implementation ==="
rg -A 10 "dataset_id.*bson\.ObjectId" fiftyone/operators/store/store.py

# Check execution_store.py implementation
echo -e "\n=== execution_store.py implementation ==="
rg -A 10 "dataset_id.*bson\.ObjectId" fiftyone/factory/repos/execution_store.py

# Check for any dataset_id usage in tests
echo -e "\n=== Test implementations ==="
rg "dataset_id" tests/unittests/execution_store_tests.py

Length of output: 2638

fiftyone/operators/store/store.py (2)

9-9: LGTM: Import addition is appropriate

The bson import is correctly added to support the new dataset_id parameter type annotation.


19-24: Implementation looks good, verify API usage

The changes to the create() method are well-implemented:

  • Optional parameter maintains backward compatibility
  • Type annotation is correct
  • Parameter is properly forwarded to the service layer

Let's verify the usage of this API across the codebase to ensure all callers are updated appropriately:

✅ Verification successful

Let me check the implementation of the caller to ensure it's handling the parameter correctly.


API usage is correctly updated

The only caller of ExecutionStore.create() is in fiftyone/operators/executor.py, and it correctly passes both the store_name and dataset_id parameters. The dataset_id is obtained from the dataset document's ID (self.dataset._doc.id), which is exactly what we expect for dataset-specific operations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for all calls to ExecutionStore.create to verify they handle the new parameter correctly
ast-grep --pattern 'ExecutionStore.create($$$)'

Length of output: 145


Script:

#!/bin/bash
# Get context around the ExecutionStore.create call
rg -B 10 -A 5 "ExecutionStore.create\(" fiftyone/operators/executor.py

Length of output: 479

tests/unittests/dataset_tests.py (2)

126-127: LGTM: Clean import addition

The import of ExecutionStoreService is appropriately placed and follows the existing import structure.


143-153: LGTM: Proper execution store cleanup verification

The test additions effectively verify that:

  1. Execution store keys can be set for a dataset
  2. Keys are properly retrieved before deletion
  3. Keys are cleaned up when the dataset is deleted

The test structure follows good practices:

  • Gets dataset ID before deletion
  • Creates store service with correct dataset ID
  • Uses a test store name
  • Verifies keys before and after deletion
fiftyone/core/dataset.py (1)

5188-5189: LGTM! Good integration of execution store cleanup.

The execution store cleanup is called at the right place in the deletion flow, ensuring proper cleanup of associated data.

fiftyone/factory/repos/execution_store.py (19)

6-6: Importing 'bson' module is necessary for ObjectId usage

The import of bson is appropriate as bson.ObjectId is used in the code, ensuring proper type annotations and functionality.


11-16: Including 'dataset_id' in query construction enhances data scoping

Adding dataset_id to the _where function allows queries to be scoped to specific datasets, promoting data isolation and integrity. The implementation correctly includes dataset_id in the query when it is not None.


25-29: Adding 'dataset_id' parameter to the initializer

Including dataset_id as an optional parameter in the __init__ method allows the repository to operate within the context of a specific dataset. Storing it as self._dataset_id is appropriate for use in other methods.


33-37: Associating stores with 'dataset_id' in 'create_store'

By including dataset_id in the StoreDocument, stores are correctly associated with the specified dataset, ensuring that store operations are scoped appropriately.


56-56: Including 'dataset_id' in 'KeyDocument' for key association

Adding dataset_id to the KeyDocument in the set_key method ensures that keys are associated with the correct dataset, enhancing data integrity.


59-65: Populating 'on_insert_fields' with 'dataset_id'

Including dataset_id in on_insert_fields ensures that the dataset association is maintained during insert operations, which is crucial for data scoping.


67-69: Conditionally removing 'dataset_id' when it is None

Properly handling the case when self._dataset_id is None by removing dataset_id from on_insert_fields avoids inserting null or undefined dataset_ids, preventing potential query issues.


76-88: Excluding non-updatable fields during upsert operations

When performing the upsert operation, excluding fields like dataset_id (along with other non-updatable fields) from $set ensures that these fields remain consistent and are not altered inadvertently.


104-106: Using '_where' function with 'dataset_id' in 'get_key'

The use of _where(store_name, key, self._dataset_id) ensures that the get_key method retrieves keys specific to the dataset, maintaining data isolation.


112-114: Including 'dataset_id' in 'list_keys' method

By passing dataset_id=self._dataset_id to _where, the list_keys method correctly lists keys scoped to the specific dataset.


121-124: Updating TTL with dataset-specific scoping

The update_ttl method correctly includes self._dataset_id in the query, ensuring that the TTL is updated only for keys within the specific dataset.


128-130: Deleting keys within the correct dataset scope

Including self._dataset_id in the delete_one query ensures that only the intended key within the specific dataset is deleted.


135-139: Deleting stores with dataset scoping in 'delete_store'

The delete_store method correctly scopes the deletion to the specified dataset by including self._dataset_id in the query.


140-143: Implementing 'cleanup_for_dataset' method for dataset-specific cleanup

The cleanup_for_dataset method provides functionality to delete all keys associated with a specific dataset, which is essential for proper resource management upon dataset deletion.


149-153: Passing 'dataset_id' to the parent class in 'MongoExecutionStoreRepo'

The __init__ method of MongoExecutionStoreRepo correctly passes dataset_id to the parent class, ensuring consistent dataset scoping across inherited methods.


156-161: Defining index names and including 'dataset_id'

Defining dataset_id_name and including it in index definitions ensures that indexes are correctly applied to the dataset_id field, improving query performance.


168-171: Updating unique index to include 'dataset_id'

Including dataset_id in the unique index on ('store_name', 'key', 'dataset_id') ensures that keys are unique within each dataset, preventing key collisions across datasets.


172-174: Creating individual indexes for efficient queries

Creating indexes on store_name, key, and dataset_id individually can improve query performance when filtering on these fields.


90-94: Verify the impact of queries when 'dataset_id' is None

When dataset_id is None, the queries do not include dataset_id, potentially affecting documents across all datasets. Please verify that this behavior is intended and that operations are correctly scoped to prevent unintended access or modification of data from other datasets.

Run the following script to check if there are any unintended documents being accessed when dataset_id is None:

✅ Verification successful

Behavior of dataset_id=None is intentionally designed and safe

The code analysis reveals that the handling of dataset_id is properly implemented with built-in safeguards:

  • The _where function explicitly builds queries that scope operations to a specific dataset only when dataset_id is provided
  • When initializing documents, there's specific logic to handle None dataset_id cases:
    if self._dataset_id is None:
        on_insert_fields.pop("dataset_id")
  • The class maintains unique indexes on (store_name, key, dataset_id) combination, preventing any accidental cross-dataset conflicts
  • All operations consistently use the same _where function, ensuring uniform scoping behavior
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: List all documents where 'dataset_id' does not exist.
# Expected result: Only documents that are meant to be global (not associated with any dataset).

# Find documents without 'dataset_id' field
mongo --eval 'db.execution_store.find({"dataset_id": {"$exists": false}}).pretty()'

Length of output: 158


Script:

#!/bin/bash
# Let's analyze the codebase to understand the usage of dataset_id in execution_store
# First, let's find where _where function is defined and how it handles dataset_id
rg -A 5 "_where\s*=?\s*\(" fiftyone/factory/repos/

# Then check for any other usages of dataset_id in execution store
rg "dataset_id" fiftyone/factory/repos/execution_store.py

# Look for any tests that might verify this behavior
fd "test.*execution.*store.*py" -t f

Length of output: 4537

tests/unittests/execution_store_tests.py (2)

229-362: Tests for dataset_id functionality are comprehensive and well-structured

The newly added test methods effectively verify that the dataset_id is correctly incorporated in all ExecutionStoreService operations. The assertions confirm that database queries and updates include the dataset_id, ensuring proper scoping and data integrity.


313-326: ⚠️ Potential issue

Fix incorrect indentation in test_create_store_with_dataset_id method

The method test_create_store_with_dataset_id is indented too far, causing it to be nested within the previous method test_delete_key_with_dataset_id. This will result in a syntax error, and the method will not be recognized as part of the ExecutionStoreServiceDatasetIdTests class.

Apply this diff to correct the indentation:

-            def test_create_store_with_dataset_id(self):
+    def test_create_store_with_dataset_id(self):

Ensure all lines within the method are adjusted accordingly.

Likely invalid or redundant comment.

fiftyone/factory/repo_factory.py Show resolved Hide resolved
fiftyone/factory/repo_factory.py Outdated Show resolved Hide resolved
fiftyone/operators/store/service.py Outdated Show resolved Hide resolved
fiftyone/operators/store/service.py Outdated Show resolved Hide resolved
ritch and others added 2 commits November 7, 2024 17:28
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1f6733d and e12a841.

📒 Files selected for processing (2)
  • fiftyone/factory/repo_factory.py (3 hunks)
  • fiftyone/operators/store/service.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • fiftyone/factory/repo_factory.py
🧰 Additional context used
🪛 Ruff
fiftyone/operators/store/service.py

20-20: Undefined name ExecutionStoreRepo

(F821)

🔇 Additional comments (3)
fiftyone/operators/store/service.py (3)

9-9: LGTM: Import addition is appropriate

The addition of the bson import is necessary for typing the new dataset_id parameter.


21-37: LGTM: Dataset scoping implementation is correct

The dataset_id parameter is:

  • Properly typed as Optional[bson.ObjectId]
  • Well-documented in the docstring
  • Correctly passed to the repository factory

141-143: LGTM: Clean implementation of dataset cleanup

The cleanup_for_dataset method:

  • Has clear documentation
  • Correctly delegates to the repository layer
  • Follows single responsibility principle

fiftyone/operators/store/service.py Show resolved Hide resolved
@ritch ritch enabled auto-merge (rebase) November 8, 2024 01:00
@brimoor brimoor changed the base branch from develop to release/v1.1.0 November 8, 2024 06:15
@brimoor brimoor disabled auto-merge November 8, 2024 19:40
Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (6)
fiftyone/operators/store/__init__.py (2)

8-12: LGTM! Consider documenting the re-exported entities.

The imports look good for an __init__.py file that re-exports these entities. However, it would be helpful to add a docstring listing the public API that this module exposes.

Add a module-level docstring section like this:

"""
FiftyOne execution store module.

| Copyright 2017-2024, Voxel51, Inc.
| `voxel51.com <https://voxel51.com/>`_
|

Public API
----------
- ExecutionStore: Main class for managing execution stores
- ExecutionStoreService: Service for managing execution store operations
- cleanup_store_for_dataset: Function to cleanup dataset-specific store data
- StoreDocument, KeyDocument: Data models for store entries
"""
🧰 Tools
🪛 Ruff

10-10: .service.cleanup_store_for_dataset imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


10-10: .service.ExecutionStoreService imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


11-11: .store.ExecutionStore imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


12-12: .models.StoreDocument imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


12-12: .models.KeyDocument imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


14-19: Consider using an explicit all list for better API control.

While dynamic __all__ generation is convenient, it may accidentally expose internal implementation details. For a public API, it's often better to be explicit about what should be exposed.

Consider replacing the dynamic generation with an explicit list:

-__all__ = [
-    k
-    for k, v in globals().items()
-    if not k.startswith("_") and not isinstance(v, types.ModuleType)
-]
+__all__ = [
+    "ExecutionStore",
+    "ExecutionStoreService",
+    "cleanup_store_for_dataset",
+    "StoreDocument",
+    "KeyDocument",
+]

This makes the public API more intentional and easier to maintain, as you'll need to explicitly add new entities that should be public.

fiftyone/operators/store/service.py (2)

19-27: Add type hint for dataset_id parameter

The function should include type annotation for better code maintainability and IDE support.

-def cleanup_store_for_dataset(dataset_id):
+def cleanup_store_for_dataset(dataset_id: bson.ObjectId):

38-41: Standardize docstring format

The docstring parameter format is inconsistent. Parameters should use the standard format: param_name: description.

         Args:
-            repo (None): execution store repository instance
-            dataset_id (None): dataset ID to scope operations to
+            repo: Optional execution store repository instance
+            dataset_id: Optional dataset ID to scope operations to
fiftyone/core/odm/database.py (1)

1130-1139: Add error handling for store operations.

The implementation correctly integrates execution store cleanup into the dataset deletion flow. However, consider adding error handling for the store operations to ensure graceful failure.

Apply this diff to add error handling:

    dataset_id = dataset_dict["_id"]
-   svc = foos.ExecutionStoreService(dataset_id=dataset_id)
-   num_docs = sum(
-       len(svc.list_keys(store_name)) for store_name in svc.list_stores()
-   )
-   if num_docs > 0:
-       _logger.info("Deleting %d store doc(s)", num_docs)
-       if not dry_run:
-           foos.cleanup_store_for_dataset(dataset_id)
+   try:
+       svc = foos.ExecutionStoreService(dataset_id=dataset_id)
+       num_docs = sum(
+           len(svc.list_keys(store_name)) for store_name in svc.list_stores()
+       )
+       if num_docs > 0:
+           _logger.info("Deleting %d store doc(s)", num_docs)
+           if not dry_run:
+               foos.cleanup_store_for_dataset(dataset_id)
+   except Exception as e:
+       _logger.warning("Failed to cleanup execution store: %s", str(e))
fiftyone/core/dataset.py (1)

5184-5185: LGTM! Consider adding error handling for cleanup.

The addition of execution store cleanup when deleting a dataset is a good practice to prevent orphaned data. However, consider wrapping the cleanup in a try-except to ensure dataset deletion completes even if cleanup fails.

-        foos.cleanup_store_for_dataset(self._doc.id)
+        try:
+            foos.cleanup_store_for_dataset(self._doc.id)
+        except Exception as e:
+            logger.warning(
+                "Failed to cleanup execution store for dataset %s: %s",
+                self._doc.id,
+                str(e)
+            )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e12a841 and e7e7c26.

📒 Files selected for processing (6)
  • fiftyone/core/dataset.py (2 hunks)
  • fiftyone/core/odm/database.py (2 hunks)
  • fiftyone/operators/store/__init__.py (1 hunks)
  • fiftyone/operators/store/service.py (2 hunks)
  • fiftyone/operators/store/store.py (4 hunks)
  • tests/unittests/dataset_tests.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • fiftyone/operators/store/store.py
🧰 Additional context used
🪛 Ruff
fiftyone/operators/store/__init__.py

10-10: .service.cleanup_store_for_dataset imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


10-10: .service.ExecutionStoreService imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


11-11: .store.ExecutionStore imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


12-12: .models.StoreDocument imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


12-12: .models.KeyDocument imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

fiftyone/operators/store/service.py

33-33: Undefined name ExecutionStoreRepo

(F821)

🔇 Additional comments (5)
fiftyone/operators/store/service.py (2)

9-12: LGTM: Required imports added correctly

The new imports are properly added to support the dataset ID functionality.


154-158: LGTM: Clean implementation of dataset cleanup

The method correctly delegates the cleanup operation to the repository layer while maintaining clear documentation.

fiftyone/core/odm/database.py (1)

1130-1139: Verify execution store service integration.

Let's verify the integration with the execution store service.

✅ Verification successful

Based on the shell script results, I can generate the final response as the verification is complete. The results show:

  1. ExecutionStoreService is properly defined in fiftyone/operators/store/service.py
  2. cleanup_store_for_dataset is properly defined in the same file
  3. Both components are exported via fiftyone/operators/store/__init__.py
  4. The integration is well-tested with dedicated test classes
  5. The usage in database.py matches the established patterns in other files

Integration with execution store service is properly implemented

The code correctly uses the ExecutionStoreService and cleanup_store_for_dataset functions from the store module. The implementation follows the established patterns seen across the codebase, particularly in core/dataset.py. The integration is also well-tested with dedicated test classes covering both general integration and dataset-specific scenarios.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the ExecutionStoreService import and cleanup function.

# Test 1: Check if the ExecutionStoreService is properly imported
rg -A 2 "class ExecutionStoreService"

# Test 2: Check if cleanup_store_for_dataset function exists
rg -A 2 "def cleanup_store_for_dataset"

# Test 3: Check for any existing usages of these components
rg "ExecutionStoreService|cleanup_store_for_dataset"

Length of output: 3030

tests/unittests/dataset_tests.py (2)

29-29: LGTM: Import ExecutionStoreService for execution store tests

The import is correctly placed and follows the file's import organization pattern.


142-157: LGTM: Comprehensive test coverage for execution store cleanup

The test properly verifies that:

  1. Execution store keys can be set and retrieved
  2. Keys are properly cleaned up when the dataset is deleted
  3. The test maintains backwards compatibility with existing dataset deletion checks

The test structure follows good practices with clear setup, action, and verification steps.

@ritch ritch merged commit 172e700 into release/v1.1.0 Nov 8, 2024
11 checks passed
@ritch ritch deleted the feat/estore-dataset-id branch November 8, 2024 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants