-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add dataset link for estores #5062
Conversation
WalkthroughThe changes introduce new functionalities related to managing execution stores for datasets in the FiftyOne framework. A new cleanup function is added to ensure that execution data is removed when a dataset is deleted. Additionally, several classes and methods are updated to accept a Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
Here's where to cleanup store docs upon dataset.delete()
:
fiftyone/fiftyone/core/dataset.py
Lines 5174 to 5187 in 7f73a78
def _delete(self): | |
self._sample_collection.drop() | |
fos.Sample._reset_docs(self._sample_collection_name) | |
# Clips datasets directly inherit frames from source dataset | |
if self._frame_collection_name is not None and not self._is_clips: | |
self._frame_collection.drop() | |
fofr.Frame._reset_docs(self._frame_collection_name) | |
# Update singleton | |
self._instances.pop(self._doc.name, None) | |
_delete_dataset_doc(self._doc) | |
self._deleted = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (5)
fiftyone/operators/store/models.py (1)
34-35
: Add a comment explaining the dataset_id serialization behavior.The serialization logic is correct, but it would be helpful to document why we exclude null dataset_ids.
+ # Exclude dataset_id when None to maintain clean documents and backward compatibility if self.dataset_id is None: data.pop("dataset_id", None)
fiftyone/factory/repo_factory.py (1)
Line range hint
54-69
: Singleton pattern may cause issues with dataset-specific reposThe current implementation uses a static dictionary to store repository instances, but the key is only based on the collection name, not the dataset_id. This means all datasets will share the same repository instance, which could lead to unexpected behavior when different dataset_ids are used.
Consider using a composite key that includes both collection name and dataset_id:
- if ( - MongoExecutionStoreRepo.COLLECTION_NAME - not in RepositoryFactory.repos - ): - RepositoryFactory.repos[ - MongoExecutionStoreRepo.COLLECTION_NAME - ] = MongoExecutionStoreRepo( - collection=_get_db()[MongoExecutionStoreRepo.COLLECTION_NAME], - dataset_id=dataset_id, - ) - - return RepositoryFactory.repos[MongoExecutionStoreRepo.COLLECTION_NAME] + repo_key = f"{MongoExecutionStoreRepo.COLLECTION_NAME}:{dataset_id}" + if repo_key not in RepositoryFactory.repos: + RepositoryFactory.repos[repo_key] = MongoExecutionStoreRepo( + collection=_get_db()[MongoExecutionStoreRepo.COLLECTION_NAME], + dataset_id=dataset_id, + ) + + return RepositoryFactory.repos[repo_key]fiftyone/operators/store/store.py (1)
19-24
: Good architectural design for dataset scopingThe approach of using an optional dataset_id parameter is well-designed:
- Maintains backward compatibility for global stores
- Cleanly separates dataset-scoped stores
- Follows the principle of least surprise by making the scoping explicit
- Delegates implementation details to the service layer
Consider documenting this dataset scoping behavior in the class and method docstrings to make the functionality more discoverable.
fiftyone/operators/executor.py (1)
889-890
: Consider using a public API to access the dataset ID.The implementation correctly associates the execution store with a dataset, which is good. However, it uses the private attribute
_doc.id
to access the dataset ID. Consider using a public API if available, as private attributes may change without notice.If a public API exists, consider this alternative:
- dataset_id = self.dataset._doc.id + dataset_id = self.dataset.id # Assuming dataset.id exists as a public API return ExecutionStore.create(store_name, dataset_id)fiftyone/core/dataset.py (1)
10182-10187
: LGTM! Clean implementation of execution store cleanup.The function has a clear purpose and follows single responsibility principle. Consider adding debug logging to help track cleanup operations.
def _cleanup_execution_store_for_dataset(dataset_id): """Cleans up the execution store for the given dataset.""" from fiftyone.operators.store import ExecutionStoreService + logger.debug("Cleaning up execution store for dataset %s", dataset_id) svc = ExecutionStoreService(dataset_id=dataset_id) svc.cleanup_for_dataset()
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (9)
fiftyone/core/dataset.py
(2 hunks)fiftyone/factory/repo_factory.py
(3 hunks)fiftyone/factory/repos/execution_store.py
(4 hunks)fiftyone/operators/executor.py
(1 hunks)fiftyone/operators/store/models.py
(3 hunks)fiftyone/operators/store/service.py
(2 hunks)fiftyone/operators/store/store.py
(2 hunks)tests/unittests/dataset_tests.py
(2 hunks)tests/unittests/execution_store_tests.py
(2 hunks)
🧰 Additional context used
🪛 Ruff
fiftyone/operators/store/service.py
21-21: Undefined name ExecutionStoreRepo
(F821)
🔇 Additional comments (27)
fiftyone/operators/store/models.py (1)
20-20
: LGTM! Verify dataset_id usage across the codebase.
The field addition looks good and aligns with MongoDB's document model.
Let's verify the consistent usage of dataset_id across the codebase:
✅ Verification successful
Dataset ID field implementation is properly integrated and tested
The dataset_id field is consistently implemented across the codebase:
- Properly propagated through the store creation chain in
store.py
- Correctly stored and used in
execution_store.py
- Comprehensive test coverage in
execution_store_tests.py
including store creation, key operations, and TTL updates with dataset_id
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for consistent dataset_id handling in related files
# Expected: Find related classes that should also handle dataset_id
# Look for related store classes that might need dataset_id handling
rg -l "class.*Store" --type py
# Check for existing dataset_id usage patterns
rg "dataset_id.*bson\.ObjectId" --type py
Length of output: 726
Script:
#!/bin/bash
# Let's check the implementation details of dataset_id handling in the found files
# Check store.py implementation
echo "=== store.py implementation ==="
rg -A 10 "dataset_id.*bson\.ObjectId" fiftyone/operators/store/store.py
# Check execution_store.py implementation
echo -e "\n=== execution_store.py implementation ==="
rg -A 10 "dataset_id.*bson\.ObjectId" fiftyone/factory/repos/execution_store.py
# Check for any dataset_id usage in tests
echo -e "\n=== Test implementations ==="
rg "dataset_id" tests/unittests/execution_store_tests.py
Length of output: 2638
fiftyone/operators/store/store.py (2)
9-9
: LGTM: Import addition is appropriate
The bson import is correctly added to support the new dataset_id parameter type annotation.
19-24
: Implementation looks good, verify API usage
The changes to the create() method are well-implemented:
- Optional parameter maintains backward compatibility
- Type annotation is correct
- Parameter is properly forwarded to the service layer
Let's verify the usage of this API across the codebase to ensure all callers are updated appropriately:
✅ Verification successful
Let me check the implementation of the caller to ensure it's handling the parameter correctly.
API usage is correctly updated
The only caller of ExecutionStore.create()
is in fiftyone/operators/executor.py
, and it correctly passes both the store_name
and dataset_id
parameters. The dataset_id
is obtained from the dataset document's ID (self.dataset._doc.id
), which is exactly what we expect for dataset-specific operations.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for all calls to ExecutionStore.create to verify they handle the new parameter correctly
ast-grep --pattern 'ExecutionStore.create($$$)'
Length of output: 145
Script:
#!/bin/bash
# Get context around the ExecutionStore.create call
rg -B 10 -A 5 "ExecutionStore.create\(" fiftyone/operators/executor.py
Length of output: 479
tests/unittests/dataset_tests.py (2)
126-127
: LGTM: Clean import addition
The import of ExecutionStoreService is appropriately placed and follows the existing import structure.
143-153
: LGTM: Proper execution store cleanup verification
The test additions effectively verify that:
- Execution store keys can be set for a dataset
- Keys are properly retrieved before deletion
- Keys are cleaned up when the dataset is deleted
The test structure follows good practices:
- Gets dataset ID before deletion
- Creates store service with correct dataset ID
- Uses a test store name
- Verifies keys before and after deletion
fiftyone/core/dataset.py (1)
5188-5189
: LGTM! Good integration of execution store cleanup.
The execution store cleanup is called at the right place in the deletion flow, ensuring proper cleanup of associated data.
fiftyone/factory/repos/execution_store.py (19)
6-6
: Importing 'bson' module is necessary for ObjectId usage
The import of bson
is appropriate as bson.ObjectId
is used in the code, ensuring proper type annotations and functionality.
11-16
: Including 'dataset_id' in query construction enhances data scoping
Adding dataset_id
to the _where
function allows queries to be scoped to specific datasets, promoting data isolation and integrity. The implementation correctly includes dataset_id
in the query when it is not None
.
25-29
: Adding 'dataset_id' parameter to the initializer
Including dataset_id
as an optional parameter in the __init__
method allows the repository to operate within the context of a specific dataset. Storing it as self._dataset_id
is appropriate for use in other methods.
33-37
: Associating stores with 'dataset_id' in 'create_store'
By including dataset_id
in the StoreDocument
, stores are correctly associated with the specified dataset, ensuring that store operations are scoped appropriately.
56-56
: Including 'dataset_id' in 'KeyDocument' for key association
Adding dataset_id
to the KeyDocument
in the set_key
method ensures that keys are associated with the correct dataset, enhancing data integrity.
59-65
: Populating 'on_insert_fields' with 'dataset_id'
Including dataset_id
in on_insert_fields
ensures that the dataset association is maintained during insert operations, which is crucial for data scoping.
67-69
: Conditionally removing 'dataset_id' when it is None
Properly handling the case when self._dataset_id
is None
by removing dataset_id
from on_insert_fields
avoids inserting null or undefined dataset_id
s, preventing potential query issues.
76-88
: Excluding non-updatable fields during upsert operations
When performing the upsert operation, excluding fields like dataset_id
(along with other non-updatable fields) from $set
ensures that these fields remain consistent and are not altered inadvertently.
104-106
: Using '_where' function with 'dataset_id' in 'get_key'
The use of _where(store_name, key, self._dataset_id)
ensures that the get_key
method retrieves keys specific to the dataset, maintaining data isolation.
112-114
: Including 'dataset_id' in 'list_keys' method
By passing dataset_id=self._dataset_id
to _where
, the list_keys
method correctly lists keys scoped to the specific dataset.
121-124
: Updating TTL with dataset-specific scoping
The update_ttl
method correctly includes self._dataset_id
in the query, ensuring that the TTL is updated only for keys within the specific dataset.
128-130
: Deleting keys within the correct dataset scope
Including self._dataset_id
in the delete_one
query ensures that only the intended key within the specific dataset is deleted.
135-139
: Deleting stores with dataset scoping in 'delete_store'
The delete_store
method correctly scopes the deletion to the specified dataset by including self._dataset_id
in the query.
140-143
: Implementing 'cleanup_for_dataset' method for dataset-specific cleanup
The cleanup_for_dataset
method provides functionality to delete all keys associated with a specific dataset, which is essential for proper resource management upon dataset deletion.
149-153
: Passing 'dataset_id' to the parent class in 'MongoExecutionStoreRepo'
The __init__
method of MongoExecutionStoreRepo
correctly passes dataset_id
to the parent class, ensuring consistent dataset scoping across inherited methods.
156-161
: Defining index names and including 'dataset_id'
Defining dataset_id_name
and including it in index definitions ensures that indexes are correctly applied to the dataset_id
field, improving query performance.
168-171
: Updating unique index to include 'dataset_id'
Including dataset_id
in the unique index on ('store_name', 'key', 'dataset_id')
ensures that keys are unique within each dataset, preventing key collisions across datasets.
172-174
: Creating individual indexes for efficient queries
Creating indexes on store_name
, key
, and dataset_id
individually can improve query performance when filtering on these fields.
90-94
: Verify the impact of queries when 'dataset_id' is None
When dataset_id
is None
, the queries do not include dataset_id
, potentially affecting documents across all datasets. Please verify that this behavior is intended and that operations are correctly scoped to prevent unintended access or modification of data from other datasets.
Run the following script to check if there are any unintended documents being accessed when dataset_id
is None
:
✅ Verification successful
Behavior of dataset_id=None
is intentionally designed and safe
The code analysis reveals that the handling of dataset_id
is properly implemented with built-in safeguards:
- The
_where
function explicitly builds queries that scope operations to a specific dataset only whendataset_id
is provided - When initializing documents, there's specific logic to handle
None
dataset_id cases:if self._dataset_id is None: on_insert_fields.pop("dataset_id")
- The class maintains unique indexes on
(store_name, key, dataset_id)
combination, preventing any accidental cross-dataset conflicts - All operations consistently use the same
_where
function, ensuring uniform scoping behavior
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: List all documents where 'dataset_id' does not exist.
# Expected result: Only documents that are meant to be global (not associated with any dataset).
# Find documents without 'dataset_id' field
mongo --eval 'db.execution_store.find({"dataset_id": {"$exists": false}}).pretty()'
Length of output: 158
Script:
#!/bin/bash
# Let's analyze the codebase to understand the usage of dataset_id in execution_store
# First, let's find where _where function is defined and how it handles dataset_id
rg -A 5 "_where\s*=?\s*\(" fiftyone/factory/repos/
# Then check for any other usages of dataset_id in execution store
rg "dataset_id" fiftyone/factory/repos/execution_store.py
# Look for any tests that might verify this behavior
fd "test.*execution.*store.*py" -t f
Length of output: 4537
tests/unittests/execution_store_tests.py (2)
229-362
: Tests for dataset_id
functionality are comprehensive and well-structured
The newly added test methods effectively verify that the dataset_id
is correctly incorporated in all ExecutionStoreService
operations. The assertions confirm that database queries and updates include the dataset_id
, ensuring proper scoping and data integrity.
313-326
:
Fix incorrect indentation in test_create_store_with_dataset_id
method
The method test_create_store_with_dataset_id
is indented too far, causing it to be nested within the previous method test_delete_key_with_dataset_id
. This will result in a syntax error, and the method will not be recognized as part of the ExecutionStoreServiceDatasetIdTests
class.
Apply this diff to correct the indentation:
- def test_create_store_with_dataset_id(self):
+ def test_create_store_with_dataset_id(self):
Ensure all lines within the method are adjusted accordingly.
Likely invalid or redundant comment.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (2)
fiftyone/factory/repo_factory.py
(3 hunks)fiftyone/operators/store/service.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- fiftyone/factory/repo_factory.py
🧰 Additional context used
🪛 Ruff
fiftyone/operators/store/service.py
20-20: Undefined name ExecutionStoreRepo
(F821)
🔇 Additional comments (3)
fiftyone/operators/store/service.py (3)
9-9
: LGTM: Import addition is appropriate
The addition of the bson import is necessary for typing the new dataset_id parameter.
21-37
: LGTM: Dataset scoping implementation is correct
The dataset_id parameter is:
- Properly typed as Optional[bson.ObjectId]
- Well-documented in the docstring
- Correctly passed to the repository factory
141-143
: LGTM: Clean implementation of dataset cleanup
The cleanup_for_dataset method:
- Has clear documentation
- Correctly delegates to the repository layer
- Follows single responsibility principle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (6)
fiftyone/operators/store/__init__.py (2)
8-12
: LGTM! Consider documenting the re-exported entities.The imports look good for an
__init__.py
file that re-exports these entities. However, it would be helpful to add a docstring listing the public API that this module exposes.Add a module-level docstring section like this:
""" FiftyOne execution store module. | Copyright 2017-2024, Voxel51, Inc. | `voxel51.com <https://voxel51.com/>`_ | Public API ---------- - ExecutionStore: Main class for managing execution stores - ExecutionStoreService: Service for managing execution store operations - cleanup_store_for_dataset: Function to cleanup dataset-specific store data - StoreDocument, KeyDocument: Data models for store entries """🧰 Tools
🪛 Ruff
10-10:
.service.cleanup_store_for_dataset
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
10-10:
.service.ExecutionStoreService
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
11-11:
.store.ExecutionStore
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
12-12:
.models.StoreDocument
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
12-12:
.models.KeyDocument
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
14-19
: Consider using an explicit all list for better API control.While dynamic
__all__
generation is convenient, it may accidentally expose internal implementation details. For a public API, it's often better to be explicit about what should be exposed.Consider replacing the dynamic generation with an explicit list:
-__all__ = [ - k - for k, v in globals().items() - if not k.startswith("_") and not isinstance(v, types.ModuleType) -] +__all__ = [ + "ExecutionStore", + "ExecutionStoreService", + "cleanup_store_for_dataset", + "StoreDocument", + "KeyDocument", +]This makes the public API more intentional and easier to maintain, as you'll need to explicitly add new entities that should be public.
fiftyone/operators/store/service.py (2)
19-27
: Add type hint for dataset_id parameterThe function should include type annotation for better code maintainability and IDE support.
-def cleanup_store_for_dataset(dataset_id): +def cleanup_store_for_dataset(dataset_id: bson.ObjectId):
38-41
: Standardize docstring formatThe docstring parameter format is inconsistent. Parameters should use the standard format:
param_name: description
.Args: - repo (None): execution store repository instance - dataset_id (None): dataset ID to scope operations to + repo: Optional execution store repository instance + dataset_id: Optional dataset ID to scope operations tofiftyone/core/odm/database.py (1)
1130-1139
: Add error handling for store operations.The implementation correctly integrates execution store cleanup into the dataset deletion flow. However, consider adding error handling for the store operations to ensure graceful failure.
Apply this diff to add error handling:
dataset_id = dataset_dict["_id"] - svc = foos.ExecutionStoreService(dataset_id=dataset_id) - num_docs = sum( - len(svc.list_keys(store_name)) for store_name in svc.list_stores() - ) - if num_docs > 0: - _logger.info("Deleting %d store doc(s)", num_docs) - if not dry_run: - foos.cleanup_store_for_dataset(dataset_id) + try: + svc = foos.ExecutionStoreService(dataset_id=dataset_id) + num_docs = sum( + len(svc.list_keys(store_name)) for store_name in svc.list_stores() + ) + if num_docs > 0: + _logger.info("Deleting %d store doc(s)", num_docs) + if not dry_run: + foos.cleanup_store_for_dataset(dataset_id) + except Exception as e: + _logger.warning("Failed to cleanup execution store: %s", str(e))fiftyone/core/dataset.py (1)
5184-5185
: LGTM! Consider adding error handling for cleanup.The addition of execution store cleanup when deleting a dataset is a good practice to prevent orphaned data. However, consider wrapping the cleanup in a try-except to ensure dataset deletion completes even if cleanup fails.
- foos.cleanup_store_for_dataset(self._doc.id) + try: + foos.cleanup_store_for_dataset(self._doc.id) + except Exception as e: + logger.warning( + "Failed to cleanup execution store for dataset %s: %s", + self._doc.id, + str(e) + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (6)
fiftyone/core/dataset.py
(2 hunks)fiftyone/core/odm/database.py
(2 hunks)fiftyone/operators/store/__init__.py
(1 hunks)fiftyone/operators/store/service.py
(2 hunks)fiftyone/operators/store/store.py
(4 hunks)tests/unittests/dataset_tests.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- fiftyone/operators/store/store.py
🧰 Additional context used
🪛 Ruff
fiftyone/operators/store/__init__.py
10-10: .service.cleanup_store_for_dataset
imported but unused; consider removing, adding to __all__
, or using a redundant alias
(F401)
10-10: .service.ExecutionStoreService
imported but unused; consider removing, adding to __all__
, or using a redundant alias
(F401)
11-11: .store.ExecutionStore
imported but unused; consider removing, adding to __all__
, or using a redundant alias
(F401)
12-12: .models.StoreDocument
imported but unused; consider removing, adding to __all__
, or using a redundant alias
(F401)
12-12: .models.KeyDocument
imported but unused; consider removing, adding to __all__
, or using a redundant alias
(F401)
fiftyone/operators/store/service.py
33-33: Undefined name ExecutionStoreRepo
(F821)
🔇 Additional comments (5)
fiftyone/operators/store/service.py (2)
9-12
: LGTM: Required imports added correctly
The new imports are properly added to support the dataset ID functionality.
154-158
: LGTM: Clean implementation of dataset cleanup
The method correctly delegates the cleanup operation to the repository layer while maintaining clear documentation.
fiftyone/core/odm/database.py (1)
1130-1139
: Verify execution store service integration.
Let's verify the integration with the execution store service.
✅ Verification successful
Based on the shell script results, I can generate the final response as the verification is complete. The results show:
ExecutionStoreService
is properly defined infiftyone/operators/store/service.py
cleanup_store_for_dataset
is properly defined in the same file- Both components are exported via
fiftyone/operators/store/__init__.py
- The integration is well-tested with dedicated test classes
- The usage in
database.py
matches the established patterns in other files
Integration with execution store service is properly implemented
The code correctly uses the ExecutionStoreService
and cleanup_store_for_dataset
functions from the store module. The implementation follows the established patterns seen across the codebase, particularly in core/dataset.py
. The integration is also well-tested with dedicated test classes covering both general integration and dataset-specific scenarios.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the ExecutionStoreService import and cleanup function.
# Test 1: Check if the ExecutionStoreService is properly imported
rg -A 2 "class ExecutionStoreService"
# Test 2: Check if cleanup_store_for_dataset function exists
rg -A 2 "def cleanup_store_for_dataset"
# Test 3: Check for any existing usages of these components
rg "ExecutionStoreService|cleanup_store_for_dataset"
Length of output: 3030
tests/unittests/dataset_tests.py (2)
29-29
: LGTM: Import ExecutionStoreService for execution store tests
The import is correctly placed and follows the file's import organization pattern.
142-157
: LGTM: Comprehensive test coverage for execution store cleanup
The test properly verifies that:
- Execution store keys can be set and retrieved
- Keys are properly cleaned up when the dataset is deleted
- The test maintains backwards compatibility with existing dataset deletion checks
The test structure follows good practices with clear setup, action, and verification steps.
Summary by CodeRabbit
Release Notes
New Features
ExecutionStore
andExecutionStoreService
to accept and utilize dataset IDs.Bug Fixes
Tests