Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor initialization of storages #85

Closed
vdusek opened this issue Apr 3, 2024 · 11 comments
Closed

Refactor initialization of storages #85

vdusek opened this issue Apr 3, 2024 · 11 comments
Assignees
Labels
debt Code quality improvement or decrease of technical debt. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@vdusek
Copy link
Collaborator

vdusek commented Apr 3, 2024

Description

  • Currently, if you want to initialize Dataset/KVS/RQ you should use open() constructor. And it goes like the following:
    • dataset.open()
    • base_storage.open()
    • dataset.__init__()
    • base_storage.__init__()
  • In the base_storage.open() a specific client is selected (local - MemoryStorageClient or cloud - ApifyClient) using StorageClientManager.
  • Refactor initialization of memory storage resource clients as well.

Desired state

  • Make it more readable, less error-prone (e.g. user uses a wrong constructor), and extensible by supporting other clients.
@vdusek vdusek added t-tooling Issues with this label are in the ownership of the tooling team. debt Code quality improvement or decrease of technical debt. labels Apr 3, 2024
@janbuchar
Copy link
Collaborator

In Python, it's usually better to just make a module instead of a singleton - I'm just noting this with StorageClientManager in mind

@janbuchar
Copy link
Collaborator

The resource collection clients, e.g. src/crawlee/memory_storage/dataset_collection_client.py seem to contain little to no functionality.

@janbuchar
Copy link
Collaborator

BaseStorage consists mostly of class methods that handle "instance caching". I believe this should be pulled out, possibly to StorageClientManager.

@janbuchar

This comment was marked as resolved.

@B4nan

This comment was marked as resolved.

@janbuchar

This comment was marked as resolved.

@B4nan

This comment was marked as resolved.

@vdusek

This comment was marked as resolved.

@janbuchar

This comment was marked as resolved.

@vdusek

This comment has been minimized.

@vdusek vdusek self-assigned this Apr 24, 2024
@vdusek vdusek added this to the 88th sprint - Tooling team milestone Apr 24, 2024
vdusek added a commit that referenced this issue May 10, 2024
### Description

- Refactor initialization of storages & memory resource clients.
- `BaseStorage` and `BaseResourceClients` are only ABC with abstract
methods; all the creation-related code is moved to a separate module.
- I created `_creation_management.py` helper modules in both `storages/`
and `memory_storage_client/`.
- I had to move `crawlee/storages/models.py` to `crawlee/models.py`
because of the import loops, I merged it with `request.py`. I am open to
other ideas in this regard.

### Related issues

- #85

### Testing

- Tests pass
@vdusek
Copy link
Collaborator Author

vdusek commented May 10, 2024

closing as it was resolved in #143

@vdusek vdusek closed this as completed May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debt Code quality improvement or decrease of technical debt. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants