Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the usage of Configuration class #670

Open
vdusek opened this issue Nov 8, 2024 · 4 comments
Open

Fix the usage of Configuration class #670

vdusek opened this issue Nov 8, 2024 · 4 comments
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@vdusek
Copy link
Collaborator

vdusek commented Nov 8, 2024

Our Configuration class currently contains many fields that are unused and/or undocumented. Review each field, decide whether it should be used, removed, or consolidated, and add missing doc strings.

@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Nov 8, 2024
@janbuchar
Copy link
Collaborator

Thanks for bringing this up! If you could write out the attributes in question here, we can discuss the appropriate steps together.

@vdusek
Copy link
Collaborator Author

vdusek commented Nov 8, 2024

The fields that are used:

  • internal_timeout
  • verbose_log (but it just sets the DEBUG level, and we have another field exactly for that)
  • default_dataset_id
  • default_key_value_store_id
  • default_request_queue_id
  • purge_on_start
  • write_metadata
  • persist_storage
  • memory_mbytes
  • storage_dir

The fields that are not used:

  • default_browser_path
  • disable_browser_sandbox
  • persist_state_interval
  • system_info_interval
  • max_used_cpu_ratio
  • available_memory_ratio
  • chrome_executable_path
  • headless
  • xvfb

The inputKey field from JS is missing completely.

@janbuchar
Copy link
Collaborator

The fields that are not used:

  • default_browser_path

In JS, this is used by PlaywrightLauncher

  • disable_browser_sandbox

In JS, this is used by BrowserLauncher to add an argument to playwright/puppeteer

  • chrome_executable_path

In JS, this is used by BrowserLauncher, and set in https://github.com/apify/apify-actor-docker/blob/master/node-playwright/Dockerfile or by the actor developer in environment variables

  • xvfb

This may be set in https://github.com/apify/apify-actor-docker/blob/master/node-playwright/Dockerfile or by the actor developer in environment variables, but I don't think it actually has any effect 🤔 @vladfrangu do you know more?

  • headless
    We should probably implement this - it's convenient for local debugging
  • persist_state_interval
  • system_info_interval

These should probably be consumed by LocalEventManager. Currently, the SDK passes that, which may not be optimal.

  • max_used_cpu_ratio
  • available_memory_ratio

These should probably be handled by LocalEventManager, SystemInfo or Snapshotter.

The inputKey field from JS is missing completely.

I must have missed that one. The only use case I can think of is when you want to debug something locally against multiple inputs and don't want to bother with renaming files... which is valid

@vdusek
Copy link
Collaborator Author

vdusek commented Nov 11, 2024

Agreed, thanks @janbuchar for the investigation. We should also note, that Configuration values should be overridable by component-specific values.

E.g.

config = Configuration(headless=False)
crawler = PlaywrightCrawler(config=config, headless=True)

then True is used for headless.

Also, we should split this issue into many smaller ones.

vdusek added a commit that referenced this issue Nov 11, 2024
@vdusek vdusek self-assigned this Nov 22, 2024
vdusek added a commit that referenced this issue Dec 13, 2024
### Description

- The `service_container` module has been completely refactored. It
introduces changes to its usage, and resulting in many changes across
the code base.
- While it remains possible to pass "services" directly to components as
before, components now rely on the `service_container` internally. They
no longer store service instances themselves.
- A new `force` flag has been added to the `service_container`'s
setters, which is especially useful for testing purposes.
- This also quite simplifies the `memory_storage_client`.
- We now have only `set_storage_client`, the same approach as for the
`event_manager`. This is more flexible (allows more envs than just local
& cloud). And in the SDK Actor, we can them based on the `is_at_home`.
- This is a breaking change, but it affects only the `service_container`
interface.

### Open discussion

- [x] Should we go further and remove the option to pass configuration,
event managers, or storage clients directly to components - requiring
them to be set exclusively via the `service_container`? Thoughts are
welcome.
  - No
- [x] Better name for `service_container`? `service_locator`?
  - `service_locator`

### Issues

- Closes: #699
- Closes: #539 
- Closes: #369 
- It also unlocks the:
  - #670,
- and maybe
apify/apify-sdk-python#324 (comment).

### Testing

- Existing tests, including those covering the `service_container`, have
been updated to reflect these changes.
- New tests covering the `MemoryStorageClient` respects the
`Configuration`.

### Manual reproduction

- This code snippet demonstrates that the `Crawler` and
`MemoryStorageClient` respects the custom `Configuration`.
- Note: Some fields remain non-working for now. These will be addressed
in a future PR, as this refactor is already quite big. However, with the
new architecture, those updates should now be easy.

```python
import asyncio

from crawlee.configuration import Configuration
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext
from crawlee.service_container import set_configuration


async def main() -> None:
    config = Configuration(persist_storage=False, write_metadata=False)
    set_configuration(config)  # or Crawler(config=config)
    crawler = HttpCrawler()

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        await context.push_data({'url': context.request.url})

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())
```

### Checklist

- [x] CI passed
@vdusek vdusek added this to the 105th sprint - Tooling team milestone Dec 13, 2024
@vdusek vdusek added the bug Something isn't working. label Dec 16, 2024
@vdusek vdusek changed the title Improve the Configuration class Fix the usage of Configuration class Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants