Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [WIP] initial add of apify-extra package with dataset functions #81

Closed
wants to merge 7 commits into from

Conversation

metalwarrior665
Copy link
Member

Build not working yet

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will need to run npm i in the root to update the lock file and push that (npm is unfortunately dumb and puts the local packages to the lock file too, for no reason)

image

Comment on lines 57 to 58
"@types/bluebird": "^3.5.37",
"bluebird": "^3.7.2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to use bluebird 😭

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will eventually get rid of it but it will require a bit of work

}

// Now we execute all the requests in parallel (with defined concurrency)
await bluebird.map(requestInfoArr, async (requestInfoObj) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be the same as:

Suggested change
await bluebird.map(requestInfoArr, async (requestInfoObj) => {
await Promise.all(requestInfoArr.map(async (requestInfoObj) => {

and we can let bluebird rest in its graveyard

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it is there for the concurrency option. I can replace it later with AutoscaledPool, BasicCrawler or maybe I will find something more lightweight

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but pls test :) btw: this is also a good to export as we used this bluebird thing in more places

packages/apify-extra/src/dataset.ts Outdated Show resolved Hide resolved
@metalwarrior665 metalwarrior665 changed the title [WIP] initial add of apify-extra package with dataset functions feat: [WIP] initial add of apify-extra package with dataset functions Oct 4, 2022
@barjin barjin self-assigned this Oct 10, 2022
@barjin
Copy link
Contributor

barjin commented Oct 19, 2022

Ok, after partially revamping this yesterday, I have a few points/questions (@metalwarrior665 ):

  • Migration-safe dataset upload - what is the use case? The local "context" wipes with every migration nevertheless. This question is just out of curiosity though, I can see this working.
  • loadDatasetItemsInParallel - if I understand correctly, this:
    • a) loads all the items from the specified datasets, optionally combines those and returns the items
      • Use case? We have to wait till the loading ends either way - and when the migration happens somewhere in between, we can start anew.
    • b) if provided with a processFn function, it runs this function on all of the items -> but then doesn't return anything - in this case, it keeps track of the processed datasets/chunks in case of a migration (this makes sense!)

If I can add my two cents, I like the "parallel" forEach (map?) approach - it reminds me of Hadoop MapReduce and similar Big Data processing systems - and imo works well with an environment with migrations; for the rest, I am ready to have my mind changed :)

@B4nan
Copy link
Member

B4nan commented Oct 19, 2022

@barjin keep in mind the main issue you should be looking into is the build failure, we can discuss code improvements once we can actually build it.

@metalwarrior665
Copy link
Member Author

metalwarrior665 commented Oct 20, 2022

@barjin

  1. Use-case is when you need to push a lot of items at once. Typical use-case is dataset transformations, you load 1M items and push 1M so you need to increase speed via paralelism and make sure you only start where you ended (loading is much faster than pushing)
  2. This actor is the main consumer but it has some customizations.
    a) Use-case is that it keeps the returned items in order (in datasets and inside dataset) and the parallel loading and that it is able to parallelize over more dataset and also inside one dataset. You can go even 30x speed this way. If migration happens, you have to load again yeah but with the speed you have a good chance it doesn't happen.
    b) This part was a bit messy so will need good tests

We could get rid of a) for simplicity and make processFn mandatory but then the keeping track of migration needs to be optional because sometimes you want to start from scratch. And if the user want the items in order, they would need to do that themselves by checking which dataset the chunk is from and what are the offsets. So since we already provide that functionality (if you transform dataset it might be nice that it is in the same order) I would keep it but maybe we can change the interface/implementation. The function definitely ridiculously complex and needs to be refactored eventually and tested in smaller pieces.

@metalwarrior665
Copy link
Member Author

metalwarrior665 commented Oct 20, 2022

My plan going forward is to have one more look at the features and the logic (suggestions welcome) and then we would release it in beta so we can start using it internally. Then add proper tests (it is quite battle-tested in production :D but for sure needs tests) before we release it in latest. Sounds good?

@metalwarrior665
Copy link
Member Author

metalwarrior665 commented Oct 20, 2022

Maybe the way to do this better would be to extend the SDK classes so it is more in line with SDK interfaces rather than having standalone functions with tons of options. But I probably don't have capacity for a major rewrite now :)

@barjin
Copy link
Contributor

barjin commented Nov 10, 2022

closed in favor of #116, all further development there

@barjin barjin closed this Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants