feat: [WIP] initial add of apify-extra package with dataset functions #81

metalwarrior665 · 2022-10-04T15:02:27Z

Build not working yet

…d not working yet)

B4nan

you will need to run npm i in the root to update the lock file and push that (npm is unfortunately dumb and puts the local packages to the lock file too, for no reason)

B4nan · 2022-10-04T15:04:43Z

packages/apify-extra/package.json

+        "@types/bluebird": "^3.5.37",
+        "bluebird": "^3.7.2"


do we really need to use bluebird 😭

I will eventually get rid of it but it will require a bit of work

B4nan · 2022-10-04T15:05:51Z

packages/apify-extra/src/dataset.ts

+    }
+
+    //  Now we execute all the requests in parallel (with defined concurrency)
+    await bluebird.map(requestInfoArr, async (requestInfoObj) => {


should be the same as:

Suggested change

await bluebird.map(requestInfoArr, async (requestInfoObj) => {

await Promise.all(requestInfoArr.map(async (requestInfoObj) => {

and we can let bluebird rest in its graveyard

Nope, it is there for the concurrency option. I can replace it later with AutoscaledPool, BasicCrawler or maybe I will find something more lightweight

something more lightweight?

Looks good but pls test :) btw: this is also a good to export as we used this bluebird thing in more places

packages/apify-extra/src/dataset.ts

barjin · 2022-10-19T13:12:22Z

Ok, after partially revamping this yesterday, I have a few points/questions (@metalwarrior665 ):

Migration-safe dataset upload - what is the use case? The local "context" wipes with every migration nevertheless. This question is just out of curiosity though, I can see this working.
loadDatasetItemsInParallel - if I understand correctly, this:
- a) loads all the items from the specified datasets, optionally combines those and returns the items
  - Use case? We have to wait till the loading ends either way - and when the migration happens somewhere in between, we can start anew.
- b) if provided with a processFn function, it runs this function on all of the items -> but then doesn't return anything - in this case, it keeps track of the processed datasets/chunks in case of a migration (this makes sense!)

If I can add my two cents, I like the "parallel" forEach (map?) approach - it reminds me of Hadoop MapReduce and similar Big Data processing systems - and imo works well with an environment with migrations; for the rest, I am ready to have my mind changed :)

B4nan · 2022-10-19T13:19:29Z

@barjin keep in mind the main issue you should be looking into is the build failure, we can discuss code improvements once we can actually build it.

packages/apify-extra/package.json

metalwarrior665 · 2022-10-20T18:01:20Z

@barjin

Use-case is when you need to push a lot of items at once. Typical use-case is dataset transformations, you load 1M items and push 1M so you need to increase speed via paralelism and make sure you only start where you ended (loading is much faster than pushing)
This actor is the main consumer but it has some customizations.
a) Use-case is that it keeps the returned items in order (in datasets and inside dataset) and the parallel loading and that it is able to parallelize over more dataset and also inside one dataset. You can go even 30x speed this way. If migration happens, you have to load again yeah but with the speed you have a good chance it doesn't happen.
b) This part was a bit messy so will need good tests

We could get rid of a) for simplicity and make processFn mandatory but then the keeping track of migration needs to be optional because sometimes you want to start from scratch. And if the user want the items in order, they would need to do that themselves by checking which dataset the chunk is from and what are the offsets. So since we already provide that functionality (if you transform dataset it might be nice that it is in the same order) I would keep it but maybe we can change the interface/implementation. The function definitely ridiculously complex and needs to be refactored eventually and tested in smaller pieces.

metalwarrior665 · 2022-10-20T18:03:39Z

My plan going forward is to have one more look at the features and the logic (suggestions welcome) and then we would release it in beta so we can start using it internally. Then add proper tests (it is quite battle-tested in production :D but for sure needs tests) before we release it in latest. Sounds good?

metalwarrior665 · 2022-10-20T18:18:08Z

Maybe the way to do this better would be to extend the SDK classes so it is more in line with SDK interfaces rather than having standalone functions with tons of options. But I probably don't have capacity for a major rewrite now :)

barjin · 2022-11-10T11:00:18Z

closed in favor of #116, all further development there

feat: initial add of apify-extra package with dataset functions (buil…

9b0df04

…d not working yet)

B4nan requested changes Oct 4, 2022

View reviewed changes

update package-lock

0e9c0ec

metalwarrior665 changed the title ~~[WIP] initial add of apify-extra package with dataset functions~~ feat: [WIP] initial add of apify-extra package with dataset functions Oct 4, 2022

metalwarrior665 added 2 commits October 4, 2022 20:18

fix: improve functionality and docs of parallelPersistedPushData

08a93cf

fix: use DatasetInfo type from crawlee, not apify-client

2f88335

barjin self-assigned this Oct 10, 2022

fix: update devDependencies for ts build

77172c1

B4nan reviewed Oct 19, 2022

View reviewed changes

packages/apify-extra/package.json Show resolved Hide resolved

barjin added 2 commits October 20, 2022 13:28

feat: remove bluebird dependency

f156a7f

feat: helpers, "out" parameter, log.debug

3f60b9a

barjin closed this Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: [WIP] initial add of apify-extra package with dataset functions #81

feat: [WIP] initial add of apify-extra package with dataset functions #81

metalwarrior665 commented Oct 4, 2022

B4nan left a comment •

edited

Loading

B4nan Oct 4, 2022

metalwarrior665 Oct 4, 2022

B4nan Oct 4, 2022

metalwarrior665 Oct 4, 2022

barjin Oct 20, 2022

metalwarrior665 Oct 20, 2022

barjin commented Oct 19, 2022

B4nan commented Oct 19, 2022

metalwarrior665 commented Oct 20, 2022 •

edited

Loading

metalwarrior665 commented Oct 20, 2022 •

edited

Loading

metalwarrior665 commented Oct 20, 2022 •

edited

Loading

barjin commented Nov 10, 2022

	await bluebird.map(requestInfoArr, async (requestInfoObj) => {
	await Promise.all(requestInfoArr.map(async (requestInfoObj) => {

feat: [WIP] initial add of apify-extra package with dataset functions #81

feat: [WIP] initial add of apify-extra package with dataset functions #81

Conversation

metalwarrior665 commented Oct 4, 2022

B4nan left a comment • edited Loading

Choose a reason for hiding this comment

B4nan Oct 4, 2022

Choose a reason for hiding this comment

metalwarrior665 Oct 4, 2022

Choose a reason for hiding this comment

B4nan Oct 4, 2022

Choose a reason for hiding this comment

metalwarrior665 Oct 4, 2022

Choose a reason for hiding this comment

barjin Oct 20, 2022

Choose a reason for hiding this comment

metalwarrior665 Oct 20, 2022

Choose a reason for hiding this comment

barjin commented Oct 19, 2022

B4nan commented Oct 19, 2022

metalwarrior665 commented Oct 20, 2022 • edited Loading

metalwarrior665 commented Oct 20, 2022 • edited Loading

metalwarrior665 commented Oct 20, 2022 • edited Loading

barjin commented Nov 10, 2022

B4nan left a comment •

edited

Loading

metalwarrior665 commented Oct 20, 2022 •

edited

Loading

metalwarrior665 commented Oct 20, 2022 •

edited

Loading

metalwarrior665 commented Oct 20, 2022 •

edited

Loading