Retrieving objects for a set or list of URL's in parallel #22

vikas95 · 2021-12-10T18:35:12Z

Hi,

Thanks for sharing the programming example - https://github.com/cocrawler/cdx_toolkit#programming-example
I wanted to ask if there is a way to feed in a list of URL's and retrieve their objects. We feed URL's one by one in the above example and looping over a few thousands (or even hundreds) seems to be a little time consuming.

Thanks.

wumpus · 2021-12-13T20:28:18Z

The loop could be one iteration... in fact the example you're looking at just loops once (limit=1)

vikas95 · 2022-03-17T05:56:05Z

@wumpus - thanks for the response :D
I am trying to retrieve meta data for nearly 10k webpages. I am feeding URL of each webpage one by one to the cdx.iter function. I am observing the retrieval time for sets of 20 webpages. Some of the sets take nearly 30 mins while other set (of same size of 20 webpages) are retrieved within 5 mins.

I read your explanation on another issue on this repo (#8). I wanted to ask if the retrieval time is dependent on how many requests are given to cc at a specific time ? And it would be helpful if you can suggest any changes that can help in speeding up the retrieval time.

Thanks.

wumpus · 2022-03-26T18:58:27Z

Turn up the verbose level and you'll see what's going on -- if you are not limiting your time span, the cdx code has to talk to every Common Crawl index individually. Whereas for the Internet Archive, there's just one query.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieving objects for a set or list of URL's in parallel #22

Retrieving objects for a set or list of URL's in parallel #22

vikas95 commented Dec 10, 2021

wumpus commented Dec 13, 2021

vikas95 commented Mar 17, 2022

wumpus commented Mar 26, 2022

Retrieving objects for a set or list of URL's in parallel #22

Retrieving objects for a set or list of URL's in parallel #22

Comments

vikas95 commented Dec 10, 2021

wumpus commented Dec 13, 2021

vikas95 commented Mar 17, 2022

wumpus commented Mar 26, 2022