Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving objects for a set or list of URL's in parallel #22

Open
vikas95 opened this issue Dec 10, 2021 · 3 comments
Open

Retrieving objects for a set or list of URL's in parallel #22

vikas95 opened this issue Dec 10, 2021 · 3 comments

Comments

@vikas95
Copy link

vikas95 commented Dec 10, 2021

Hi,

Thanks for sharing the programming example - https://github.com/cocrawler/cdx_toolkit#programming-example
I wanted to ask if there is a way to feed in a list of URL's and retrieve their objects. We feed URL's one by one in the above example and looping over a few thousands (or even hundreds) seems to be a little time consuming.

Thanks.

@wumpus
Copy link
Member

wumpus commented Dec 13, 2021

The loop could be one iteration... in fact the example you're looking at just loops once (limit=1)

@vikas95
Copy link
Author

vikas95 commented Mar 17, 2022

@wumpus - thanks for the response :D
I am trying to retrieve meta data for nearly 10k webpages. I am feeding URL of each webpage one by one to the cdx.iter function. I am observing the retrieval time for sets of 20 webpages. Some of the sets take nearly 30 mins while other set (of same size of 20 webpages) are retrieved within 5 mins.

I read your explanation on another issue on this repo (#8). I wanted to ask if the retrieval time is dependent on how many requests are given to cc at a specific time ? And it would be helpful if you can suggest any changes that can help in speeding up the retrieval time.

Thanks.

@wumpus
Copy link
Member

wumpus commented Mar 26, 2022

Turn up the verbose level and you'll see what's going on -- if you are not limiting your time span, the cdx code has to talk to every Common Crawl index individually. Whereas for the Internet Archive, there's just one query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants