-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build a stage 1 worker using search engine results #7
Comments
We need to use the 3 stage workflow. The output will be to sent all the data to Postgres database via SQLalchemy engine. We should use CAH format for that. https://github.com/rvencu/crawlingathome-gpu-hcloud/blob/43eec102d3c4f08145a7704d4c65648619677768/ccpp.py#L375 The issue I have is that while we can use private workers, using crowdsourced workers will expose DB credentials and I still have no idea how to curate the output to allow saving to the database. At this point I can only operate our swarm of private workers. while testing we can use a test table instead of the production one |
Table structure is:
the trigger just updates the timestamp for last modified time |
so, update from bing image search query tests: At first I got like 300 im-txt-pairs per sec with my colab code .... - Then after a few 100k samples the ip gets blocked and the rate drops to ~10 samples per sec .... still not bad ... but maybe using tor could be a good idea ... need to do some tests again |
Here the general plan:
( additionally we could get all at least x-times mentioned named entities from wikipedia, the pile, ... ) |
I expect if this starts working at scale, search engines will actively work on banning us and will succeed. |
let's try tor |
Happy to help out on the tracker side of this :) Sounds very promising |
Let's fill all details on this idea by @christophschuhmann
For example:
The text was updated successfully, but these errors were encountered: