Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High-frequency queries from https://wikidocumentaries-demo.wmcloud.org/ #1548

Open
hannahbast opened this issue Oct 11, 2024 · 6 comments
Open

Comments

@hannahbast
Copy link
Member

@tuukka For some time now, we are receiving a very high volume of queries (ten queries per second and more, around the clock) from https://wikidocumentaries-demo.wmcloud.org . This looks like either disrespectful crawlers or bots, or a script gone astray. Can you please check?

And are you using some caching mechanism to avoid issuing too many queries?

@tuukka
Copy link

tuukka commented Oct 13, 2024

I haven't made any changes lately, but now that I check, Google seems to have suddenly resumed crawling the site (using the user agent GoogleOther) and at a rate of some 100K pages per day or 1 per second.

There's one quick way to reduce the number of requests I send to your direction: I've now disabled the retry logic (exponential backoff) which I had in case of the error responses 400 and 429 (indeterministic out-of-memory errors etc.) Does it help anything?

I'm not using any caching mechanism and I don't think it would help, as all the crawled pages refer to a different Wikidata item so the Sparql query will also differ on the Wikidata item to query.

I'm currently sending as many queries as there are facets in the UI (currently three) - I don't know if it would work to join these queries into one or if it would cause more out-of-memory errors.

@hannahbast
Copy link
Member Author

@tuukka Thank you for your reply! it's now back to one query every 1-2 seconds, which is reasonable.

But I am curious: can you tell from your logs how many queries per day come from actual users and how many come from bots?

@tuukka
Copy link

tuukka commented Oct 16, 2024

But I am curious: can you tell from your logs how many queries per day come from actual users and how many come from bots?

I don't see the requests made from the clients towards QLever, so I have to use page loads as an approximation.

I ran a simple analysis for yesterday's page load logs:
52162 loads (55.47%) by GoogleOther,
38120 loads (40.55%) by other bots,
3746 loads (3.98%) by actual users.

Regarding actual users, I had a look at the referer data:
311 loads (8.30%) came directly from the wikis (mainly Commons),
960 loads (25.63%) had no referer data,
2441 loads (65.16%) came from navigation within the service.

The numbers you see should be at least 3 times higher (based on the number of facets). Now that I think of it, if I fetched the facets only after fetching the images, I could skip the facet queries whenever I get 0 images. Further, whenever the number of images is small, I could perhaps calculate the facets fully client-side.

@hannahbast
Copy link
Member Author

@tuukka Comming back to this after some time. It seems wrong that only 4% of queries are by users, and 96% are from bots. What could we do about it?

This seems to be an important question when running a service like the WDSQ.

@tuukka
Copy link

tuukka commented Nov 21, 2024

Could you clarify what you mean by 'wrong'? E.g. "There should be more users, users should cause more queries, Google should index a smaller proportion of the site, the site should generate about 96% less queries to stay within a per-site reasonable-use quota"

Are you familiar with the traffic statistics of Wikidata and WDQS? I'm not, but I would guess that it's typical for a large proportion to be bots given the "long-tail" nature of Wikidata (and Wikipedia): There are lots of items that some but not many humans are interested in.

Do you have a way to track the cost of queries as opposed to just the number of them? I guess the queries for the less interesting items will be cheaper as there will be less data and images about them. (See also my previous comment for some ideas how I could reduce the number of these cheaper queries.)

@hannahbast
Copy link
Member Author

@tuukka I am familiar with the statistics for https://dblp.org and https://sparql.dblp.org, which get millions of requests per day both from users and bots/scripts .

What I meant by "wrong" is the following: It costs hardware and energy to answer queries. This is fine when a human being asks or triggers a query. It seems "wrong" when bots trigger complex queries and then don't really do anything with it. Then it's just machines wasting energy for nothing. It's easy to think of scenarios (bots asking other bots asking other bots ... to do complex things), where enormous amounts of energy are wasted, without any human ever in the loop.

Two ways come to mind to deal with this: (1) for the bots, have static versions of the pages, which lag behind regarding their up-to-dateness; (2) for the bots, have reduced versions of the pages. For https://dblp.org, we do (1). For example, https://dblp.org/pid/b/HannahBast.html is a static HTML page, which is produced daily from numerous queries to numerous systems.

It's a challenging topic, I know. Some time ago, we had problems with a browser extension running amok and bringing parts of https://dblp.org down . We couldn't meaningfully block it because it was a broser extension (so coming from the IP addresses of the users of the extension). The final fix was not on our side, but to fix the script to do something more meaningful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants