-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High-frequency queries from https://wikidocumentaries-demo.wmcloud.org/ #1548
Comments
I haven't made any changes lately, but now that I check, Google seems to have suddenly resumed crawling the site (using the user agent There's one quick way to reduce the number of requests I send to your direction: I've now disabled the retry logic (exponential backoff) which I had in case of the error responses 400 and 429 (indeterministic out-of-memory errors etc.) Does it help anything? I'm not using any caching mechanism and I don't think it would help, as all the crawled pages refer to a different Wikidata item so the Sparql query will also differ on the Wikidata item to query. I'm currently sending as many queries as there are facets in the UI (currently three) - I don't know if it would work to join these queries into one or if it would cause more out-of-memory errors. |
@tuukka Thank you for your reply! it's now back to one query every 1-2 seconds, which is reasonable. But I am curious: can you tell from your logs how many queries per day come from actual users and how many come from bots? |
I don't see the requests made from the clients towards QLever, so I have to use page loads as an approximation. I ran a simple analysis for yesterday's page load logs: Regarding actual users, I had a look at the referer data: The numbers you see should be at least 3 times higher (based on the number of facets). Now that I think of it, if I fetched the facets only after fetching the images, I could skip the facet queries whenever I get 0 images. Further, whenever the number of images is small, I could perhaps calculate the facets fully client-side. |
@tuukka Comming back to this after some time. It seems wrong that only 4% of queries are by users, and 96% are from bots. What could we do about it? This seems to be an important question when running a service like the WDSQ. |
Could you clarify what you mean by 'wrong'? E.g. "There should be more users, users should cause more queries, Google should index a smaller proportion of the site, the site should generate about 96% less queries to stay within a per-site reasonable-use quota" Are you familiar with the traffic statistics of Wikidata and WDQS? I'm not, but I would guess that it's typical for a large proportion to be bots given the "long-tail" nature of Wikidata (and Wikipedia): There are lots of items that some but not many humans are interested in. Do you have a way to track the cost of queries as opposed to just the number of them? I guess the queries for the less interesting items will be cheaper as there will be less data and images about them. (See also my previous comment for some ideas how I could reduce the number of these cheaper queries.) |
@tuukka I am familiar with the statistics for https://dblp.org and https://sparql.dblp.org, which get millions of requests per day both from users and bots/scripts . What I meant by "wrong" is the following: It costs hardware and energy to answer queries. This is fine when a human being asks or triggers a query. It seems "wrong" when bots trigger complex queries and then don't really do anything with it. Then it's just machines wasting energy for nothing. It's easy to think of scenarios (bots asking other bots asking other bots ... to do complex things), where enormous amounts of energy are wasted, without any human ever in the loop. Two ways come to mind to deal with this: (1) for the bots, have static versions of the pages, which lag behind regarding their up-to-dateness; (2) for the bots, have reduced versions of the pages. For https://dblp.org, we do (1). For example, https://dblp.org/pid/b/HannahBast.html is a static HTML page, which is produced daily from numerous queries to numerous systems. It's a challenging topic, I know. Some time ago, we had problems with a browser extension running amok and bringing parts of https://dblp.org down . We couldn't meaningfully block it because it was a broser extension (so coming from the IP addresses of the users of the extension). The final fix was not on our side, but to fix the script to do something more meaningful. |
@tuukka For some time now, we are receiving a very high volume of queries (ten queries per second and more, around the clock) from https://wikidocumentaries-demo.wmcloud.org . This looks like either disrespectful crawlers or bots, or a script gone astray. Can you please check?
And are you using some caching mechanism to avoid issuing too many queries?
The text was updated successfully, but these errors were encountered: