Investigate missing Top 1k home pages #222

rviscomi · 2023-11-13T16:56:40Z

For some reason HA has no data for ~90 of the top 1k sites in CrUX:

https://allegro.pl/
https://aquamanga.com/
https://auctions.yahoo.co.jp/
https://auth.uber.com/
https://betproexch.com/
https://blaze-1.com/
https://bollyflix.tax/
https://brainly.com.br/
https://brainly.in/
https://brainly.lat/
https://chance.enjoy.point.auone.jp/
https://cookpad.com/
https://detail.chiebukuro.yahoo.co.jp/
https://e-okul.meb.gov.tr/
https://filmyfly.club/
https://game.hiroba.dpoint.docomo.ne.jp/
https://gamewith.jp/
https://gdz.ru/
https://hdhub4u.markets/
https://hentailib.me/
https://holoo.fun/
https://ifilo.net/
https://indianhardtube.com/
https://login.caixa.gov.br/
https://m.autoplius.lt/
https://m.fmkorea.com/
https://m.happymh.com/
https://m.pgf-asw0zz.com/
https://m.porno365.pics/
https://m.skelbiu.lt/
https://mangalib.me/
https://mangalivre.net/
https://mnregaweb4.nic.in/
https://myaadhaar.uidai.gov.in/
https://myreadingmanga.info/
https://namu.wiki/
https://nhattruyenplus.com/
https://nhentai.net/
https://onlar.az/
https://page.auctions.yahoo.co.jp/
https://passbook.epfindia.gov.in/
https://pixbet.com/
https://pmkisan.gov.in/
https://quizlet.com/
https://schools.emaktab.uz/
https://schools.madrasati.sa/
https://scratch.mit.edu/
https://supjav.com/
https://tathya.uidai.gov.in/
https://uchi.ru/
https://v.daum.net/
https://vl2.xvideos98.pro/
https://vlxx.moe/
https://www.avto.net/
https://www.bartarinha.ir/
https://www.bestbuy.com/
https://www.betproexch.com/
https://www.cardmarket.com/
https://www.chegg.com/
https://www.cityheaven.net/
https://www.deviantart.com/
https://www.dns-shop.ru/
https://www.fiverr.com/
https://www.fmkorea.com/
https://www.hotstar.com/
https://www.idealista.com/
https://www.idealista.it/
https://www.justdial.com/
https://www.khabaronline.ir/
https://www.leboncoin.fr/
https://www.leroymerlin.fr/
https://www.makemytrip.com/
https://www.mediaexpert.pl/
https://www.milanuncios.com/
https://www.namasha.com/
https://www.nettruyenus.com/
https://www.ninisite.com/
https://www.nitrotype.com/
https://www.otvfoco.com.br/
https://www.ozon.ru/
https://www.realtor.com/
https://www.sahibinden.com/
https://www.shahrekhabar.com/
https://www.si.com/
https://www.studocu.com/
https://www.thenetnaija.net/
https://www.varzesh3.com/
https://www.wannonce.com/
https://www.wayfair.com/
https://www.winzogames.com/
https://www.zillow.com/
https://znanija.com/

WITH ha AS (
  SELECT
    page
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-10-01' AND
    rank = 1000 AND
    is_root_page
),

crux AS (
  SELECT
    DISTINCT CONCAT(origin, '/') AS page
  FROM
    `chrome-ux-report.materialized.metrics_summary`
  WHERE
    date = '2023-09-01' AND
    rank = 1000
)


SELECT
  page
FROM
  crux
LEFT OUTER JOIN
  ha
USING
  (page)
WHERE
  ha.page IS NULL
ORDER BY
  page

This has been pretty consistent:

Row	date	top_1k
1	2023-01-01	918
2	2023-02-01	922
3	2023-03-01	910
4	2023-04-01	924
5	2023-05-01	916
6	2023-06-01	913
7	2023-07-01	908
8	2023-08-01	917
9	2023-09-01	910
10	2023-10-01	908

And here are the top 1k home pages that have consistently been missing all year (202301–202309):

https://aquamanga.com/
https://auctions.yahoo.co.jp/
https://betproexch.com/
https://brainly.in/
https://chance.enjoy.point.auone.jp/
https://detail.chiebukuro.yahoo.co.jp/
https://game.hiroba.dpoint.docomo.ne.jp/
https://login.caixa.gov.br/
https://m.fmkorea.com/
https://m.happymh.com/
https://mangalib.me/
https://mangalivre.net/
https://myreadingmanga.info/
https://namu.wiki/
https://page.auctions.yahoo.co.jp/
https://pmkisan.gov.in/
https://quizlet.com/
https://scratch.mit.edu/
https://v.daum.net/
https://www.bartarinha.ir/
https://www.bestbuy.com/
https://www.betproexch.com/
https://www.deviantart.com/
https://www.fiverr.com/
https://www.fmkorea.com/
https://www.idealista.com/
https://www.justdial.com/
https://www.khabaronline.ir/
https://www.leboncoin.fr/
https://www.leroymerlin.fr/
https://www.milanuncios.com/
https://www.namasha.com/
https://www.ninisite.com/
https://www.ozon.ru/
https://www.realtor.com/
https://www.sahibinden.com/
https://www.wannonce.com/

Are the tests erroring out? Are they blocking us?

The text was updated successfully, but these errors were encountered:

tunetheweb · 2023-11-13T17:16:59Z

Just trying the first one (https://allegro.pl/) it also fails in the public WebPageTest:
https://www.webpagetest.org/result/231113_AiDcFK_98G/1/details/#waterfall_view_step1
With a 403.

When I try with curl it asks for JS to be enabled and depends on something using https://ct.captcha-delivery.com/c.js

So would guess it's just blocked.

max-ostapenko · 2024-09-20T10:34:17Z

I looked into this for September crawl, and the number of missing pages increased to 20%.

There are other reasons besides 403 response, like redirects:

https://clever.com/ redirected to https://www.clever.com/ (WPT result)
https://as.com/ redirected to https://us.as.com/ (WPT result)

The debug information in the staging dataset would help us see expected VS unexpected cases.

@pmeenan do we log reasons for not collecting crawl data that we could JOIN here?

tunetheweb · 2024-09-20T10:35:57Z

Are those sites also available as their own pages?

max-ostapenko · 2024-09-20T10:44:12Z

I've found https://www.clever.com/ in CrUX, but not the other one.
So yeah, we're loosing a bit of pages here (maybe deduplicating in BQ post crawl could be an alternative).

And could we also run crawl in a headful browser? I believe it will fix big part of blocked pages.

tunetheweb · 2024-09-20T10:56:43Z

Well if popular enough page then I would expect it to be in CrUX. Weird that the pre-redirect one is in CrUX at all but maybe they just moved to www this month? Or it’s used for some other non-public reason (e.g. clever.com/intranet).

We do have WPTS in our user agent header so we’re easy to block for people that don’t want crawlers/bots. We could remove that but would rather be a good net citizen and be honest about this.

Another issue is that we only crawl from US data centres which can affect things. For example www.bbc.co.uk redirects to www.bbc.com for US visitors (which is in CrUX separately anyway).

So not sure moving to a headed browser would fix most things that are blocking us.

max-ostapenko · 2024-09-20T11:31:24Z

You're right, user agent is more obvious than headless signals.

I'd still like to get a report for crawling 'failures' on a page level, so that we can have an overview of the discrepancies reasons instead of checking them one by one manually.

pmeenan · 2024-09-20T22:10:37Z

FWIW, we crawl with a full, headful Chrome browser running in a XOrg virtual framebuffer.

We could upload failed pages to a different table so we'd have the test results at least if that would help diagnose the issues or I could just log them somewhere along with the test IDs.

Blocking visitors coming from Google Cloud isn't necessarily surprising since not many actual users will be browsing from a cloud provider. If we can find which CDN they are using we can see if that CDN classifies us appropriately.

max-ostapenko · 2024-09-20T23:58:00Z

A table preferably. Plus requests data.
I understand wappalyzer test runs on these pages too, so identifying security technologies should be easy.

And hope to be able to categorize and match the reasons:

page redirect,
bot blocking (e.g. captcha, rules),
other.

pmeenan · 2024-10-02T00:02:53Z

In theory, the next crawl should write results for failed tests to tables in the crawl_failures dataset (pages, requests and parsed_css). It will only write the failed test after the 2 retries (won't write the transient failures that succeed when retried).

The HARs and full test results will also be uploaded so we can look at the raw WPT tests as needed (in theory - that part of the pipeline doesn't have a good way to test until the crawl starts).

pmeenan · 2024-10-13T17:00:44Z

Looks like the crawl_failures dataset is populating. So far it looks like all of the failures are legit.

SELECT
  JSON_VALUE(payload, '$._result') as result,
  count(*) as num
FROM `httparchive.crawl_failures.pages`
WHERE
  date = "2024-10-01" AND
  rank = 1000
GROUP BY JSON_VALUE(payload, '$._result')
ORDER BY num DESC

Row	result	num
1	888	53
2	403	33
3	404	19
4	429	2
5	500	1

888 is a custom result code we use when the final page has a different origin from the page we navigated to (redirected).

Without the rank filter the main erros are similar but the ratios change a bit (and the long tail of error codes is long)

Row	result	num
1	888	308821
2	404	300417
3	403	153971
4	500	8438
5	400	6477

max-ostapenko · 2024-10-15T01:12:19Z

Seems aligned with the ranks (mobile here)

But still 1M pages is missing somehow:

WITH pages AS (
  SELECT 
    page
  FROM `all.pages`
  WHERE date = '2024-10-01'
    AND is_root_page
    and client = 'mobile'
), fails AS (
  SELECT 
    page
  FROM crawl_failures.pages
  WHERE date = '2024-10-01'
    AND is_root_page
    and client = 'mobile'
), crux AS (
  SELECT
    origin || "/" AS page
  FROM `chrome-ux-report.experimental.global`
  WHERE yyyymm = 202409
)

SELECT
  crux.page
FROM crux
LEFT JOIN pages
ON crux.page = pages.page
LEFT JOIN fails
ON crux.page = fails.page
WHERE pages.page IS NULL AND fails.page IS NULL

pmeenan · 2024-10-15T16:14:05Z

Possibly something is causing the failures to not get logged on the 3rd retry all the time, but spot-checking a few of those looks like they are mostly redirects to different origins as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate missing Top 1k home pages #222

Investigate missing Top 1k home pages #222

rviscomi commented Nov 13, 2023

tunetheweb commented Nov 13, 2023

max-ostapenko commented Sep 20, 2024 •

edited

Loading

tunetheweb commented Sep 20, 2024

max-ostapenko commented Sep 20, 2024

tunetheweb commented Sep 20, 2024

max-ostapenko commented Sep 20, 2024

pmeenan commented Sep 20, 2024

max-ostapenko commented Sep 20, 2024

pmeenan commented Oct 2, 2024

pmeenan commented Oct 13, 2024

max-ostapenko commented Oct 15, 2024

pmeenan commented Oct 15, 2024

Investigate missing Top 1k home pages #222

Investigate missing Top 1k home pages #222

Comments

rviscomi commented Nov 13, 2023

tunetheweb commented Nov 13, 2023

max-ostapenko commented Sep 20, 2024 • edited Loading

tunetheweb commented Sep 20, 2024

max-ostapenko commented Sep 20, 2024

tunetheweb commented Sep 20, 2024

max-ostapenko commented Sep 20, 2024

pmeenan commented Sep 20, 2024

max-ostapenko commented Sep 20, 2024

pmeenan commented Oct 2, 2024

pmeenan commented Oct 13, 2024

max-ostapenko commented Oct 15, 2024

pmeenan commented Oct 15, 2024

max-ostapenko commented Sep 20, 2024 •

edited

Loading