Reorganize the BigQuery datasets to be more efficient #15

rviscomi · 2022-02-01T18:13:25Z

rviscomi · 2022-05-10T18:49:27Z

Brainstorming new tables/schemas.

httparchive
- all
  - pages
  - requests

`httparchive.all.pages`

All-in-one table containing all page-based results from all crawls for all clients.

Partition by:

date (required)

Cluster by:

client
is_root_page
rank

Combines data from multiple existing tables:

pages
summary_pages
blink_features
technologies
lighthouse

Additional enhancements:

New custom_metrics field
features and technologies fields are well-defined repeated structs
- Combining categories into a single repeated STRING field, rather than creating new technology+category pairs, which tripped up many users (including myself)
summary field is JSON-encoded so we can support all historical summary_pages stats if we'd like and add/remove anything over time without breaking the schema

Schema

[
    {
        "name": "date",
        "type": "DATE",
        "mode": "REQUIRED",
        "description": "YYYY-MM-DD format of the HTTP Archive monthly crawl"
    },
    {
        "name": "client",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "Test environment: desktop or mobile"
    },
    {
        "name": "page",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "The URL of the page being tested"
    },
    {
        "name": "is_root_page",
        "type": "BOOLEAN",
        "mode": "REQUIRED",
        "description": "Whether the page is the root of the origin"
    },
    {
        "name": "root_page",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "The URL of the root page being tested, the origin followed by /"
    },
    {
        "name": "rank",
        "type": "INTEGER",
        "mode": "NULLABLE",
        "description": "Site popularity rank, from CrUX"
    },
    {
        "name": "wptid",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "ID of the WebPageTest results"
    },
    {
        "name": "payload",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "JSON-encoded WebPageTest results for the page"
    },
    {
        "name": "summary",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "JSON-encoded summarization of the page-level data"
    },
    {
        "name": "custom_metrics",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "JSON-encoded test results of the custom metrics"
    },
    {
        "name": "lighthouse",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "JSON-encoded Lighthouse report"
    },
    {
        "name": "features",
        "type": "RECORD",
        "mode": "REPEATED",
        "description": "Blink features detected at runtime (see https://chromestatus.com/features)",
        "fields": [
            {
                "name": "feature",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Blink feature name"
            },
            {
                "name": "id",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Blink feature ID"
            },
            {
                "name": "type",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Blink feature type (css, default)"
            }
        ]
    },
    {
        "name": "technologies",
        "type": "RECORD",
        "mode": "REPEATED",
        "description": "Technologies detected at runtime (see https://www.wappalyzer.com/)",
        "fields": [
            {
                "name": "technology",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Name of the detected technology"
            },
            {
                "name": "categories",
                "type": "STRING",
                "mode": "REPEATED",
                "description": "List of categories to which this technology belongs"
            },
            {
                "name": "info",
                "type": "STRING",
                "mode": "REPEATED",
                "description": "Additional metadata about the detected technology, ie version number"
            }
        ]
    },
    {
        "name": "metadata",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "Additional metadata about the test"
    }
]

Example queries

Median page weight over time

SELECT
  # No need to parse the date or client from _TABLE_SUFFIX
  date,
  client,
  # Existing summary stats can be extracted from the JSON-encoded `summary` field
  APPROX_QUANTILES(JSON_VALUE(summary, '$.bytesTotal'), 1000)[500] AS median_page_weight
FROM
  # New table address
  `httparchive.all.pages`
WHERE
  # The table is partitioned by date, so we don't incur any costs for data older than 2020
  date >= '2020-01-01' AND
  # Only measure root/home pages for consistency
  is_root_page
GROUP BY
  date,
  client
ORDER BY
  date,
  client

`httparchive.all.requests`

All-in-one table containing all request-based results for all crawls and clients.

Partition by:

date (required)

Cluster by:

client
is_root_page
is_main_document
type

Combines data from multiple existing tables:

requests
summary_requests

Additional enhancements:

Top-level request_headers and response_headers fields that are well-defined repeated structs of key/value pairs, borrowing from the almanac.summary_response_bodies table
summary field is JSON-encoded so we can support all historical summary_requests stats if we'd like and add/remove anything over time without breaking the schema

Schema

[
    {
        "name": "date",
        "type": "DATE",
        "mode": "REQUIRED",
        "description": "YYYY-MM-DD format of the HTTP Archive monthly crawl"
    },
    {
        "name": "client",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "Test environment: desktop or mobile"
    },
    {
        "name": "page",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "The URL of the page being tested"
    },
    {
        "name": "is_root_page",
        "type": "BOOLEAN",
        "mode": "NULLABLE",
        "description": "Whether the page is the root of the origin."
    },
    {
        "name": "root_page",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "The URL of the root page being tested"
    },
    {
        "name": "url",
        "type": "STRING",
        "mode": "REQUIRED",
        "description": "The URL of the request"
    },
    {
        "name": "is_main_document",
        "type": "BOOLEAN",
        "mode": "REQUIRED",
        "description": "Whether this request corresponds with the main HTML document of the page, which is the first HTML request after redirects"
    },
    {
        "name": "type",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "Simplified description of the type of resource (script, html, css, text, other, etc)"
    },
    {
        "name": "index",
        "type": "INTEGER",
        "mode": "NULLABLE",
        "description": "The sequential 0-based index of the request"
    },
    {
        "name": "payload",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "JSON-encoded WebPageTest result data for this request"
    },
    {
        "name": "summary",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "JSON-encoded summarization of request data"
    },
    {
        "name": "request_headers",
        "type": "RECORD",
        "mode": "REPEATED",
        "description": "Request headers",
        "fields": [
            {
                "name": "name",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Request header name"
            },
            {
                "name": "value",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Request header value"
            }
        ]
    },
    {
        "name": "response_headers",
        "type": "RECORD",
        "mode": "REPEATED",
        "description": "Response headers",
        "fields": [
            {
                "name": "name",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Response header name"
            },
            {
                "name": "value",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "Response header value"
            }
        ]
    },
    {
        "name": "response_body",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "Text-based response body"
    }
]

Example queries

rviscomi · 2022-05-22T16:00:45Z

Generating May 2022 data for httparchive.all.pages:

CREATE TEMP FUNCTION GET_CUSTOM_METRICS(payload STRING) RETURNS STRING LANGUAGE js AS '''
const $ = JSON.parse(payload);
return JSON.stringify(Object.fromEntries($._custom.map(name => {
  let value = $[`_${name}`];
  if (typeof value == 'string') {
    try {
      value = JSON.parse(value);
    } catch (e) {
      // The value is not a JSON string.
    }
  }
  return [name, value];
})));
''';

CREATE TEMP FUNCTION GET_FEATURES(payload STRING)
RETURNS ARRAY<STRUCT<feature STRING, id STRING, type STRING>> LANGUAGE js AS
'''
  function getFeatureNames(featureMap, featureType) {
    try {
      return Object.entries(featureMap).map(([key, value]) => {
        // After Feb 2020 keys are feature IDs.
        if (value.name) {
          return {'feature': value.name, 'type': featureType, 'id': key};
        }
        // Prior to Feb 2020 keys fell back to IDs if the name was unknown.
        if (idPattern.test(key)) {
          return {'feature': '', 'type': featureType, 'id': key.match(idPattern)[1]};
        }
        // Prior to Feb 2020 keys were names by default.
        return {'feature': key, 'type': featureType, 'id': ''};
      });
    } catch (e) {
      return [];
    }
  }
  
  var $ = JSON.parse(payload);
  if (!$._blinkFeatureFirstUsed) return [];
  
  var idPattern = new RegExp('^Feature_(\\d+)$');
  return getFeatureNames($._blinkFeatureFirstUsed.Features, 'default')
    .concat(getFeatureNames($._blinkFeatureFirstUsed.CSSFeatures, 'css'))
    .concat(getFeatureNames($._blinkFeatureFirstUsed.AnimatedCSSFeatures, 'animated-css'));
''';

WITH pages AS (
  SELECT
    _TABLE_SUFFIX AS client,
    url AS page,
    JSON_VALUE(payload, '$._metadata.rank') AS rank,
    JSON_VALUE(payload, '$._metadata.crawl_depth') = '0' AS is_root_page,
    JSON_VALUE(payload, '$._testID') AS wptid,
    GET_CUSTOM_METRICS(payload) AS custom_metrics,
    JSON_QUERY(payload, '$._metadata') AS metadata,
    payload
  FROM
    `httparchive.pages.2022_05_01_*`
# TODO: Backfill when the summary pages is ready.
/* ), summary_pages AS (
  SELECT
    _TABLE_SUFFIX AS client,
    url AS page,
    TO_JSON_STRING(summary_pages) AS summary
  FROM
    `httparchive.summary_pages.2022_05_01_*` AS summary_pages */
), loose_technologies AS (
    SELECT
      _TABLE_SUFFIX AS client,
      url AS page,
      STRUCT(
        app AS technology,
        ARRAY_AGG(DISTINCT category ORDER BY category) AS categories,
        ARRAY_AGG(info) AS info
      ) AS technology
    FROM
      `httparchive.technologies.2022_05_01_*`
    GROUP BY
      client,
      page,
      app
), techs AS (
  SELECT
    client,
    page,
    ARRAY_AGG(technology) AS technologies
  FROM
    loose_technologies
  GROUP BY
    client,
    page
), lh AS (
  SELECT
    _TABLE_SUFFIX AS client,
    url AS page,
    report AS lighthouse
  FROM
    `httparchive.lighthouse.2022_05_01_*`
)

SELECT
  DATE('2022-05-01') AS date,
  client,
  page,
  is_root_page,
  rank,
  wptid,
  payload,
  # TODO: Update when the summary pipeline completes successfully.
  '' AS summary,
  custom_metrics,
  lighthouse,
  GET_FEATURES(payload) AS features,
  technologies,
  metadata
FROM
  pages
LEFT JOIN
  techs
USING
  (client, page)
LEFT JOIN
  lh
USING
  (client, page)

rviscomi · 2022-05-24T01:14:11Z

Pages data for April 2022. Changed rank to an INT64 field and added the required root_page field.

CREATE TEMP FUNCTION GET_CUSTOM_METRICS(payload STRING) RETURNS STRING LANGUAGE js AS '''
const $ = JSON.parse(payload);
return JSON.stringify(Object.fromEntries($._custom.map(name => {
  let value = $[`_${name}`];
  if (typeof value == 'string') {
    try {
      value = JSON.parse(value);
    } catch (e) {
      // The value is not a JSON string.
    }
  }
  return [name, value];
})));
''';

CREATE TEMP FUNCTION GET_FEATURES(payload STRING)
RETURNS ARRAY<STRUCT<feature STRING, id STRING, type STRING>> LANGUAGE js AS
'''
  function getFeatureNames(featureMap, featureType) {
    try {
      return Object.entries(featureMap).map(([key, value]) => {
        // After Feb 2020 keys are feature IDs.
        if (value.name) {
          return {'feature': value.name, 'type': featureType, 'id': key};
        }
        // Prior to Feb 2020 keys fell back to IDs if the name was unknown.
        if (idPattern.test(key)) {
          return {'feature': '', 'type': featureType, 'id': key.match(idPattern)[1]};
        }
        // Prior to Feb 2020 keys were names by default.
        return {'feature': key, 'type': featureType, 'id': ''};
      });
    } catch (e) {
      return [];
    }
  }
  
  var $ = JSON.parse(payload);
  if (!$._blinkFeatureFirstUsed) return [];
  
  var idPattern = new RegExp('^Feature_(\\d+)$');
  return getFeatureNames($._blinkFeatureFirstUsed.Features, 'default')
    .concat(getFeatureNames($._blinkFeatureFirstUsed.CSSFeatures, 'css'))
    .concat(getFeatureNames($._blinkFeatureFirstUsed.AnimatedCSSFeatures, 'animated-css'));
''';

WITH pages AS (
  SELECT
    _TABLE_SUFFIX AS client,
    url AS page,
    SAFE_CAST(JSON_VALUE(payload, '$._metadata.rank') AS INT64) AS rank,
    COALESCE(JSON_VALUE(payload, '$._metadata.crawl_depth') = '0', TRUE) AS is_root_page,
    COALESCE(JSON_VALUE(payload, '$._metadata.root_page_url'), url) AS root_page,
    JSON_VALUE(payload, '$._testID') AS wptid,
    GET_CUSTOM_METRICS(payload) AS custom_metrics,
    JSON_QUERY(payload, '$._metadata') AS metadata,
    payload
  FROM
    `httparchive.pages.2022_04_01_*`
), summary_pages AS (
  SELECT
    _TABLE_SUFFIX AS client,
    url AS page,
    rank,
    TO_JSON_STRING(summary_pages) AS summary
  FROM
    `httparchive.summary_pages.2022_04_01_*` AS summary_pages
), loose_technologies AS (
    SELECT
      _TABLE_SUFFIX AS client,
      url AS page,
      STRUCT(
        app AS technology,
        ARRAY_AGG(DISTINCT category ORDER BY category) AS categories,
        ARRAY_AGG(info) AS info
      ) AS technology
    FROM
      `httparchive.technologies.2022_04_01_*`
    GROUP BY
      client,
      page,
      app
), techs AS (
  SELECT
    client,
    page,
    ARRAY_AGG(technology) AS technologies
  FROM
    loose_technologies
  GROUP BY
    client,
    page
), lh AS (
  SELECT
    _TABLE_SUFFIX AS client,
    url AS page,
    report AS lighthouse
  FROM
    `httparchive.lighthouse.2022_04_01_*`
)

SELECT
  DATE('2022-04-01') AS date,
  client,
  page,
  is_root_page,
  root_page,
  COALESCE(pages.rank, summary_pages.rank) AS rank,
  wptid,
  payload,
  summary,
  custom_metrics,
  lighthouse,
  GET_FEATURES(payload) AS features,
  technologies,
  metadata
FROM
  pages
LEFT JOIN
  summary_pages
USING
  (client, page)
LEFT JOIN
  techs
USING
  (client, page)
LEFT JOIN
  lh
USING
  (client, page)

rviscomi · 2022-05-24T16:24:05Z

Ran into OOM issues with generating the all.requests table directly in BQ.

Since we'll eventually need to generate tables in the new all dataset from Dataflow, I decided to prototype how that pipeline would work. See HTTPArchive/bigquery#170. Summary data is omitted for now since we'll be merging pipelines soon anyway.

I've successfully tested it on a single HAR and now attempting a full-scale test on the entire 2022_05_12 crawl, both desktop and mobile concurrently. I expect it to take 6-8 hours (running for ~2 so far).

tunetheweb · 2022-10-24T13:57:00Z

The all.requests table is clustered by the following columns:

client
is_root_page
is_main_document
type

As well as implicitly by the date partitioning on this table.

We're only allowed 4 cluster fields meaning we cannot add any more. However I'm wondering if is_main_document is that useful? It might be much more beneficial to cluster based on page to allow us to cheaply join to the all.pages table. We could still basically filter for is_main_document relatively cheaply by looking at html type and then filtering to is_main_document (appreciate not ALL main documents will be type HTML but most should be). Changing to page would allow us to filter based on rank, and technologies (again both via join) and similarly other items.

WDYT?

rviscomi · 2022-10-26T00:04:20Z

is_main_document is equivalent to the old firstHtml field, so having that parity might make migrating to the new schema easier. But I'm happy to explore other faster/cheaper schemas. Would you be willing to create a temp table with the proposed schema and benchmark a few test queries against each?

brendankenny · 2022-11-16T17:54:21Z

If I query for

SELECT
  JSON_VALUE(lighthouse, '$.finalUrl') AS final_url,
FROM `httparchive.all.pages`
WHERE
  date = "2022-10-01"
  AND client = 'mobile'
  AND is_root_page

The BQ UI warns me I'm going to be querying 13.65 TB, vs 4.63 TB for the (currently) equivalent

SELECT
  JSON_VALUE(report, '$.finalUrl') AS final_url,
FROM `httparchive.lighthouse.2022_10_01_mobile`

Is there some partitioning/clustering trick I should be doing to bring the all number more in line with the lighthouse number? Or more that could be done to the table structure? That would be an unfortunate cost increase for querying the raw data.

tunetheweb · 2022-11-16T17:58:02Z

Weirdly I don't see that and they look similar cost to me

Sure you're "holding it right"?

brendankenny · 2022-11-16T18:01:02Z

huh, so weird. Pasting into a fresh editor, same 13.65 TB

could being a member of the project somehow affect that? I don't see how, but not sure what other difference there could be.

rviscomi · 2022-11-16T18:07:13Z

The old BQ estimation behavior was to not take clustered fields into consideration. I'm also seeing the new behavior showing the 4.67 TB estimate, so maybe they've only partially rolled it out.

brendankenny · 2022-11-16T18:36:25Z

via the CLI:

bq query --nouse_legacy_sql --dry_run 'SELECT
  JSON_VALUE(lighthouse, "$.finalUrl") AS final_url,
FROM `httparchive.all.pages`
WHERE
  date = "2022-10-01"
  AND client = "mobile"
  AND is_root_page'

also gives me "this query will process upper bound of 15012318480052 bytes of data" (aka 13.65 TiB). Can either of you try it with a secondary cloud project to see if you get the same numbers?

If everyone eventually gets the lower number, obviously this isn't an actual issue.

tunetheweb · 2022-11-16T18:40:16Z

I see 13.65TB when using a random account. Hmm....

rviscomi · 2022-11-16T19:09:52Z

If you run the query it should still only process 4.67 TiB

max-ostapenko · 2024-10-18T21:17:21Z

Well, here is the current plan for this reorganization: https://docs.google.com/document/d/1kNCpVgvC0W77HJzXiNvlchuGjCCY6uymIfv0nH10tF0/edit

rviscomi added this to the M3: Improving the analysis pipeline milestone Feb 1, 2022

rviscomi self-assigned this Feb 1, 2022

rviscomi assigned paulcalvano and giancarloaf Feb 1, 2022

This was referenced May 6, 2022

M0 5/pipeline poc #34

Merged

Add the ability to test secondary pages #12

Closed

rviscomi mentioned this issue May 13, 2022

Omit secondary pages from canonical BQ tables #51

Closed

4 tasks

rviscomi mentioned this issue May 26, 2022

Add import_all pipeline #75

Merged

rviscomi mentioned this issue Jun 22, 2022

Keep secondary page data separate from home page data (for now) #93

Closed

giancarloaf mentioned this issue Aug 29, 2022

Bug/backfill fixes #136

Merged

giancarloaf mentioned this issue Sep 15, 2022

Create a process to backfill all tables from HARs older than March 2022 #138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorganize the BigQuery datasets to be more efficient #15

Reorganize the BigQuery datasets to be more efficient #15

rviscomi commented Feb 1, 2022 •

edited by max-ostapenko

Loading

rviscomi commented May 10, 2022 •

edited

Loading

rviscomi commented May 22, 2022

rviscomi commented May 24, 2022 •

edited

Loading

rviscomi commented May 24, 2022 •

edited

Loading

tunetheweb commented Oct 24, 2022 •

edited

Loading

rviscomi commented Oct 26, 2022

brendankenny commented Nov 16, 2022

tunetheweb commented Nov 16, 2022

brendankenny commented Nov 16, 2022

rviscomi commented Nov 16, 2022

brendankenny commented Nov 16, 2022

tunetheweb commented Nov 16, 2022

rviscomi commented Nov 16, 2022

max-ostapenko commented Oct 18, 2024

Reorganize the BigQuery datasets to be more efficient #15

Reorganize the BigQuery datasets to be more efficient #15

Comments

rviscomi commented Feb 1, 2022 • edited by max-ostapenko Loading

rviscomi commented May 10, 2022 • edited Loading

httparchive.all.pages

Schema

Example queries

httparchive.all.requests

Schema

Example queries

rviscomi commented May 22, 2022

rviscomi commented May 24, 2022 • edited Loading

rviscomi commented May 24, 2022 • edited Loading

tunetheweb commented Oct 24, 2022 • edited Loading

rviscomi commented Oct 26, 2022

brendankenny commented Nov 16, 2022

tunetheweb commented Nov 16, 2022

brendankenny commented Nov 16, 2022

rviscomi commented Nov 16, 2022

brendankenny commented Nov 16, 2022

tunetheweb commented Nov 16, 2022

rviscomi commented Nov 16, 2022

max-ostapenko commented Oct 18, 2024

rviscomi commented Feb 1, 2022 •

edited by max-ostapenko

Loading

rviscomi commented May 10, 2022 •

edited

Loading

`httparchive.all.pages`

`httparchive.all.requests`

rviscomi commented May 24, 2022 •

edited

Loading

rviscomi commented May 24, 2022 •

edited

Loading

tunetheweb commented Oct 24, 2022 •

edited

Loading