Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate views on all dataset #142

Open
tunetheweb opened this issue Sep 21, 2022 · 7 comments
Open

Investigate views on all dataset #142

tunetheweb opened this issue Sep 21, 2022 · 7 comments

Comments

@tunetheweb
Copy link
Member

tunetheweb commented Sep 21, 2022

@rviscomi , @paulcalvano , @pmeenan as discussed on last call.

I ran the following:

CREATE VIEW `httparchive.scratchspace.test_view_2022_08_01_mobile` AS (
  SELECT
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2022-08-01' AND
    client = 'mobile'
);

Then when I run this:

SELECT page FROM `httparchive.scratchspace.test_view_2022_08_01_mobile`

It uses up 1.55 GB - so clearly isn't including the payload (in this view), and definitely not the Lighthouse results (not in this view, but in underlying table).

image

Changing it to this uses up 1.76GB (so slightly more as two columns):

SELECT page, rank FROM `httparchive.scratchspace.test_view_2022_08_01_mobile`

Changing it to either of these uses up 5.38TB:

SELECT page, rank, payload FROM `httparchive.scratchspace.test_view_2022_08_01_mobile`
SELECT * FROM `httparchive.scratchspace.test_view_2022_08_01_mobile`

This SQL is only 409.83 MB:

SELECT page, rank, payload FROM `httparchive.scratchspace.test_view_2022_08_01_mobile` WHERE rank = 1000

So it looks to me like it IS using the clustering on the underlying table and not charging you the full amount each time, as you thought it would @rviscomi ? Unless I'm missing something?

This means we could in theory move everything to the new all schema (including backfilling) and replace all the existing old tables with views to maintain backwards compatibility for old data, but also have it in the new schema format, without duplicating data.

We could in theory also keep creating these old-style views every month so no one has to learn the new schema - they just get cheaper queries. That would also potentially reduce risk of someone accidentally querying all columns, or large date ranges, and running up huge bills too easily.

However I did discover one flaw in this plan in that you can't do this:

SELECT
  _TABLE_SUFFIX AS client,
  page,
  rank,
  payload
FROM
  `httparchive.scratchspace.test_view_2022_08_01_*`
WHERE
  rank = 1000

As you get this error 😔:

Views cannot be queried through prefix. First view httparchive:scratchspace.test_view_2022_08_01_desktop.

Turns out this limitation is documented:

Wildcard tables support native BigQuery storage only. You cannot use wildcards when querying an external table or a view.

Still, non-wildcard queries would work on the views.

@tunetheweb
Copy link
Member Author

You could do this:

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_view_2022_08_01_star` AS (
  SELECT
    client AS TABLE_SUFFIX,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2022-08-01'
);

But three downsides to this hack:

  1. The view is called ..._star instead of ...*
  2. The table suffix column is called TABLE_SUFFIX instead of _TABLE_SUFFIX
  3. Wouldn't scale to include alternative *s

So not a drop in replacement.

Plus it's just an awful, awful, hack.

@tunetheweb
Copy link
Member Author

@tunetheweb
Copy link
Member Author

I'm gonna close this. The Views look like a good option for latest dataset as being discussed in #141

Given the wildcard issue, they don't look like a good option to fully replicate our current schema, from the new schema. There's also a bigger question as to whether we want to maintain the old schema indefinitely even if we could do that with views.

I do have one further discussion point for potentially using views to restrict access to some of the more expensive queries which I've raised #149 for, but think that's a separate issue to this one, which was to investigate what views meant: did they indeed lead to cheaper queries (yes!) and are there any limits on them (yes - wildcards), so let's close this out.

@tunetheweb
Copy link
Member Author

Reopening. FYI @rviscomi

@tunetheweb tunetheweb reopened this Jun 20, 2024
@rviscomi
Copy link
Member

I'd be interested to use views for the July dataset and seeing what breaks

@tunetheweb
Copy link
Member Author

Quite a few queries in our BigQuery repo that fed our site for a start.

@max-ostapenko
Copy link
Contributor

@tunetheweb you reopened it a few months ago, is there still some research to be done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants