Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved documentation for the reports #114

Open
Themanwithoutaplan opened this issue Jul 4, 2022 · 1 comment
Open

Improved documentation for the reports #114

Themanwithoutaplan opened this issue Jul 4, 2022 · 1 comment

Comments

@Themanwithoutaplan
Copy link

I know that the per website, per month focus of the old httparchive.org is no longer maintained but as it's still useful to be able to provide historical analysis for groups of websites, it's important to have reliable metadata. Also, as changes are made to the dataset, such as the important introduction of secondary pages, it's good to be able to distinguish these:

  • there are multiple summary reports for May 2022 (1st and 12th) with no explanation of the difference, though it looks like the second includes only secondary pages. A field to indicate homepage or secondary page would be useful.

  • alignment of the report label with the title. The reports with the title 2022-01-05 have the label 2022-01-12. Even though the label has always been slightly misleading, consistency is important and useful in reporting.

@rviscomi
Copy link
Member

rviscomi commented Jul 7, 2022

Thanks for flagging this @Themanwithoutaplan!

We're in an awkward transition period while we get the new pipeline working. It should only be a few more weeks but I'll try to clarify how things work now and where we're going.

The way things are currently organized:

  • tables dated YYYY_MM_01 (first day of the month) always contain home page data only
  • other tables dated YYYY_MM_DD (so far just 2022_05_12 and 2022_06_09) contain data for home and secondary pages

The way things will work with the new pipeline:

  • tables dated YYYY_MM_01 will continue to contain home page data
  • all home and secondary page data will be written to the new httparchive.all.pages and httparchive.all.requests tables
  • other tables dated YYYY_MM_DD will be migrated to the all dataset and deprecated

Long term:

  • all tables dated YYYY_MM_DD (and 01) will be migrated and deprecated
  • the all dataset will be the canonical place to access all current and historical page/request data

We're taking care to ensure that existing queries on 01 tables continue to work as expected by only including home page data. To start querying secondary page data, one would need to consciously switch to a non-standard table date like 2022_06_09 or the new all dataset.

I'll leave this issue open until the new pipeline is working and we migrate the secondary page tables.

A field to indicate homepage or secondary page would be useful.

Yes this is exactly what we have planned for the all dataset, we're naming it is_root_page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants