You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I know that the per website, per month focus of the old httparchive.org is no longer maintained but as it's still useful to be able to provide historical analysis for groups of websites, it's important to have reliable metadata. Also, as changes are made to the dataset, such as the important introduction of secondary pages, it's good to be able to distinguish these:
there are multiple summary reports for May 2022 (1st and 12th) with no explanation of the difference, though it looks like the second includes only secondary pages. A field to indicate homepage or secondary page would be useful.
alignment of the report label with the title. The reports with the title 2022-01-05 have the label 2022-01-12. Even though the label has always been slightly misleading, consistency is important and useful in reporting.
The text was updated successfully, but these errors were encountered:
We're in an awkward transition period while we get the new pipeline working. It should only be a few more weeks but I'll try to clarify how things work now and where we're going.
The way things are currently organized:
tables dated YYYY_MM_01 (first day of the month) always contain home page data only
other tables dated YYYY_MM_DD (so far just 2022_05_12 and 2022_06_09) contain data for home and secondary pages
The way things will work with the new pipeline:
tables dated YYYY_MM_01 will continue to contain home page data
all home and secondary page data will be written to the new httparchive.all.pages and httparchive.all.requests tables
other tables dated YYYY_MM_DD will be migrated to the all dataset and deprecated
Long term:
all tables dated YYYY_MM_DD (and 01) will be migrated and deprecated
the all dataset will be the canonical place to access all current and historical page/request data
We're taking care to ensure that existing queries on 01 tables continue to work as expected by only including home page data. To start querying secondary page data, one would need to consciously switch to a non-standard table date like 2022_06_09 or the new all dataset.
I'll leave this issue open until the new pipeline is working and we migrate the secondary page tables.
A field to indicate homepage or secondary page would be useful.
Yes this is exactly what we have planned for the all dataset, we're naming it is_root_page.
I know that the per website, per month focus of the old httparchive.org is no longer maintained but as it's still useful to be able to provide historical analysis for groups of websites, it's important to have reliable metadata. Also, as changes are made to the dataset, such as the important introduction of secondary pages, it's good to be able to distinguish these:
there are multiple summary reports for May 2022 (1st and 12th) with no explanation of the difference, though it looks like the second includes only secondary pages. A field to indicate homepage or secondary page would be useful.
alignment of the report label with the title. The reports with the title 2022-01-05 have the label 2022-01-12. Even though the label has always been slightly misleading, consistency is important and useful in reporting.
The text was updated successfully, but these errors were encountered: