Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine how release statistics should be stored #30

Open
preaction opened this issue Jun 20, 2018 · 2 comments
Open

Determine how release statistics should be stored #30

preaction opened this issue Jun 20, 2018 · 2 comments

Comments

@preaction
Copy link
Member

Presently, the per-release summary statistics are stored in two tables: release_data and release_summary. These two tables have the exact same schema, but slightly different uses:

  • The release_data table stores one row per test report. One of the pass, fail, na, unknown columns will have a 1 in it.
  • The release_summary table stores one row per distribution version. The pass, fail, na, and unknown columns will have the count of each test report grade.

In essence, the release_summary table is the sum of all the related release_data rows (this is also technically a duplication of the cpanstats table (which, technically is a duplication of test_report table with data extracted from the JSON)).

Now that we have a dedicated database server with a few more CPU cycles than we had previously, we can look at how we store this data: Do we need the intermediate state of the release_data table, or can we just store the release_summary? Or, should we avoid the further step of summing the values and storing them in release_summary and just keep release_data? Or can we get rid of these tables entirely and just build this data on-the-fly from cpanstats?

@barbie
Copy link

barbie commented Jun 20, 2018 via email

@preaction
Copy link
Member Author

Yep, presently the release summary APIs use the release_summary table directly, and that's still how MetaCPAN is getting their data. Even if the underlying data storage is changed, the APIs must remain the same.

But, we may not need to generate and store the derived data anymore. I find it highly unlikely, for the same reasons you mentioned, but it might be possible to do all of this on-the-fly.

But, if it ends up that we do need to generate and store the derived data, we might not need both steps to be stored. It might be possible to drop release_data and keep release_summary. When new reports come in, they increment the correct value in the release_summary table. I'm not sure I like this idea, because it's a lot more work for the database to validate that the summary data is correct.

More likely, the release summary data may be able to be generated on-the-fly from the release_data table using a bunch of SUM(...) functions in MySQL. It's easy to validate release_data against cpanstats: Two simple queries (one 1:1 LEFT JOIN, one 1:1 RIGHT JOIN) can find what records are incorrect.

I'm not confident that any improvements can be made here, but it's something we can look into. The smaller the schema we have, the easier it will be to start deriving all this data for other languages (like Perl 6). Also, if we can derive this data easily, we can offer more query options from the API side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants