Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report Z-test Validations & DropInfo Self-adoption #53

Open
GalvinGao opened this issue Apr 27, 2022 · 1 comment
Open

Report Z-test Validations & DropInfo Self-adoption #53

GalvinGao opened this issue Apr 27, 2022 · 1 comment

Comments

@GalvinGao
Copy link
Member

GalvinGao commented Apr 27, 2022

Currently there's only several simple and, if not naive, approaches on report validation. Previously we've proposed Z-test mechanism, and implemented on the previous backend. However, due to the MongoDB evaluation bottleneck existing on the previous backend, we unfortunately have to disable such feature due to high performance drawbacks.

The backend-next project now has the capability on both flexibility and performance extensibility to allow us relaunch such mechanism on checking the reports.

Moreover, currently the DropInfo section is somewhat artificially decided and might not be suitable for the first several hundred reports, due to the nature that we can't predict what is actually the finite set of drop possibilities, so there previously have existed several issues related to DropInfo not being applied properly at the first, causing potentially deviations for the dataset. Although we've been fixing those actively manually, those are time-consuming and as well not an optimal solution at all. Therefore, there could also be a mechanism where DropInfo itself could adopt continuously with the growth of the report dataset. However the implementation detail of the adoption is still a huge topic to discuss.


Just to note down here that, those statistics-based tests are all pretty susceptible to attacks where the attacker could aim to report several hundred or about a thousand of false reports after the very first moments the stage opens, causing the dataset to converge to a skewed result. Any results afterwards would consider invalid and therefore rejecting the true reports. Such attack could be mitigated by either randomly picking reports across different accounts, IPs, and carefully designing the threshold when Z-test kicks in, to minimize the effect such attack could bring.

@FlandiaYingman
Copy link
Collaborator

FlandiaYingman commented Apr 28, 2022

On the point of view of attack prevention, we could run a two proportion z-test, which allows us to compare two proportions (in our case, they could be reports categorized by IP or account) to see if they are the same. If the result shows that two proportions are not the same, we have reason to suspect that one of the proportion is false.

And we may not state the null hypothesis based on the first n reports, instead, we could state it based on first n groups of reports, which reports can be grouped by IP, account or report method (by recognition or manually). Thus, our dataset wouldn't be affected by a huge count of reports from a single source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants