-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing strategy #520
Comments
Thank you for the detailed write-up - I will need some time to digest it, but just want to first expressed we appreciate how you put so much thought into it. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The goal of this issue is to kickoff the discussion on the testing strategy. Hopefully we can unify names/concepts.
Our current testing landscape
Some of the advantages/disadvantages listed above, are specific to our domain, (like the fact that certain properties of the output are only valid after several steps of execution), but some others are typical characteristics of E2E/big tests.
Our desired testing landscape
It is generally considered that a healthy codebase test suite has pyramid shape [1,2]: has a bigger set of unit tests, quick developer feedback, easy to understand, easy to catch regressions, complemented with a smaller set of integration tests that verify the connections between objects and finally a smaller set of big, slow, but user related E2E tests.
In systems design, it is always important to identify and differentiate between what Brooks [4] coined accidental complexity and essential complexity. Accidental complexity has to do with details specific to solving a problem, when executing in a computer. For example, the fact that blackjax runs on top of jax, and the code needs to be jittable for performance. This is complementary to essential complexity, which is proper of the problem being solved: for example, the fact that in order to run HMC we need to differentiate, or the fact that we need to sample to implement RMH acceptance rule. This is closely related to the concept of `domain logic’: the set of ideas that we want our code to represent, and the operations related to them. In our case, we are talking about the components to build samplers, and how they interoperate. We would like to be able clearly represent in code samplers, and thus be able to test them in a simple and fast way, trying to isolate the essential complexity from the accidental complexity.
Theory to practice
All these looks great in theory, but in practice Blackjax is a library with a heavy dependence on JAX. We want all our code to be Jittable and runnable in devices such as GPU, so we need to balance testing the domain logic while making sure certain accidental properties hold.
Quoting from [5]:
JAX relies extensively on code transformation and compilation, meaning that it can be hard to ensure that code is properly tested. For instance, just testing a python function using JAX code will not cover the actual code path that is executed when jitted, and that path will also differ whether the code is jitted for CPU, GPU, or TPU. This has been a source of obscure and hard to catch bugs where XLA changes would lead to undesirable behaviours that however only manifest in one specific code transformation.
In order to account for this situation, the Jax ecosystem provides Chex, which among other things, exposes
chex.TestCase
that forces code compilation and emulates the presence/absence of devices such as GPU. Since the compilation is JIT, JAX requires us to run the code on JAX compilable inputs, in order to extract the shapes, and then compile the code given those shapes. This has as consequence that Test Doubles [3] libraries like MagicMock can't be used at all in chex test, since they are not JAX-compilable.A real example
Let’s try to analize this code, which corresponds to the NUTS kernel.
Let’s start by noting that kernel is factory: it builds another object, in this case, the one_step function. But that’s not the only responsibility it has(I am talking about the SOLID notion of responsibility here), it also has one_step defined inside it, so all the behavior from one_step is also defined inside kernel. Now if we think about one_step, its main goal should be, given an input, execute one nuts step. But here, again, there are more responsibilities inside:
The main problem with this code from a testing standpoint is that all listed here ends up being tested together, there’s no way of decoupling the step test from an execution of the gaussian_euclidean code. Naturally, this leads to bigger tests, closer to E2E than to unitary. Searching for usages of this code, it seems is being used in test_compilation and test_benchmarks. If we remove those two, the file has 45% test coverage.
Design suggestion 1.
In this case, a suggestion is splitting between building/factory code vs properly sampling code.
So this code looks far more testable than before, let’s try to do that. Lets first look at the components the file has
I’ll start with nuts_kernel. I could first try to build a test like this one:
Check that I have used a Stub for proposal_generator and a Dummy for expected info. ****
It works, and the assertions make sense. But there is room to break the code, let’s try that:
I have on purpose included a regression inside the code. If we rerun the above test, no failure is raised. There’s no assertion whatsoever that the integrator_state passed to the proposal_generator, nor any other parameter is correctly computed, which is part of this functions’ responsibility.
Two solutions to these types of problems: use Chex Runtime Assertions so that proposal_generator becomes a Mock, or use a Fake.
With Fake
With Mock
Both approaches have advantages and disadvantages. The Mock is simpler to understand for
someone new to the algorithm. The Fake has the challenge that we need to think the simplest case that makes sense for the algorithm and someone else to understand, so it may require more explanation.
Summary of suggestions so far:
[1] https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html
[2] https://martinfowler.com/articles/practical-test-pyramid.html
[3] https://martinfowler.com/bliki/TestDouble.html
[4] https://en.wikipedia.org/wiki/No_Silver_Bullet
[5] https://github.com/deepmind/chex
The text was updated successfully, but these errors were encountered: