Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Using machine learning to suggest new inputs #1495

Open
benjaminy opened this issue Aug 12, 2022 · 5 comments
Open

Idea: Using machine learning to suggest new inputs #1495

benjaminy opened this issue Aug 12, 2022 · 5 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@benjaminy
Copy link

This is just the very beginning of an idea, but your "ideas" page says you're interested in "machine learning something something", so I figured I'd toss it out there. The problem that this idea starts with is the general issue that coverage-guided fuzzers don't "understand" program logic in a very deep way, and so can get stuck in a rut after many iterations.

Every iteration of a fuzzer starts with some input and results in some projection of the program's behavior (coverage, traces, etc).
The idea is to take all those input-behavior pairs, flip them around and train a ML model to predict program inputs from a description of the program's behavior.

Given such a model, you can then throw in different program behavior descriptions and see what program input comes out. You could imagine starting with existing program behavior descriptions and tweaking them a little bit ("Like this run, but take that branch"). Or throw in a lot of randomness just to see what happens.

I expect the training process would be quite expensive, so I'm imagining this as a complete separate "background" process that just pops up every once in a while to suggest a different input. I expect a model like the one I'm suggesting would be very noisy, but if it managed to "understand" something nontrivial about the structure of the program, it might be useful.

@vanhauser-thc
Copy link
Member

that idea is already many years old, and so far - to my knowledge - not a single paper or implementation was able to create something useful from the ML overhead.
right @eqv?

@benjaminy
Copy link
Author

that idea is already many years old, and so far - to my knowledge - not a single paper or implementation was able to create something useful from the ML overhead.
right @eqv?

Thanks for the reply. It's not too surprising that it hasn't been successful. It seems like for the applications where one leaves a computer in the closet to fuzz indefinitely, the overhead kind of doesn't matter. Once the fuzzer is not very effectively spinning its wheels, anything that might find a novel input is useful.

@eqv
Copy link
Contributor

eqv commented Aug 15, 2022

@vanhauser-thc that is correct (to the best of my knowledge). @benjaminy obviously feel free to experiment with that. There is a couple of approaches that were tried in the past that sound very similar to what you describe. For example: https://arxiv.org/pdf/1807.05620.pdf and https://arxiv.org/pdf/1701.07232.pdf. Again, to the best of my knowledge, those approaches weren't easily reproduced by independent groups. I believe the fundamental problem with most ML based papers is that when you are training on the existing corpus, you are training on all the things that the fuzzer already knows about (and doesn't care about anymore). IF you are training on the current corpus, the ML won't be able to learn the unknown unknowns needed to find more coverage. Fundamentally I believe you'd need something like GP3, trained on millions of different file formats, to ensure the AI is able to find common patterns in man-made file formats and guess what the not-yet-dsicovered parts would look like.

@benjaminy
Copy link
Author

Thanks for the links @vanhauser-thc

I don't anticipate having the time to experiment any time soon, but 🤞

Training on all the world's programs/data formats sounds promising in principle, but... uh... ambitious.

I agree with your point that a network trained on observed executions of a program will tend to produce more inputs that will drive the program along the same paths. Here's a sketch of how I think one might be able to bust out of that.

  • Use a generative adversarial network architecture, where
    • The generator network takes a tuple of an execution profile (coverage, etc) and a Large random number and produces a program input
    • The discriminator network takes a tuple of an execution profile and a program input and outputs it's guess whether they are "real" or produced by the generator network
    • The generator and the discriminator play the usual GAN game
  • If this works well you end up with a network that can take an execution profile and produce inputs that are likely to drive the program to that execution profile. Now how do we get it to generate new/interesting inputs?
    • The most obvious thing is to fiddle with the random number input. This seems likely to perform badly for exactly the reason @vanhauser-thc points out
    • The next thing to try is to fiddle with the execution profile input. This might do something interesting, but it's hard to predict since the network hasn't been trained on unseen execution profiles.
    • I think the most promising idea is to gently vandalize the weights of the network itself, then run it with the "best" (most coverage, etc) execution profiles. This might have the effect of exploring the input space in the wildly high-dimensional neighborhood around known inputs in a useful way.

@AltayAkkus
Copy link

I don't think a reasonably sized/reasonably fast ML model could help penetrate deeper into more edges/paths.
Also, AFL already does a pretty good job.
The model would have to accurately predict which value, which manipulation will lead the program to meet it's condition, aka. correctly represent and predict the full program, based on return values from previous attempts.

AFL does a good job at this, because it has an assembly-level instrumentation that gives it feedback.
And it still bruteforces, as seen in the JPEG article above, it does not use a single feedback to synthesize the entire new value that will lead to the new path, but it tries out bit by bit (most cases).
So calling a model with previous inputs, and then synthesizing the full new input that will achieve higher penetration is very very unlikely. What could be done is improving the "bruteforce", but then you would have to call the model after each execution, which means significant time if your model is big, and a small model won't have the capability to understand such complex structures.

Just my 50 cents, I may be wrong :D

@vanhauser-thc vanhauser-thc added enhancement New feature or request help wanted Extra attention is needed labels Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants