Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to preprocess chunks before attempting to reduce the test case #13

Open
sethfowler opened this issue Apr 22, 2017 · 2 comments

Comments

@sethfowler
Copy link

It'd be great to be able to provide, in an addition to an "interestingness" test, a chunk preprocessing strategy.

Here's what I'm interested in using it for:

What I observe when I'm using lithium is that at large chunk sizes, it's often the case that a chunk could have been removed (i.e., the file was interesting without it) except that a poorly placed chunk boundary led to a syntax error. Since removing large chunks early on drastically reduces runtimes, I'd really like to help those large chunk removals succeed. It seems to me that a small degree of knowledge about the syntax of the file that lithium is processing would go a long way.

What I plan to do as a first attempt is to process the input file with pygments, which is a Python library for syntax highlighting. The syntax highlighting definitions are mostly implemented using regular expressions, but a stack is included to support grammars which involve nesting. Essentially, pygments provides simple parsers for a very large number of programming languages.

Most of the information that pygments produces is specific to a particular file format, but what I find interesting is the stack. I'd expect that if a chunk has exactly the same stack at the beginning and the end (i.e., we haven't popped the original elements in the stack off, and we have popped off everything that's been added in the interim) then it's much more likely to be syntactically correct and hence removable. So, given this information, we can move around the chunk boundaries in a way that should help us remove more large chunks.

The pygments stuff is speculative at this point, and it may be a bit much to include in upstream lithium (though I'd be happy to fold it in if there's interest). I think that offering a general way to preprocess the chunks selected for each pass in this fashion would probably be quite useful in all sorts of ways, though.

Does that sound like a feature you'd like to include?

@nth10sd
Copy link
Contributor

nth10sd commented May 1, 2017

This does sound interesting, although having it as an experiment / branch might be a better way to start off. Note that #11 is a primitive way to parse the syntax by looking out for matching closing braces/square braces.

With your method, we might even be able to remove try ... catch blocks.

try {
    <code>
} catch (e) {}

to just:

<code>

@sethfowler
Copy link
Author

Oh nice! Thanks for pointing out #11, that looks like a big win.

I agree, this should definitely start out as an experiment. I'm planning to hack something together within the next month or two, so I'll come back then and report my initial results. I'm hoping that a big payoff is possible with a relatively small amount of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants