-
Notifications
You must be signed in to change notification settings - Fork 48
Lexer speedup #81
Comments
Hello, I'm back from my paternity leave. I can finally answer your questions. So, since token namespaces are dropped, a single regular expression is created which combined all token definitions, and then is used to match the data. The token name is used as the identifier of the named capture ( The fact you iterate over the tokens with However, I think it is possible to have a similar algorithm with token namespaces. Instead of combining token definitions into a single pattern, it can possible to combine one pattern (of token definitions) per namespace and use Anyway, if there is a single namespace, this optimisation is very interesting. I just would like to play with your benchmark a little bit if possible. Are you willing to make a PR on |
@Hywan Syntetic tests: https://gist.github.com/SerafimArts/a976562e388d6b8d8863941348720211 It is better to run these tests from different files. |
I thought about it, but I do not have any really working code with such assembles with same namespaces. It is necessary to do either as done in
The problem is quite large for adapting the algorithm to Hoa. I think that it will be easier for the author of Hoa to do it. The more this algorithm is already understood by the author, everything is pretty obvious (I'm just lazy). :D |
👍 |
Hello! I transferred all the lexers into a separate component that does not depend on the parser and other pieces. It can also be used separately: https://github.com/railt/lexer In addition, added support lexertl driver (https://github.com/BenHanson/lexertl), which speeds up another 8 times when compared with the previous version published above. I can suggest using this component for lexical analysis, in cases when grammar tokens do not use namespaces. |
@SerafimArts i don't understand how you connect the php code with the C++ library you linked. Care to explain ? |
|
It reminds me of http://nikic.github.io/2014/02/18/Fast-request-routing-using-regular-expressions.html#combined-regular-expressions Also, try to make benchmark with
AFAIU, |
I'm trying to close huge projects at work, and I will get back on this issue, I hope shortly. |
Modifier |
yield from $result;
yield Token::eof($offset); If I understand it right, it is just preparation for Generator (from interface point of view at least), but in fact it doesn't save memory. I encountered with this issue (wanted to make possible parse from very long stream, for example). Probably, it can be resolved via
Few months ago I had to parse JSON file which had size over 500 GB (one big array with complex structured objects with random newlines). Yes, it was awful idea by the guys, but it could be useful to parse from streams as well. Finally, I found some library for PHP that parses content from streams (probably, this one: https://github.com/salsify/jsonstreamingparser, but I'm not sure). Sorry for offtopic. |
Theoretically, I can try to build a hybrid through \preg_match('/(?P<TOKEN_A>...)|(?P<TOKEN_B>...)/', ...); This will significantly reduce the time spent on compiling PCRE in the classic version from Hoa and at the same time have the possibility of lazy calculations, which are not in the implementation of |
btw, few weeks ago I found out ~
\G
(?|
(?:[^",\r\n]+)(*MARK:token0)
|
(?:"[^"\\]*(?:\\.[^"\\]*)*")(*MARK:token1)
|
(?:,)(*MARK:token2)
|
(?:\r?\n)(*MARK:token3)
|
[\S\s](*MARK:error)
)
~Jx You can put arbitrary labels in the When you receive token with namespace, you can, for example, throw exception out of However, as I said, |
it does not allow to understand in what order the tokens are located.
|
P.S. You can also read my article on Habr on this topic, where I cited different implementations as examples. But you should to translate it from the original language for understanding https://habr.com/ru/post/435102/ |
? |
<?php
preg_match_all(
'~\\G
(?|
(?:(.?)([<>=\\^]))(*MARK:T_ALIGN)
|
(?:[-+ ])(*MARK:T_SIGN)
|
(?:\\#)(*MARK:T_HASH)
|
(?:[0-9]+)(*MARK:T_WIDTH)
|
(?:,)(*MARK:T_COMMA)
|
(?:\\.(\\d+))(*MARK:T_PRECISION)
|
(?:[a-zA-Z%\\s:-]+)(*MARK:T_TYPE)
|
(?:\\})(*MARK:T_CL_PL__SWITCH__PLACEHOLDER)
|
(?:\\{([a-zA-Z][a-zA-Z0-9_.-]*)\\})(*MARK:T_NESTED_PL)
)~AJx',
',.42015+f{foo}{bar}',
$matches,
PREG_SET_ORDER
);
var_dump($matches);
|
Well, I meant with "named groups", not "marks" |
I mean, in order to restart But more importantly for me, it is just not really comfortable API of PCRE in PHP. So I don't see winners here, it depends, that's what I'm saying. |
Yes it is. But in most cases it is possible to manage with one "state". The PCRE syntax is quite powerful. For example, here are tokens of PHP language: https://gist.github.com/SerafimArts/bb9363fbd2d5cca8693cfcdc4631c2f7#file-lexer-php-L24-L178 Or GraphQL: https://github.com/railt/railt/blob/1e43c86734e3a7dc8f34172729246dec1fcdc5ff/resources/grammar/lexemes.pp2 |
How does it handle case like "foo" //
<?php
// ... without context/state ("out-of-code", "code")? Also, string literals looks simplified, i.e. you have to use another lexer lately to find all variables in double quoted strings, for example: |
class PHPLexer extends Lexer
{
public function lex($sources): iterable
{
$enable = false;
foreach (parent::lex($sources) as $token) {
switch (true) {
case $token->is('T_OPEN_TAG'):
$enable = true;
break;
case $token->is('T_CLOSE_TAG'):
$enable = false;
break;
case $enable:
yield $token;
break;
}
}
}
}
Yep. In this case, the second state is required. But I wrote that in most cases, and not always))) |
I think it doesn't work in general case, e.g.
|
https://gist.github.com/SerafimArts/bb9363fbd2d5cca8693cfcdc4631c2f7#file-lexer-php-L25 |
I don't have ability to test it right now, but I suppose this lexer will treat first line as a comment, but in fact it is plain text + Your point was
But it seems like example is not that convenient. |
Oh, yes, I did not understand what you mean the first time. That's right, without two states, it's impossible to parse this =) |
Could someone summarize this? I would like to know the following.
Maybe i will come back later and try to answer these questions if they have not been answered yet. |
Need to get rid of unnecessary calls preg_match. To do this, you can use the approach with named groups |
I implemented a simple multistate implementation based on the Benchmarks: https://travis-ci.org/SerafimArts/lexer-benchmarks/builds |
Thanks great @SerafimArts congratulations on your good results :D |
You can achieve better performance with code generation and /**
* @ParamProviders({"samples"})
*/
public function benchGenerated(array $params): void
{
$tokens = GeneratedLexer::lex(\file_get_contents($params[0]));
foreach ($tokens as $token) {
//
}
}
https://gist.github.com/unkind/87995556f5d8661aa09f5cf82478ba5e |
@unkind I don't understand how it works, but I didn’t find anything about the |
https://www.pcre.org/original/doc/html/pcrepattern.html#SEC27 (Cmd + F «*MARK»). By the way, another optimization is |
I rewrote the lexer using the Comparison of old preg_replace (left) algo and new preg_match (right): All sample files are available here: https://github.com/railt/lexer/tree/32366b7d9dddabd05cf294d47c3e03a42025b335/tests/resources Benchmarks: https://travis-ci.org/SerafimArts/lexer-benchmarks/jobs/508295208 |
Unfortunately, I don't have examples right now, this issue may be overrated, but it may break BC for actual library users. |
@unkind in such places the parser will fall, since it generates a trace that weighs several orders of magnitude more than the sequence of tokens P.S. this trace contains a sequence of correctly parsed rules, which will then be converted to AST: https://github.com/hoaproject/Compiler/blob/master/Llk/Parser.php#L503 I tried to implement it "more lazy" and generate AST parts, but so far these experiments have not led to success. |
@SerafimArts |
@flip111 All grammars are there: https://github.com/railt/lexer/tree/32366b7d9dddabd05cf294d47c3e03a42025b335/tests/resources/pp2 =) Except graphql, it is here: https://github.com/railt/graphql/tree/b90978e070230f461f10a5870d2736997064b6e2/resources/graphql because it is in several different files |
If you remove the support for the token namespaces, then can significantly speed up the Lexer. This will simply completely rewrite the algorithm.
Benchmarks:
Stand:
Hoa Original
Sources: Original Lexer
AVG: 29.0824s (408 token/s)
Compiltely rewritten (Hoa-like)
Sources: Rewritten Lexer
AVG: 30.9218s (383.7 token/s)
Fast Lexer (without namespaces support)
Sources: FastLexer
AVG: 0.2091s (56752.7 token/s)
Can it make sense to adapt it for Hoa and in those cases when the user's grammar contains only the
default
namespace to use the FastLexer implementation?The text was updated successfully, but these errors were encountered: