detect invalid regex in lexer #15

CircleCode · 2013-12-02T09:05:12Z

in Lexer.php, preg_match could return false in case of regex error.

the false return should be tested and explained at https://github.com/hoaproject/Compiler/blob/master/Llk/Lexer.php#L270

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

The text was updated successfully, but these errors were encountered:

Hywan · 2013-12-02T09:29:52Z

Hello :-),

This will change with this PR #6.

Hywan · 2013-12-02T09:44:23Z

And additionnaly, this can be done separately in this block.

CircleCode · 2013-12-02T10:02:57Z

PR #6 still has this issue

on another side, I like the idea of detecting wrong regexes during pp parsing, like you suggest it, but I think having to test each regex at this stage has an overhead, and preg_match could still return false at line 270 (even with valid regex - for example when offset > length(string)… this should never happen, right…)
moreover, just testing false === $preg does not seem a huge overhead to me, but the error would be reported late in the process, which is sad

jubianchi · 2016-03-11T17:03:08Z

I'm adding the required check and benchmarking it.

Here is the script I use to bench:

<?php

require_once __DIR__ . '/vendor/autoload.php';

passthru(PHP_BINARY . ' -n -v');
$contents = <<<PP
#rule:
    choice()

choice:
    concatenation() ( ::or:: concatenation() #choice )*

concatenation:
    repetition() ( repetition() #concatenation )*

repetition:
    simple() ( quantifier() #repetition )? <node>?

simple:
    ::capturing_:: choice() ::_capturing::
  | ::skipped:: <token> ( ::unification_:: <unification> ::_unification:: )?
    ::skipped:: #skipped
  | ::kept_:: <token> ( ::unification_:: <unification> ::_unification:: )?
    ::_kept:: #kept
  | <token> ::named::
    ( ::unification_:: <unification> ::_unification:: )? #named

quantifier:
    <zero_or_one>
  | <one_or_more>
  | <zero_or_more>
  | <n_to_m>
  | <n_or_more>
  | <exactly_n>
PP;


$bench = new Hoa\Bench\Bench();
$bench->one_iteration->start();
try {
    $parser = \Hoa\Compiler\Llk\Llk::load(new \Hoa\File\Read(__DIR__ . '/Llk/Llk.pp'));
    $parser->parse($contents);
} catch (\exception $e) {}
$bench->one_iteration->stop();

$bench->hundred_iterations->start();

for ($i = 100; $i > 0; $i--) {
    try {
        $parser = \Hoa\Compiler\Llk\Llk::load(new \Hoa\File\Read(__DIR__ . '/Llk/Llk.pp'));
        $parser->parse($contents);
    } catch (\exception $e) {}
}

$bench->hundred_iterations->stop();

echo $bench;

Executing it with php -n test.phpproduces the following results:

No PCRE check

PHP 7.0.3 (cli) (built: Feb 17 2016 22:30:41) ( ZTS )
Copyright (c) 1997-2016 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies
__global__          ||||||||||||||||||||||||||||||||||||||||||||   696ms, 100.0%
one_iteration       |                                               18ms,   2.6%
hundred_iterations  |||||||||||||||||||||||||||||||||||||||||||    678ms,  97.4%

With PCRE check

PHP 7.0.3 (cli) (built: Feb 17 2016 22:30:41) ( ZTS )
Copyright (c) 1997-2016 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies
__global__          ||||||||||||||||||||||||||||||||||||||||||||   954ms, 100.0%
one_iteration       |                                               19ms,   2.0%
hundred_iterations  |||||||||||||||||||||||||||||||||||||||||||    934ms,  97.9%

This was implemented using $preg = @preg_match(...); if (false === $preg) { throw ... } and a custom error handler.

The check adds up to 30% computation time: this is too costly. I think this should only be enabled when running in a debug mode (see #14)

Hywan · 2016-03-15T07:52:10Z

Maybe we can check tokens once when “instanciating” the lexer. Is it where your placed it?

Hywan · 2016-08-15T09:22:26Z

ping @CircleCode?

flip111 · 2019-01-25T15:54:46Z

The regexes come from the tokens

Compiler/Llk/Lexer.php

Lines 192 to 201 in fd6f3f9

    
           $tokenArray = &$this->_tokens[$this->_lexerState]; 
        
           foreach ($tokenArray as $lexeme => $bucket) { 
        
               list($regex, $nextState) = $bucket; 
        
               if (null === $nextState) { 
        
                   $nextState = $this->_lexerState; 
        
               } 
        
               $out = $this->matchLexeme($lexeme, $regex, $offset);

These are supplied in

Compiler/Llk/Lexer.php

Line 111 in fd6f3f9

public function lexMe($text, array $tokens)

The interesting places where lexMe is called is as follows:

Compiler/Llk/Rule/Analyzer.php

Line 149 in ff934cc

$this->_lexer = new Iterator\Lookahead($lexer->lexMe($value, static::$_ppLexemes));
Compiler/Llk/Parser.php

Line 163 in ff934cc

$lexer->lexMe($text, $this->_tokens),

The first code calls lexMe in a loop with the same tokens each time. Possibly this is an opportunity for optimizing this code.

Anyway i both cases it seems fine to place the check in lexMe and throw an exception.

As for how to check if the regex is valid i created this feature request which describes it in more detail https://bugs.php.net/bug.php?id=77521

flip111 · 2019-01-25T17:04:13Z

<?php
// returns 1 if the pattern matches given subject, 0 if it does not, or FALSE if an error occurred
// The @ is used (unfortunately) to stop the warnings in your error_log
// The ! revedrses the condition, so the real one will be true, the bad one false

function isRegex($string)
{
    return !(@preg_match($string, null) === false);
}

var_dump(isRegex('/\w+/'));
var_dump(isRegex('thisis|64*%notRe3!ex/'));

Hywan added ready difficulty: casual enhancement and removed ready labels Aug 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detect invalid regex in lexer #15

detect invalid regex in lexer #15

CircleCode commented Dec 2, 2013 •

edited by Hywan

Loading

Hywan commented Dec 2, 2013

Hywan commented Dec 2, 2013

CircleCode commented Dec 2, 2013

jubianchi commented Mar 11, 2016

Hywan commented Mar 15, 2016

Hywan commented Aug 15, 2016

flip111 commented Jan 25, 2019

flip111 commented Jan 25, 2019

detect invalid regex in lexer #15

detect invalid regex in lexer #15

Comments

CircleCode commented Dec 2, 2013 • edited by Hywan Loading

Hywan commented Dec 2, 2013

Hywan commented Dec 2, 2013

CircleCode commented Dec 2, 2013

jubianchi commented Mar 11, 2016

No PCRE check

With PCRE check

Hywan commented Mar 15, 2016

Hywan commented Aug 15, 2016

flip111 commented Jan 25, 2019

flip111 commented Jan 25, 2019

CircleCode commented Dec 2, 2013 •

edited by Hywan

Loading