Skip to content
This repository has been archived by the owner on Sep 20, 2021. It is now read-only.

detect invalid regex in lexer #15

Open
CircleCode opened this issue Dec 2, 2013 · 8 comments
Open

detect invalid regex in lexer #15

CircleCode opened this issue Dec 2, 2013 · 8 comments

Comments

@CircleCode
Copy link
Member

CircleCode commented Dec 2, 2013

in Lexer.php, preg_match could return false in case of regex error.

the false return should be tested and explained at https://github.com/hoaproject/Compiler/blob/master/Llk/Lexer.php#L270


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@Hywan
Copy link
Member

Hywan commented Dec 2, 2013

Hello :-),

This will change with this PR #6.

@Hywan
Copy link
Member

Hywan commented Dec 2, 2013

And additionnaly, this can be done separately in this block.

@CircleCode
Copy link
Member Author

PR #6 still has this issue

on another side, I like the idea of detecting wrong regexes during pp parsing, like you suggest it, but I think having to test each regex at this stage has an overhead, and preg_match could still return false at line 270 (even with valid regex - for example when offset > length(string)… this should never happen, right…)
moreover, just testing false === $preg does not seem a huge overhead to me, but the error would be reported late in the process, which is sad

@jubianchi
Copy link
Member

I'm adding the required check and benchmarking it.

Here is the script I use to bench:

<?php

require_once __DIR__ . '/vendor/autoload.php';

passthru(PHP_BINARY . ' -n -v');
$contents = <<<PP
#rule:
    choice()

choice:
    concatenation() ( ::or:: concatenation() #choice )*

concatenation:
    repetition() ( repetition() #concatenation )*

repetition:
    simple() ( quantifier() #repetition )? <node>?

simple:
    ::capturing_:: choice() ::_capturing::
  | ::skipped:: <token> ( ::unification_:: <unification> ::_unification:: )?
    ::skipped:: #skipped
  | ::kept_:: <token> ( ::unification_:: <unification> ::_unification:: )?
    ::_kept:: #kept
  | <token> ::named::
    ( ::unification_:: <unification> ::_unification:: )? #named

quantifier:
    <zero_or_one>
  | <one_or_more>
  | <zero_or_more>
  | <n_to_m>
  | <n_or_more>
  | <exactly_n>
PP;


$bench = new Hoa\Bench\Bench();
$bench->one_iteration->start();
try {
    $parser = \Hoa\Compiler\Llk\Llk::load(new \Hoa\File\Read(__DIR__ . '/Llk/Llk.pp'));
    $parser->parse($contents);
} catch (\exception $e) {}
$bench->one_iteration->stop();

$bench->hundred_iterations->start();

for ($i = 100; $i > 0; $i--) {
    try {
        $parser = \Hoa\Compiler\Llk\Llk::load(new \Hoa\File\Read(__DIR__ . '/Llk/Llk.pp'));
        $parser->parse($contents);
    } catch (\exception $e) {}
}

$bench->hundred_iterations->stop();

echo $bench;

Executing it with php -n test.phpproduces the following results:

No PCRE check

PHP 7.0.3 (cli) (built: Feb 17 2016 22:30:41) ( ZTS )
Copyright (c) 1997-2016 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies
__global__          ||||||||||||||||||||||||||||||||||||||||||||   696ms, 100.0%
one_iteration       |                                               18ms,   2.6%
hundred_iterations  |||||||||||||||||||||||||||||||||||||||||||    678ms,  97.4%

With PCRE check

PHP 7.0.3 (cli) (built: Feb 17 2016 22:30:41) ( ZTS )
Copyright (c) 1997-2016 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies
__global__          ||||||||||||||||||||||||||||||||||||||||||||   954ms, 100.0%
one_iteration       |                                               19ms,   2.0%
hundred_iterations  |||||||||||||||||||||||||||||||||||||||||||    934ms,  97.9%

This was implemented using $preg = @preg_match(...); if (false === $preg) { throw ... } and a custom error handler.

The check adds up to 30% computation time: this is too costly. I think this should only be enabled when running in a debug mode (see #14)

@Hywan
Copy link
Member

Hywan commented Mar 15, 2016

Maybe we can check tokens once when “instanciating” the lexer. Is it where your placed it?

@Hywan
Copy link
Member

Hywan commented Aug 15, 2016

ping @CircleCode?

@flip111
Copy link

flip111 commented Jan 25, 2019

The regexes come from the tokens

Compiler/Llk/Lexer.php

Lines 192 to 201 in fd6f3f9

$tokenArray = &$this->_tokens[$this->_lexerState];
foreach ($tokenArray as $lexeme => $bucket) {
list($regex, $nextState) = $bucket;
if (null === $nextState) {
$nextState = $this->_lexerState;
}
$out = $this->matchLexeme($lexeme, $regex, $offset);

These are supplied in

public function lexMe($text, array $tokens)

The interesting places where lexMe is called is as follows:

The first code calls lexMe in a loop with the same tokens each time. Possibly this is an opportunity for optimizing this code.

Anyway i both cases it seems fine to place the check in lexMe and throw an exception.

As for how to check if the regex is valid i created this feature request which describes it in more detail https://bugs.php.net/bug.php?id=77521

@flip111
Copy link

flip111 commented Jan 25, 2019

<?php
// returns 1 if the pattern matches given subject, 0 if it does not, or FALSE if an error occurred
// The @ is used (unfortunately) to stop the warnings in your error_log
// The ! revedrses the condition, so the real one will be true, the bad one false

function isRegex($string)
{
    return !(@preg_match($string, null) === false);
}

var_dump(isRegex('/\w+/'));
var_dump(isRegex('thisis|64*%notRe3!ex/'));

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

No branches or pull requests

4 participants