re2-java

re2 for Java

Warning: Only 64bit Linux is supported for now. It should be easy to add support for other platforms.

Licence

Like RE2 library iteself, this library can be distributed and used under the terms of The BSD 3-Clause License.

Installation

Requirements

Java 7 (JDK 1.7, never tested on Java 8). Set environment variable JAVA_HOME to point to the root directory of JDK.
Maven 3.x , http://maven.apache.org/ .

Check that mvn command can be run from your command line.

gcc 4.5.x or higher.
Boost C++ Library (http://www.boost.org/), version newer than Stonehenge should be enough.
wget

Compilation

Simply type:

$ make

It downloads latest stable revision of re2, builds re2 library in separate directory and builds another library with JNI bindigs as well. Finally, jar file that includes so libraries files is produced in the target folder.

You can type:

$ make clean

to clean all files that come into existence during normal run of make.

After seccessfull compilation you can run:

$ mvn test

But tests are very time and memory consuming and, at present, they print a lot of debug messages. Sorry if it is annoying, this binding is actually under development.

Generating header files - example

From root folder containing pom.xml, src/ etc.

javah -jni -classpath "/home/<user>/repos/re2-java/src/main/java" -o src/main/java/com/logentries/re2/RE2.h com.logentries.re2.RE2

Installation

After running of make, directory target contains jar file with the library. You can include it to your classpath. Native library files (libre2.so and libre2-java.so) are part of the jar file as well. They are extracted after JVM startup, saved into temporary files and dynamically loaded into the address space of the JVM.

Changelog

v1.3

added option UNICODE_WORD to transparently handle unicode words and word boundaries. Take a look at the section below.

v1.2

added RE2.compile static method, similar to Pattern.compile. The main difference with the RE2 constructor is that compile method doesn't use checked exception and you can avoid try/catch block.
support for RE2String that can be reused with multiple patterns, in order to avoid multiple copies of the same string.
generalization of RE2.matcher that now accepts CharSequence rather than String

v1.1

support for RE2Matcher

Usage

For usage of the library, please import com.logentries.re2.RE2 and com.logentries.re2.Options .

Basic usage of java-re2 is quite similar to the C++ library.

Static functions RE2.fullMatch(.) and RE2.partialMatch(.) can be used.

You can create precompiled RE in this way:

RE2 re = new RE2("\\d+");

as the object allocates some memory that is not under the control of JVM, it should be freed explicitly. You can either use member function dispoze(), or member function close() . Class RE2 contains overloaded method finalize() that is automatically called before the object is destroyed by the Garbage Collector. This method ensures that the additional memory is freed and may be frees it on its own. But it is usually bad idea to rely on Java GC. :-)

Any try to use the object after the call of dispoze() or close() will cause the thrown of IllegalStateException .

Precompiled RE supports member functions partialMatch(.) or fullMatch(.).

re.fullMatch("2569");
re.partialMatch("xxx=2569");

RE2 constructor is declared with checked exception that can be raised if the regex is malformed. This is quite annoying if the regex is a static variable instantiated at startup. You can then use static method RE2.compile that wraps checked exception to the unchecked IllegalArgumentException.

public class MyClass {
    private static RE2 regex = RE2.compile("...");
}

Matcher

RE2 object supports also a more javaesque interface, similar to java.util.regex.Pattern and java.util.regex.Matcher.

RE2 re = new RE2("..(..)");
RE2Matcher matcher = re.matcher("my input string");
if (matcher.find()) {
  // get matching string(s),
  // see java.util.regex.Matcher javadoc or
  // com.logentries.re2.RE2Matcher code for additional details
  // eg. matcher.group(<n>) or matcher.start(<n>) and matcher.end(<n>)
  ...
}

You can also iterate over the input string searching for repeated pattern

RE2 re = new RE2("bla?");
RE2Matcher matcher = re.matcher("my bla input string bl bla");
while (matcher.findNext()) {
  // 3 iterations, get positions using matcher.start() and matcher.end()
}

R2Matcher also implements java.util.Iterable<java.util.regex.MatchResult>. It can be used this way

int c = 0;
for (MatchResult mr : new RE2("t").matcher("input text")) {
    // play with matches using mr.start, mr.end, mr.group
}
assertEquals(3, c);

This can be very useful when playing with this library in Scala:

import scala.collection.JavaConversions._
import com.logentries.re2._

new RE2("abc?") matcher "abc and abc ab ab" map( _.group ) foreach println

If you are not interested in fetching groups offset you can disable this feature, by using

RE2Matcher m = new RE2("ab(c?)").matcher("abc and abc ab ab", false);
assertEquals(1, m.GroupCount());
// now m contains information only for group 0
// so m.start(), m.end() and m.group()
// trying m.{start|end|group}(n : n > 0) always fails

If your regex is very complex (most likely programmatically composed by concatenating different patterns) and the number of groups is huge, this can improve performance significantly (data structures to contain all possible matches are not allocated).

NOTE 1: RE2Matcher object maintains a pointer to a char buffer that is used in C++ stack to manage the current string, in order to avoid a copy for each iteration. For this reason, RE2Matcher object implements AutoCloseable interface, to be used in try-with-resource statement. Close method is called in finalize(), so garbage collector will ensure (sooner or later) to free the memory. This is the same pattern that has been used for RE2 object, but, usually, RE2 regex are compiled and then used multiple times while RE2Matcher objects are used in stack and most likely you will want to delete it as soon as has been used. In this case, you can use the try-with-resource block to make sure you don't miss anything

try (RE2Matcher matcher = re.matcher("my bla input string bl bla")) {
  matcher. ....
}

NOTE 2: RE2Matcher is not thread-safe, just like java.util.regex.Matcher

Re-using strings

Whenever a RE2Matcher is created, the content of the string is copied to make it accessible from C++ stack. If you have to check and search for several patterns on the same string, this could affect performances, because you are copying the same string multiple times.

For this reason, from version v1.2, we have implemented a new object, RE2String that is a wrapper for a CharSequence. You can create an instance of this object in advance, and then create a RE2Matcher using your RE2String. This new object can be re-used multiple times to create matchers for different patterns. When RE2Matcher is created using a RE2String, it doesn't copy the string and when you close it (see above about the AutoCloseable interface) simply does nothing. Similarly, RE2String implements AutoCloseable interface and finalize method has been overridden to let the GC clean resources for you.

RE2 regex1 = RE2.compile("\\b[\\d]{5}\\b");
RE2 regex2 = RE2.compile("\\b[a-zA-Z]{5}\\b");

String input = ....
RE2String rstring = new RE2String(input);

RE2Matcher m1 = regex1.matcher(rstring);
RE2Matcher m2 = regex2.matcher(rstring);
while(m1.find()) {
    int endFirst = m1.end();
    if (m2.find(endFirst, endFirst + 10)) {
        ...
    }
}

// here m1.close() and m2.close() do nothing

Submatch extraction

Both static and member match functions support additional parameters in which submatches will be stored. Java does not support passing arguments by reference, so we use arrays to store submatches:

int[] x = new int[1];
long[] y = new int[1];
RE2.fullMatch("121:2569856321142", "(\\d+):(\\d+)", x, y);
// x[0] == 121, y[0] == 2569856321142

Array of length bigger then 1 can be used. Then it is used to store as much consecutive submatches as is the length of the array:

int[] x = new int[2];
String[] s = new String[1];
long[] y = new long[3];
new RE2 re = new RE2("(\\d+):(\\d+)-([a-zA-Z]+)-(\\d+):(\\d+):(\\d+)");
re.fullMatch("225:3-xxx-2:2555422298777:7", x, s, y);
// x[0] == 225, x[1] == 3, s[0] == xxx, y[0] == 2, y[1] == 2555422298777, y[2] == 7

So far, only int[], long[], float[], double[] and String[] are supported. Adding of other types should be quite easy.

Little comment about the interface and passing by reference

I know that a lot of Java programmers may complain that the interface based on passing of parameters by reference through the trick with arrays is quite bad practise, dirty trick and that it introduces something what is in fact not present in Java.

But after I try it in a real code I decided that it is the best way to pass the values of submatches. ~~If you have any idea how to implement it in different way, please give me know.~~ See Matcher interface above

Named capture group extraction

Capture group entities have a sub-string and a reference to the beginning and end index that this string corresponds to in a matched event. Named capture group entities wrap this and include a name.

getCaptureGroups(), and getCaptureGroupNames() are two methods that are called by getNamedCaptureGroups() to create a list of NamedGroup entities. The lists returned by these methods are in order, allowing getNamedCaptureGroups to associate them, if the length of the returned lists differ we can assume that we cannot maintain association and return an empty list.

getCaptureGroupNames uses the native RE2 method, getCaptureGroups uses the contributor code to get RE2Matcher objects.

Options

Object com.logentries.re2.Options encapsulates possible configuration that is used during creation of the RE2 object. It is more or less equivalent to RE2::Options from C++ interface. It can be passed as a second argument to RE2 constructor.

It uses several setter methods to set the configuration values:

Options opt = new Options();
opt.setNeverNl(true);
opt.setWordBoundary(false);

or equivalently:

Options opt = new Options().setNeverNl(true).setWordBoundary(false);

RE2 constructor is now overloaded to support for explicit flag list, to mimic C++ style:

    RE2 regex = new RE2("TGIF?",
        Options.CASE_INSENSITIVE,
        Options.ENCODING(Encoding.UTF8),
        Options.PERL_CLASSES(false)
    );

see Options static fields for further details.

Unicode words and word boundaries

RE2 natively supports unicode words (with special character classes like \pN and \pL) but it does not support unicode word boundaries (\b), because it is implemented with a single byte lookahead. Even using multiple bytes as lookahead the resulting automata would be huge and more difficult to handle.

What we did from version 1.3 is to transparently handle some of these cases, that we think are the most meaningful in a real world scenario. This behavior is enabled through the option UNICODE_WORD and works by replacing words and boundary classes with named groups that are then removed in the output. While there is no problem replacing \w and \W, there are some problems with \b. The replacement makes sense in cases where the \b is used "properly", i.e. without repetitions or non-word characters close to it. In these "extreme" cases what we do is actually changing the semantics of the regular expression, therefore we may produce false matches or missing some of them (with respect to the original expression).

Anyway this patch works OK in the majority of "normal" cases of usage of \b, that is near word characters. For example, without using the option, the regex \bball\b would match "ball" in the text "Fußball" (because "ß" is a non-word in ascii). With the option UNICODE_WORD we can transparently avoid this match.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
bin		bin
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pom.xml		pom.xml
re2-20140304.tgz		re2-20140304.tgz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

re2-java

Licence

Installation

Requirements

Compilation

Generating header files - example

Installation

Changelog

v1.3

v1.2

v1.1

Usage

Matcher

Re-using strings

Submatch extraction

Little comment about the interface and passing by reference

Named capture group extraction

Options

Unicode words and word boundaries

About

Releases 1

Packages

Languages

License

SpazioDati/re2-java

Folders and files

Latest commit

History

Repository files navigation

re2-java

Licence

Installation

Requirements

Compilation

Generating header files - example

Installation

Changelog

v1.3

v1.2

v1.1

Usage

Matcher

Re-using strings

Submatch extraction

Little comment about the interface and passing by reference

Named capture group extraction

Options

Unicode words and word boundaries

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages