Home

Parsinator

Parsinator turns structured and unstructured text into a header-detail representation. You could use Parsinator to create an xml file from a pdf file, an object from a printer spool file or to parse relevant data from any text.

Why?

One day, you are asked to write a piece of software to extract relevant information from a text-based pdf file to later process it in your main software. You have to finish it by the end of the week. And, support pdf files with different layouts. By the way, you're behind the schedule. Et voilà! Here you have Parsinator. You can read more about Parsinator motivation in A Tale of a Pdf Parser

Getting Started

Usage

Parsinator uses three type of entities:

Skipper: It removes chunks of text from the text to parse. For example, legal notices, header and footer from an invoice.
Parser: It captures text based on a pattern. For example, capture the first match of a regular expression from the input text or at a given line.
Transformation: It reduces lines spawning multiples pages into a single stream of text.

Remember you can roll your own entities!

Skippers

Skippers are applied in every page of the input text. These are Parsinator's skippers:

SkipBeforeRegexAndAfterRegex
SkipBlankLines
SkipFromFirstMatchOfRegex
SkipFromFirstRegexToLastRegex
SkipIfDoesNotMatch
SkipIfMatches
SkipLineCountFromEnd
SkipLineCountFromLineNumber
SkipLineCountFromStart

Parsers

Parsers can be applied once per page or line, or multiple times per each of the lines spawning multiple pages. These are Parsinator's parsers:

AndThen
Concatenate
IfThen
Not
OrElse
ParseFromFirstRegexToLastRegex
ParseFromFirstRegexToRegex
ParseFromGenerator
ParseFromLineNumberUntilFirstMatchOfRegex
ParseFromLineNumberWithRegex
ParseFromLineWithCountAfterPosition
ParseFromMultiGroupRegex
ParseFromOutput
ParseFromRegex
ParseFromRegexToLastRegex
ParseFromRegexToRegex
ParseFromSplitting
ParseFromValue
Required
Validate

Transformations

Transformations can use a single or multiple skippers. But, you can create a transformation to suit your needs.

TransformFromMultipleSkips
TransformFromSingleSkip

Use cases

Create an xml from a pdf file

Create a unit test
Manually create the expected xml file
Create a dataset for the xml file once parsed the file.
- Name your dataset after your root node
- Create a table for every node. Name it after every parsed section
- Create columns for every value or attribute. Name them after every parsed value in the given section
Identify blocks of text that can be ignored, if any. For example: legal notices, headers, footers, empty lines
Add the appropiate skippers
Identify the patterns of text to extract from a given page or line
Add the appropiate parsers
Identify lines that spawns to multiple pages. Maybe, they are between certain keywords or patterns
Add a transformation to make a single stream of lines from them
- Use existings skippers to delimit these lines
- Add your own transformation
Add the appropiate parsers to extract text for every of the transformed lines

Please, take a look at the Sample project to see how to parse a plain-text invoice and a GPS frame

Provide feedback

Saved searches

Use saved searches to filter your results more quickly