Skip to content
Cesar Aguirre edited this page Aug 3, 2020 · 2 revisions

Parsinator

Parsinator turns structured and unstructured text into a header-detail representation. You could use Parsinator to create an xml file from a pdf file, an object from a printer spool file or to parse relevant data from any text.

Why?

One day, you are asked to write a piece of software to extract relevant information from a text-based pdf file to later process it in your main software. You have to finish it by the end of the week. And, support pdf files with different layouts. By the way, you're behind the schedule. Et voilà! Here you have Parsinator. You can read more about Parsinator motivation in A Tale of a Pdf Parser

Getting Started

Usage

Parsinator uses three type of entities:

  • Skipper: It removes chunks of text from the text to parse. For example, legal notices, header and footer from an invoice.
  • Parser: It captures text based on a pattern. For example, capture the first match of a regular expression from the input text or at a given line.
  • Transformation: It reduces lines spawning multiples pages into a single stream of text.

Remember you can roll your own entities!

Skippers

Skippers are applied in every page of the input text. These are Parsinator's skippers:

  • SkipBeforeRegexAndAfterRegex
  • SkipBlankLines
  • SkipFromFirstMatchOfRegex
  • SkipFromFirstRegexToLastRegex
  • SkipIfDoesNotMatch
  • SkipIfMatches
  • SkipLineCountFromEnd
  • SkipLineCountFromLineNumber
  • SkipLineCountFromStart

Parsers

Parsers can be applied once per page or line, or multiple times per each of the lines spawning multiple pages. These are Parsinator's parsers:

  • AndThen
  • Concatenate
  • IfThen
  • Not
  • OrElse
  • ParseFromFirstRegexToLastRegex
  • ParseFromFirstRegexToRegex
  • ParseFromGenerator
  • ParseFromLineNumberUntilFirstMatchOfRegex
  • ParseFromLineNumberWithRegex
  • ParseFromLineWithCountAfterPosition
  • ParseFromMultiGroupRegex
  • ParseFromOutput
  • ParseFromRegex
  • ParseFromRegexToLastRegex
  • ParseFromRegexToRegex
  • ParseFromSplitting
  • ParseFromValue
  • Required
  • Validate

Transformations

Transformations can use a single or multiple skippers. But, you can create a transformation to suit your needs.

  • TransformFromMultipleSkips
  • TransformFromSingleSkip

Use cases

Create an xml from a pdf file

  1. Create a unit test
  2. Manually create the expected xml file
  3. Create a dataset for the xml file once parsed the file.
    • Name your dataset after your root node
    • Create a table for every node. Name it after every parsed section
    • Create columns for every value or attribute. Name them after every parsed value in the given section
  4. Identify blocks of text that can be ignored, if any. For example: legal notices, headers, footers, empty lines
  5. Add the appropiate skippers
  6. Identify the patterns of text to extract from a given page or line
  7. Add the appropiate parsers
  8. Identify lines that spawns to multiple pages. Maybe, they are between certain keywords or patterns
  9. Add a transformation to make a single stream of lines from them
    • Use existings skippers to delimit these lines
    • Add your own transformation
  10. Add the appropiate parsers to extract text for every of the transformed lines

Please, take a look at the Sample project to see how to parse a plain-text invoice and a GPS frame