DelimiterFinder

DelimiterFinder is a Python package for probabilistic delimiter detection. It is a fast, efficient, and easy-to-use tool for identifying unknown delimiters within tabular data.

Key Features

Versatile: Detection of both single and multiple character delimiters.
Versatile: Supports tabular data stored in a variety of formats, including common tabular data format files (e.g., CSV, TSV, TXT) or Python string and list types.
Robustness: Leverages Bayesian techniques to probabilistically identify unknown delimiters given data.
Robustness: Includes significance testing for all results.
Robustness: Robust to malformed data (not an "all or nothing approach" in the case of malformed rows).
Transparency: Reports posterior probabilities for all identified candidate delimiters.
Fast and efficient: Detect delimiters with a high level of confidence given just 10 rows.

Installation

Install the latest released version from PyPI.

pip install DelimiterFinder

User Guide

Parameters and methods for `DelimiterFinder.finder.Finder`

class DelimiterFinder.finder.Finder(ignore_chars=None)

Parameter	Type	Default	Optional	Description
`ignore_chars`	`list`	None	Yes	List of non-alphanumeric characters which should not be considered candidate delimiters.

Attributes	Type	Description
`posterior`	`dict`	The posterior probability of each candidate delimiter.
`bayes_factor`	`float`	Evidence in favor of the most likely delimiter (MAP) relative to the second most likely delimiter.

Methods:

find(data, is_path=False, num_samples=20, new_line_sep="\n")

Parameter	Type	Default	Optional	Description
`data`	`str` or `list`		No	The input data either as a single string with each row separated by `new_line_sep` or a list where each element is a row. Alternatively, a path to a text file (e.g., .TXT, .CSV) may be passed, in which case, the `is_path` parameter should be set to "True". Data should have more than one row.
`is_path`	`bool`	False	Yes	An indicator for whether the value passed to the `data` parameter is a file path.
`num_samples`	`int`	20	:Yes	Number of rows to sample for inference.
`new_line_sep`	`str`	"\n"	Yes	The new line separator for the rows in the data.

Return	Type	Description
`delim`	`str`	The maximum a posteriori probability (MAP) estimate.

Example

Using DelimiterFinder is easy. To get started, simply create an instance of the Finder class and pass your data to the find method. The example below walks through a simple implementation.

>>> from DelimiterFinder.finder import Finder
>>> # example data
>>> data = "c_1~|~c_2~|~c_3\n1~|~2~|~3\n4~|~~|~\n5~|~~|~6"
>>> # create instance of Finder and fit to data
>>> delim_locator = Finder()
>>> delim = delim_locator.find(data)
>>> # check the most likely delimiter
>>> print(delim)
~|~
>>> # check the probabilities for each delimiter
>>> print(delim_locator.posterior)
{'_': 0.022, '~|~': 0.977}
>>> # check the results of the significance test
>>> print(delim_locator.bayes_factor)
42.66

As we can see from the output above, the DelimiterFinder was able to identify an unknown three character long delimiter. The posterior attribute provides a dictionary with all of the tested candidates delimiters and their associated posterior probabilities. The bayes_factor attribute shows us that there is very strong evidence (i.e., a value greater than 10) in favor of the most likely delimiter relative to the second most likely delimiter. All with just 4 rows of data!

Indeed, DelimiterFinder can handle much more complicated data than the example given above, with the confidence in the decision made increasing with the number of rows provided. The DelimiterFinder has been tested for robustness against hundreds of randomly generated test cases. These tests can be found in the tests directory of the GitHub repo.

Bayesian Methods

Inference

DelimiterFinder leverages Bayesian techniques to probabilistically identify unknown delimiters given data. In particular, DelimiterFinder fits a model using sequential Bayesian updating.

The model is given as follows:

Here, theta is a finite set of candidate delimiters. Candidate delimiters are all contiguous strings of valid (i.e., not in the given ignore_chars list) non-alphanumeric characters in the first row of data (assumed to be the header) The prior for these candidate delimiters is given by their relative frequencies. The variable X represents a row of data. The likelihood is the proportion of the number of columns in the header and number of columns in the given row of data, assuming delimiter theta is the true delimiter. Since this is a discrete distribution with a finite number of candidates delimiters, the denominator (normalization constant) is the sum over all thetas of the likelihood times prior.

The model is updated sequentially over M rows of data as follows:

The posterior probabilities from row N are used as priors in row N+1. This is implemented sequentially for all rows 1...N...M. Finally, the maximum a posteriori probability (MAP) estimate is taken to be the delimiter.

Hypothesis Testing

A Bayesian hypothesis test is used to evaluate the significance of the most likely delimiter. The framework for this hypothesis test is as follows: hypothesis one is that the delimiter with the highest posterior probability (MAP estimate) is the true delimiter, and hypothesis two is that the delimiter with the second highest posterior probability is the true delimiter. The more likely hypothesis one is than hypothesis two, the more confident we are with the model's choice for most likely delimiter.

To conduct this hypothesis test, we will calculate the Bayes factor, which is the ratio of likelihood between the two hypotheses.

The following rules are used to determine the significance of the results given the Bayes factor:

1.) Bayes factor = 1: no evidence.
2.) 1 < Bayes factor < 3: weak evidence.
3.) 3 < Bayes factor < 10: substantial evidence.
4.) Bayes factor > 10: strong evidence.
5.) Bayes factor < 1: not possible in this hypothesis test.

Source: Jeffreys, Harold (1998) [1961]. The Theory of Probability (3rd ed.). Oxford, England. p. 432.

DelimiterFinder will raise a warning if the Bayes factor for the chosen delimiter is less than 3. Increasing the number of rows or adding unwanted characters to the ignore_chars list will generally increase the Bayes factor.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github/workflows		.github/workflows
DelimiterFinder		DelimiterFinder
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
eq1.png		eq1.png
eq2.png		eq2.png
eq3.png		eq3.png
eq4.png		eq4.png
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows