Skip to content

Commit

Permalink
Merge pull request #10 from pgolo/dev
Browse files Browse the repository at this point in the history
Release 1.1.0
  • Loading branch information
pgolo authored Oct 31, 2020
2 parents 30aa135 + c40fdbf commit 56986f1
Show file tree
Hide file tree
Showing 28 changed files with 702 additions and 85 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ build/*
cythonized/*
pyd/*
dist/*
bin/*
!dist/*.whl
!dist/*.tar.gz
*.spec
Expand Down
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,19 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [1.1.0] - 2020-10-31

### Added

- Implicit instantiation of core classes
- Classes and function for ad hoc creation of tokenization config
- Methods to save (pickle) and load (unpickle) compiled Normalizer instance
- Wheel for Python 3.9

### Removed

### Changed

## [1.0.6] - 2020-09-10

### Changed
Expand Down
203 changes: 178 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,24 +175,24 @@ Below are descriptions and examples of tokenizer config elements.
| `<split>` | where="l" value="?" | Separates token specified in `value` from **left** part of a bigger token. | where="l" value="kappa": `nf kappab` --> `nf kappa b` |
| `<split>` | where="m" value="?" | Separates token specified in `value` when it is found in the **middle** of a bigger token. | where="m" value="kappa": `nfkappab` --> `nf kappa b` |
| `<split>` | where="r" value="?" | Separates token specified in `value` from **right** part of a bigger token. | where="r" value="gamma": `ifngamma` --> `ifn gamma` |
| `<token>` | to="?" from="?" | Replaces token specified in `to` with another token specified in `from`. | to="protein" from="gene": `nf kappa b gene` --> `nf kappa b protein` |
| `<character>` | to="?" from="?" | Replaces character specified in `to` with another character specified in `from`. | to="e" from="ë": `citroën` --> `citroen` |
| `<token>` | to="?" from="?" | Replaces token specified in `from` with another token specified in `to`. | to="protein" from="gene": `nf kappa b gene` --> `nf kappa b protein` |
| `<character>` | to="?" from="?" | Replaces character specified in `from` with another character specified in `to`. | to="e" from="ë": `citroën` --> `citroen` |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Attribute `where` of `<split>` element may have any combination of `l`, `m`, or
`r` literals if the specified substring is required to be separated in different
places of a bigger string. So, instead of three different elements

```xml
<split where="l" value="word">
<split where="m" value="word">
<split where="r" value="word">
<split where="l" value="word" />
<split where="m" value="word" />
<split where="r" value="word" />
```

using the following single one

```xml
<split where="lmr" value="word">
<split where="lmr" value="word" />
```

will achieve the same result.
Expand All @@ -214,17 +214,68 @@ import sic
For detailed description of all function and methods, see comments in the
source code.

### Class `Builder`
### Class `sic.Model`

**Function** `Builder.build_normalizer()` reads tokenization config,
instantiates `Normalizer` class that would perform tokenization according to
rules specified in a given config, and returns this `Normalizer` class
This class is designed to instanly create tokenization rules directly in
Python. It is neither convenient nor recommended for complex normalization
tasks, but can be handy for small ones where using external XML config might
seem an overkill.

```python
# instantiate Model
model = sic.Model()

# model is case-sensitive
model.case_sensitive = True

# model will do nothing
model.bypass = True
```

**Method** `sic.Model.add_rule` adds single tokenization instruction to the
Model instance:

```python
# equivalent to XML <split where="lmr" value="beta" />
model.add_rule(sic.SplitToken('beta', 'lmr'))

# equivalent to XML <token to="good" from="bad" />
model.add_rule(sic.ReplaceToken('bad', 'good'))

# equivalent to XML <character to="z" from="a" />
model.add_rule(sic.ReplaceCharacter('a', 'z'))
```

> **NB**: in case new `sic.ReplaceToken` or `sic.ReplaceChar` instruction
> contradicts something that is already in the model, the newer instruction
> overrides older instruction:
>
> ```python
> model.add_rule(sic.ReplaceToken('bad', 'good'))
> model.add_rule(sic.ReplaceToken('bad', 'better'))
> ```
>
> "bad" --> "good" will not be used; "bad" --> "better" will be used instead
**Method** `sic.Model.remove_rule` removes single tokenization instruction from
Model instance if is there:
```python
model.remove_rule(sic.ReplaceToken('bad', 'good'))
# tokenization rule that fits definition above will be removed from model
```
### Class `sic.Builder`

**Function** `sic.Builder.build_normalizer()` reads tokenization config,
instantiates `sic.Normalizer` class that would perform tokenization according
to rules specified in a given config, and returns this `sic.Normalizer` class
instance.

| ARGUMENT | TYPE | DEFAULT | DESCRIPTION |
|:--------:|:----:|:-------:|:-------------------------------------:|
| filename | str | None | Path to tokenizer configuration file. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ARGUMENT | TYPE | DEFAULT | DESCRIPTION |
|:--------:|:-----------:|:-------:|:-------------------------------------:|
| endpoint | str, Model | None | Path to tokenizer configuration file. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

```python
# create Builder object
Expand All @@ -235,13 +286,34 @@ machine = builder.build_normalizer()

# create Normalizer object with custom set of rules
machine = builder.build_normalizer('/path/to/config.xml')

# create Normalizer object using ad hoc model
model = sic.Model()
model.add_rule(sic.SplitToken('beta', 'lmr'))
machine = builder.build_normalizer(model)
```

### Class `Normalizer`
### Class `sic.Normalizer`

**Method** `sic.Normalizer.save()` saves data structure from instance of
`sic.Normalizer` class to a specified file (pickle).

| ARGUMENT | TYPE | DEFAULT | DESCRIPTION |
|:--------:|:----:|:-------:|:-------------------------------:|
| filename | str | n/a | Path and name of file to write. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

**Function** `Normalizer.normalize()` performs string normalization according
to the rules ingested at the time of class initialization, and returns
normalized string.
**Function** `sic.Normalizer.load()` reads specified file (pickle) and places
data structure in `sic.Normalizer` instance.

| ARGUMENT | TYPE | DEFAULT | DESCRIPTION |
|:--------:|:----:|:-------:|:------------------------------:|
| filename | str | n/a | Path and name of file to read. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

**Function** `sic.Normalizer.normalize()` performs string normalization
according to the rules ingested at the time of class initialization, and
returns normalized string.

| ARGUMENT | TYPE | DEFAULT | DESCRIPTION |
|:-----------------:|:----:|:-------:|:----------------------------------:|
Expand All @@ -265,8 +337,8 @@ controls the way tokenized string is post-processed:
| 2 | Rearrange tokens in alphabetical order and remove duplicates. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

**Property** `Normalizer.result` retains the result of last call for
`Normalizer.normalize` function as dict object with the following keys:
**Property** `sic.Normalizer.result` retains the result of last call for
`sic.Normalizer.normalize` function as dict object with the following keys:

| KEY | VALUE TYPE | DESCRIPTION |
|:------------:|:---------------:|:----------------------------------------------------:|
Expand All @@ -276,21 +348,86 @@ controls the way tokenized string is post-processed:
| 'r_map' | list(list(int)) | Reverse map between original and normalized strings. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

`Normalizer.result['map']`: Not only `Normalizer.normalize()` generates
`sic.Normalizer.result['map']`: Not only `sic.Normalizer.normalize()` generates
normalized string out of originally provided, it also tries to map character
indexes in normalized string back on those in the original one. This map is
represented as list of integers where item index is character position in
normalized string and item value is character position in original string. This
is only valid when `normalizer_option` argument for `Normalizer.normalize()`
is only valid when `normalizer_option` argument for `sic.Normalizer.normalize()`
call has been set to 0.

`Normalizer.result['r_map']`: Reverse map between character locations in
`sic.Normalizer.result['r_map']`: Reverse map between character locations in
original string and its normalized reflection (item index is character position
in original string; item value is list [`x`, `y`] where `x` and `y` are
respectively lowest and highest indexes of mapped characted in normalized
string.
respectively lowest and highest indexes of mapped character in normalized
string).

### Method `sic.build_normalizer()`

`sic.build_normalizer()` implicitly creates single instance of `sic.Normalizer`
class accessible globally from `sic` namespace. Arguments are same as for
`sic.Builder.build_normalizer()` function.

### Method `sic.save()`

`sic.save()` saves data structure stored in global instance of `sic.Normalizer`
class to a specified file (pickle). Arguments are same as for
`sic.Normalizer.save()` method.

### Function `sic.load()`

`sic.load()` reads specified file (pickle) and places data structure in global
instance of `sic.Normalizer` class stored in that file. Arguments are same as
for `sic.Normalizer.load()` function.

### Function `sic.normalize()`

`sic.normalize(*args, **kwargs)` either uses global class `sic.Normalizer` or
instantly creates new local `sic.Normalizer` class, and uses it to perform
requested string normalization.

| ARGUMENT | TYPE | DEFAULT | DESCRIPTION |
|:-----------------:|:----:|:-------:|:-------------------------------------:|
| source_string | str | n/a | String to normalize. |
| word_separator | str | ' ' | Word delimiter (single character). |
| normalizer_option | int | 0 | Mode of post-processing. |
| tokenizer_config | str | None | Path to tokenizer configuration file. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

If `tokenizer_config` argument is not provided, the function will use global
instance of `sic.Normalizer` class (will create it if it is not initialized).

### Method `sic.reset()`

`sic.reset()` resets global `sic.Normalizer` instance to `None`, forcing
subsequently called `sic.normalize()` to create new global instance again if it
needs it.

### Attribute `sic.result`, function `sic.result()`

`sic.result` attribute retains the value of `sic.Normalizer.result` property
that belonged to most recently used `sic.Normalizer` instance accessed from
`sic.normalize()` function (either global or local).

Python 3.6 does not support [PEP-562](https://www.python.org/dev/peps/pep-0562/)
(module attributes). So in Python 3.6, use function `sic.result()` rather than
attribute `sic.result`:

```python
sic.result() # will work in Python >= 3.6
sic.result # will work in Python >= 3.7
```

## Examples

```python
import sic

# create Builder object
builder = sic.Builder()
# create Normalizer object with default set of rules
machine = builder.build_normalizer()

# using default word_separator and normalizer_option
x = machine.normalize('alpha-2-macroglobulin-p')
print(x) # 'alpha - 2 - macroglobulin - p'
Expand Down Expand Up @@ -319,4 +456,20 @@ print(x) # '- - - 2 alpha macroglobulin p'
# using normalizer_option=2
x = machine.normalize('alpha-2-macroglobulin-p', normalizer_option=2)
print(x) # '- 2 alpha macroglobulin p'

# ad hoc normalization
x = sic.normalize('alpha-2-macroglobulin-p', word_separator='|')
print(x) # 'alpha|-|2|-|macroglobulin|-|p'

sic.build_normalizer('/path/to/config.xml')
x = sic.normalize('some string')
print(x) # will be normalized according to config at /path/to/config.xml

x = sic.normalize('some string', tokenizer_config='/path/to/another/config.xml')
print(x) # will be normalized according to config at /path/to/another/config.xml

# save/load compiled normalizer to/from disk
machine.save('/path/to/file') # will write /path/to/file
another_machine = sic.Normalizer()
another_machine.load('/path/to/file') # will read /path/to/file
```
Binary file removed dist/sic-1.0.6-cp36-cp36m-win_amd64.whl
Binary file not shown.
Binary file removed dist/sic-1.0.6-cp37-cp37m-win_amd64.whl
Binary file not shown.
Binary file removed dist/sic-1.0.6-cp38-cp38-win_amd64.whl
Binary file not shown.
Binary file removed dist/sic-1.0.6.tar.gz
Binary file not shown.
Binary file added dist/sic-1.1.0-cp36-cp36m-win_amd64.whl
Binary file not shown.
Binary file added dist/sic-1.1.0-cp37-cp37m-win_amd64.whl
Binary file not shown.
Binary file added dist/sic-1.1.0-cp38-cp38-win_amd64.whl
Binary file not shown.
Binary file added dist/sic-1.1.0-cp39-cp39-win_amd64.whl
Binary file not shown.
Binary file added dist/sic-1.1.0.tar.gz
Binary file not shown.
Binary file modified requirements.txt
Binary file not shown.
5 changes: 4 additions & 1 deletion scripts/linux/buildso.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,15 @@ MYDIR=`pwd`
ROOT=${MYDIR}/../..
ENV=.env.36
SRC=${ROOT}/sic
DIST=${ROOT}/pyd
DIST=${ROOT}/bin
TEST=${ROOT}/test
cd ${ROOT}
rm -r ${ROOT}/build
rm -r ${ROOT}/cythonized
rm -r ${DIST}
mkdir -p ${DIST}
${ROOT}/${ENV}/bin/python3 ${TEST}/compile.py build_ext --inplace
mv ${SRC}/*.so ${DIST}
cp ${SRC}/__init__.py ${DIST}
cp ${SRC}/*.xml ${DIST}
cd ${RUNDIR}
5 changes: 4 additions & 1 deletion scripts/win/buildpyd.bat
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,15 @@ set RUNDIR=%cd%
set ROOT=%~dp0..\..
set ENV=.env.37
set SRC=%ROOT%\sic
set DIST=%ROOT%\pyd
set DIST=%ROOT%\bin
set TEST=%ROOT%\test
cd %ROOT%
rmdir /S /Q %ROOT%\build
rmdir /S /Q %ROOT%\cythonized
rmdir /S /Q %DIST%
if not exist %DIST%\nul mkdir %DIST%
call %ROOT%\%ENV%\Scripts\python %TEST%\compile.py build_ext --inplace
move /Y %SRC%\*.pyd %DIST%\
copy /Y %SRC%\__init__.py %DIST%\
copy /Y %SRC%\*.xml %DIST%\
cd %RUNDIR%
11 changes: 6 additions & 5 deletions scripts/win/buildtargz.bat
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,12 @@ set SHIPPING=%ROOT%\shipping

if (%1)==() (cd %RUNDIR% && exit)
if not exist "%1" (echo "%1": Python not found && cd %RUNDIR% && exit)
cd "%ROOT%"
virtualenv -p "%1" "%ENV%"
"%ENV%"\Scripts\python "%SHIPPING%"\make_setup.py sdist
"%ENV%"\Scripts\python "%ROOT%"\setup.py sdist
rmdir /S /Q "%ENV%"
cd %ROOT%
virtualenv -p %1 %ENV%
%ENV%\Scripts\python %SHIPPING%\make_setup.py sdist
%ENV%\Scripts\python %ROOT%\setup.py sdist
rmdir /S /Q %ENV%
rmdir /S /Q %ROOT%\sic.egg-info

del /Q "%ROOT%"\setup.py

Expand Down
22 changes: 12 additions & 10 deletions scripts/win/buildwheel.bat
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,20 @@ set SHIPPING=%ROOT%\shipping
:BUILD
if (%1)==() (goto FINISH)
if not exist "%1" (echo "%1": Python not found && shift && goto BUILD)
cd "%ROOT%"
virtualenv -p "%1" "%ENV%"
"%ENV%"\Scripts\python -m pip install --no-cache-dir -r "%REQUIREMENTS%"
"%ENV%"\Scripts\python "%SHIPPING%"\make_setup.py bdist_wheel
"%ENV%"\Scripts\python "%ROOT%"\setup.py bdist_wheel
rmdir /S /Q "%ENV%"
rmdir /S /Q "%ROOT%"\sic.egg-info
rmdir /S /Q "%ROOT%"\build
del /Q "%ROOT%"\sic\core.c
cd %ROOT%
virtualenv -p %1 %ENV%
%ENV%\Scripts\python -m pip install --no-cache-dir -r %REQUIREMENTS%
%ENV%\Scripts\python "%SHIPPING%"\make_setup.py bdist_wheel
%ENV%\Scripts\python "%ROOT%"\setup.py bdist_wheel
rmdir /S /Q %ENV%
rmdir /S /Q %ROOT%\sic.egg-info
rmdir /S /Q %ROOT%\build
del /Q %ROOT%\sic\core.c
del /Q %ROOT%\sic\implicit.c
del /Q %ROOT%\sic\implicit.h
shift
goto BUILD

:FINISH
del /Q "%ROOT%"\setup.py
del /Q %ROOT%\setup.py
cd %RUNDIR%
2 changes: 1 addition & 1 deletion shipping/cythonize.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
try:
from Cython.Build import cythonize
ext_modules = cythonize(['sic/core.py'], compiler_directives={'language_level': '3'})
ext_modules = cythonize(['sic/core.py', 'sic/implicit.py'], compiler_directives={'language_level': '3'})
except:
pass
Loading

0 comments on commit 56986f1

Please sign in to comment.