Skip to content

Commit

Permalink
Merge pull request #34 from pgolo/dev
Browse files Browse the repository at this point in the history
sic-1.0.4
  • Loading branch information
pgolo authored Sep 3, 2020
2 parents 9b514b3 + 5e07952 commit f9f0221
Show file tree
Hide file tree
Showing 22 changed files with 254 additions and 56 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
build/*
cythonized/*
dist/*
!dist/*.whl
!dist/*.tar.gz
*.spec
sic.egg-info/*
sic/core.c
Expand Down
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,13 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [1.0.4] - 2020-09-03

### Added

- Normalizer.result['r_map'] attribute
- Scripts to build wheels

## [1.0.3] - 2020-07-30

### Added
Expand Down
1 change: 0 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
include README.md
include LICENSE
include sic/core.c
include sic/tokenizer.greek.xml
include sic/tokenizer.standard.xml
include sic/tokenizer.western.xml
40 changes: 29 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,18 @@ pip install sic

- `sic` is designed to work in Python 3 environment.
- `sic` only needs Python Standard Library (no other packages).
- Although `sic` leaves very little footprint, it is recommended that in
production environment, `Cython` is installed at the time of `sic`
installation. Then the module will be cythonized and will work much faster.
`Cython` is not required for `sic` to run, once `sic` is installed.

**Windows users:** PyPi distribution includes wheels for Python 3.6, 3.7, and
3.8. The wheels include cythonized module, so no additional steps are required
to achieve best performance.

**Linux users:** PyPi distribution does not include wheels for Linux, so
`pip` will install source code package when run under Linux. Installed this
way, the module will still work, though performance will be subpar. To achieve
higher speed of processing, it is recommended to clone the repository, build
the wheel on a local system, and then install it in the environment. See
`scripts/linux/buildwheel.sh` and the comments there for details.


## Tokenization configs

Expand Down Expand Up @@ -249,12 +257,13 @@ controls the way tokenized string is post-processed:
**Property** `Normalizer.result` retains the result of last call for
`Normalizer.normalize` function as dict object with the following keys:

| KEY | VALUE TYPE | DESCRIPTION |
|:------------:|:----------:|:--------------------------------------------:|
| 'original' | str | Original string value that was processed. |
| 'normalized' | str | Returned normalized string value. |
| 'map' | list(int) | Map between original and normalized strings. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| KEY | VALUE TYPE | DESCRIPTION |
|:------------:|:---------------:|:----------------------------------------------------:|
| 'original' | str | Original string value that was processed. |
| 'normalized' | str | Returned normalized string value. |
| 'map' | list(int) | Map between original and normalized strings. |
| 'r_map' | list(list(int)) | Reverse map between original and normalized strings. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

`Normalizer.result['map']`: Not only `Normalizer.normalize()` generates
normalized string out of originally provided, it also tries to map character
Expand All @@ -264,6 +273,12 @@ normalized string and item value is character position in original string. This
is only valid when `normalizer_option` argument for `Normalizer.normalize()`
call has been set to 0.

`Normalizer.result['r_map']`: Reverse map between character locations in
original string and its normalized reflection (item index is character position
in original string; item value is list [`x`, `y`] where `x` and `y` are
respectively lowest and highest indexes of mapped characted in normalized
string.

```python
# using default word_separator and normalizer_option
x = machine.normalize('alpha-2-macroglobulin-p')
Expand All @@ -275,6 +290,9 @@ print(machine.result)
'normalized': 'alpha - 2 - macroglobulin - p',
'map': [
0, 1, 2, 3, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 21, 22, 22
],
'r_map: [
[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 13], [14, 14], [15, 15], [16, 16], [17, 17], [18, 18], [19, 19], [20, 20], [21, 21], [22, 22], [23, 23], [24, 24], [25, 26], [27, 28]
]
}
"""
Expand All @@ -288,6 +306,6 @@ x = machine.normalize('alpha-2-macroglobulin-p', normalizer_option=1)
print(x) # '- - - 2 alpha macroglobulin p'

# using normalizer_option=2
x = machine.normalize('alpha-2-macroglobulin-p', normalizer_option=1)
x = machine.normalize('alpha-2-macroglobulin-p', normalizer_option=2)
print(x) # '- 2 alpha macroglobulin p'
```
Binary file added dist/sic-1.0.4-cp36-cp36m-win_amd64.whl
Binary file not shown.
Binary file added dist/sic-1.0.4-cp37-cp37m-win_amd64.whl
Binary file not shown.
Binary file added dist/sic-1.0.4-cp38-cp38-win_amd64.whl
Binary file not shown.
Binary file added dist/sic-1.0.4.tar.gz
Binary file not shown.
Binary file added requirements.txt
Binary file not shown.
File renamed without changes.
35 changes: 35 additions & 0 deletions scripts/linux/buildtargz.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Usage:
# buildtargz.sh path/to/python
#
# The package will be placed in dist/ directory.

RUNDIR=`pwd`
cd `dirname $0`
MYDIR=`pwd`
ROOT=${MYDIR}/../..
REQUIREMENTS=${ROOT}/requirements.txt
ENV=${ROOT}/.env.build
SHIPPING=${ROOT}/shipping

if [ $# -eq 0 ]
then
cd ${RUNDIR}
exit
fi

if [ ! -f $1 ]
then
echo $1: Python not found
cd ${RUNDIR}
exit
fi

cd ${ROOT}

virtualenv -p $1 ${ENV}
${ENV}/bin/python3 ${SHIPPING}/make_setup.py sdist
${ENV}/bin/python3 ${ROOT}/setup.py sdist

rm ${ROOT}/setup.py

cd ${RUNDIR}
34 changes: 34 additions & 0 deletions scripts/linux/buildwheel.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Usage:
# buildwheel.sh path/to/python3.6 path/to/python3.7 path/to/python3.8
#
# The wheels will be placed in dist/ directory.

RUNDIR=`pwd`
cd `dirname $0`
MYDIR=`pwd`
ROOT=${MYDIR}/../..
REQUIREMENTS=${ROOT}/requirements.txt
ENV=${ROOT}/.env.build
SHIPPING=${ROOT}/shipping

cd ${ROOT}

for PY in "$@"
do
if [ ! -f ${PY} ]
then
echo ${PY}: Python not found
else
virtualenv -p ${PY} ${ENV}
${ENV}/bin/python3 -m pip install --no-cache-dir -r ${REQUIREMENTS}
${ENV}/bin/python3 ${SHIPPING}/make_setup.py bdist_wheel
${ENV}/bin/python3 ${ROOT}/setup.py bdist_wheel
rm -r ${ENV}
rm -r ${ROOT}/sic.egg-info
rm -r ${ROOT}/build
rm ${ROOT}/sic/core.c
rm ${ROOT}/setup.py
fi
done

cd ${RUNDIR}
File renamed without changes.
23 changes: 23 additions & 0 deletions scripts/win/buildtargz.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
@echo off
rem Usage:
rem buildtargz.bat path\to\python
rem
rem The package will be placed in dist\ directory.

set RUNDIR=%cd%
set MYDIR=%~dp0
set ROOT=%MYDIR%\..\..
set ENV=%ROOT%\.env.build
set SHIPPING=%ROOT%\shipping

if (%1)==() (cd %RUNDIR% && exit)
if not exist "%1" (echo "%1": Python not found && cd %RUNDIR% && exit)
cd "%ROOT%"
virtualenv -p "%1" "%ENV%"
"%ENV%"\Scripts\python "%SHIPPING%"\make_setup.py sdist
"%ENV%"\Scripts\python "%ROOT%"\setup.py sdist
rmdir /S /Q "%ENV%"

del /Q "%ROOT%"\setup.py

cd %RUNDIR%
31 changes: 31 additions & 0 deletions scripts/win/buildwheel.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
@echo off
rem Usage:
rem buildwheel.bat path\to\python36 path\to\python37 path\to\python38
rem
rem The package will be placed in dist\ directory.

set RUNDIR=%cd%
set MYDIR=%~dp0
set ROOT=%MYDIR%\..\..
set REQUIREMENTS=%ROOT%\requirements.txt
set ENV=%ROOT%\.env.build
set SHIPPING=%ROOT%\shipping

:BUILD
if (%1)==() (goto FINISH)
if not exist "%1" (echo "%1": Python not found && shift && goto BUILD)
cd "%ROOT%"
virtualenv -p "%1" "%ENV%"
"%ENV%"\Scripts\python -m pip install --no-cache-dir -r "%REQUIREMENTS%"
"%ENV%"\Scripts\python "%SHIPPING%"\make_setup.py bdist_wheel
"%ENV%"\Scripts\python "%ROOT%"\setup.py bdist_wheel
rmdir /S /Q "%ENV%"
rmdir /S /Q "%ROOT%"\sic.egg-info
rmdir /S /Q "%ROOT%"\build
del /Q "%ROOT%"\sic\core.c
shift
goto BUILD

:FINISH
del /Q "%ROOT%"\setup.py
cd %RUNDIR%
39 changes: 0 additions & 39 deletions setup.py

This file was deleted.

5 changes: 5 additions & 0 deletions shipping/cythonize.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
try:
from Cython.Build import cythonize
ext_modules = cythonize(['sic/core.py'], compiler_directives={'language_level': '3'})
except:
pass
16 changes: 16 additions & 0 deletions shipping/make_setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import sys

def just_do_it(option):
cythonize = ''
if option in ['bdist_wheel']:
with open('shipping/cythonize.py', mode='r', encoding='utf8') as f:
cythonize = f.read()
with open('shipping/setup.py', mode='r', encoding='utf8') as i, open('./setup.py', mode='w', encoding='utf8') as o:
for line in i:
if line.strip() != '# sic: cythonize?':
o.write(line)
else:
o.write(cythonize)

if __name__ == '__main__':
just_do_it(sys.argv[1] if len(sys.argv) > 1 else '')
36 changes: 36 additions & 0 deletions shipping/setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import sys
from setuptools import setup

ext_modules = None
with open('README.md', mode='r', encoding='utf8') as f:
long_description = f.read()

# sic: cythonize?

setup(
name='sic',
version='1.0.4',
description='Utility for string normalization',
long_description=long_description,
long_description_content_type='text/markdown',
url='https://github.com/pgolo/sic',
author='Pavel Golovatenko-Abramov',
author_email='[email protected]',
packages=['sic'],
ext_modules=ext_modules,
include_package_data=True,
license='MIT',
platforms=['any'],
classifiers=[
'Development Status :: 4 - Beta',
'Intended Audience :: Developers',
'Topic :: Software Development :: Libraries :: Python Modules',
'Topic :: Text Processing :: Linguistic',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'License :: OSI Approved :: MIT License',
'Operating System :: OS Independent',
],
python_requires='>=3.6'
)
8 changes: 8 additions & 0 deletions sic/core.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,14 @@ cdef class Normalizer():

cdef int chargroup(self, str s)

@cython.locals(
ret=cython.list,
i=cython.int,
j=cython.int,
k=cython.int
)
cdef list reverse_map(self, list m)

@cython.locals(
original_string=cython.str,
subtrie=cython.dict,
Expand Down
Loading

0 comments on commit f9f0221

Please sign in to comment.