Skip to content

packaging upgrades, faster language id, bug fixes

Latest
Compare
Choose a tag to compare
@bdewilde bdewilde released this 02 Apr 22:46
· 3 commits to main since this release
d94c618

Took a (longer than expected) break from NLP, so this release is mostly just maintenance and bug fixes — but in anticipation of more interesting updates to come.

  • upgraded built-in language identification model (PR #375)
    • replaced v2 thinc/cld3 model with v3 floret/fasttext model, which has much faster predictions and comparable but more consistent performance
  • modernized and improved Python packaging for faster, simpler installation and testing (PR #368 and #369)
    • all package metadata and configuration moved into a single pyproject.toml file
    • code formatting and linting updated to use ruff plus newer versions of mypy and black, and their use in GitHub Actions CI has been consolidated
    • bumped supported Python versions range from 3.8–3.10 to 3.9–3.11 (PR #369)
    • added full CI testing matrix for PY 3.9/3.10/3.11 x Linux/macOS/Windows, and removed extraneous AppVeyor integration
  • updated and improved type hints throughout, reducing number of mypy complaints by ~80% (PR #372)

Fixed

  • fixed ReDoS bugs in regex patterns (PR #371)
  • fixed breaking API issues with newer networkx/scikit-learn versions (PR #367)
  • improved dev workflow documentation and code to better incorporate language data (PR #363)
  • updated caching code with a fix from upstream pysize library, which was preventing Russian-language spaCy model from loading properly (PR #358)

Contributors

Big thanks to @jonwiggins, @Hironsan, amnd @kevinbackhouse for the fixes!