diff --git a/CHANGELOG.md b/CHANGELOG.md
index 52bf5a989f..067a361c29 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,50 +1,133 @@
Changes
=======
-## Unreleased
+## 4.0.0beta, FIXME 2020-10-??
+
+**⚠️ Gensim 4.0 contains breaking API changes! See the [Migration guide](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4) to update your existing Gensim 3.x code and models.**
+
+Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.
+
+### Main highlights (see also *👍 Improvements* below)
+
+* Massively optimized popular algorithms the community has grown to love: [fastText](https://radimrehurek.com/gensim/models/fasttext.html), [word2vec](https://radimrehurek.com/gensim/models/word2vec.html), [doc2vec](https://radimrehurek.com/gensim/models/doc2vec.html), [phrases](https://radimrehurek.com/gensim/models/phrases.html):
+
+ a. **Efficiency**
+
+ | model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput |
+ |----------|------------|--------|
+ | fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / **1.26 GB** / 914k words/s |
+ | word2vec | 1.7h / 0.36 GB / 1685k words/s | **1.2h** / 0.33 GB / 1762k words/s |
+
+ In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. ([4.0 benchmarks](https://github.com/RaRe-Technologies/gensim/issues/2887#issuecomment-711097334))
+
+ b. **Robustness**. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)
+
+ c. **Simplified OOP model** for easier model exports and integration with TensorFlow, PyTorch &co.
+
+ These improvements come to you transparently aka "for free", but see [Migration guide](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4) for some changes that break the old Gensim 3.x API. **Update your code accordingly**.
+
+* Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.
+ - Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.
+
+ So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.
+
+* Dropped Python 2. Gensim 4.0 is Py3.6+. Read our [Python version support policy](https://github.com/RaRe-Technologies/gensim/wiki/Gensim-And-Compatibility).
+ - If you still need Python 2 for some reason, stay at [Gensim 3.8.3](https://github.com/RaRe-Technologies/gensim/releases/tag/3.8.3).
+
+* A new [Gensim website](https://radimrehurek.com/gensim_4.0.0) – finally! 🙃
+
+So, a major clean-up release overall. We're happy with this **tighter, leaner and faster Gensim**.
+
+This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.
+
+### Why pre-release?
+
+This 4.0.0beta pre-release is for users who want the **cutting edge performance and bug fixes**. Plus users who want to help out, by **testing and providing feedback**: code, documentation, workflows… Please let us know on the [mailing list](https://groups.google.com/forum/#!forum/gensim)!
+
+Install the pre-release with:
+
+```bash
+pip install --pre --upgrade gensim
+```
+
+### What will change between this pre-release and a "full" 4.0 release?
+
+Production stability is important to Gensim, so we're improving the process of **upgrading already-trained saved models**. There'll be an explicit model upgrade script between each `4.n` to `4.(n+1)` Gensim release. Check progress [here](https://github.com/RaRe-Technologies/gensim/milestone/3).
-This release contains a major refactoring.
### :+1: Improvements
-* Refactor ldamulticore to serialize less data (PR [#2300](https://github.com/RaRe-Technologies/gensim/pull/2300), __[@horpto](https://github.com/horpto)__)
-* KeyedVectors & X2Vec API streamlining, consistency (PR [#2698](https://github.com/RaRe-Technologies/gensim/pull/2698), __[@gojomo](https://github.com/gojomo)__)
-* No more wheels for x32 platforms (if you need x32 binaries, please build them yourself).
- (__[menshikh-iv](https://github.com/menshikh-iv)__, [#6](https://github.com/RaRe-Technologies/gensim-wheels/pull/6))
-* Speed up random number generation in word2vec model (PR [#2864](https://github.com/RaRe-Technologies/gensim/pull/2864), __[@zygm0nt](https://github.com/zygm0nt)__)
-* Fix deprecations in SoftCosineSimilarity (PR [#2940](https://github.com/RaRe-Technologies/gensim/pull/2940), __[@Witiko](https://github.com/Witiko)__)
-* Remove Keras dependency (PR [#2937](https://github.com/RaRe-Technologies/gensim/pull/2937), __[@piskvorky](https://github.com/piskvorky)__)
-* Bump minimum Python version to 3.6 (PR [#2947](https://github.com/RaRe-Technologies/gensim/pull/2947), __[@gojomo](https://github.com/gojomo)__)
+* [#2947](https://github.com/RaRe-Technologies/gensim/pull/2947): Bump minimum Python version to 3.6, by [@gojomo](https://github.com/gojomo)
+* [#2939](https://github.com/RaRe-Technologies/gensim/pull/2939) + [#2984](https://github.com/RaRe-Technologies/gensim/pull/2984): Code style & py3 migration clean up, by [@piskvorky](https://github.com/piskvorky)
+* [#2300](https://github.com/RaRe-Technologies/gensim/pull/2300): Use less RAM in LdaMulticore, by [@horpto](https://github.com/horpto)
+* [#2698](https://github.com/RaRe-Technologies/gensim/pull/2698): Streamline KeyedVectors & X2Vec API, by [@gojomo](https://github.com/gojomo)
+* [#2864](https://github.com/RaRe-Technologies/gensim/pull/2864): Speed up random number generation in word2vec, by [@zygm0nt](https://github.com/zygm0nt)
+* [#2976](https://github.com/RaRe-Technologies/gensim/pull/2976): Speed up phrase (collocation) detection, by [@piskvorky](https://github.com/piskvorky)
+* [#2979](https://github.com/RaRe-Technologies/gensim/pull/2979): Allow skipping common English words in multi-word phrases, by [@piskvorky](https://github.com/piskvorky)
+* [#2867](https://github.com/RaRe-Technologies/gensim/pull/2867): Expose `max_final_vocab` parameter in fastText constructor, by [@mpenkov](https://github.com/mpenkov)
+* [#2931](https://github.com/RaRe-Technologies/gensim/pull/2931): Clear up job queue parameters in word2vec, by [@lunastera](https://github.com/lunastera)
+* [#2939](https://github.com/RaRe-Technologies/gensim/pull/2939): X2Vec SaveLoad improvements, by [@piskvorky](https://github.com/piskvorky)
+
+### :books: Tutorials and docs
+
+* [#2954](https://github.com/RaRe-Technologies/gensim/pull/2954): New theme for the Gensin website, [@dvorakvaclav](https://github.com/dvorakvaclav)
+* [#2960](https://github.com/RaRe-Technologies/gensim/issues/2960): Added [Gensim and Compatibility](https://github.com/RaRe-Technologies/gensim/wiki/Gensim-And-Compatibility) Wiki page, by [@piskvorky](https://github.com/piskvorky)
+* [#2960](https://github.com/RaRe-Technologies/gensim/issues/2960): Reworked & simplified the [Developer Wiki page](https://github.com/RaRe-Technologies/gensim/wiki/Developer-page), by [@piskvorky](https://github.com/piskvorky)
+* [#2968](https://github.com/RaRe-Technologies/gensim/pull/2968): Migrate tutorials & how-tos to 4.0.0, by [@piskvorky](https://github.com/piskvorky)
+* [#2899](https://github.com/RaRe-Technologies/gensim/pull/2899): Clean up of language and formatting of docstrings, by [@piskvorky](https://github.com/piskvorky)
+* [#2899](https://github.com/RaRe-Technologies/gensim/pull/2899): Added documentation for NMSLIB indexer, by [@piskvorky](https://github.com/piskvorky)
+* [#2832](https://github.com/RaRe-Technologies/gensim/pull/2832): Clear up LdaModel documentation by [@FyzHsn](https://github.com/FyzHsn)
+* [#2871](https://github.com/RaRe-Technologies/gensim/pull/2871): Clarify that license is LGPL-2.1, by [@pombredanne](https://github.com/pombredanne)
+* [#2896](https://github.com/RaRe-Technologies/gensim/pull/2896): Make docs clearer on `alpha` parameter in LDA model, by [@xh2](https://github.com/xh2)
+* [#2897](https://github.com/RaRe-Technologies/gensim/pull/2897): Update Hoffman paper link for Online LDA, by [@xh2](https://github.com/xh2)
+* [#2910](https://github.com/RaRe-Technologies/gensim/pull/2910): Refresh docs for run_annoy tutorial, by [@piskvorky](https://github.com/piskvorky)
+* [#2935](https://github.com/RaRe-Technologies/gensim/pull/2935): Fix "generator" language in word2vec docs, by [@polm](https://github.com/polm)
-### :books: Tutorial and doc improvements
+### :red_circle: Bug fixes
+
+* [#2891](https://github.com/RaRe-Technologies/gensim/pull/2891): Fix fastText word-vectors with ngrams off, by [@gojomo](https://github.com/gojomo)
+* [#2907](https://github.com/RaRe-Technologies/gensim/pull/2907): Fix doc2vec crash for large sets of doc-vectors, by [@gojomo](https://github.com/gojomo)
+* [#2899](https://github.com/RaRe-Technologies/gensim/pull/2899): Fix similarity bug in NMSLIB indexer, by [@piskvorky](https://github.com/piskvorky)
+* [#2899](https://github.com/RaRe-Technologies/gensim/pull/2899): Fix deprecation warnings in Annoy integration, by [@piskvorky](https://github.com/piskvorky)
+* [#2901](https://github.com/RaRe-Technologies/gensim/pull/2901): Fix inheritance of WikiCorpus from TextCorpus, by [@jenishah](https://github.com/jenishah)
+* [#2940](https://github.com/RaRe-Technologies/gensim/pull/2940); Fix deprecations in SoftCosineSimilarity, by [@Witiko](https://github.com/Witiko)
+* [#2944](https://github.com/RaRe-Technologies/gensim/pull/2944): Fix `save_facebook_model` failure after update-vocab & other initialization streamlining, by [@gojomo](https://github.com/gojomo)
+* [#2846](https://github.com/RaRe-Technologies/gensim/pull/2846): Fix for Python 3.9/3.10: remove `xml.etree.cElementTree`, by [@hugovk](https://github.com/hugovk)
+* [#2973](https://github.com/RaRe-Technologies/gensim/issues/2973): phrases.export_phrases() doesn't yield all bigrams
+* [#2942](https://github.com/RaRe-Technologies/gensim/issues/2942): Segfault when training doc2vec
- * Clear up LdaModel documentation - remove claim that it accepts CSC matrix as input (PR [#2832](https://github.com/RaRe-Technologies/gensim/pull/2832), [@FyzHsn](https://github.com/FyzHsn))
- * Fix "generator" language in word2vec docs (PR [#2935](https://github.com/RaRe-Technologies/gensim/pull/2935), __[@polm](https://github.com/polm)__)
+### :warning: Removed functionality & deprecations
-### :warning: Removed functionality
+* [#6](https://github.com/RaRe-Technologies/gensim-wheels/pull/6): No more binary wheels for x32 platforms, by [menshikh-iv](https://github.com/menshikh-iv)
+* [#2899](https://github.com/RaRe-Technologies/gensim/pull/2899): Renamed overly broad `similarities.index` to the more appropriate `similarities.annoy`, by [@piskvorky](https://github.com/piskvorky)
+* [#2958](https://github.com/RaRe-Technologies/gensim/pull/2958): Remove gensim.summarization subpackage, docs and test data, by [@mpenkov](https://github.com/mpenkov)
+* [#2926](https://github.com/RaRe-Technologies/gensim/pull/2926): Rename `num_words` to `topn` in dtm_coherence, by [@MeganStodel](https://github.com/MeganStodel)
+* [#2937](https://github.com/RaRe-Technologies/gensim/pull/2937): Remove Keras dependency, by [@piskvorky](https://github.com/piskvorky)
+* Removed all code, methods, attributes and functions marked as deprecated in [Gensim 3.8.3](https://github.com/RaRe-Technologies/gensim/releases/tag/3.8.3).
- * Remove gensim.summarization subpackage, docs and test data (PR [#2958](https://github.com/RaRe-Technologies/gensim/pull/2958), __[@mpenkov](https://github.com/mpenkov)__)
+---
-## :warning: 3.8.x will be the last gensim version to support Py2.7. Starting with 4.0.0, gensim will only support Py3.5 and above
## 3.8.3, 2020-05-03
+**:warning: 3.8.x will be the last Gensim version to support Py2.7. Starting with 4.0.0, Gensim will only support Py3.5 and above.**
+
This is primarily a bugfix release to bring back Py2.7 compatibility to gensim 3.8.
### :red_circle: Bug fixes
-* Bring back Py27 support (PR [#2812](https://github.com/RaRe-Technologies/gensim/pull/2812), __[@mpenkov](https://github.com/mpenkov)__)
+* Bring back Py27 support (PR [#2812](https://github.com/RaRe-Technologies/gensim/pull/2812), [@mpenkov](https://github.com/mpenkov))
* Fix wrong version reported by setup.py (Issue [#2796](https://github.com/RaRe-Technologies/gensim/issues/2796))
* Fix missing C extensions (Issues [#2794](https://github.com/RaRe-Technologies/gensim/issues/2794) and [#2802](https://github.com/RaRe-Technologies/gensim/issues/2802))
### :+1: Improvements
-* Wheels for Python 3.8 (__[@menshikh-iv](https://github.com/menshikh-iv)__)
-* Prepare for removal of deprecated `lxml.etree.cElementTree` (PR [#2777](https://github.com/RaRe-Technologies/gensim/pull/2777), __[@tirkarthi](https://github.com/tirkarthi)__)
+* Wheels for Python 3.8 ([@menshikh-iv](https://github.com/menshikh-iv))
+* Prepare for removal of deprecated `lxml.etree.cElementTree` (PR [#2777](https://github.com/RaRe-Technologies/gensim/pull/2777), [@tirkarthi](https://github.com/tirkarthi))
### :books: Tutorial and doc improvements
-* Update test instructions in README (PR [#2814](https://github.com/RaRe-Technologies/gensim/pull/2814), __[@piskvorky](https://github.com/piskvorky)__)
+* Update test instructions in README (PR [#2814](https://github.com/RaRe-Technologies/gensim/pull/2814), [@piskvorky](https://github.com/piskvorky))
### :warning: Deprecations (will be removed in the next major release)
@@ -68,6 +151,8 @@ This is primarily a bugfix release to bring back Py2.7 compatibility to gensim 3
- `gensim.utils` ➡ `gensim.utils.utils` (old imports will continue to work)
- `gensim.parsing.*` ➡ `gensim.utils.text_utils`
+---
+
## 3.8.2, 2020-04-10
### :red_circle: Bug fixes
@@ -96,23 +181,25 @@ This is primarily a bugfix release to bring back Py2.7 compatibility to gensim 3
- `gensim.utils` ➡ `gensim.utils.utils` (old imports will continue to work)
- `gensim.parsing.*` ➡ `gensim.utils.text_utils`
+---
+
## 3.8.1, 2019-09-23
### :red_circle: Bug fixes
-* Fix usage of base_dir instead of BASE_DIR in _load_info in downloader. (__[movb](https://github.com/movb)__, [#2605](https://github.com/RaRe-Technologies/gensim/pull/2605))
-* Update the version of smart_open in the setup.py file (__[AMR-KELEG](https://github.com/AMR-KELEG)__, [#2582](https://github.com/RaRe-Technologies/gensim/pull/2582))
-* Properly handle unicode_errors arg parameter when loading a vocab file (__[wmtzk](https://github.com/wmtzk)__, [#2570](https://github.com/RaRe-Technologies/gensim/pull/2570))
-* Catch loading older TfidfModels without smartirs (__[bnomis](https://github.com/bnomis)__, [#2559](https://github.com/RaRe-Technologies/gensim/pull/2559))
-* Fix bug where a module import set up logging, pin doctools for Py2 (__[piskvorky](https://github.com/piskvorky)__, [#2552](https://github.com/RaRe-Technologies/gensim/pull/2552))
+* Fix usage of base_dir instead of BASE_DIR in _load_info in downloader. ([movb](https://github.com/movb), [#2605](https://github.com/RaRe-Technologies/gensim/pull/2605))
+* Update the version of smart_open in the setup.py file ([AMR-KELEG](https://github.com/AMR-KELEG), [#2582](https://github.com/RaRe-Technologies/gensim/pull/2582))
+* Properly handle unicode_errors arg parameter when loading a vocab file ([wmtzk](https://github.com/wmtzk), [#2570](https://github.com/RaRe-Technologies/gensim/pull/2570))
+* Catch loading older TfidfModels without smartirs ([bnomis](https://github.com/bnomis), [#2559](https://github.com/RaRe-Technologies/gensim/pull/2559))
+* Fix bug where a module import set up logging, pin doctools for Py2 ([piskvorky](https://github.com/piskvorky), [#2552](https://github.com/RaRe-Technologies/gensim/pull/2552))
### :books: Tutorial and doc improvements
-* Fix usage example in phrases.py (__[piskvorky](https://github.com/piskvorky)__, [#2575](https://github.com/RaRe-Technologies/gensim/pull/2575))
+* Fix usage example in phrases.py ([piskvorky](https://github.com/piskvorky), [#2575](https://github.com/RaRe-Technologies/gensim/pull/2575))
### :+1: Improvements
-* Optimize Poincare model training (__[koiizukag](https://github.com/koiizukag)__, [#2589](https://github.com/RaRe-Technologies/gensim/pull/2589))
+* Optimize Poincare model training ([koiizukag](https://github.com/koiizukag), [#2589](https://github.com/RaRe-Technologies/gensim/pull/2589))
### :warning: Deprecations (will be removed in the next major release)
@@ -136,34 +223,36 @@ This is primarily a bugfix release to bring back Py2.7 compatibility to gensim 3
- `gensim.utils` ➡ `gensim.utils.utils` (old imports will continue to work)
- `gensim.parsing.*` ➡ `gensim.utils.text_utils`
+---
+
## 3.8.0, 2019-07-08
### :star2: New Features
-* Enable online training of Poincare models (__[koiizukag](https://github.com/koiizukag)__, [#2505](https://github.com/RaRe-Technologies/gensim/pull/2505))
-* Make BM25 more scalable by adding support for generator inputs (__[saraswatmks](https://github.com/saraswatmks)__, [#2479](https://github.com/RaRe-Technologies/gensim/pull/2479))
-* Allow the Gensim dataset / pre-trained model downloader `gensim.downloader` to run offline, by introducing a local file cache (__[mpenkov](https://github.com/mpenkov)__, [#2545](https://github.com/RaRe-Technologies/gensim/pull/2545))
-* Make the `gensim.downloader` target directory configurable (__[mpenkov](https://github.com/mpenkov)__, [#2456](https://github.com/RaRe-Technologies/gensim/pull/2456))
-* Add `nmslib` indexer (__[masa3141](https://github.com/masa3141)__, [#2417](https://github.com/RaRe-Technologies/gensim/pull/2417))
+* Enable online training of Poincare models ([koiizukag](https://github.com/koiizukag), [#2505](https://github.com/RaRe-Technologies/gensim/pull/2505))
+* Make BM25 more scalable by adding support for generator inputs ([saraswatmks](https://github.com/saraswatmks), [#2479](https://github.com/RaRe-Technologies/gensim/pull/2479))
+* Allow the Gensim dataset / pre-trained model downloader `gensim.downloader` to run offline, by introducing a local file cache ([mpenkov](https://github.com/mpenkov), [#2545](https://github.com/RaRe-Technologies/gensim/pull/2545))
+* Make the `gensim.downloader` target directory configurable ([mpenkov](https://github.com/mpenkov), [#2456](https://github.com/RaRe-Technologies/gensim/pull/2456))
+* Add `nmslib` indexer ([masa3141](https://github.com/masa3141), [#2417](https://github.com/RaRe-Technologies/gensim/pull/2417))
### :red_circle: Bug fixes
-* Fix `smart_open` deprecation warning globally (__[itayB](https://github.com/itayB)__, [#2530](https://github.com/RaRe-Technologies/gensim/pull/2530))
-* Fix AppVeyor issues with Windows and Py2 (__[mpenkov](https://github.com/mpenkov)__, [#2546](https://github.com/RaRe-Technologies/gensim/pull/2546))
-* Fix `topn=0` versus `topn=None` bug in `most_similar`, accept `topn` of any integer type (__[Witiko](https://github.com/Witiko)__, [#2497](https://github.com/RaRe-Technologies/gensim/pull/2497))
-* Fix Python version check (__[charsyam](https://github.com/charsyam)__, [#2547](https://github.com/RaRe-Technologies/gensim/pull/2547))
-* Fix typo in FastText documentation (__[Guitaricet](https://github.com/Guitaricet)__, [#2518](https://github.com/RaRe-Technologies/gensim/pull/2518))
-* Fix "Market Matrix" to "Matrix Market" typo. (__[Shooter23](https://github.com/Shooter23)__, [#2513](https://github.com/RaRe-Technologies/gensim/pull/2513))
-* Fix auto-generated hyperlinks in `CHANGELOG.md` (__[mpenkov](https://github.com/mpenkov)__, [#2482](https://github.com/RaRe-Technologies/gensim/pull/2482))
+* Fix `smart_open` deprecation warning globally ([itayB](https://github.com/itayB), [#2530](https://github.com/RaRe-Technologies/gensim/pull/2530))
+* Fix AppVeyor issues with Windows and Py2 ([mpenkov](https://github.com/mpenkov), [#2546](https://github.com/RaRe-Technologies/gensim/pull/2546))
+* Fix `topn=0` versus `topn=None` bug in `most_similar`, accept `topn` of any integer type ([Witiko](https://github.com/Witiko), [#2497](https://github.com/RaRe-Technologies/gensim/pull/2497))
+* Fix Python version check ([charsyam](https://github.com/charsyam), [#2547](https://github.com/RaRe-Technologies/gensim/pull/2547))
+* Fix typo in FastText documentation ([Guitaricet](https://github.com/Guitaricet), [#2518](https://github.com/RaRe-Technologies/gensim/pull/2518))
+* Fix "Market Matrix" to "Matrix Market" typo. ([Shooter23](https://github.com/Shooter23), [#2513](https://github.com/RaRe-Technologies/gensim/pull/2513))
+* Fix auto-generated hyperlinks in `CHANGELOG.md` ([mpenkov](https://github.com/mpenkov), [#2482](https://github.com/RaRe-Technologies/gensim/pull/2482))
### :books: Tutorial and doc improvements
-* Generate documentation for the `gensim.similarities.termsim` module (__[Witiko](https://github.com/Witiko)__, [#2485](https://github.com/RaRe-Technologies/gensim/pull/2485))
-* Simplify the `Support` section in README (__[piskvorky](https://github.com/piskvorky)__, [#2542](https://github.com/RaRe-Technologies/gensim/pull/2542))
+* Generate documentation for the `gensim.similarities.termsim` module ([Witiko](https://github.com/Witiko), [#2485](https://github.com/RaRe-Technologies/gensim/pull/2485))
+* Simplify the `Support` section in README ([piskvorky](https://github.com/piskvorky), [#2542](https://github.com/RaRe-Technologies/gensim/pull/2542))
### :+1: Improvements
-* Pin sklearn version for Py2, because sklearn dropped py2 support (__[mpenkov](https://github.com/mpenkov)__, [#2510](https://github.com/RaRe-Technologies/gensim/pull/2510))
+* Pin sklearn version for Py2, because sklearn dropped py2 support ([mpenkov](https://github.com/mpenkov), [#2510](https://github.com/RaRe-Technologies/gensim/pull/2510))
### :warning: Deprecations (will be removed in the next major release)
@@ -192,24 +281,24 @@ This is primarily a bugfix release to bring back Py2.7 compatibility to gensim 3
### :red_circle: Bug fixes
-* Fix fasttext model loading from gzip files (__[mpenkov](https://github.com/mpenkov)__, [#2476](https://github.com/RaRe-Technologies/gensim/pull/2476))
-* Fix misleading `Doc2Vec.docvecs` comment (__[gojomo](https://github.com/gojomo)__, [#2472](https://github.com/RaRe-Technologies/gensim/pull/2472))
-* NMF bugfix (__[mpenkov](https://github.com/mpenkov)__, [#2466](https://github.com/RaRe-Technologies/gensim/pull/2466))
-* Fix `WordEmbeddingsKeyedVectors.most_similar` (__[Witiko](https://github.com/Witiko)__, [#2461](https://github.com/RaRe-Technologies/gensim/pull/2461))
-* Fix LdaSequence model by updating to num_documents (__[Bharat123rox](https://github.com/Bharat123rox)__, [#2410](https://github.com/RaRe-Technologies/gensim/pull/2410))
-* Make termsim matrix positive definite even with negative similarities (__[Witiko](https://github.com/Witiko)__, [#2397](https://github.com/RaRe-Technologies/gensim/pull/2397))
-* Fix the off-by-one bug in the TFIDF model. (__[AMR-KELEG](https://github.com/AMR-KELEG)__, [#2392](https://github.com/RaRe-Technologies/gensim/pull/2392))
-* Update legacy model loading (__[mpenkov](https://github.com/mpenkov)__, [#2454](https://github.com/RaRe-Technologies/gensim/pull/2454), [#2457](https://github.com/RaRe-Technologies/gensim/pull/2457))
-* Make `matutils.unitvec` always return float norm when requested (__[Witiko](https://github.com/Witiko)__, [#2419](https://github.com/RaRe-Technologies/gensim/pull/2419))
+* Fix fasttext model loading from gzip files ([mpenkov](https://github.com/mpenkov), [#2476](https://github.com/RaRe-Technologies/gensim/pull/2476))
+* Fix misleading `Doc2Vec.docvecs` comment ([gojomo](https://github.com/gojomo), [#2472](https://github.com/RaRe-Technologies/gensim/pull/2472))
+* NMF bugfix ([mpenkov](https://github.com/mpenkov), [#2466](https://github.com/RaRe-Technologies/gensim/pull/2466))
+* Fix `WordEmbeddingsKeyedVectors.most_similar` ([Witiko](https://github.com/Witiko), [#2461](https://github.com/RaRe-Technologies/gensim/pull/2461))
+* Fix LdaSequence model by updating to num_documents ([Bharat123rox](https://github.com/Bharat123rox), [#2410](https://github.com/RaRe-Technologies/gensim/pull/2410))
+* Make termsim matrix positive definite even with negative similarities ([Witiko](https://github.com/Witiko), [#2397](https://github.com/RaRe-Technologies/gensim/pull/2397))
+* Fix the off-by-one bug in the TFIDF model. ([AMR-KELEG](https://github.com/AMR-KELEG), [#2392](https://github.com/RaRe-Technologies/gensim/pull/2392))
+* Update legacy model loading ([mpenkov](https://github.com/mpenkov), [#2454](https://github.com/RaRe-Technologies/gensim/pull/2454), [#2457](https://github.com/RaRe-Technologies/gensim/pull/2457))
+* Make `matutils.unitvec` always return float norm when requested ([Witiko](https://github.com/Witiko), [#2419](https://github.com/RaRe-Technologies/gensim/pull/2419))
### :books: Tutorial and doc improvements
-* Update word2vec.ipynb (__[asyabo](https://github.com/asyabo)__, [#2423](https://github.com/RaRe-Technologies/gensim/pull/2423))
+* Update word2vec.ipynb ([asyabo](https://github.com/asyabo), [#2423](https://github.com/RaRe-Technologies/gensim/pull/2423))
### :+1: Improvements
-* Adding type check for corpus_file argument (__[saraswatmks](https://github.com/saraswatmks)__, [#2469](https://github.com/RaRe-Technologies/gensim/pull/2469))
-* Clean up FastText Cython code, fix division by zero (__[mpenkov](https://github.com/mpenkov)__, [#2382](https://github.com/RaRe-Technologies/gensim/pull/2382))
+* Adding type check for corpus_file argument ([saraswatmks](https://github.com/saraswatmks), [#2469](https://github.com/RaRe-Technologies/gensim/pull/2469))
+* Clean up FastText Cython code, fix division by zero ([mpenkov](https://github.com/mpenkov), [#2382](https://github.com/RaRe-Technologies/gensim/pull/2382))
### :warning: Deprecations (will be removed in the next major release)
@@ -242,43 +331,43 @@ This is primarily a bugfix release to bring back Py2.7 compatibility to gensim 3
### :red_circle: Bug fixes
-* Fix unicode error when loading FastText vocabulary (__[@mpenkov](https://github.com/mpenkov)__, [#2390](https://github.com/RaRe-Technologies/gensim/pull/2390))
-* Avoid division by zero in fasttext_inner.pyx (__[@mpenkov](https://github.com/mpenkov)__, [#2404](https://github.com/RaRe-Technologies/gensim/pull/2404))
-* Avoid incorrect filename inference when loading model (__[@mpenkov](https://github.com/mpenkov)__, [#2408](https://github.com/RaRe-Technologies/gensim/pull/2408))
-* Handle invalid unicode when loading native FastText models (__[@mpenkov](https://github.com/mpenkov)__, [#2411](https://github.com/RaRe-Technologies/gensim/pull/2411))
-* Avoid divide by zero when calculating vectors for terms with no ngrams (__[@mpenkov](https://github.com/mpenkov)__, [#2411](https://github.com/RaRe-Technologies/gensim/pull/2411))
+* Fix unicode error when loading FastText vocabulary ([@mpenkov](https://github.com/mpenkov), [#2390](https://github.com/RaRe-Technologies/gensim/pull/2390))
+* Avoid division by zero in fasttext_inner.pyx ([@mpenkov](https://github.com/mpenkov), [#2404](https://github.com/RaRe-Technologies/gensim/pull/2404))
+* Avoid incorrect filename inference when loading model ([@mpenkov](https://github.com/mpenkov), [#2408](https://github.com/RaRe-Technologies/gensim/pull/2408))
+* Handle invalid unicode when loading native FastText models ([@mpenkov](https://github.com/mpenkov), [#2411](https://github.com/RaRe-Technologies/gensim/pull/2411))
+* Avoid divide by zero when calculating vectors for terms with no ngrams ([@mpenkov](https://github.com/mpenkov), [#2411](https://github.com/RaRe-Technologies/gensim/pull/2411))
### :books: Tutorial and doc improvements
-* Add link to bindr (__[rogueleaderr](https://github.com/rogueleaderr)__, [#2387](https://github.com/RaRe-Technologies/gensim/pull/2387))
+* Add link to bindr ([rogueleaderr](https://github.com/rogueleaderr), [#2387](https://github.com/RaRe-Technologies/gensim/pull/2387))
### :+1: Improvements
-* Undo the hash2index optimization (__[mpenkov](https://github.com/mpenkov)__, [#2370](https://github.com/RaRe-Technologies/gensim/pull/2370))
+* Undo the hash2index optimization ([mpenkov](https://github.com/mpenkov), [#2370](https://github.com/RaRe-Technologies/gensim/pull/2370))
### :warning: Changes in FastText behavior
#### Out-of-vocab word handling
To achieve consistency with the reference implementation from Facebook,
-a `FastText` model will now always report any word, out-of-vocabulary or
-not, as being in the model, and always return some vector for any word
+a `FastText` model will now always report any word, out-of-vocabulary or
+not, as being in the model, and always return some vector for any word
looked-up. Specifically:
-1. `'any_word' in ft_model` will always return `True`. Previously, it
-returned `True` only if the full word was in the vocabulary. (To test if a
-full word is in the known vocabulary, you can consult the `wv.vocab`
-property: `'any_word' in ft_model.wv.vocab` will return `False` if the full
+1. `'any_word' in ft_model` will always return `True`. Previously, it
+returned `True` only if the full word was in the vocabulary. (To test if a
+full word is in the known vocabulary, you can consult the `wv.vocab`
+property: `'any_word' in ft_model.wv.vocab` will return `False` if the full
word wasn't learned during model training.)
-2. `ft_model['any_word']` will always return a vector. Previously, it
-raised `KeyError` for OOV words when the model had no vectors
+2. `ft_model['any_word']` will always return a vector. Previously, it
+raised `KeyError` for OOV words when the model had no vectors
for **any** ngrams of the word.
3. If no ngrams from the term are present in the model,
or when no ngrams could be extracted from the term, a vector pointing
to the origin will be returned. Previously, a vector of NaN (not a number)
was returned as a consequence of a divide-by-zero problem.
4. Models may use more more memory, or take longer for word-vector
-lookup, especially after training on smaller corpuses where the previous
+lookup, especially after training on smaller corpuses where the previous
non-compliant behavior discarded some ngrams from consideration.
#### Loading models in Facebook .bin format
@@ -291,7 +380,7 @@ Since this function is deprecated, consider using one of its alternatives (see b
Furthermore, you must now pass the full path to the file to load, **including the file extension.**
Previously, if you specified a model path that ends with anything other than .bin, the code automatically appended .bin to the path before loading the model.
This behavior was [confusing](https://github.com/RaRe-Technologies/gensim/issues/2407), so we removed it.
-
+
### :warning: Deprecations (will be removed in the next major release)
Remove:
@@ -302,28 +391,28 @@ Remove:
### :+1: Improvements
-* NMF optimization & documentation (__[@anotherbugmaster](https://github.com/anotherbugmaster)__, [#2361](https://github.com/RaRe-Technologies/gensim/pull/2361))
-* Optimize `FastText.load_fasttext_model` (__[@mpenkov](https://github.com/mpenkov)__, [#2340](https://github.com/RaRe-Technologies/gensim/pull/2340))
-* Add warning when string is used as argument to `Doc2Vec.infer_vector` (__[@tobycheese](https://github.com/tobycheese)__, [#2347](https://github.com/RaRe-Technologies/gensim/pull/2347))
-* Fix light linting issues in `LdaSeqModel` (__[@horpto](https://github.com/horpto)__, [#2360](https://github.com/RaRe-Technologies/gensim/pull/2360))
-* Move out `process_result_queue` from cycle in `LdaMulticore` (__[@horpto](https://github.com/horpto)__, [#2358](https://github.com/RaRe-Technologies/gensim/pull/2358))
+* NMF optimization & documentation ([@anotherbugmaster](https://github.com/anotherbugmaster), [#2361](https://github.com/RaRe-Technologies/gensim/pull/2361))
+* Optimize `FastText.load_fasttext_model` ([@mpenkov](https://github.com/mpenkov), [#2340](https://github.com/RaRe-Technologies/gensim/pull/2340))
+* Add warning when string is used as argument to `Doc2Vec.infer_vector` ([@tobycheese](https://github.com/tobycheese), [#2347](https://github.com/RaRe-Technologies/gensim/pull/2347))
+* Fix light linting issues in `LdaSeqModel` ([@horpto](https://github.com/horpto), [#2360](https://github.com/RaRe-Technologies/gensim/pull/2360))
+* Move out `process_result_queue` from cycle in `LdaMulticore` ([@horpto](https://github.com/horpto), [#2358](https://github.com/RaRe-Technologies/gensim/pull/2358))
### :red_circle: Bug fixes
-* Fix infinite diff in `LdaModel.do_mstep` (__[@horpto](https://github.com/horpto)__, [#2344](https://github.com/RaRe-Technologies/gensim/pull/2344))
-* Fix backward compatibility issue: loading `FastTextKeyedVectors` using `KeyedVectors` (missing attribute `compatible_hash`) (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2349](https://github.com/RaRe-Technologies/gensim/pull/2349))
-* Fix logging issue (conda-forge related) (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2339](https://github.com/RaRe-Technologies/gensim/pull/2339))
-* Fix `WordEmbeddingsKeyedVectors.most_similar` (__[@Witiko](https://github.com/Witiko)__, [#2356](https://github.com/RaRe-Technologies/gensim/pull/2356))
-* Fix issues of `flake8==3.7.1` (__[@horpto](https://github.com/horpto)__, [#2365](https://github.com/RaRe-Technologies/gensim/pull/2365))
+* Fix infinite diff in `LdaModel.do_mstep` ([@horpto](https://github.com/horpto), [#2344](https://github.com/RaRe-Technologies/gensim/pull/2344))
+* Fix backward compatibility issue: loading `FastTextKeyedVectors` using `KeyedVectors` (missing attribute `compatible_hash`) ([@menshikh-iv](https://github.com/menshikh-iv), [#2349](https://github.com/RaRe-Technologies/gensim/pull/2349))
+* Fix logging issue (conda-forge related) ([@menshikh-iv](https://github.com/menshikh-iv), [#2339](https://github.com/RaRe-Technologies/gensim/pull/2339))
+* Fix `WordEmbeddingsKeyedVectors.most_similar` ([@Witiko](https://github.com/Witiko), [#2356](https://github.com/RaRe-Technologies/gensim/pull/2356))
+* Fix issues of `flake8==3.7.1` ([@horpto](https://github.com/horpto), [#2365](https://github.com/RaRe-Technologies/gensim/pull/2365))
### :books: Tutorial and doc improvements
-* Improve `FastText` documentation (__[@mpenkov](https://github.com/mpenkov)__, [#2353](https://github.com/RaRe-Technologies/gensim/pull/2353))
-* Minor corrections and improvements in `Any*Vec` docstrings (__[@tobycheese](https://github.com/tobycheese)__, [#2345](https://github.com/RaRe-Technologies/gensim/pull/2345))
-* Fix the example code for SparseTermSimilarityMatrix (__[@Witiko](https://github.com/Witiko)__, [#2359](https://github.com/RaRe-Technologies/gensim/pull/2359))
-* Update `poincare` documentation to indicate the relation format (__[@AMR-KELEG](https://github.com/AMR-KELEG)__, [#2357](https://github.com/RaRe-Technologies/gensim/pull/2357))
+* Improve `FastText` documentation ([@mpenkov](https://github.com/mpenkov), [#2353](https://github.com/RaRe-Technologies/gensim/pull/2353))
+* Minor corrections and improvements in `Any*Vec` docstrings ([@tobycheese](https://github.com/tobycheese), [#2345](https://github.com/RaRe-Technologies/gensim/pull/2345))
+* Fix the example code for SparseTermSimilarityMatrix ([@Witiko](https://github.com/Witiko), [#2359](https://github.com/RaRe-Technologies/gensim/pull/2359))
+* Update `poincare` documentation to indicate the relation format ([@AMR-KELEG](https://github.com/AMR-KELEG), [#2357](https://github.com/RaRe-Technologies/gensim/pull/2357))
### :warning: Deprecations (will be removed in the next major release)
@@ -352,7 +441,7 @@ Remove:
### :star2: New features
-* Fast Online NMF (__[@anotherbugmaster](https://github.com/anotherbugmaster)__, [#2007](https://github.com/RaRe-Technologies/gensim/pull/2007))
+* Fast Online NMF ([@anotherbugmaster](https://github.com/anotherbugmaster), [#2007](https://github.com/RaRe-Technologies/gensim/pull/2007))
- Benchmark `wiki-english-20171001`
| Model | Perplexity | Coherence | L2 norm | Train time (minutes) |
@@ -398,7 +487,7 @@ Remove:
- [NMF tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/nmf_tutorial.ipynb)
- [Full NMF Benchmark](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/nmf_wikipedia.ipynb)
-* Massive improvement`FastText` compatibilities (__[@mpenkov](https://github.com/mpenkov)__, [#2313](https://github.com/RaRe-Technologies/gensim/pull/2313))
+* Massive improvement`FastText` compatibilities ([@mpenkov](https://github.com/mpenkov), [#2313](https://github.com/RaRe-Technologies/gensim/pull/2313))
```python
from gensim.models import FastText
@@ -444,7 +533,7 @@ Remove:
model.train(corpus, total_examples=len(corpus), epochs=5)
```
-* Similarity search improvements (__[@Witiko](https://github.com/Witiko)__, [#2016](https://github.com/RaRe-Technologies/gensim/pull/2016))
+* Similarity search improvements ([@Witiko](https://github.com/Witiko), [#2016](https://github.com/RaRe-Technologies/gensim/pull/2016))
- Add similarity search using the Levenshtein distance in `gensim.similarities.LevenshteinSimilarityIndex`
- Performance optimizations to `gensim.similarities.SoftCosineSimilarity` ([full benchmark](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_benchmark.ipynb))
@@ -459,83 +548,83 @@ Remove:
- See [updated soft-cosine tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb) for more information and usage examples
-* Add `python3.7` support (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2211](https://github.com/RaRe-Technologies/gensim/pull/2211))
- - Wheels for Window, OSX and Linux platforms (__[@menshikh-iv](https://github.com/menshikh-iv)__, [MacPython/gensim-wheels/#12](https://github.com/MacPython/gensim-wheels/pull/12))
+* Add `python3.7` support ([@menshikh-iv](https://github.com/menshikh-iv), [#2211](https://github.com/RaRe-Technologies/gensim/pull/2211))
+ - Wheels for Window, OSX and Linux platforms ([@menshikh-iv](https://github.com/menshikh-iv), [MacPython/gensim-wheels/#12](https://github.com/MacPython/gensim-wheels/pull/12))
- Faster installation
### :+1: Improvements
##### Optimizations
-* Reduce `Phraser` memory usage (drop frequencies) (__[@jenishah](https://github.com/jenishah)__, [#2208](https://github.com/RaRe-Technologies/gensim/pull/2208))
-* Reduce memory consumption of summarizer (__[@horpto](https://github.com/horpto)__, [#2298](https://github.com/RaRe-Technologies/gensim/pull/2298))
-* Replace inline slow equivalent of mean_absolute_difference with fast (__[@horpto](https://github.com/horpto)__, [#2284](https://github.com/RaRe-Technologies/gensim/pull/2284))
-* Reuse precalculated updated prior in `ldamodel.update_dir_prior` (__[@horpto](https://github.com/horpto)__, [#2274](https://github.com/RaRe-Technologies/gensim/pull/2274))
-* Improve `KeyedVector.wmdistance` (__[@horpto](https://github.com/horpto)__, [#2326](https://github.com/RaRe-Technologies/gensim/pull/2326))
-* Optimize `remove_unreachable_nodes` in `gensim.summarization` (__[@horpto](https://github.com/horpto)__, [#2263](https://github.com/RaRe-Technologies/gensim/pull/2263))
-* Optimize `mz_entropy` from `gensim.summarization` (__[@horpto](https://github.com/horpto)__, [#2267](https://github.com/RaRe-Technologies/gensim/pull/2267))
-* Improve `filter_extremes` methods in `Dictionary` and `HashDictionary` (__[@horpto](https://github.com/horpto)__, [#2303](https://github.com/RaRe-Technologies/gensim/pull/2303))
+* Reduce `Phraser` memory usage (drop frequencies) ([@jenishah](https://github.com/jenishah), [#2208](https://github.com/RaRe-Technologies/gensim/pull/2208))
+* Reduce memory consumption of summarizer ([@horpto](https://github.com/horpto), [#2298](https://github.com/RaRe-Technologies/gensim/pull/2298))
+* Replace inline slow equivalent of mean_absolute_difference with fast ([@horpto](https://github.com/horpto), [#2284](https://github.com/RaRe-Technologies/gensim/pull/2284))
+* Reuse precalculated updated prior in `ldamodel.update_dir_prior` ([@horpto](https://github.com/horpto), [#2274](https://github.com/RaRe-Technologies/gensim/pull/2274))
+* Improve `KeyedVector.wmdistance` ([@horpto](https://github.com/horpto), [#2326](https://github.com/RaRe-Technologies/gensim/pull/2326))
+* Optimize `remove_unreachable_nodes` in `gensim.summarization` ([@horpto](https://github.com/horpto), [#2263](https://github.com/RaRe-Technologies/gensim/pull/2263))
+* Optimize `mz_entropy` from `gensim.summarization` ([@horpto](https://github.com/horpto), [#2267](https://github.com/RaRe-Technologies/gensim/pull/2267))
+* Improve `filter_extremes` methods in `Dictionary` and `HashDictionary` ([@horpto](https://github.com/horpto), [#2303](https://github.com/RaRe-Technologies/gensim/pull/2303))
##### Additions
-* Add `KeyedVectors.relative_cosine_similarity` (__[@rsdel2007](https://github.com/rsdel2007)__, [#2307](https://github.com/RaRe-Technologies/gensim/pull/2307))
-* Add `random_seed` to `LdaMallet` (__[@Zohaggie](https://github.com/Zohaggie)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#2153](https://github.com/RaRe-Technologies/gensim/pull/2153))
-* Add `common_terms` parameter to `sklearn_api.PhrasesTransformer` (__[@pmlk](https://github.com/pmlk)__, [#2074](https://github.com/RaRe-Technologies/gensim/pull/2074))
-* Add method for patch `corpora.Dictionary` based on special tokens (__[@Froskekongen](https://github.com/Froskekongen)__, [#2200](https://github.com/RaRe-Technologies/gensim/pull/2200))
+* Add `KeyedVectors.relative_cosine_similarity` ([@rsdel2007](https://github.com/rsdel2007), [#2307](https://github.com/RaRe-Technologies/gensim/pull/2307))
+* Add `random_seed` to `LdaMallet` ([@Zohaggie](https://github.com/Zohaggie) & [@menshikh-iv](https://github.com/menshikh-iv), [#2153](https://github.com/RaRe-Technologies/gensim/pull/2153))
+* Add `common_terms` parameter to `sklearn_api.PhrasesTransformer` ([@pmlk](https://github.com/pmlk), [#2074](https://github.com/RaRe-Technologies/gensim/pull/2074))
+* Add method for patch `corpora.Dictionary` based on special tokens ([@Froskekongen](https://github.com/Froskekongen), [#2200](https://github.com/RaRe-Technologies/gensim/pull/2200))
##### Cleanup
-* Improve `six` usage (`xrange`, `map`, `zip`) (__[@horpto](https://github.com/horpto)__, [#2264](https://github.com/RaRe-Technologies/gensim/pull/2264))
-* Refactor `line2doc` methods of `LowCorpus` and `MalletCorpus` (__[@horpto](https://github.com/horpto)__, [#2269](https://github.com/RaRe-Technologies/gensim/pull/2269))
-* Get rid most of warnings in testing (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2191](https://github.com/RaRe-Technologies/gensim/pull/2191))
-* Fix non-deterministic test failures (pin `PYTHONHASHSEED`) (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2196](https://github.com/RaRe-Technologies/gensim/pull/2196))
-* Fix "aliasing chunkize to chunkize_serial" warning on Windows (__[@aquatiko](https://github.com/aquatiko)__, [#2202](https://github.com/RaRe-Technologies/gensim/pull/2202))
-* Remove `__getitem__` code duplication in `gensim.models.phrases` (__[@jenishah](https://github.com/jenishah)__, [#2206](https://github.com/RaRe-Technologies/gensim/pull/2206))
-* Add `flake8-rst` for docstring code examples (__[@kataev](https://github.com/kataev)__, [#2192](https://github.com/RaRe-Technologies/gensim/pull/2192))
-* Get rid `py26` stuff (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2214](https://github.com/RaRe-Technologies/gensim/pull/2214))
-* Use `itertools.chain` instead of `sum` to concatenate lists (__[@Stigjb](https://github.com/Stigjb)__, [#2212](https://github.com/RaRe-Technologies/gensim/pull/2212))
-* Fix flake8 warnings W605, W504 (__[@horpto](https://github.com/horpto)__, [#2256](https://github.com/RaRe-Technologies/gensim/pull/2256))
-* Remove unnecessary creations of lists at all (__[@horpto](https://github.com/horpto)__, [#2261](https://github.com/RaRe-Technologies/gensim/pull/2261))
-* Fix extra list creation in `utils.get_max_id` (__[@horpto](https://github.com/horpto)__, [#2254](https://github.com/RaRe-Technologies/gensim/pull/2254))
-* Fix deprecation warning `np.sum(generator)` (__[@rsdel2007](https://github.com/rsdel2007)__, [#2296](https://github.com/RaRe-Technologies/gensim/pull/2296))
-* Refactor `BM25` (__[@horpto](https://github.com/horpto)__, [#2275](https://github.com/RaRe-Technologies/gensim/pull/2275))
-* Fix pyemd import (__[@ramprakash-94](https://github.com/ramprakash-94)__, [#2240](https://github.com/RaRe-Technologies/gensim/pull/2240))
-* Set `metadata=True` for `make_wikicorpus` script by default (__[@Xinyi2016](https://github.com/Xinyi2016)__, [#2245](https://github.com/RaRe-Technologies/gensim/pull/2245))
-* Remove unimportant warning from `Phrases` (__[@rsdel2007](https://github.com/rsdel2007)__, [#2331](https://github.com/RaRe-Technologies/gensim/pull/2331))
-* Replace `open()` by `smart_open()` in `gensim.models.fasttext._load_fasttext_format` (__[@rsdel2007](https://github.com/rsdel2007)__, [#2335](https://github.com/RaRe-Technologies/gensim/pull/2335))
+* Improve `six` usage (`xrange`, `map`, `zip`) ([@horpto](https://github.com/horpto), [#2264](https://github.com/RaRe-Technologies/gensim/pull/2264))
+* Refactor `line2doc` methods of `LowCorpus` and `MalletCorpus` ([@horpto](https://github.com/horpto), [#2269](https://github.com/RaRe-Technologies/gensim/pull/2269))
+* Get rid most of warnings in testing ([@menshikh-iv](https://github.com/menshikh-iv), [#2191](https://github.com/RaRe-Technologies/gensim/pull/2191))
+* Fix non-deterministic test failures (pin `PYTHONHASHSEED`) ([@menshikh-iv](https://github.com/menshikh-iv), [#2196](https://github.com/RaRe-Technologies/gensim/pull/2196))
+* Fix "aliasing chunkize to chunkize_serial" warning on Windows ([@aquatiko](https://github.com/aquatiko), [#2202](https://github.com/RaRe-Technologies/gensim/pull/2202))
+* Remove `getitem` code duplication in `gensim.models.phrases` ([@jenishah](https://github.com/jenishah), [#2206](https://github.com/RaRe-Technologies/gensim/pull/2206))
+* Add `flake8-rst` for docstring code examples ([@kataev](https://github.com/kataev), [#2192](https://github.com/RaRe-Technologies/gensim/pull/2192))
+* Get rid `py26` stuff ([@menshikh-iv](https://github.com/menshikh-iv), [#2214](https://github.com/RaRe-Technologies/gensim/pull/2214))
+* Use `itertools.chain` instead of `sum` to concatenate lists ([@Stigjb](https://github.com/Stigjb), [#2212](https://github.com/RaRe-Technologies/gensim/pull/2212))
+* Fix flake8 warnings W605, W504 ([@horpto](https://github.com/horpto), [#2256](https://github.com/RaRe-Technologies/gensim/pull/2256))
+* Remove unnecessary creations of lists at all ([@horpto](https://github.com/horpto), [#2261](https://github.com/RaRe-Technologies/gensim/pull/2261))
+* Fix extra list creation in `utils.get_max_id` ([@horpto](https://github.com/horpto), [#2254](https://github.com/RaRe-Technologies/gensim/pull/2254))
+* Fix deprecation warning `np.sum(generator)` ([@rsdel2007](https://github.com/rsdel2007), [#2296](https://github.com/RaRe-Technologies/gensim/pull/2296))
+* Refactor `BM25` ([@horpto](https://github.com/horpto), [#2275](https://github.com/RaRe-Technologies/gensim/pull/2275))
+* Fix pyemd import ([@ramprakash-94](https://github.com/ramprakash-94), [#2240](https://github.com/RaRe-Technologies/gensim/pull/2240))
+* Set `metadata=True` for `make_wikicorpus` script by default ([@Xinyi2016](https://github.com/Xinyi2016), [#2245](https://github.com/RaRe-Technologies/gensim/pull/2245))
+* Remove unimportant warning from `Phrases` ([@rsdel2007](https://github.com/rsdel2007), [#2331](https://github.com/RaRe-Technologies/gensim/pull/2331))
+* Replace `open()` by `smart_open()` in `gensim.models.fasttext._load_fasttext_format` ([@rsdel2007](https://github.com/rsdel2007), [#2335](https://github.com/RaRe-Technologies/gensim/pull/2335))
### :red_circle: Bug fixes
-* Fix overflow error for `*Vec` corpusfile-based training (__[@bm371613](https://github.com/bm371613)__, [#2239](https://github.com/RaRe-Technologies/gensim/pull/2239))
-* Fix `malletmodel2ldamodel` conversion (__[@horpto](https://github.com/horpto)__, [#2288](https://github.com/RaRe-Technologies/gensim/pull/2288))
-* Replace custom epsilons with numpy equivalent in `LdaModel` (__[@horpto](https://github.com/horpto)__, [#2308](https://github.com/RaRe-Technologies/gensim/pull/2308))
-* Add missing content to tarball (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2194](https://github.com/RaRe-Technologies/gensim/pull/2194))
-* Fixes divided by zero when w_star_count==0 (__[@allenyllee](https://github.com/allenyllee)__, [#2259](https://github.com/RaRe-Technologies/gensim/pull/2259))
-* Fix check for callbacks (__[@allenyllee](https://github.com/allenyllee)__, [#2251](https://github.com/RaRe-Technologies/gensim/pull/2251))
-* Fix `SvmLightCorpus.serialize` if `labels` instance of numpy.ndarray (__[@aquatiko](https://github.com/aquatiko)__, [#2243](https://github.com/RaRe-Technologies/gensim/pull/2243))
-* Fix poincate viz incompatibility with `plotly>=3.0.0` (__[@jenishah](https://github.com/jenishah)__, [#2226](https://github.com/RaRe-Technologies/gensim/pull/2226))
-* Fix `keep_n` behavior for `Dictionary.filter_extremes` (__[@johann-petrak](https://github.com/johann-petrak)__, [#2232](https://github.com/RaRe-Technologies/gensim/pull/2232))
-* Fix for `sphinx==1.8.1` (last r (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#None](https://github.com/RaRe-Technologies/gensim/pull/None))
-* Fix `np.issubdtype` warnings (__[@marioyc](https://github.com/marioyc)__, [#2210](https://github.com/RaRe-Technologies/gensim/pull/2210))
-* Drop wrong key `-c` from `gensim.downloader` description (__[@horpto](https://github.com/horpto)__, [#2262](https://github.com/RaRe-Technologies/gensim/pull/2262))
-* Fix gensim build (docs & pyemd issues) (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2318](https://github.com/RaRe-Technologies/gensim/pull/2318))
-* Limit visdom version (avoid py2 issue from the latest visdom release) (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2334](https://github.com/RaRe-Technologies/gensim/pull/2334))
-* Fix visdom integration (using `viz.line()` instead of `viz.updatetrace()`) (__[@allenyllee](https://github.com/allenyllee)__, [#2252](https://github.com/RaRe-Technologies/gensim/pull/2252))
+* Fix overflow error for `*Vec` corpusfile-based training ([@bm371613](https://github.com/bm371613), [#2239](https://github.com/RaRe-Technologies/gensim/pull/2239))
+* Fix `malletmodel2ldamodel` conversion ([@horpto](https://github.com/horpto), [#2288](https://github.com/RaRe-Technologies/gensim/pull/2288))
+* Replace custom epsilons with numpy equivalent in `LdaModel` ([@horpto](https://github.com/horpto), [#2308](https://github.com/RaRe-Technologies/gensim/pull/2308))
+* Add missing content to tarball ([@menshikh-iv](https://github.com/menshikh-iv), [#2194](https://github.com/RaRe-Technologies/gensim/pull/2194))
+* Fixes divided by zero when w_star_count==0 ([@allenyllee](https://github.com/allenyllee), [#2259](https://github.com/RaRe-Technologies/gensim/pull/2259))
+* Fix check for callbacks ([@allenyllee](https://github.com/allenyllee), [#2251](https://github.com/RaRe-Technologies/gensim/pull/2251))
+* Fix `SvmLightCorpus.serialize` if `labels` instance of numpy.ndarray ([@aquatiko](https://github.com/aquatiko), [#2243](https://github.com/RaRe-Technologies/gensim/pull/2243))
+* Fix poincate viz incompatibility with `plotly>=3.0.0` ([@jenishah](https://github.com/jenishah), [#2226](https://github.com/RaRe-Technologies/gensim/pull/2226))
+* Fix `keep_n` behavior for `Dictionary.filter_extremes` ([@johann-petrak](https://github.com/johann-petrak), [#2232](https://github.com/RaRe-Technologies/gensim/pull/2232))
+* Fix for `sphinx==1.8.1` (last r ([@menshikh-iv](https://github.com/menshikh-iv), [#None](https://github.com/RaRe-Technologies/gensim/pull/None))
+* Fix `np.issubdtype` warnings ([@marioyc](https://github.com/marioyc), [#2210](https://github.com/RaRe-Technologies/gensim/pull/2210))
+* Drop wrong key `-c` from `gensim.downloader` description ([@horpto](https://github.com/horpto), [#2262](https://github.com/RaRe-Technologies/gensim/pull/2262))
+* Fix gensim build (docs & pyemd issues) ([@menshikh-iv](https://github.com/menshikh-iv), [#2318](https://github.com/RaRe-Technologies/gensim/pull/2318))
+* Limit visdom version (avoid py2 issue from the latest visdom release) ([@menshikh-iv](https://github.com/menshikh-iv), [#2334](https://github.com/RaRe-Technologies/gensim/pull/2334))
+* Fix visdom integration (using `viz.line()` instead of `viz.updatetrace()`) ([@allenyllee](https://github.com/allenyllee), [#2252](https://github.com/RaRe-Technologies/gensim/pull/2252))
### :books: Tutorial and doc improvements
-* Add gensim-data repo to `gensim.downloader` & fix rendering of code examples (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2327](https://github.com/RaRe-Technologies/gensim/pull/2327))
-* Fix typos in `gensim.models` (__[@rsdel2007](https://github.com/rsdel2007)__, [#2323](https://github.com/RaRe-Technologies/gensim/pull/2323))
-* Fixed typos in notebooks (__[@rsdel2007](https://github.com/rsdel2007)__, [#2322](https://github.com/RaRe-Technologies/gensim/pull/2322))
-* Update `Doc2Vec` documentation: how tags are assigned in `corpus_file` mode (__[@persiyanov](https://github.com/persiyanov)__, [#2320](https://github.com/RaRe-Technologies/gensim/pull/2320))
-* Fix typos in `gensim/models/keyedvectors.py` (__[@rsdel2007](https://github.com/rsdel2007)__, [#2290](https://github.com/RaRe-Technologies/gensim/pull/2290))
-* Add documentation about ranges to scoring functions for `Phrases` (__[@jenishah](https://github.com/jenishah)__, [#2242](https://github.com/RaRe-Technologies/gensim/pull/2242))
-* Update return sections for `KeyedVectors.evaluate_word_*` (__[@Stigjb](https://github.com/Stigjb)__, [#2205](https://github.com/RaRe-Technologies/gensim/pull/2205))
-* Fix return type in `KeyedVector.evaluate_word_analogies` (__[@Stigjb](https://github.com/Stigjb)__, [#2207](https://github.com/RaRe-Technologies/gensim/pull/2207))
-* Fix `WmdSimilarity` documentation (__[@jagmoreira](https://github.com/jagmoreira)__, [#2217](https://github.com/RaRe-Technologies/gensim/pull/2217))
-* Replace `fify -> fifty` in `gensim.parsing.preprocessing.STOPWORDS` (__[@coderwassananmol](https://github.com/coderwassananmol)__, [#2220](https://github.com/RaRe-Technologies/gensim/pull/2220))
-* Remove `alpha="auto"` from `LdaMulticore` (not supported yet) (__[@johann-petrak](https://github.com/johann-petrak)__, [#2225](https://github.com/RaRe-Technologies/gensim/pull/2225))
-* Update Adopters in README (__[@piskvorky](https://github.com/piskvorky)__, [#2234](https://github.com/RaRe-Technologies/gensim/pull/2234))
-* Fix broken link in `tutorials.md` (__[@rsdel2007](https://github.com/rsdel2007)__, [#2302](https://github.com/RaRe-Technologies/gensim/pull/2302))
+* Add gensim-data repo to `gensim.downloader` & fix rendering of code examples ([@menshikh-iv](https://github.com/menshikh-iv), [#2327](https://github.com/RaRe-Technologies/gensim/pull/2327))
+* Fix typos in `gensim.models` ([@rsdel2007](https://github.com/rsdel2007), [#2323](https://github.com/RaRe-Technologies/gensim/pull/2323))
+* Fixed typos in notebooks ([@rsdel2007](https://github.com/rsdel2007), [#2322](https://github.com/RaRe-Technologies/gensim/pull/2322))
+* Update `Doc2Vec` documentation: how tags are assigned in `corpus_file` mode ([@persiyanov](https://github.com/persiyanov), [#2320](https://github.com/RaRe-Technologies/gensim/pull/2320))
+* Fix typos in `gensim/models/keyedvectors.py` ([@rsdel2007](https://github.com/rsdel2007), [#2290](https://github.com/RaRe-Technologies/gensim/pull/2290))
+* Add documentation about ranges to scoring functions for `Phrases` ([@jenishah](https://github.com/jenishah), [#2242](https://github.com/RaRe-Technologies/gensim/pull/2242))
+* Update return sections for `KeyedVectors.evaluate_word_*` ([@Stigjb](https://github.com/Stigjb), [#2205](https://github.com/RaRe-Technologies/gensim/pull/2205))
+* Fix return type in `KeyedVector.evaluate_word_analogies` ([@Stigjb](https://github.com/Stigjb), [#2207](https://github.com/RaRe-Technologies/gensim/pull/2207))
+* Fix `WmdSimilarity` documentation ([@jagmoreira](https://github.com/jagmoreira), [#2217](https://github.com/RaRe-Technologies/gensim/pull/2217))
+* Replace `fify -> fifty` in `gensim.parsing.preprocessing.STOPWORDS` ([@coderwassananmol](https://github.com/coderwassananmol), [#2220](https://github.com/RaRe-Technologies/gensim/pull/2220))
+* Remove `alpha="auto"` from `LdaMulticore` (not supported yet) ([@johann-petrak](https://github.com/johann-petrak), [#2225](https://github.com/RaRe-Technologies/gensim/pull/2225))
+* Update Adopters in README ([@piskvorky](https://github.com/piskvorky), [#2234](https://github.com/RaRe-Technologies/gensim/pull/2234))
+* Fix broken link in `tutorials.md` ([@rsdel2007](https://github.com/rsdel2007), [#2302](https://github.com/RaRe-Technologies/gensim/pull/2302))
### :warning: Deprecations (will be removed in the next major release)
@@ -563,7 +652,7 @@ Remove:
## 3.6.0, 2018-09-20
### :star2: New features
-* File-based training for `*2Vec` models (__[@persiyanov](https://github.com/persiyanov)__, [#2127](https://github.com/RaRe-Technologies/gensim/pull/2127) & [#2078](https://github.com/RaRe-Technologies/gensim/pull/2078) & [#2048](https://github.com/RaRe-Technologies/gensim/pull/2048))
+* File-based training for `*2Vec` models ([@persiyanov](https://github.com/persiyanov), [#2127](https://github.com/RaRe-Technologies/gensim/pull/2127) & [#2078](https://github.com/RaRe-Technologies/gensim/pull/2078) & [#2048](https://github.com/RaRe-Technologies/gensim/pull/2048))
New training mode for `*2Vec` models (word2vec, doc2vec, fasttext) that allows model training to scale linearly with the number of cores (full GIL elimination). The result of our Google Summer of Code 2018 project by Dmitry Persiyanov.
@@ -609,36 +698,36 @@ Remove:
### :+1: Improvements
-* Add scikit-learn wrapper for `FastText` (__[@mcemilg](https://github.com/mcemilg)__, [#2178](https://github.com/RaRe-Technologies/gensim/pull/2178))
-* Add multiprocessing support for `BM25` (__[@Shiki-H](https://github.com/Shiki-H)__, [#2146](https://github.com/RaRe-Technologies/gensim/pull/2146))
-* Add `name_only` option for downloader api (__[@aneesh-joshi](https://github.com/aneesh-joshi)__, [#2143](https://github.com/RaRe-Technologies/gensim/pull/2143))
-* Make `word2vec2tensor` script compatible with `python3` (__[@vsocrates](https://github.com/vsocrates)__, [#2147](https://github.com/RaRe-Technologies/gensim/pull/2147))
-* Add custom filter for `Wikicorpus` (__[@mattilyra](https://github.com/mattilyra)__, [#2089](https://github.com/RaRe-Technologies/gensim/pull/2089))
-* Make `similarity_matrix` support non-contiguous dictionaries (__[@Witiko](https://github.com/Witiko)__, [#2047](https://github.com/RaRe-Technologies/gensim/pull/2047))
+* Add scikit-learn wrapper for `FastText` ([@mcemilg](https://github.com/mcemilg), [#2178](https://github.com/RaRe-Technologies/gensim/pull/2178))
+* Add multiprocessing support for `BM25` ([@Shiki-H](https://github.com/Shiki-H), [#2146](https://github.com/RaRe-Technologies/gensim/pull/2146))
+* Add `name_only` option for downloader api ([@aneesh-joshi](https://github.com/aneesh-joshi), [#2143](https://github.com/RaRe-Technologies/gensim/pull/2143))
+* Make `word2vec2tensor` script compatible with `python3` ([@vsocrates](https://github.com/vsocrates), [#2147](https://github.com/RaRe-Technologies/gensim/pull/2147))
+* Add custom filter for `Wikicorpus` ([@mattilyra](https://github.com/mattilyra), [#2089](https://github.com/RaRe-Technologies/gensim/pull/2089))
+* Make `similarity_matrix` support non-contiguous dictionaries ([@Witiko](https://github.com/Witiko), [#2047](https://github.com/RaRe-Technologies/gensim/pull/2047))
### :red_circle: Bug fixes
-* Fix memory consumption in `AuthorTopicModel` (__[@philipphager](https://github.com/philipphager)__, [#2122](https://github.com/RaRe-Technologies/gensim/pull/2122))
-* Correctly process empty documents in `AuthorTopicModel` (__[@probinso](https://github.com/probinso)__, [#2133](https://github.com/RaRe-Technologies/gensim/pull/2133))
-* Fix ZeroDivisionError `keywords` issue with short input (__[@LShostenko](https://github.com/LShostenko)__, [#2154](https://github.com/RaRe-Technologies/gensim/pull/2154))
-* Fix `min_count` handling in phrases detection using `npmi_scorer` (__[@lopusz](https://github.com/lopusz)__, [#2072](https://github.com/RaRe-Technologies/gensim/pull/2072))
-* Remove duplicate count from `Phraser` log message (__[@robguinness](https://github.com/robguinness)__, [#2151](https://github.com/RaRe-Technologies/gensim/pull/2151))
-* Replace `np.integer` -> `np.int` in `AuthorTopicModel` (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2145](https://github.com/RaRe-Technologies/gensim/pull/2145))
+* Fix memory consumption in `AuthorTopicModel` ([@philipphager](https://github.com/philipphager), [#2122](https://github.com/RaRe-Technologies/gensim/pull/2122))
+* Correctly process empty documents in `AuthorTopicModel` ([@probinso](https://github.com/probinso), [#2133](https://github.com/RaRe-Technologies/gensim/pull/2133))
+* Fix ZeroDivisionError `keywords` issue with short input ([@LShostenko](https://github.com/LShostenko), [#2154](https://github.com/RaRe-Technologies/gensim/pull/2154))
+* Fix `min_count` handling in phrases detection using `npmi_scorer` ([@lopusz](https://github.com/lopusz), [#2072](https://github.com/RaRe-Technologies/gensim/pull/2072))
+* Remove duplicate count from `Phraser` log message ([@robguinness](https://github.com/robguinness), [#2151](https://github.com/RaRe-Technologies/gensim/pull/2151))
+* Replace `np.integer` -> `np.int` in `AuthorTopicModel` ([@menshikh-iv](https://github.com/menshikh-iv), [#2145](https://github.com/RaRe-Technologies/gensim/pull/2145))
### :books: Tutorial and doc improvements
-* Update docstring with new analogy evaluation method (__[@akutuzov](https://github.com/akutuzov)__, [#2130](https://github.com/RaRe-Technologies/gensim/pull/2130))
-* Improve `prune_at` parameter description for `gensim.corpora.Dictionary` (__[@yxonic](https://github.com/yxonic)__, [#2128](https://github.com/RaRe-Technologies/gensim/pull/2128))
-* Fix `default` -> `auto` prior parameter in documentation for lda-related models (__[@Laubeee](https://github.com/Laubeee)__, [#2156](https://github.com/RaRe-Technologies/gensim/pull/2156))
-* Use heading instead of bold style in `gensim.models.translation_matrix` (__[@nzw0301](https://github.com/nzw0301)__, [#2164](https://github.com/RaRe-Technologies/gensim/pull/2164))
-* Fix quote of vocabulary from `gensim.models.Word2Vec` (__[@nzw0301](https://github.com/nzw0301)__, [#2161](https://github.com/RaRe-Technologies/gensim/pull/2161))
-* Replace deprecated parameters with new in docstring of `gensim.models.Doc2Vec` (__[@xuhdev](https://github.com/xuhdev)__, [#2165](https://github.com/RaRe-Technologies/gensim/pull/2165))
-* Fix formula in Mallet documentation (__[@Laubeee](https://github.com/Laubeee)__, [#2186](https://github.com/RaRe-Technologies/gensim/pull/2186))
-* Fix minor semantic issue in docs for `Phrases` (__[@RunHorst](https://github.com/RunHorst)__, [#2148](https://github.com/RaRe-Technologies/gensim/pull/2148))
-* Fix typo in documentation (__[@KenjiOhtsuka](https://github.com/KenjiOhtsuka)__, [#2157](https://github.com/RaRe-Technologies/gensim/pull/2157))
-* Additional documentation fixes (__[@piskvorky](https://github.com/piskvorky)__, [#2121](https://github.com/RaRe-Technologies/gensim/pull/2121))
+* Update docstring with new analogy evaluation method ([@akutuzov](https://github.com/akutuzov), [#2130](https://github.com/RaRe-Technologies/gensim/pull/2130))
+* Improve `prune_at` parameter description for `gensim.corpora.Dictionary` ([@yxonic](https://github.com/yxonic), [#2128](https://github.com/RaRe-Technologies/gensim/pull/2128))
+* Fix `default` -> `auto` prior parameter in documentation for lda-related models ([@Laubeee](https://github.com/Laubeee), [#2156](https://github.com/RaRe-Technologies/gensim/pull/2156))
+* Use heading instead of bold style in `gensim.models.translation_matrix` ([@nzw0301](https://github.com/nzw0301), [#2164](https://github.com/RaRe-Technologies/gensim/pull/2164))
+* Fix quote of vocabulary from `gensim.models.Word2Vec` ([@nzw0301](https://github.com/nzw0301), [#2161](https://github.com/RaRe-Technologies/gensim/pull/2161))
+* Replace deprecated parameters with new in docstring of `gensim.models.Doc2Vec` ([@xuhdev](https://github.com/xuhdev), [#2165](https://github.com/RaRe-Technologies/gensim/pull/2165))
+* Fix formula in Mallet documentation ([@Laubeee](https://github.com/Laubeee), [#2186](https://github.com/RaRe-Technologies/gensim/pull/2186))
+* Fix minor semantic issue in docs for `Phrases` ([@RunHorst](https://github.com/RunHorst), [#2148](https://github.com/RaRe-Technologies/gensim/pull/2148))
+* Fix typo in documentation ([@KenjiOhtsuka](https://github.com/KenjiOhtsuka), [#2157](https://github.com/RaRe-Technologies/gensim/pull/2157))
+* Additional documentation fixes ([@piskvorky](https://github.com/piskvorky), [#2121](https://github.com/RaRe-Technologies/gensim/pull/2121))
### :warning: Deprecations (will be removed in the next major release)
@@ -673,60 +762,60 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
### :books: Documentation improvements
-* Overhaul documentation for `*2vec` models (__[@steremma](https://github.com/steremma)__ & __[@piskvorky](https://github.com/piskvorky)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1944](https://github.com/RaRe-Technologies/gensim/pull/1944), [#2087](https://github.com/RaRe-Technologies/gensim/pull/2087))
-* Fix documentation for LDA-related models (__[@steremma](https://github.com/steremma)__ & __[@piskvorky](https://github.com/piskvorky)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#2026](https://github.com/RaRe-Technologies/gensim/pull/2026))
-* Fix documentation for utils, corpora, inferfaces (__[@piskvorky](https://github.com/piskvorky)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#2096](https://github.com/RaRe-Technologies/gensim/pull/2096))
-* Update non-API docs (about, intro, license etc) (__[@piskvorky](https://github.com/piskvorky)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#2101](https://github.com/RaRe-Technologies/gensim/pull/2101))
-* Refactor documentation for `gensim.models.phrases` (__[@CLearERR](https://github.com/CLearERR)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1950](https://github.com/RaRe-Technologies/gensim/pull/1950))
-* Fix HashDictionary documentation (__[@piskvorky](https://github.com/piskvorky)__, [#2073](https://github.com/RaRe-Technologies/gensim/pull/2073))
-* Fix docstrings for `gensim.models.AuthorTopicModel` (__[@souravsingh](https://github.com/souravsingh)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1907](https://github.com/RaRe-Technologies/gensim/pull/1907))
-* Fix docstrings for HdpModel, lda_worker & lda_dispatcher (__[@gyanesh-m](https://github.com/gyanesh-m)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1912](https://github.com/RaRe-Technologies/gensim/pull/1912))
-* Fix format & links for `gensim.similarities.docsim` (__[@CLearERR](https://github.com/CLearERR)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#2030](https://github.com/RaRe-Technologies/gensim/pull/2030))
-* Remove duplication of class documentation for `IndexedCorpus` (__[@darindf](https://github.com/darindf)__, [#2033](https://github.com/RaRe-Technologies/gensim/pull/2033))
-* Refactor documentation for `gensim.models.coherencemodel` (__[@CLearERR](https://github.com/CLearERR)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1933](https://github.com/RaRe-Technologies/gensim/pull/1933))
-* Fix docstrings for `gensim.sklearn_api` (__[@steremma](https://github.com/steremma)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1895](https://github.com/RaRe-Technologies/gensim/pull/1895))
-* Disable google-style docstring support (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2106](https://github.com/RaRe-Technologies/gensim/pull/2106))
-* Fix docstring of `gensim.models.KeyedVectors.similarity_matrix` (__[@Witiko](https://github.com/Witiko)__, [#1971](https://github.com/RaRe-Technologies/gensim/pull/1971))
-* Consistently use `smart_open()` instead of `open()` in notebooks (__[@sharanry](https://github.com/sharanry)__, [#1812](https://github.com/RaRe-Technologies/gensim/pull/1812))
+* Overhaul documentation for `*2vec` models ([@steremma](https://github.com/steremma) & [@piskvorky](https://github.com/piskvorky) & [@menshikh-iv](https://github.com/menshikh-iv), [#1944](https://github.com/RaRe-Technologies/gensim/pull/1944), [#2087](https://github.com/RaRe-Technologies/gensim/pull/2087))
+* Fix documentation for LDA-related models ([@steremma](https://github.com/steremma) & [@piskvorky](https://github.com/piskvorky) & [@menshikh-iv](https://github.com/menshikh-iv), [#2026](https://github.com/RaRe-Technologies/gensim/pull/2026))
+* Fix documentation for utils, corpora, inferfaces ([@piskvorky](https://github.com/piskvorky) & [@menshikh-iv](https://github.com/menshikh-iv), [#2096](https://github.com/RaRe-Technologies/gensim/pull/2096))
+* Update non-API docs (about, intro, license etc) ([@piskvorky](https://github.com/piskvorky) & [@menshikh-iv](https://github.com/menshikh-iv), [#2101](https://github.com/RaRe-Technologies/gensim/pull/2101))
+* Refactor documentation for `gensim.models.phrases` ([@CLearERR](https://github.com/CLearERR) & [@menshikh-iv](https://github.com/menshikh-iv), [#1950](https://github.com/RaRe-Technologies/gensim/pull/1950))
+* Fix HashDictionary documentation ([@piskvorky](https://github.com/piskvorky), [#2073](https://github.com/RaRe-Technologies/gensim/pull/2073))
+* Fix docstrings for `gensim.models.AuthorTopicModel` ([@souravsingh](https://github.com/souravsingh) & [@menshikh-iv](https://github.com/menshikh-iv), [#1907](https://github.com/RaRe-Technologies/gensim/pull/1907))
+* Fix docstrings for HdpModel, lda_worker & lda_dispatcher ([@gyanesh-m](https://github.com/gyanesh-m) & [@menshikh-iv](https://github.com/menshikh-iv), [#1912](https://github.com/RaRe-Technologies/gensim/pull/1912))
+* Fix format & links for `gensim.similarities.docsim` ([@CLearERR](https://github.com/CLearERR) & [@menshikh-iv](https://github.com/menshikh-iv), [#2030](https://github.com/RaRe-Technologies/gensim/pull/2030))
+* Remove duplication of class documentation for `IndexedCorpus` ([@darindf](https://github.com/darindf), [#2033](https://github.com/RaRe-Technologies/gensim/pull/2033))
+* Refactor documentation for `gensim.models.coherencemodel` ([@CLearERR](https://github.com/CLearERR) & [@menshikh-iv](https://github.com/menshikh-iv), [#1933](https://github.com/RaRe-Technologies/gensim/pull/1933))
+* Fix docstrings for `gensim.sklearn_api` ([@steremma](https://github.com/steremma) & [@menshikh-iv](https://github.com/menshikh-iv), [#1895](https://github.com/RaRe-Technologies/gensim/pull/1895))
+* Disable google-style docstring support ([@menshikh-iv](https://github.com/menshikh-iv), [#2106](https://github.com/RaRe-Technologies/gensim/pull/2106))
+* Fix docstring of `gensim.models.KeyedVectors.similarity_matrix` ([@Witiko](https://github.com/Witiko), [#1971](https://github.com/RaRe-Technologies/gensim/pull/1971))
+* Consistently use `smart_open()` instead of `open()` in notebooks ([@sharanry](https://github.com/sharanry), [#1812](https://github.com/RaRe-Technologies/gensim/pull/1812))
### :star2: New features:
-* Add `add_entity` method to `KeyedVectors` to allow adding word vectors manually (__[@persiyanov](https://github.com/persiyanov)__, [#1957](https://github.com/RaRe-Technologies/gensim/pull/1957))
-* Add inference for new unseen author to `AuthorTopicModel` (__[@Stamenov](https://github.com/Stamenov)__, [#1766](https://github.com/RaRe-Technologies/gensim/pull/1766))
-* Add `evaluate_word_analogies` (will replace `accuracy`) method to `KeyedVectors` (__[@akutuzov](https://github.com/akutuzov)__, [#1935](https://github.com/RaRe-Technologies/gensim/pull/1935))
-* Add Pivot Normalization to `TfidfModel` (__[@markroxor](https://github.com/markroxor)__, [#1780](https://github.com/RaRe-Technologies/gensim/pull/1780))
+* Add `add_entity` method to `KeyedVectors` to allow adding word vectors manually ([@persiyanov](https://github.com/persiyanov), [#1957](https://github.com/RaRe-Technologies/gensim/pull/1957))
+* Add inference for new unseen author to `AuthorTopicModel` ([@Stamenov](https://github.com/Stamenov), [#1766](https://github.com/RaRe-Technologies/gensim/pull/1766))
+* Add `evaluate_word_analogies` (will replace `accuracy`) method to `KeyedVectors` ([@akutuzov](https://github.com/akutuzov), [#1935](https://github.com/RaRe-Technologies/gensim/pull/1935))
+* Add Pivot Normalization to `TfidfModel` ([@markroxor](https://github.com/markroxor), [#1780](https://github.com/RaRe-Technologies/gensim/pull/1780))
### :+1: Improvements
-* Allow initialization with `max_final_vocab` in lieu of `min_count` in `Word2Vec`(__[@aneesh-joshi](https://github.com/aneesh-joshi)__, [#1915](https://github.com/RaRe-Technologies/gensim/pull/1915))
-* Add `dtype` argument for `chunkize_serial` in `LdaModel` (__[@darindf](https://github.com/darindf)__, [#2027](https://github.com/RaRe-Technologies/gensim/pull/2027))
-* Increase performance in `Phrases.analyze_sentence` (__[@JonathanHourany](https://github.com/JonathanHourany)__, [#2070](https://github.com/RaRe-Technologies/gensim/pull/2070))
-* Add `ns_exponent` parameter to control the negative sampling distribution for `*2vec` models (__[@fernandocamargoti](https://github.com/fernandocamargoti)__, [#2093](https://github.com/RaRe-Technologies/gensim/pull/2093))
+* Allow initialization with `max_final_vocab` in lieu of `min_count` in `Word2Vec`([@aneesh-joshi](https://github.com/aneesh-joshi), [#1915](https://github.com/RaRe-Technologies/gensim/pull/1915))
+* Add `dtype` argument for `chunkize_serial` in `LdaModel` ([@darindf](https://github.com/darindf), [#2027](https://github.com/RaRe-Technologies/gensim/pull/2027))
+* Increase performance in `Phrases.analyze_sentence` ([@JonathanHourany](https://github.com/JonathanHourany), [#2070](https://github.com/RaRe-Technologies/gensim/pull/2070))
+* Add `ns_exponent` parameter to control the negative sampling distribution for `*2vec` models ([@fernandocamargoti](https://github.com/fernandocamargoti), [#2093](https://github.com/RaRe-Technologies/gensim/pull/2093))
### :red_circle: Bug fixes:
-* Fix `Doc2Vec.infer_vector` + notebook cleanup (__[@gojomo](https://github.com/gojomo)__, [#2103](https://github.com/RaRe-Technologies/gensim/pull/2103))
-* Fix linear decay for learning rate in `Doc2Vec.infer_vector` (__[@umangv](https://github.com/umangv)__, [#2063](https://github.com/RaRe-Technologies/gensim/pull/2063))
-* Fix negative sampling floating-point error for `gensim.models.Poincare (__[@jayantj](https://github.com/jayantj)__, [#1959](https://github.com/RaRe-Technologies/gensim/pull/1959))
-* Fix loading `word2vec` and `doc2vec` models saved using old Gensim versions (__[@manneshiva](https://github.com/manneshiva)__, [#2012](https://github.com/RaRe-Technologies/gensim/pull/2012))
-* Fix `SoftCosineSimilarity.get_similarities` on corpora ssues/1955) (__[@Witiko](https://github.com/Witiko)__, [#1972](https://github.com/RaRe-Technologies/gensim/pull/1972))
-* Fix return dtype for `matutils.unitvec` according to input dtype (__[@o-P-o](https://github.com/o-P-o)__, [#1992](https://github.com/RaRe-Technologies/gensim/pull/1992))
-* Fix passing empty dictionary to `gensim.corpora.WikiCorpus` (__[@steremma](https://github.com/steremma)__, [#2042](https://github.com/RaRe-Technologies/gensim/pull/2042))
-* Fix bug in `Similarity.query_shards` in multiprocessing case (__[@bohea](https://github.com/bohea)__, [#2044](https://github.com/RaRe-Technologies/gensim/pull/2044))
-* Fix SMART from TfidfModel for case when `df == "n"` (__[@PeteBleackley](https://github.com/PeteBleackley)__, [#2021](https://github.com/RaRe-Technologies/gensim/pull/2021))
-* Fix OverflowError when loading a large term-document matrix in compiled MatrixMarket format (__[@arlenk](https://github.com/arlenk)__, [#2001](https://github.com/RaRe-Technologies/gensim/pull/2001))
-* Update rules for removing table markup from Wikipedia dumps (__[@chaitaliSaini](https://github.com/chaitaliSaini)__, [#1954](https://github.com/RaRe-Technologies/gensim/pull/1954))
-* Fix `_is_single` from `Phrases` for case when corpus is a NumPy array (__[@rmalouf](https://github.com/rmalouf)__, [#1987](https://github.com/RaRe-Technologies/gensim/pull/1987))
-* Fix tests for `EuclideanKeyedVectors.similarity_matrix` (__[@Witiko](https://github.com/Witiko)__, [#1984](https://github.com/RaRe-Technologies/gensim/pull/1984))
-* Fix deprecated parameters in `D2VTransformer` and `W2VTransformer`(__[@MritunjayMohitesh](https://github.com/MritunjayMohitesh)__, [#1945](https://github.com/RaRe-Technologies/gensim/pull/1945))
-* Fix `Doc2Vec.infer_vector` after loading old `Doc2Vec` (`gensim<=3.2`)(__[@manneshiva](https://github.com/manneshiva)__, [#1974](https://github.com/RaRe-Technologies/gensim/pull/1974))
-* Fix inheritance chain for `load_word2vec_format` (__[@DennisChen0307](https://github.com/DennisChen0307)__, [#1968](https://github.com/RaRe-Technologies/gensim/pull/1968))
-* Update Keras version (avoid bug from `keras==2.1.5`) (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1963](https://github.com/RaRe-Technologies/gensim/pull/1963))
+* Fix `Doc2Vec.infer_vector` + notebook cleanup ([@gojomo](https://github.com/gojomo), [#2103](https://github.com/RaRe-Technologies/gensim/pull/2103))
+* Fix linear decay for learning rate in `Doc2Vec.infer_vector` ([@umangv](https://github.com/umangv), [#2063](https://github.com/RaRe-Technologies/gensim/pull/2063))
+* Fix negative sampling floating-point error for `gensim.models.Poincare ([@jayantj](https://github.com/jayantj), [#1959](https://github.com/RaRe-Technologies/gensim/pull/1959))
+* Fix loading `word2vec` and `doc2vec` models saved using old Gensim versions ([@manneshiva](https://github.com/manneshiva), [#2012](https://github.com/RaRe-Technologies/gensim/pull/2012))
+* Fix `SoftCosineSimilarity.get_similarities` on corpora ssues/1955) ([@Witiko](https://github.com/Witiko), [#1972](https://github.com/RaRe-Technologies/gensim/pull/1972))
+* Fix return dtype for `matutils.unitvec` according to input dtype ([@o-P-o](https://github.com/o-P-o), [#1992](https://github.com/RaRe-Technologies/gensim/pull/1992))
+* Fix passing empty dictionary to `gensim.corpora.WikiCorpus` ([@steremma](https://github.com/steremma), [#2042](https://github.com/RaRe-Technologies/gensim/pull/2042))
+* Fix bug in `Similarity.query_shards` in multiprocessing case ([@bohea](https://github.com/bohea), [#2044](https://github.com/RaRe-Technologies/gensim/pull/2044))
+* Fix SMART from TfidfModel for case when `df == "n"` ([@PeteBleackley](https://github.com/PeteBleackley), [#2021](https://github.com/RaRe-Technologies/gensim/pull/2021))
+* Fix OverflowError when loading a large term-document matrix in compiled MatrixMarket format ([@arlenk](https://github.com/arlenk), [#2001](https://github.com/RaRe-Technologies/gensim/pull/2001))
+* Update rules for removing table markup from Wikipedia dumps ([@chaitaliSaini](https://github.com/chaitaliSaini), [#1954](https://github.com/RaRe-Technologies/gensim/pull/1954))
+* Fix `_is_single` from `Phrases` for case when corpus is a NumPy array ([@rmalouf](https://github.com/rmalouf), [#1987](https://github.com/RaRe-Technologies/gensim/pull/1987))
+* Fix tests for `EuclideanKeyedVectors.similarity_matrix` ([@Witiko](https://github.com/Witiko), [#1984](https://github.com/RaRe-Technologies/gensim/pull/1984))
+* Fix deprecated parameters in `D2VTransformer` and `W2VTransformer`([@MritunjayMohitesh](https://github.com/MritunjayMohitesh), [#1945](https://github.com/RaRe-Technologies/gensim/pull/1945))
+* Fix `Doc2Vec.infer_vector` after loading old `Doc2Vec` (`gensim<=3.2`)([@manneshiva](https://github.com/manneshiva), [#1974](https://github.com/RaRe-Technologies/gensim/pull/1974))
+* Fix inheritance chain for `load_word2vec_format` ([@DennisChen0307](https://github.com/DennisChen0307), [#1968](https://github.com/RaRe-Technologies/gensim/pull/1968))
+* Update Keras version (avoid bug from `keras==2.1.5`) ([@menshikh-iv](https://github.com/menshikh-iv), [#1963](https://github.com/RaRe-Technologies/gensim/pull/1963))
@@ -754,7 +843,7 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
## 3.4.0, 2018-03-01
### :star2: New features:
-* Massive optimizations of `gensim.models.LdaModel`: much faster training, using Cython. (__[@arlenk](https://github.com/arlenk)__, [#1767](https://github.com/RaRe-Technologies/gensim/pull/1767))
+* Massive optimizations of `gensim.models.LdaModel`: much faster training, using Cython. ([@arlenk](https://github.com/arlenk), [#1767](https://github.com/RaRe-Technologies/gensim/pull/1767))
- Training benchmark :boom:
| dataset | old LDA [sec] | optimized LDA [sec] | speed up |
@@ -763,7 +852,7 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
| enron | 774 | **437** | **1.77x** |
- This change **affects all models that depend on `LdaModel`**, such as `LdaMulticore`, `LdaSeqModel`, `AuthorTopicModel`.
-* Huge speed-ups to corpus I/O with `MmCorpus` (Cython) (__[@arlenk](https://github.com/arlenk)__, [#1825](https://github.com/RaRe-Technologies/gensim/pull/1825))
+* Huge speed-ups to corpus I/O with `MmCorpus` (Cython) ([@arlenk](https://github.com/arlenk), [#1825](https://github.com/RaRe-Technologies/gensim/pull/1825))
- File reading benchmark
| dataset | file compressed? | old MmReader [sec] | optimized MmReader [sec] | speed up |
@@ -777,7 +866,7 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
- Overall, a **2.5x** speedup for compressed `.mm.gz` input and **8.5x** :fire::fire::fire: for uncompressed plaintext `.mm`.
-* Performance and memory optimization to `gensim.models.FastText` :rocket: (__[@jbaiter](https://github.com/jbaiter)__, [#1916](https://github.com/RaRe-Technologies/gensim/pull/1916))
+* Performance and memory optimization to `gensim.models.FastText` :rocket: ([@jbaiter](https://github.com/jbaiter), [#1916](https://github.com/RaRe-Technologies/gensim/pull/1916))
- Benchmark (first 500,000 articles from English Wikipedia)
| Metric | old FastText | optimized FastText | improvement |
@@ -789,7 +878,7 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
- Overall, a **2.5x** speedup & memory usage reduced by **30%**.
-* Implemented [Soft Cosine Measure](https://en.wikipedia.org/wiki/Cosine_similarity#Soft_cosine_measure) (__[@Witiko](https://github.com/Witiko)__, [#1827](https://github.com/RaRe-Technologies/gensim/pull/1827))
+* Implemented [Soft Cosine Measure](https://en.wikipedia.org/wiki/Cosine_similarity#Soft_cosine_measure) ([@Witiko](https://github.com/Witiko), [#1827](https://github.com/RaRe-Technologies/gensim/pull/1827))
- New method for assessing document similarity, a nice faster alternative to [WMD, Word Mover's Distance](http://proceedings.mlr.press/v37/kusnerb15.pdf)
- Benchmark
@@ -808,39 +897,39 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
### :+1: Improvements:
-* New method to show the Gensim installation parameters: `python -m gensim.scripts.package_info --info`. Use this when reporting problems, for easier debugging. Fix #1902 (__[@sharanry](https://github.com/sharanry)__, [#1903](https://github.com/RaRe-Technologies/gensim/pull/1903))
-* Added a flag to optionally skip network-related tests, to help maintainers avoid network issues with CI services (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1930](https://github.com/RaRe-Technologies/gensim/pull/1930))
-* Added `license` field to `setup.py`, allowing the use of tools like `pip-licenses` (__[@nils-werner](https://github.com/nils-werner)__, [#1909](https://github.com/RaRe-Technologies/gensim/pull/1909))
+* New method to show the Gensim installation parameters: `python -m gensim.scripts.package_info --info`. Use this when reporting problems, for easier debugging. Fix #1902 ([@sharanry](https://github.com/sharanry), [#1903](https://github.com/RaRe-Technologies/gensim/pull/1903))
+* Added a flag to optionally skip network-related tests, to help maintainers avoid network issues with CI services ([@menshikh-iv](https://github.com/menshikh-iv), [#1930](https://github.com/RaRe-Technologies/gensim/pull/1930))
+* Added `license` field to `setup.py`, allowing the use of tools like `pip-licenses` ([@nils-werner](https://github.com/nils-werner), [#1909](https://github.com/RaRe-Technologies/gensim/pull/1909))
### :red_circle: Bug fixes:
-* Fix Python 3 compatibility for `gensim.corpora.UciCorpus.save_corpus` (__[@darindf](https://github.com/darindf)__, [#1875](https://github.com/RaRe-Technologies/gensim/pull/1875))
-* Add `wv` property to KeyedVectors for backward compatibility. Fix #1882 (__[@manneshiva](https://github.com/manneshiva)__, [#1884](https://github.com/RaRe-Technologies/gensim/pull/1884))
-* Fix deprecation warning from `inspect.getargspec`. Fix #1878 (__[@aneesh-joshi](https://github.com/aneesh-joshi)__, [#1887](https://github.com/RaRe-Technologies/gensim/pull/1887))
-* Add `LabeledSentence` to `gensim.models.doc2vec` for backward compatibility. Fix #1886 (__[@manneshiva](https://github.com/manneshiva)__, [#1891](https://github.com/RaRe-Technologies/gensim/pull/1891))
-* Fix empty output bug in `Phrases` (when using `model[tokens]` twice). Fix #1401 (__[@sj29-innovate](https://github.com/sj29-innovate)__, [#1853](https://github.com/RaRe-Technologies/gensim/pull/1853))
-* Fix type problems for `D2VTransformer.fit_transform`. Fix #1834 (__[@Utkarsh-Mishra-CIC](https://github.com/Utkarsh-Mishra-CIC)__, [#1845](https://github.com/RaRe-Technologies/gensim/pull/1845))
-* Fix `datatype` parameter for `KeyedVectors.load_word2vec_format`. Fix #1682 (__[@pushpankar](https://github.com/pushpankar)__, [#1819](https://github.com/RaRe-Technologies/gensim/pull/1819))
-* Fix deprecated parameters in `doc2vec-lee` notebook (__[@TheFlash10](https://github.com/TheFlash10)__, [#1918](https://github.com/RaRe-Technologies/gensim/pull/1918))
-* Fix file-like closing bug in `gensim.corpora.MmCorpus`. Fix #1869 (__[@sj29-innovate](https://github.com/sj29-innovate)__, [#1911](https://github.com/RaRe-Technologies/gensim/pull/1911))
-* Fix precision problem in `test_similarities.py`, no more FP fails. (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1928](https://github.com/RaRe-Technologies/gensim/pull/1928))
-* Fix encoding in Lee corpus reader. (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1931](https://github.com/RaRe-Technologies/gensim/pull/1931))
-* Fix OOV pairs counter in `WordEmbeddingsKeyedVectors.evaluate_word_pairs`. (__[@akutuzov](https://github.com/akutuzov)__, [#1934](https://github.com/RaRe-Technologies/gensim/pull/1934))
+* Fix Python 3 compatibility for `gensim.corpora.UciCorpus.save_corpus` ([@darindf](https://github.com/darindf), [#1875](https://github.com/RaRe-Technologies/gensim/pull/1875))
+* Add `wv` property to KeyedVectors for backward compatibility. Fix #1882 ([@manneshiva](https://github.com/manneshiva), [#1884](https://github.com/RaRe-Technologies/gensim/pull/1884))
+* Fix deprecation warning from `inspect.getargspec`. Fix #1878 ([@aneesh-joshi](https://github.com/aneesh-joshi), [#1887](https://github.com/RaRe-Technologies/gensim/pull/1887))
+* Add `LabeledSentence` to `gensim.models.doc2vec` for backward compatibility. Fix #1886 ([@manneshiva](https://github.com/manneshiva), [#1891](https://github.com/RaRe-Technologies/gensim/pull/1891))
+* Fix empty output bug in `Phrases` (when using `model[tokens]` twice). Fix #1401 ([@sj29-innovate](https://github.com/sj29-innovate), [#1853](https://github.com/RaRe-Technologies/gensim/pull/1853))
+* Fix type problems for `D2VTransformer.fit_transform`. Fix #1834 ([@Utkarsh-Mishra-CIC](https://github.com/Utkarsh-Mishra-CIC), [#1845](https://github.com/RaRe-Technologies/gensim/pull/1845))
+* Fix `datatype` parameter for `KeyedVectors.load_word2vec_format`. Fix #1682 ([@pushpankar](https://github.com/pushpankar), [#1819](https://github.com/RaRe-Technologies/gensim/pull/1819))
+* Fix deprecated parameters in `doc2vec-lee` notebook ([@TheFlash10](https://github.com/TheFlash10), [#1918](https://github.com/RaRe-Technologies/gensim/pull/1918))
+* Fix file-like closing bug in `gensim.corpora.MmCorpus`. Fix #1869 ([@sj29-innovate](https://github.com/sj29-innovate), [#1911](https://github.com/RaRe-Technologies/gensim/pull/1911))
+* Fix precision problem in `test_similarities.py`, no more FP fails. ([@menshikh-iv](https://github.com/menshikh-iv), [#1928](https://github.com/RaRe-Technologies/gensim/pull/1928))
+* Fix encoding in Lee corpus reader. ([@menshikh-iv](https://github.com/menshikh-iv), [#1931](https://github.com/RaRe-Technologies/gensim/pull/1931))
+* Fix OOV pairs counter in `WordEmbeddingsKeyedVectors.evaluate_word_pairs`. ([@akutuzov](https://github.com/akutuzov), [#1934](https://github.com/RaRe-Technologies/gensim/pull/1934))
### :books: Tutorial and doc improvements:
-* Fix example block for `gensim.models.Word2Vec` (__[@nzw0301](https://github.com/nzw0301)__, [#1870](https://github.com/RaRe-Technologies/gensim/pull/1876))
-* Fix `doc2vec-lee` notebook (__[@numericlee](https://github.com/numericlee)__, [#1870](https://github.com/RaRe-Technologies/gensim/pull/1870))
-* Store images from `README.md` directly in repository. Fix #1849 (__[@ibrahimsharaf](https://github.com/ibrahimsharaf)__, [#1861](https://github.com/RaRe-Technologies/gensim/pull/1861))
-* Add windows venv activate command to `CONTRIBUTING.md` (__[@aneesh-joshi](https://github.com/aneesh-joshi)__, [#1880](https://github.com/RaRe-Technologies/gensim/pull/1880))
-* Add anaconda-cloud badge. Partial fix #1901 (__[@sharanry](https://github.com/sharanry)__, [#1905](https://github.com/RaRe-Technologies/gensim/pull/1905))
-* Fix docstrings for lsi-related code (__[@steremma](https://github.com/steremma)__, [#1892](https://github.com/RaRe-Technologies/gensim/pull/1892))
-* Fix parameter description of `sg` parameter for `gensim.models.word2vec` (__[@mdcclv](https://github.com/mdcclv)__, [#1919](https://github.com/RaRe-Technologies/gensim/pull/1919))
-* Refactor documentation for `gensim.similarities.docsim` and `MmCorpus-related`. (__[@CLearERR](https://github.com/CLearERR)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1910](https://github.com/RaRe-Technologies/gensim/pull/1910))
-* Fix docstrings for `gensim.test.utils` (__[@yurkai](https://github.com/yurkai)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1904](https://github.com/RaRe-Technologies/gensim/pull/1904))
-* Refactor docstrings for `gensim.scripts`. Partial fix #1665 (__[@yurkai](https://github.com/yurkai)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1792](https://github.com/RaRe-Technologies/gensim/pull/1792))
-* Refactor API reference `gensim.corpora`. Partial fix #1671 (__[@CLearERR](https://github.com/CLearERR)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1835](https://github.com/RaRe-Technologies/gensim/pull/1835))
-* Fix documentation for `gensim.models.wrappers` (__[@kakshay21](https://github.com/kakshay21)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1859](https://github.com/RaRe-Technologies/gensim/pull/1859))
-* Fix docstrings for `gensim.interfaces` (__[@yurkai](https://github.com/yurkai)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1913](https://github.com/RaRe-Technologies/gensim/pull/1913))
+* Fix example block for `gensim.models.Word2Vec` ([@nzw0301](https://github.com/nzw0301), [#1870](https://github.com/RaRe-Technologies/gensim/pull/1876))
+* Fix `doc2vec-lee` notebook ([@numericlee](https://github.com/numericlee), [#1870](https://github.com/RaRe-Technologies/gensim/pull/1870))
+* Store images from `README.md` directly in repository. Fix #1849 ([@ibrahimsharaf](https://github.com/ibrahimsharaf), [#1861](https://github.com/RaRe-Technologies/gensim/pull/1861))
+* Add windows venv activate command to `CONTRIBUTING.md` ([@aneesh-joshi](https://github.com/aneesh-joshi), [#1880](https://github.com/RaRe-Technologies/gensim/pull/1880))
+* Add anaconda-cloud badge. Partial fix #1901 ([@sharanry](https://github.com/sharanry), [#1905](https://github.com/RaRe-Technologies/gensim/pull/1905))
+* Fix docstrings for lsi-related code ([@steremma](https://github.com/steremma), [#1892](https://github.com/RaRe-Technologies/gensim/pull/1892))
+* Fix parameter description of `sg` parameter for `gensim.models.word2vec` ([@mdcclv](https://github.com/mdcclv), [#1919](https://github.com/RaRe-Technologies/gensim/pull/1919))
+* Refactor documentation for `gensim.similarities.docsim` and `MmCorpus-related`. ([@CLearERR](https://github.com/CLearERR) & [@menshikh-iv](https://github.com/menshikh-iv), [#1910](https://github.com/RaRe-Technologies/gensim/pull/1910))
+* Fix docstrings for `gensim.test.utils` ([@yurkai](https://github.com/yurkai) & [@menshikh-iv](https://github.com/menshikh-iv), [#1904](https://github.com/RaRe-Technologies/gensim/pull/1904))
+* Refactor docstrings for `gensim.scripts`. Partial fix #1665 ([@yurkai](https://github.com/yurkai) & [@menshikh-iv](https://github.com/menshikh-iv), [#1792](https://github.com/RaRe-Technologies/gensim/pull/1792))
+* Refactor API reference `gensim.corpora`. Partial fix #1671 ([@CLearERR](https://github.com/CLearERR) & [@menshikh-iv](https://github.com/menshikh-iv), [#1835](https://github.com/RaRe-Technologies/gensim/pull/1835))
+* Fix documentation for `gensim.models.wrappers` ([@kakshay21](https://github.com/kakshay21) & [@menshikh-iv](https://github.com/menshikh-iv), [#1859](https://github.com/RaRe-Technologies/gensim/pull/1859))
+* Fix docstrings for `gensim.interfaces` ([@yurkai](https://github.com/yurkai) & [@menshikh-iv](https://github.com/menshikh-iv), [#1913](https://github.com/RaRe-Technologies/gensim/pull/1913))
### :warning: Deprecations (will be removed in the next major release)
@@ -867,13 +956,13 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
## 3.3.0, 2018-02-02
:star2: New features:
-* Re-designed all "*2vec" implementations (__[@manneshiva](https://github.com/manneshiva)__, [#1777](https://github.com/RaRe-Technologies/gensim/pull/1777))
+* Re-designed all "*2vec" implementations ([@manneshiva](https://github.com/manneshiva), [#1777](https://github.com/RaRe-Technologies/gensim/pull/1777))
- Modular organization of `Word2Vec`, `Doc2Vec`, `FastText`, etc ..., making it easier to add new models in the future and re-use code
- Fully backward compatible (even with loading models stored by a previous Gensim version)
- [Detailed documentation for the *2vec refactoring project](https://github.com/manneshiva/gensim/wiki/Any2Vec-Refactoring-Summary)
* Improve `gensim.scripts.segment_wiki` by retaining interwiki links. Fix #1712
- (__[@steremma](https://github.com/steremma)__, [PR #1839](https://github.com/RaRe-Technologies/gensim/pull/1839))
+ ([@steremma](https://github.com/steremma), [PR #1839](https://github.com/RaRe-Technologies/gensim/pull/1839))
- Optionally extract interlinks from Wikipedia pages (use the `--include-interlinks` option). This will output one additional JSON dict for each article:
```
{
@@ -914,7 +1003,7 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
```
-* Add support for [SMART notation](https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html) for `TfidfModel`. Fix #1785 (__[@markroxor](https://github.com/markroxor)__, [#1791](https://github.com/RaRe-Technologies/gensim/pull/1791))
+* Add support for [SMART notation](https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html) for `TfidfModel`. Fix #1785 ([@markroxor](https://github.com/markroxor), [#1791](https://github.com/RaRe-Technologies/gensim/pull/1791))
- Natural extension of `TfidfModel` to allow different weighting and normalization schemes
```python
from gensim.corpora import Dictionary
@@ -938,7 +1027,7 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
```
- [SMART Information Retrieval System (wiki)](https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System)
-* Add CircleCI for building Gensim documentation. Fix #1807 (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1822](https://github.com/RaRe-Technologies/gensim/pull/1822))
+* Add CircleCI for building Gensim documentation. Fix #1807 ([@menshikh-iv](https://github.com/menshikh-iv), [#1822](https://github.com/RaRe-Technologies/gensim/pull/1822))
- An easy way to preview the rendered documentation (especially, if don't use Linux)
- Go to "Details" link of CircleCI in your PR, click on the "Artifacts" tab, choose the HTML file that you want to view; a new tab will open with the rendered HTML page
- Integration with Github, to see the documentation directly from the pull request page
@@ -948,47 +1037,47 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
:red_circle: Bug fixes:
-* Fix import in `get_my_ip`. Fix #1771 (__[@darindf](https://github.com/darindf)__, [#1772](https://github.com/RaRe-Technologies/gensim/pull/1772))
-* Fix tox.ini/setup.cfg configuration (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1815](https://github.com/RaRe-Technologies/gensim/pull/1815))
-* Fix formula in `gensim.summarization.bm25`. Fix #1828 (__[@sj29-innovate](https://github.com/sj29-innovate)__, [#1833](https://github.com/RaRe-Technologies/gensim/pull/1833))
-* Fix the train method of `TranslationMatrix` (__[@robotcator](https://github.com/robotcator)__, [#1838](https://github.com/RaRe-Technologies/gensim/pull/1838))
-* Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` (__[@Alexjmsherman](https://github.com/Alexjmsherman)__, [#1823](https://github.com/RaRe-Technologies/gensim/pull/1823))
-* Fix parameter setting for `FastText.train`. Fix #1818 (__[@sj29-innovate](https://github.com/sj29-innovate)__, [#1837](https://github.com/RaRe-Technologies/gensim/pull/1837))
-* Pin python2 explicitly for building documentation (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1840](https://github.com/RaRe-Technologies/gensim/pull/1840))
-* Remove dispatcher deadlock for distributed LDA (__[@darindf](https://github.com/darindf)__, [#1817](https://github.com/RaRe-Technologies/gensim/pull/1817))
-* Fix `score_function` from `LexicalEntailmentEvaluation`. Fix #1858 (__[@hachibaka](https://github.com/hachibaka)__, [#1863](https://github.com/RaRe-Technologies/gensim/pull/1863))
-* Fix symmetrical case for hellinger distance. Fix #1854 (__[@caiyulun](https://github.com/caiyulun)__, [#1860](https://github.com/RaRe-Technologies/gensim/pull/1860))
-* Remove wrong logging at import. Fix #1706 (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1871](https://github.com/RaRe-Technologies/gensim/pull/1871))
+* Fix import in `get_my_ip`. Fix #1771 ([@darindf](https://github.com/darindf), [#1772](https://github.com/RaRe-Technologies/gensim/pull/1772))
+* Fix tox.ini/setup.cfg configuration ([@menshikh-iv](https://github.com/menshikh-iv), [#1815](https://github.com/RaRe-Technologies/gensim/pull/1815))
+* Fix formula in `gensim.summarization.bm25`. Fix #1828 ([@sj29-innovate](https://github.com/sj29-innovate), [#1833](https://github.com/RaRe-Technologies/gensim/pull/1833))
+* Fix the train method of `TranslationMatrix` ([@robotcator](https://github.com/robotcator), [#1838](https://github.com/RaRe-Technologies/gensim/pull/1838))
+* Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` ([@Alexjmsherman](https://github.com/Alexjmsherman), [#1823](https://github.com/RaRe-Technologies/gensim/pull/1823))
+* Fix parameter setting for `FastText.train`. Fix #1818 ([@sj29-innovate](https://github.com/sj29-innovate), [#1837](https://github.com/RaRe-Technologies/gensim/pull/1837))
+* Pin python2 explicitly for building documentation ([@menshikh-iv](https://github.com/menshikh-iv), [#1840](https://github.com/RaRe-Technologies/gensim/pull/1840))
+* Remove dispatcher deadlock for distributed LDA ([@darindf](https://github.com/darindf), [#1817](https://github.com/RaRe-Technologies/gensim/pull/1817))
+* Fix `score_function` from `LexicalEntailmentEvaluation`. Fix #1858 ([@hachibaka](https://github.com/hachibaka), [#1863](https://github.com/RaRe-Technologies/gensim/pull/1863))
+* Fix symmetrical case for hellinger distance. Fix #1854 ([@caiyulun](https://github.com/caiyulun), [#1860](https://github.com/RaRe-Technologies/gensim/pull/1860))
+* Remove wrong logging at import. Fix #1706 ([@menshikh-iv](https://github.com/menshikh-iv), [#1871](https://github.com/RaRe-Technologies/gensim/pull/1871))
:books: Tutorial and doc improvements:
-* Refactor documentation API Reference for `gensim.summarization` (__[@yurkai](https://github.com/yurkai)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1709](https://github.com/RaRe-Technologies/gensim/pull/1709))
-* Fix docstrings for `gensim.similarities.index`. Partial fix #1666 (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1681](https://github.com/RaRe-Technologies/gensim/pull/1681))
-* Fix docstrings for `gensim.models.translation_matrix` (__[@KokuKUSIAKU](https://github.com/KokuKUSIAKU)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1806](https://github.com/RaRe-Technologies/gensim/pull/1806))
-* Fix docstrings for `gensim.models.rpmodel` (__[@jazzmuesli](https://github.com/jazzmuesli)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1802](https://github.com/RaRe-Technologies/gensim/pull/1802))
-* Fix docstrings for `gensim.utils` (__[@kakshay21](https://github.com/kakshay21)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1797](https://github.com/RaRe-Technologies/gensim/pull/1797))
-* Fix docstrings for `gensim.matutils` (__[@Cheukting](https://github.com/Cheukting)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1804](https://github.com/RaRe-Technologies/gensim/pull/1804))
-* Fix docstrings for `gensim.models.logentropy_model` (__[@minggli](https://github.com/minggli)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1803](https://github.com/RaRe-Technologies/gensim/pull/1803))
-* Fix docstrings for `gensim.models.normmodel` (__[@AustenLamacraft](https://github.com/AustenLamacraft)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1805](https://github.com/RaRe-Technologies/gensim/pull/1805))
-* Refactor API reference `gensim.topic_coherence`. Fix #1669 (__[@CLearERR](https://github.com/CLearERR)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1714](https://github.com/RaRe-Technologies/gensim/pull/1714))
-* Fix documentation for `gensim.corpora.dictionary` and `gensim.corpora.hashdictionary`. Partial fix #1671 (__[@CLearERR](https://github.com/CLearERR)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1814](https://github.com/RaRe-Technologies/gensim/pull/1814))
-* Fix documentation for `gensim.corpora`. Partial fix #1671 (__[@anotherbugmaster](https://github.com/anotherbugmaster)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1729](https://github.com/RaRe-Technologies/gensim/pull/1729))
-* Update banner in doc pages (__[@piskvorky](https://github.com/piskvorky)__, [#1865](https://github.com/RaRe-Technologies/gensim/pull/1865))
-* Fix errors in the doc2vec-lee notebook (__[@PeterHamilton](https://github.com/PeterHamilton)__, [#1841](https://github.com/RaRe-Technologies/gensim/pull/1841))
-* Add wordnet mammal train file for Poincare notebook (__[@jayantj](https://github.com/jayantj)__, [#1781](https://github.com/RaRe-Technologies/gensim/pull/1781))
-* Update Poincare notebooks (#1774) (__[@jayantj](https://github.com/jayantj)__, [#1774](https://github.com/RaRe-Technologies/gensim/pull/1774))
-* Update contributing guide. Fix #1786 (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1793](https://github.com/RaRe-Technologies/gensim/pull/1793))
-* Add `model_to_dict` one-liner to word2vec notebook. Fix #1269 (__[@kakshay21](https://github.com/kakshay21)__, [#1776](https://github.com/RaRe-Technologies/gensim/pull/1776))
-* Add word embedding viz to word2vec notebook. Fix #1419 (__[@markroxor](https://github.com/markroxor)__, [#1800](https://github.com/RaRe-Technologies/gensim/pull/1800))
-* Fix description of `sg` parameter for `gensim.models.FastText` (__[@akutuzov](https://github.com/akutuzov)__, [#1801](https://github.com/RaRe-Technologies/gensim/pull/1801))
-* Fix typo in `doc2vec-IMDB`. Fix #1788 (__[@apoorvaeternity](https://github.com/apoorvaeternity)__, [#1796](https://github.com/RaRe-Technologies/gensim/pull/1796))
-* Remove outdated bz2 examples from tutorials[2] (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1868](https://github.com/RaRe-Technologies/gensim/pull/1868))
-* Remove outdated `bz2` + `MmCorpus` examples from tutorials (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1867](https://github.com/RaRe-Technologies/gensim/pull/1867))
+* Refactor documentation API Reference for `gensim.summarization` ([@yurkai](https://github.com/yurkai) & [@menshikh-iv](https://github.com/menshikh-iv), [#1709](https://github.com/RaRe-Technologies/gensim/pull/1709))
+* Fix docstrings for `gensim.similarities.index`. Partial fix #1666 ([@menshikh-iv](https://github.com/menshikh-iv), [#1681](https://github.com/RaRe-Technologies/gensim/pull/1681))
+* Fix docstrings for `gensim.models.translation_matrix` ([@KokuKUSIAKU](https://github.com/KokuKUSIAKU) & [@menshikh-iv](https://github.com/menshikh-iv), [#1806](https://github.com/RaRe-Technologies/gensim/pull/1806))
+* Fix docstrings for `gensim.models.rpmodel` ([@jazzmuesli](https://github.com/jazzmuesli) & [@menshikh-iv](https://github.com/menshikh-iv), [#1802](https://github.com/RaRe-Technologies/gensim/pull/1802))
+* Fix docstrings for `gensim.utils` ([@kakshay21](https://github.com/kakshay21) & [@menshikh-iv](https://github.com/menshikh-iv), [#1797](https://github.com/RaRe-Technologies/gensim/pull/1797))
+* Fix docstrings for `gensim.matutils` ([@Cheukting](https://github.com/Cheukting) & [@menshikh-iv](https://github.com/menshikh-iv), [#1804](https://github.com/RaRe-Technologies/gensim/pull/1804))
+* Fix docstrings for `gensim.models.logentropy_model` ([@minggli](https://github.com/minggli) & [@menshikh-iv](https://github.com/menshikh-iv), [#1803](https://github.com/RaRe-Technologies/gensim/pull/1803))
+* Fix docstrings for `gensim.models.normmodel` ([@AustenLamacraft](https://github.com/AustenLamacraft) & [@menshikh-iv](https://github.com/menshikh-iv), [#1805](https://github.com/RaRe-Technologies/gensim/pull/1805))
+* Refactor API reference `gensim.topic_coherence`. Fix #1669 ([@CLearERR](https://github.com/CLearERR) & [@menshikh-iv](https://github.com/menshikh-iv), [#1714](https://github.com/RaRe-Technologies/gensim/pull/1714))
+* Fix documentation for `gensim.corpora.dictionary` and `gensim.corpora.hashdictionary`. Partial fix #1671 ([@CLearERR](https://github.com/CLearERR) & [@menshikh-iv](https://github.com/menshikh-iv), [#1814](https://github.com/RaRe-Technologies/gensim/pull/1814))
+* Fix documentation for `gensim.corpora`. Partial fix #1671 ([@anotherbugmaster](https://github.com/anotherbugmaster) & [@menshikh-iv](https://github.com/menshikh-iv), [#1729](https://github.com/RaRe-Technologies/gensim/pull/1729))
+* Update banner in doc pages ([@piskvorky](https://github.com/piskvorky), [#1865](https://github.com/RaRe-Technologies/gensim/pull/1865))
+* Fix errors in the doc2vec-lee notebook ([@PeterHamilton](https://github.com/PeterHamilton), [#1841](https://github.com/RaRe-Technologies/gensim/pull/1841))
+* Add wordnet mammal train file for Poincare notebook ([@jayantj](https://github.com/jayantj), [#1781](https://github.com/RaRe-Technologies/gensim/pull/1781))
+* Update Poincare notebooks (#1774) ([@jayantj](https://github.com/jayantj), [#1774](https://github.com/RaRe-Technologies/gensim/pull/1774))
+* Update contributing guide. Fix #1786 ([@menshikh-iv](https://github.com/menshikh-iv), [#1793](https://github.com/RaRe-Technologies/gensim/pull/1793))
+* Add `model_to_dict` one-liner to word2vec notebook. Fix #1269 ([@kakshay21](https://github.com/kakshay21), [#1776](https://github.com/RaRe-Technologies/gensim/pull/1776))
+* Add word embedding viz to word2vec notebook. Fix #1419 ([@markroxor](https://github.com/markroxor), [#1800](https://github.com/RaRe-Technologies/gensim/pull/1800))
+* Fix description of `sg` parameter for `gensim.models.FastText` ([@akutuzov](https://github.com/akutuzov), [#1801](https://github.com/RaRe-Technologies/gensim/pull/1801))
+* Fix typo in `doc2vec-IMDB`. Fix #1788 ([@apoorvaeternity](https://github.com/apoorvaeternity), [#1796](https://github.com/RaRe-Technologies/gensim/pull/1796))
+* Remove outdated bz2 examples from tutorials[2] ([@menshikh-iv](https://github.com/menshikh-iv), [#1868](https://github.com/RaRe-Technologies/gensim/pull/1868))
+* Remove outdated `bz2` + `MmCorpus` examples from tutorials ([@menshikh-iv](https://github.com/menshikh-iv), [#1867](https://github.com/RaRe-Technologies/gensim/pull/1867))
:+1: Improvements:
-* Refactor tests for `gensim.corpora.WikiCorpus` (__[@steremma](https://github.com/steremma)__, [#1821](https://github.com/RaRe-Technologies/gensim/pull/1821))
+* Refactor tests for `gensim.corpora.WikiCorpus` ([@steremma](https://github.com/steremma), [#1821](https://github.com/RaRe-Technologies/gensim/pull/1821))
:warning: Deprecations (will be removed in the next major release)
@@ -1016,7 +1105,7 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
:star2: New features:
-* New download API for corpora and pre-trained models (__[@chaitaliSaini](https://github.com/chaitaliSaini)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1705](https://github.com/RaRe-Technologies/gensim/pull/1705) & [#1632](https://github.com/RaRe-Technologies/gensim/pull/1632) & [#1492](https://github.com/RaRe-Technologies/gensim/pull/1492))
+* New download API for corpora and pre-trained models ([@chaitaliSaini](https://github.com/chaitaliSaini) & [@menshikh-iv](https://github.com/menshikh-iv), [#1705](https://github.com/RaRe-Technologies/gensim/pull/1705) & [#1632](https://github.com/RaRe-Technologies/gensim/pull/1632) & [#1492](https://github.com/RaRe-Technologies/gensim/pull/1492))
- Download large NLP datasets in one line of Python, then use with memory-efficient data streaming:
```python
import gensim.downloader as api
@@ -1046,7 +1135,7 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
- [Blog post](https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/) introducing the API and design decisions.
- [Notebook with examples](https://github.com/RaRe-Technologies/gensim/blob/be4500e4f0616ec2864c2ce70cb5d4db4b46512d/docs/notebooks/downloader_api_tutorial.ipynb)
-* New model: Poincaré embeddings (__[@jayantj](https://github.com/jayantj)__, [#1696](https://github.com/RaRe-Technologies/gensim/pull/1696) & [#1700](https://github.com/RaRe-Technologies/gensim/pull/1700) & [#1757](https://github.com/RaRe-Technologies/gensim/pull/1757) & [#1734](https://github.com/RaRe-Technologies/gensim/pull/1734))
+* New model: Poincaré embeddings ([@jayantj](https://github.com/jayantj), [#1696](https://github.com/RaRe-Technologies/gensim/pull/1696) & [#1700](https://github.com/RaRe-Technologies/gensim/pull/1700) & [#1757](https://github.com/RaRe-Technologies/gensim/pull/1757) & [#1734](https://github.com/RaRe-Technologies/gensim/pull/1734))
- Embed a graph (taxonomy) in the same way as word2vec embeds words:
```python
from gensim.models.poincare import PoincareRelations, PoincareModel
@@ -1067,7 +1156,7 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
- [Model introduction and the journey of its implementation](https://rare-technologies.com/implementing-poincare-embeddings/)
- [Original paper](https://arxiv.org/abs/1705.08039) on arXiv
-* Optimized FastText (__[@manneshiva](https://github.com/manneshiva)__, [#1742](https://github.com/RaRe-Technologies/gensim/pull/1742))
+* Optimized FastText ([@manneshiva](https://github.com/manneshiva), [#1742](https://github.com/RaRe-Technologies/gensim/pull/1742))
- New fast multithreaded implementation of FastText, natively in Python/Cython. Deprecates the existing wrapper for Facebook’s C++ implementation.
```python
import gensim.downloader as api
@@ -1090,54 +1179,54 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
```
-* Binary pre-compiled wheels for Windows, OSX and Linux (__[@menshikh-iv](https://github.com/menshikh-iv)__, [MacPython/gensim-wheels/#7](https://github.com/MacPython/gensim-wheels/pull/7))
+* Binary pre-compiled wheels for Windows, OSX and Linux ([@menshikh-iv](https://github.com/menshikh-iv), [MacPython/gensim-wheels/#7](https://github.com/MacPython/gensim-wheels/pull/7))
- Users no longer need to have a C compiler for using the fast (Cythonized) version of word2vec, doc2vec, etc.
- Faster Gensim pip installation
* Added `DeprecationWarnings` to deprecated methods and parameters, with a clear schedule for removal.
:+1: Improvements:
-* Add Montemurro and Zanette's entropy based keyword extraction algorithm. Fix #665 (__[@PeteBleackley](https://github.com/PeteBleackley)__, [#1738](https://github.com/RaRe-Technologies/gensim/pull/1738))
-* Fix flake8 E731, E402, refactor tests & sklearn API code. Partial fix #1644 (__[@horpto](https://github.com/horpto)__, [#1689](https://github.com/RaRe-Technologies/gensim/pull/1689))
-* Reduce distribution size. Fix #1698 (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1699](https://github.com/RaRe-Technologies/gensim/pull/1699))
-* Improve `scan_vocab` speed, `build_vocab_from_freq` method (__[@jodevak](https://github.com/jodevak)__, [#1695](https://github.com/RaRe-Technologies/gensim/pull/1695))
-* Improve `segment_wiki` script (__[@piskvorky](https://github.com/piskvorky)__, [#1707](https://github.com/RaRe-Technologies/gensim/pull/1707))
-* Add custom `dtype` support for `LdaModel`. Partially fix #1576 (__[@xelez](https://github.com/xelez)__, [#1656](https://github.com/RaRe-Technologies/gensim/pull/1656))
-* Add `doc2idx` method for `gensim.corpora.Dictionary`. Fix #1634 (__[@roopalgarg](https://github.com/roopalgarg)__, [#1720](https://github.com/RaRe-Technologies/gensim/pull/1720))
-* Add tox and pytest to gensim, integration with Travis and Appveyor. Fix #1613, #1644 (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1721](https://github.com/RaRe-Technologies/gensim/pull/1721))
-* Add flag for hiding outdated data for `gensim.downloader.info` (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1736](https://github.com/RaRe-Technologies/gensim/pull/1736))
-* Add reproducible order between python versions for `gensim.corpora.Dictionary` (__[@formi23](https://github.com/formi23)__, [#1715](https://github.com/RaRe-Technologies/gensim/pull/1715))
-* Update `tox.ini`, `setup.cfg`, `README.md` (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1741](https://github.com/RaRe-Technologies/gensim/pull/1741))
-* Add custom `logsumexp` for `LdaModel` (__[@arlenk](https://github.com/arlenk)__, [#1745](https://github.com/RaRe-Technologies/gensim/pull/1745))
+* Add Montemurro and Zanette's entropy based keyword extraction algorithm. Fix #665 ([@PeteBleackley](https://github.com/PeteBleackley), [#1738](https://github.com/RaRe-Technologies/gensim/pull/1738))
+* Fix flake8 E731, E402, refactor tests & sklearn API code. Partial fix #1644 ([@horpto](https://github.com/horpto), [#1689](https://github.com/RaRe-Technologies/gensim/pull/1689))
+* Reduce distribution size. Fix #1698 ([@menshikh-iv](https://github.com/menshikh-iv), [#1699](https://github.com/RaRe-Technologies/gensim/pull/1699))
+* Improve `scan_vocab` speed, `build_vocab_from_freq` method ([@jodevak](https://github.com/jodevak), [#1695](https://github.com/RaRe-Technologies/gensim/pull/1695))
+* Improve `segment_wiki` script ([@piskvorky](https://github.com/piskvorky), [#1707](https://github.com/RaRe-Technologies/gensim/pull/1707))
+* Add custom `dtype` support for `LdaModel`. Partially fix #1576 ([@xelez](https://github.com/xelez), [#1656](https://github.com/RaRe-Technologies/gensim/pull/1656))
+* Add `doc2idx` method for `gensim.corpora.Dictionary`. Fix #1634 ([@roopalgarg](https://github.com/roopalgarg), [#1720](https://github.com/RaRe-Technologies/gensim/pull/1720))
+* Add tox and pytest to gensim, integration with Travis and Appveyor. Fix #1613, #1644 ([@menshikh-iv](https://github.com/menshikh-iv), [#1721](https://github.com/RaRe-Technologies/gensim/pull/1721))
+* Add flag for hiding outdated data for `gensim.downloader.info` ([@menshikh-iv](https://github.com/menshikh-iv), [#1736](https://github.com/RaRe-Technologies/gensim/pull/1736))
+* Add reproducible order between python versions for `gensim.corpora.Dictionary` ([@formi23](https://github.com/formi23), [#1715](https://github.com/RaRe-Technologies/gensim/pull/1715))
+* Update `tox.ini`, `setup.cfg`, `README.md` ([@menshikh-iv](https://github.com/menshikh-iv), [#1741](https://github.com/RaRe-Technologies/gensim/pull/1741))
+* Add custom `logsumexp` for `LdaModel` ([@arlenk](https://github.com/arlenk), [#1745](https://github.com/RaRe-Technologies/gensim/pull/1745))
:red_circle: Bug fixes:
-* Fix ranking formula in `gensim.summarization.bm25`. Fix #1718 (__[@souravsingh](https://github.com/souravsingh)__, [#1726](https://github.com/RaRe-Technologies/gensim/pull/1726))
-* Fixed incompatibility in persistence for `FastText` wrapper. Fix #1642 (__[@chinmayapancholi13](https://github.com/chinmayapancholi13)__, [#1723](https://github.com/RaRe-Technologies/gensim/pull/1723))
-* Fix `gensim.sklearn_api` bug with `documents_columns` parameter. Fix #1676 (__[@chinmayapancholi13](https://github.com/chinmayapancholi13)__, [#1704](https://github.com/RaRe-Technologies/gensim/pull/1704))
-* Fix slowdown of CI, remove pytest-cov (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1728](https://github.com/RaRe-Technologies/gensim/pull/1728))
-* Replace outdated packages in Dockerfile (__[@rbahumi](https://github.com/rbahumi)__, [#1730](https://github.com/RaRe-Technologies/gensim/pull/1730))
-* Replace `num_words` to `topn` in `LdaMallet.show_topics`. Fix #1747 (__[@apoorvaeternity](https://github.com/apoorvaeternity)__, [#1749](https://github.com/RaRe-Technologies/gensim/pull/1749))
-* Fix `os.rename` from `gensim.downloader` when 'src' and 'dst' on different partitions (__[@anotherbugmaster](https://github.com/anotherbugmaster)__, [#1733](https://github.com/RaRe-Technologies/gensim/pull/1733))
-* Fix `DeprecationWarning` from `logsumexp` (__[@dreamgonfly](https://github.com/dreamgonfly)__, [#1703](https://github.com/RaRe-Technologies/gensim/pull/1703))
-* Fix backward compatibility problem in `Phrases.load`. Fix #1751 (__[@alexgarel](https://github.com/alexgarel)__, [#1758](https://github.com/RaRe-Technologies/gensim/pull/1758))
-* Fix `load_word2vec_format` from `FastText`. Fix #1743 (__[@manneshiva](https://github.com/manneshiva)__, [#1755](https://github.com/RaRe-Technologies/gensim/pull/1755))
-* Fix ipython kernel version in `Dockerfile`. Fix #1762 (__[@rbahumi](https://github.com/rbahumi)__, [#1764](https://github.com/RaRe-Technologies/gensim/pull/1764))
-* Fix writing in `segment_wiki` (__[@horpto](https://github.com/horpto)__, [#1763](https://github.com/RaRe-Technologies/gensim/pull/1763))
-* Fix write method of file requires byte-like object in `segment_wiki` (__[@horpto](https://github.com/horpto)__, [#1750](https://github.com/RaRe-Technologies/gensim/pull/1750))
-* Fix incorrect vectors learned during online training for `FastText`. Fix #1752 (__[@manneshiva](https://github.com/manneshiva)__, [#1756](https://github.com/RaRe-Technologies/gensim/pull/1756))
-* Fix `dtype` of `model.wv.syn0_vocab` on updating `vocab` for `FastText`. Fix #1759 (__[@manneshiva](https://github.com/manneshiva)__, [#1760](https://github.com/RaRe-Technologies/gensim/pull/1760))
-* Fix hashing-trick from `FastText.build_vocab`. Fix #1765 (__[@manneshiva](https://github.com/manneshiva)__, [#1768](https://github.com/RaRe-Technologies/gensim/pull/1768))
-* Add explicit `DeprecationWarning` for all outdated stuff. Fix #1753 (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1769](https://github.com/RaRe-Technologies/gensim/pull/1769))
-* Fix epsilon according to `dtype` in `LdaModel` (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1770](https://github.com/RaRe-Technologies/gensim/pull/1770))
+* Fix ranking formula in `gensim.summarization.bm25`. Fix #1718 ([@souravsingh](https://github.com/souravsingh), [#1726](https://github.com/RaRe-Technologies/gensim/pull/1726))
+* Fixed incompatibility in persistence for `FastText` wrapper. Fix #1642 ([@chinmayapancholi13](https://github.com/chinmayapancholi13), [#1723](https://github.com/RaRe-Technologies/gensim/pull/1723))
+* Fix `gensim.sklearn_api` bug with `documents_columns` parameter. Fix #1676 ([@chinmayapancholi13](https://github.com/chinmayapancholi13), [#1704](https://github.com/RaRe-Technologies/gensim/pull/1704))
+* Fix slowdown of CI, remove pytest-cov ([@menshikh-iv](https://github.com/menshikh-iv), [#1728](https://github.com/RaRe-Technologies/gensim/pull/1728))
+* Replace outdated packages in Dockerfile ([@rbahumi](https://github.com/rbahumi), [#1730](https://github.com/RaRe-Technologies/gensim/pull/1730))
+* Replace `num_words` to `topn` in `LdaMallet.show_topics`. Fix #1747 ([@apoorvaeternity](https://github.com/apoorvaeternity), [#1749](https://github.com/RaRe-Technologies/gensim/pull/1749))
+* Fix `os.rename` from `gensim.downloader` when 'src' and 'dst' on different partitions ([@anotherbugmaster](https://github.com/anotherbugmaster), [#1733](https://github.com/RaRe-Technologies/gensim/pull/1733))
+* Fix `DeprecationWarning` from `logsumexp` ([@dreamgonfly](https://github.com/dreamgonfly), [#1703](https://github.com/RaRe-Technologies/gensim/pull/1703))
+* Fix backward compatibility problem in `Phrases.load`. Fix #1751 ([@alexgarel](https://github.com/alexgarel), [#1758](https://github.com/RaRe-Technologies/gensim/pull/1758))
+* Fix `load_word2vec_format` from `FastText`. Fix #1743 ([@manneshiva](https://github.com/manneshiva), [#1755](https://github.com/RaRe-Technologies/gensim/pull/1755))
+* Fix ipython kernel version in `Dockerfile`. Fix #1762 ([@rbahumi](https://github.com/rbahumi), [#1764](https://github.com/RaRe-Technologies/gensim/pull/1764))
+* Fix writing in `segment_wiki` ([@horpto](https://github.com/horpto), [#1763](https://github.com/RaRe-Technologies/gensim/pull/1763))
+* Fix write method of file requires byte-like object in `segment_wiki` ([@horpto](https://github.com/horpto), [#1750](https://github.com/RaRe-Technologies/gensim/pull/1750))
+* Fix incorrect vectors learned during online training for `FastText`. Fix #1752 ([@manneshiva](https://github.com/manneshiva), [#1756](https://github.com/RaRe-Technologies/gensim/pull/1756))
+* Fix `dtype` of `model.wv.syn0_vocab` on updating `vocab` for `FastText`. Fix #1759 ([@manneshiva](https://github.com/manneshiva), [#1760](https://github.com/RaRe-Technologies/gensim/pull/1760))
+* Fix hashing-trick from `FastText.build_vocab`. Fix #1765 ([@manneshiva](https://github.com/manneshiva), [#1768](https://github.com/RaRe-Technologies/gensim/pull/1768))
+* Add explicit `DeprecationWarning` for all outdated stuff. Fix #1753 ([@menshikh-iv](https://github.com/menshikh-iv), [#1769](https://github.com/RaRe-Technologies/gensim/pull/1769))
+* Fix epsilon according to `dtype` in `LdaModel` ([@menshikh-iv](https://github.com/menshikh-iv), [#1770](https://github.com/RaRe-Technologies/gensim/pull/1770))
:books: Tutorial and doc improvements:
-* Update perf numbers of `segment_wiki` (__[@piskvorky](https://github.com/piskvorky)__, [#1708](https://github.com/RaRe-Technologies/gensim/pull/1708))
-* Update docstring for `gensim.summarization.summarize`. Fix #1575 (__[@fbarrios](https://github.com/fbarrios)__, [#1702](https://github.com/RaRe-Technologies/gensim/pull/1702))
-* Refactor API Reference for `gensim.parsing`. Fix #1664 (__[@CLearERR](https://github.com/CLearERR)__, [#1684](https://github.com/RaRe-Technologies/gensim/pull/1684))
-* Fix typos in doc2vec-wikipedia notebook (__[@youqad](https://github.com/youqad)__, [#1727](https://github.com/RaRe-Technologies/gensim/pull/1727))
-* Fix PyPI long description rendering (__[@edigaryev](https://github.com/edigaryev)__, [#1739](https://github.com/RaRe-Technologies/gensim/pull/1739))
-* Fix twitter badge src (__[@menshikh-iv](https://github.com/menshikh-iv)__)
-* Fix maillist badge color (__[@menshikh-iv](https://github.com/menshikh-iv)__)
+* Update perf numbers of `segment_wiki` ([@piskvorky](https://github.com/piskvorky), [#1708](https://github.com/RaRe-Technologies/gensim/pull/1708))
+* Update docstring for `gensim.summarization.summarize`. Fix #1575 ([@fbarrios](https://github.com/fbarrios), [#1702](https://github.com/RaRe-Technologies/gensim/pull/1702))
+* Refactor API Reference for `gensim.parsing`. Fix #1664 ([@CLearERR](https://github.com/CLearERR), [#1684](https://github.com/RaRe-Technologies/gensim/pull/1684))
+* Fix typos in doc2vec-wikipedia notebook ([@youqad](https://github.com/youqad), [#1727](https://github.com/RaRe-Technologies/gensim/pull/1727))
+* Fix PyPI long description rendering ([@edigaryev](https://github.com/edigaryev), [#1739](https://github.com/RaRe-Technologies/gensim/pull/1739))
+* Fix twitter badge src ([@menshikh-iv](https://github.com/menshikh-iv))
+* Fix maillist badge color ([@menshikh-iv](https://github.com/menshikh-iv))
:warning: Deprecations (will be removed in the next major release)
* Remove
@@ -1162,7 +1251,7 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
:star2: New features:
-* Massive optimizations to LSI model training (__[@isamaru](https://github.com/isamaru)__, [#1620](https://github.com/RaRe-Technologies/gensim/pull/1620) & [#1622](https://github.com/RaRe-Technologies/gensim/pull/1622))
+* Massive optimizations to LSI model training ([@isamaru](https://github.com/isamaru), [#1620](https://github.com/RaRe-Technologies/gensim/pull/1620) & [#1622](https://github.com/RaRe-Technologies/gensim/pull/1622))
- LSI model allows use of single precision (float32), to consume *40% less memory* while being *40% faster*.
- LSI model can now also accept CSC matrix as input, for further memory and speed boost.
- Overall, if your entire corpus fits in RAM: 3x faster LSI training (SVD) in 4x less memory!
@@ -1180,7 +1269,7 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
```python
model = LsiModel(corpus=streaming_corpus, num_topics=500, dtype=np.float32)
```
-* Add common terms to Phrases. Fix #1258 (__[@alexgarel](https://github.com/alexgarel)__, [#1568](https://github.com/RaRe-Technologies/gensim/pull/1568))
+* Add common terms to Phrases. Fix #1258 ([@alexgarel](https://github.com/alexgarel), [#1568](https://github.com/RaRe-Technologies/gensim/pull/1568))
- Phrases allows to use common terms in bigrams. Before, if you are searching to reveal ngrams like `car_with_driver` and `car_without_driver`, you can either remove stop words before processing, but you will only find `car_driver`, or you won't find any of those forms (because they have three words, but also because high frequency of with will avoid them to be scored correctly), inspired by [ES common grams token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-common-grams-tokenfilter.html).
```python
phr_old = Phrases(corpus)
@@ -1189,7 +1278,7 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
print(phr_old[["we", "provide", "car", "with", "driver"]]) # ["we", "provide", "car_with", "driver"]
print(phr_new[["we", "provide", "car", "with", "driver"]]) # ["we", "provide", "car_with_driver"]
```
-* New [segment_wiki.py](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/segment_wiki.py) script (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1483](https://github.com/RaRe-Technologies/gensim/pull/1483) & [#1694](https://github.com/RaRe-Technologies/gensim/pull/1694))
+* New [segment_wiki.py](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/segment_wiki.py) script ([@menshikh-iv](https://github.com/menshikh-iv), [#1483](https://github.com/RaRe-Technologies/gensim/pull/1483) & [#1694](https://github.com/RaRe-Technologies/gensim/pull/1694))
- CLI script for processing a raw Wikipedia dump (the xml.bz2 format provided by WikiMedia) to extract its articles in a plain text format. It extracts each article's title, section names and section content and saves them as json-line:
```bash
python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 | gzip > enwiki-latest-pages-articles.json.gz
@@ -1207,33 +1296,33 @@ Apart from the **massive overhaul of all Gensim documentation** (including docst
```
:+1: Improvements:
-* Speedup FastText tests (__[@horpto](https://github.com/horpto)__, [#1686](https://github.com/RaRe-Technologies/gensim/pull/1686))
-* Add optimization for `SlicedCorpus.__len__` (__[@horpto](https://github.com/horpto)__, [#1679](https://github.com/RaRe-Technologies/gensim/pull/1679))
-* Make `word_vec` return immutable vector. Fix #1651 (__[@CLearERR](https://github.com/CLearERR)__, [#1662](https://github.com/RaRe-Technologies/gensim/pull/1662))
-* Drop Win x32 support & add rolling builds (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1652](https://github.com/RaRe-Technologies/gensim/pull/1652))
-* Fix scoring function in Phrases. Fix #1533, #1635 (__[@michaelwsherman](https://github.com/michaelwsherman)__, [#1573](https://github.com/RaRe-Technologies/gensim/pull/1573))
-* Add configuration for flake8 to setup.cfg (__[@mcobzarenco](https://github.com/mcobzarenco)__, [#1636](https://github.com/RaRe-Technologies/gensim/pull/1636))
-* Add `build_vocab_from_freq` to Word2Vec, speedup scan\_vocab (__[@jodevak](https://github.com/jodevak)__, [#1599](https://github.com/RaRe-Technologies/gensim/pull/1599))
-* Add `most_similar_to_given` method for KeyedVectors (__[@TheMathMajor](https://github.com/TheMathMajor)__, [#1582](https://github.com/RaRe-Technologies/gensim/pull/1582))
-* Add `__getitem__` method to Sparse2Corpus to allow direct queries (__[@isamaru](https://github.com/isamaru)__, [#1621](https://github.com/RaRe-Technologies/gensim/pull/1621))
+* Speedup FastText tests ([@horpto](https://github.com/horpto), [#1686](https://github.com/RaRe-Technologies/gensim/pull/1686))
+* Add optimization for `SlicedCorpus.len` ([@horpto](https://github.com/horpto), [#1679](https://github.com/RaRe-Technologies/gensim/pull/1679))
+* Make `word_vec` return immutable vector. Fix #1651 ([@CLearERR](https://github.com/CLearERR), [#1662](https://github.com/RaRe-Technologies/gensim/pull/1662))
+* Drop Win x32 support & add rolling builds ([@menshikh-iv](https://github.com/menshikh-iv), [#1652](https://github.com/RaRe-Technologies/gensim/pull/1652))
+* Fix scoring function in Phrases. Fix #1533, #1635 ([@michaelwsherman](https://github.com/michaelwsherman), [#1573](https://github.com/RaRe-Technologies/gensim/pull/1573))
+* Add configuration for flake8 to setup.cfg ([@mcobzarenco](https://github.com/mcobzarenco), [#1636](https://github.com/RaRe-Technologies/gensim/pull/1636))
+* Add `build_vocab_from_freq` to Word2Vec, speedup scan\_vocab ([@jodevak](https://github.com/jodevak), [#1599](https://github.com/RaRe-Technologies/gensim/pull/1599))
+* Add `most_similar_to_given` method for KeyedVectors ([@TheMathMajor](https://github.com/TheMathMajor), [#1582](https://github.com/RaRe-Technologies/gensim/pull/1582))
+* Add `getitem` method to Sparse2Corpus to allow direct queries ([@isamaru](https://github.com/isamaru), [#1621](https://github.com/RaRe-Technologies/gensim/pull/1621))
:red_circle: Bug fixes:
-* Add single core mode to CoherenceModel. Fix #1683 (__[@horpto](https://github.com/horpto)__, [#1685](https://github.com/RaRe-Technologies/gensim/pull/1685))
-* Fix ResourceWarnings in tests. Partially fix #1519 (__[@horpto](https://github.com/horpto)__, [#1660](https://github.com/RaRe-Technologies/gensim/pull/1660))
-* Fix DeprecationWarnings generated by deprecated assertEquals. Partial fix #1519 (__[@poornagurram](https://github.com/poornagurram)__, [#1658](https://github.com/RaRe-Technologies/gensim/pull/1658))
-* Fix DeprecationWarnings for regex string literals. Fix #1646 (__[@franklsf95](https://github.com/franklsf95)__, [#1649](https://github.com/RaRe-Technologies/gensim/pull/1649))
-* Fix pagerank algorithm. Fix #805 (__[@xelez](https://github.com/xelez)__, [#1653](https://github.com/RaRe-Technologies/gensim/pull/1653))
-* Fix FastText inconsistent dtype. Fix #1637 (__[@mcobzarenco](https://github.com/mcobzarenco)__, [#1638](https://github.com/RaRe-Technologies/gensim/pull/1638))
-* Fix `test_filename_filtering` test (__[@nehaljwani](https://github.com/nehaljwani)__, [#1647](https://github.com/RaRe-Technologies/gensim/pull/1647))
+* Add single core mode to CoherenceModel. Fix #1683 ([@horpto](https://github.com/horpto), [#1685](https://github.com/RaRe-Technologies/gensim/pull/1685))
+* Fix ResourceWarnings in tests. Partially fix #1519 ([@horpto](https://github.com/horpto), [#1660](https://github.com/RaRe-Technologies/gensim/pull/1660))
+* Fix DeprecationWarnings generated by deprecated assertEquals. Partial fix #1519 ([@poornagurram](https://github.com/poornagurram), [#1658](https://github.com/RaRe-Technologies/gensim/pull/1658))
+* Fix DeprecationWarnings for regex string literals. Fix #1646 ([@franklsf95](https://github.com/franklsf95), [#1649](https://github.com/RaRe-Technologies/gensim/pull/1649))
+* Fix pagerank algorithm. Fix #805 ([@xelez](https://github.com/xelez), [#1653](https://github.com/RaRe-Technologies/gensim/pull/1653))
+* Fix FastText inconsistent dtype. Fix #1637 ([@mcobzarenco](https://github.com/mcobzarenco), [#1638](https://github.com/RaRe-Technologies/gensim/pull/1638))
+* Fix `test_filename_filtering` test ([@nehaljwani](https://github.com/nehaljwani), [#1647](https://github.com/RaRe-Technologies/gensim/pull/1647))
:books: Tutorial and doc improvements:
-* Fix code/docstring style (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1650](https://github.com/RaRe-Technologies/gensim/pull/1650))
-* Update error message for supervised FastText. Fix #1498 (__[@ElSaico](https://github.com/ElSaico)__, [#1645](https://github.com/RaRe-Technologies/gensim/pull/1645))
-* Add "DOI badge" to README. Fix #1610 (__[@dphov](https://github.com/dphov)__, [#1639](https://github.com/RaRe-Technologies/gensim/pull/1639))
-* Remove duplicate annoy notebook. Fix #1415 (__[@Karamax](https://github.com/Karamax)__, [#1640](https://github.com/RaRe-Technologies/gensim/pull/1640))
-* Fix duplication and wrong markup in docs (__[@horpto](https://github.com/horpto)__, [#1633](https://github.com/RaRe-Technologies/gensim/pull/1633))
-* Refactor dendrogram & topic network notebooks (__[@parulsethi](https://github.com/parulsethi)__, [#1571](https://github.com/RaRe-Technologies/gensim/pull/1571))
-* Fix release badge (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1631](https://github.com/RaRe-Technologies/gensim/pull/1631))
+* Fix code/docstring style ([@menshikh-iv](https://github.com/menshikh-iv), [#1650](https://github.com/RaRe-Technologies/gensim/pull/1650))
+* Update error message for supervised FastText. Fix #1498 ([@ElSaico](https://github.com/ElSaico), [#1645](https://github.com/RaRe-Technologies/gensim/pull/1645))
+* Add "DOI badge" to README. Fix #1610 ([@dphov](https://github.com/dphov), [#1639](https://github.com/RaRe-Technologies/gensim/pull/1639))
+* Remove duplicate annoy notebook. Fix #1415 ([@Karamax](https://github.com/Karamax), [#1640](https://github.com/RaRe-Technologies/gensim/pull/1640))
+* Fix duplication and wrong markup in docs ([@horpto](https://github.com/horpto), [#1633](https://github.com/RaRe-Technologies/gensim/pull/1633))
+* Refactor dendrogram & topic network notebooks ([@parulsethi](https://github.com/parulsethi), [#1571](https://github.com/RaRe-Technologies/gensim/pull/1571))
+* Fix release badge ([@menshikh-iv](https://github.com/menshikh-iv), [#1631](https://github.com/RaRe-Technologies/gensim/pull/1631))
:warning: Deprecation part (will come into force in the next major release)
* Remove
@@ -1676,7 +1765,7 @@ Tutorial and doc improvements:
* Tutorial: Reproducing Doc2vec paper result on wikipedia. (@isohyt, [#654](https://github.com/RaRe-Technologies/gensim/pull/654))
* Add Save/Load interface to AnnoyIndexer for index persistence (@fortiema, [#845](https://github.com/RaRe-Technologies/gensim/pull/845))
* Fixed issue [#938](https://github.com/RaRe-Technologies/gensim/issues/938),Creating a unified base class for all topic models. ([@markroxor](https://github.com/markroxor), [#946](https://github.com/RaRe-Technologies/gensim/pull/946))
- - _breaking change in HdpTopicFormatter.show_\__topics_
+ - breaking change in `HdpTopicFormatter.show_topics`
* Add Phraser for Phrases optimization. ( @gojomo & @anujkhare , [#837](https://github.com/RaRe-Technologies/gensim/pull/837))
* Fix issue #743, in word2vec's n_similarity method if at least one empty list is passed ZeroDivisionError is raised (@pranay360, [#883](https://github.com/RaRe-Technologies/gensim/pull/883))
* Change export_phrases in Phrases model. Fix issue #794 (@AadityaJ, [#879](https://github.com/RaRe-Technologies/gensim/pull/879))
diff --git a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.ipynb b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.ipynb
index 998115a80e..40a3324206 100644
--- a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.ipynb
+++ b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.ipynb
@@ -152,7 +152,7 @@
},
"outputs": [],
"source": [
- "from smart_open import open # for transparently opening remote files\n\n\nclass MyCorpus(object):\n def __iter__(self):\n for line in open('https://radimrehurek.com/gensim/mycorpus.txt'):\n # assume there's one document per line, tokens separated by whitespace\n yield dictionary.doc2bow(line.lower().split())"
+ "from smart_open import open # for transparently opening remote files\n\n\nclass MyCorpus:\n def __iter__(self):\n for line in open('https://radimrehurek.com/gensim/mycorpus.txt'):\n # assume there's one document per line, tokens separated by whitespace\n yield dictionary.doc2bow(line.lower().split())"
]
},
{
diff --git a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py
index 5a77b4e637..0a49614123 100644
--- a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py
+++ b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py
@@ -136,7 +136,7 @@
from smart_open import open # for transparently opening remote files
-class MyCorpus(object):
+class MyCorpus:
def __iter__(self):
for line in open('https://radimrehurek.com/gensim/mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
diff --git a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py.md5 b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py.md5
index 9e8401aae5..935e0357af 100644
--- a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py.md5
+++ b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py.md5
@@ -1 +1 @@
-c239d5c523ea2b3af1f6d4c6c51e7925
\ No newline at end of file
+6b98413399bca9fd1ed8fe420da85692
\ No newline at end of file
diff --git a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.rst b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.rst
index 4b55ff959e..7f8d25cfec 100644
--- a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.rst
+++ b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.rst
@@ -159,10 +159,10 @@ between the questions and ids is called a dictionary:
.. code-block:: none
- 2020-10-19 01:23:37,722 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
- 2020-10-19 01:23:37,722 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
- 2020-10-19 01:23:37,722 : INFO : saving Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) under /tmp/deerwester.dict, separately None
- 2020-10-19 01:23:37,723 : INFO : saved /tmp/deerwester.dict
+ 2020-10-28 00:52:02,550 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
+ 2020-10-28 00:52:02,550 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
+ 2020-10-28 00:52:02,550 : INFO : saving Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) under /tmp/deerwester.dict, separately None
+ 2020-10-28 00:52:02,552 : INFO : saved /tmp/deerwester.dict
Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)
@@ -244,11 +244,11 @@ therefore reads: in the document `"Human computer interaction"`, the words `comp
.. code-block:: none
- 2020-10-19 01:23:38,012 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm
- 2020-10-19 01:23:38,013 : INFO : saving sparse matrix to /tmp/deerwester.mm
- 2020-10-19 01:23:38,013 : INFO : PROGRESS: saving document #0
- 2020-10-19 01:23:38,016 : INFO : saved 9x12 matrix, density=25.926% (28/108)
- 2020-10-19 01:23:38,016 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index
+ 2020-10-28 00:52:02,830 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm
+ 2020-10-28 00:52:02,832 : INFO : saving sparse matrix to /tmp/deerwester.mm
+ 2020-10-28 00:52:02,832 : INFO : PROGRESS: saving document #0
+ 2020-10-28 00:52:02,834 : INFO : saved 9x12 matrix, density=25.926% (28/108)
+ 2020-10-28 00:52:02,834 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index
[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]
@@ -276,7 +276,7 @@ only requires that a corpus must be able to return one document vector at a time
from smart_open import open # for transparently opening remote files
- class MyCorpus(object):
+ class MyCorpus:
def __iter__(self):
for line in open('https://radimrehurek.com/gensim/mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
@@ -334,7 +334,7 @@ then convert the tokens via a dictionary to their ids and yield the resulting sp
.. code-block:: none
- <__main__.MyCorpus object at 0x117e06828>
+ <__main__.MyCorpus object at 0x11e77bb38>
@@ -406,8 +406,8 @@ Similarly, to construct the dictionary without loading all texts into memory:
.. code-block:: none
- 2020-10-19 01:23:38,980 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
- 2020-10-19 01:23:38,981 : INFO : built Dictionary(42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 69 corpus positions)
+ 2020-10-28 00:52:04,241 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
+ 2020-10-28 00:52:04,243 : INFO : built Dictionary(42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 69 corpus positions)
Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)
@@ -454,11 +454,11 @@ create a toy corpus of 2 documents, as a plain Python list
.. code-block:: none
- 2020-10-19 01:23:39,099 : INFO : storing corpus in Matrix Market format to /tmp/corpus.mm
- 2020-10-19 01:23:39,100 : INFO : saving sparse matrix to /tmp/corpus.mm
- 2020-10-19 01:23:39,100 : INFO : PROGRESS: saving document #0
- 2020-10-19 01:23:39,101 : INFO : saved 2x2 matrix, density=25.000% (1/4)
- 2020-10-19 01:23:39,101 : INFO : saving MmCorpus index to /tmp/corpus.mm.index
+ 2020-10-28 00:52:04,368 : INFO : storing corpus in Matrix Market format to /tmp/corpus.mm
+ 2020-10-28 00:52:04,370 : INFO : saving sparse matrix to /tmp/corpus.mm
+ 2020-10-28 00:52:04,370 : INFO : PROGRESS: saving document #0
+ 2020-10-28 00:52:04,370 : INFO : saved 2x2 matrix, density=25.000% (1/4)
+ 2020-10-28 00:52:04,370 : INFO : saving MmCorpus index to /tmp/corpus.mm.index
@@ -486,16 +486,16 @@ Other formats include `Joachim's SVMlight format
.. code-block:: none
- 2020-10-19 01:23:39,152 : INFO : converting corpus to SVMlight format: /tmp/corpus.svmlight
- 2020-10-19 01:23:39,153 : INFO : saving SvmLightCorpus index to /tmp/corpus.svmlight.index
- 2020-10-19 01:23:39,154 : INFO : no word id mapping provided; initializing from corpus
- 2020-10-19 01:23:39,154 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c
- 2020-10-19 01:23:39,154 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab
- 2020-10-19 01:23:39,154 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index
- 2020-10-19 01:23:39,206 : INFO : no word id mapping provided; initializing from corpus
- 2020-10-19 01:23:39,207 : INFO : storing corpus in List-Of-Words format into /tmp/corpus.low
- 2020-10-19 01:23:39,207 : WARNING : List-of-words format can only save vectors with integer elements; 1 float entries were truncated to integer value
- 2020-10-19 01:23:39,207 : INFO : saving LowCorpus index to /tmp/corpus.low.index
+ 2020-10-28 00:52:04,425 : INFO : converting corpus to SVMlight format: /tmp/corpus.svmlight
+ 2020-10-28 00:52:04,426 : INFO : saving SvmLightCorpus index to /tmp/corpus.svmlight.index
+ 2020-10-28 00:52:04,427 : INFO : no word id mapping provided; initializing from corpus
+ 2020-10-28 00:52:04,427 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c
+ 2020-10-28 00:52:04,427 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab
+ 2020-10-28 00:52:04,427 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index
+ 2020-10-28 00:52:04,481 : INFO : no word id mapping provided; initializing from corpus
+ 2020-10-28 00:52:04,481 : INFO : storing corpus in List-Of-Words format into /tmp/corpus.low
+ 2020-10-28 00:52:04,482 : WARNING : List-of-words format can only save vectors with integer elements; 1 float entries were truncated to integer value
+ 2020-10-28 00:52:04,482 : INFO : saving LowCorpus index to /tmp/corpus.low.index
@@ -518,9 +518,9 @@ Conversely, to load a corpus iterator from a Matrix Market file:
.. code-block:: none
- 2020-10-19 01:23:39,260 : INFO : loaded corpus index from /tmp/corpus.mm.index
- 2020-10-19 01:23:39,262 : INFO : initializing cython corpus reader from /tmp/corpus.mm
- 2020-10-19 01:23:39,262 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries
+ 2020-10-28 00:52:04,538 : INFO : loaded corpus index from /tmp/corpus.mm.index
+ 2020-10-28 00:52:04,540 : INFO : initializing cython corpus reader from /tmp/corpus.mm
+ 2020-10-28 00:52:04,540 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries
@@ -619,10 +619,10 @@ To save the same Matrix Market document stream in Blei's LDA-C format,
.. code-block:: none
- 2020-10-19 01:23:39,634 : INFO : no word id mapping provided; initializing from corpus
- 2020-10-19 01:23:39,636 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c
- 2020-10-19 01:23:39,636 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab
- 2020-10-19 01:23:39,636 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index
+ 2020-10-28 00:52:04,921 : INFO : no word id mapping provided; initializing from corpus
+ 2020-10-28 00:52:04,922 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c
+ 2020-10-28 00:52:04,923 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab
+ 2020-10-28 00:52:04,923 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index
@@ -710,9 +710,9 @@ Optimize converting between corpora and NumPy/SciPy arrays?), see the :ref:`apir
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 0 minutes 2.979 seconds)
+ **Total running time of the script:** ( 0 minutes 4.010 seconds)
-**Estimated memory usage:** 39 MB
+**Estimated memory usage:** 40 MB
.. _sphx_glr_download_auto_examples_core_run_corpora_and_vector_spaces.py:
diff --git a/docs/src/auto_examples/core/sg_execution_times.rst b/docs/src/auto_examples/core/sg_execution_times.rst
index d346e546cb..9e36b38b09 100644
--- a/docs/src/auto_examples/core/sg_execution_times.rst
+++ b/docs/src/auto_examples/core/sg_execution_times.rst
@@ -5,10 +5,10 @@
Computation times
=================
-**00:02.979** total execution time for **auto_examples_core** files:
+**00:04.010** total execution time for **auto_examples_core** files:
+--------------------------------------------------------------------------------------------------------------+-----------+---------+
-| :ref:`sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py` (``run_corpora_and_vector_spaces.py``) | 00:02.979 | 38.7 MB |
+| :ref:`sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py` (``run_corpora_and_vector_spaces.py``) | 00:04.010 | 39.8 MB |
+--------------------------------------------------------------------------------------------------------------+-----------+---------+
| :ref:`sphx_glr_auto_examples_core_run_core_concepts.py` (``run_core_concepts.py``) | 00:00.000 | 0.0 MB |
+--------------------------------------------------------------------------------------------------------------+-----------+---------+
diff --git a/docs/src/auto_examples/tutorials/images/sphx_glr_run_word2vec_001.png b/docs/src/auto_examples/tutorials/images/sphx_glr_run_word2vec_001.png
index 6fafecbcf3..35e81eff92 100644
Binary files a/docs/src/auto_examples/tutorials/images/sphx_glr_run_word2vec_001.png and b/docs/src/auto_examples/tutorials/images/sphx_glr_run_word2vec_001.png differ
diff --git a/docs/src/auto_examples/tutorials/images/thumb/sphx_glr_run_word2vec_thumb.png b/docs/src/auto_examples/tutorials/images/thumb/sphx_glr_run_word2vec_thumb.png
index 49032072ea..38b2ba3507 100644
Binary files a/docs/src/auto_examples/tutorials/images/thumb/sphx_glr_run_word2vec_thumb.png and b/docs/src/auto_examples/tutorials/images/thumb/sphx_glr_run_word2vec_thumb.png differ
diff --git a/docs/src/auto_examples/tutorials/run_word2vec.ipynb b/docs/src/auto_examples/tutorials/run_word2vec.ipynb
index 9f132e84c9..13f779c2c2 100644
--- a/docs/src/auto_examples/tutorials/run_word2vec.ipynb
+++ b/docs/src/auto_examples/tutorials/run_word2vec.ipynb
@@ -177,7 +177,7 @@
},
"outputs": [],
"source": [
- "from gensim.test.utils import datapath\nfrom gensim import utils\n\nclass MyCorpus(object):\n \"\"\"An interator that yields sentences (lists of str).\"\"\"\n\n def __iter__(self):\n corpus_path = datapath('lee_background.cor')\n for line in open(corpus_path):\n # assume there's one document per line, tokens separated by whitespace\n yield utils.simple_preprocess(line)"
+ "from gensim.test.utils import datapath\nfrom gensim import utils\n\nclass MyCorpus:\n \"\"\"An iterator that yields sentences (lists of str).\"\"\"\n\n def __iter__(self):\n corpus_path = datapath('lee_background.cor')\n for line in open(corpus_path):\n # assume there's one document per line, tokens separated by whitespace\n yield utils.simple_preprocess(line)"
]
},
{
diff --git a/docs/src/auto_examples/tutorials/run_word2vec.py b/docs/src/auto_examples/tutorials/run_word2vec.py
index 01b0e2bb86..c5ef323bb2 100644
--- a/docs/src/auto_examples/tutorials/run_word2vec.py
+++ b/docs/src/auto_examples/tutorials/run_word2vec.py
@@ -197,8 +197,8 @@
from gensim.test.utils import datapath
from gensim import utils
-class MyCorpus(object):
- """An interator that yields sentences (lists of str)."""
+class MyCorpus:
+ """An iterator that yields sentences (lists of str)."""
def __iter__(self):
corpus_path = datapath('lee_background.cor')
diff --git a/docs/src/auto_examples/tutorials/run_word2vec.py.md5 b/docs/src/auto_examples/tutorials/run_word2vec.py.md5
index cbd7db4cf6..6d0ea3457b 100644
--- a/docs/src/auto_examples/tutorials/run_word2vec.py.md5
+++ b/docs/src/auto_examples/tutorials/run_word2vec.py.md5
@@ -1 +1 @@
-559f9ed4b873b99bf4882096b146691d
\ No newline at end of file
+4598eccb1c465c724d8cfa99e216689d
\ No newline at end of file
diff --git a/docs/src/auto_examples/tutorials/run_word2vec.rst b/docs/src/auto_examples/tutorials/run_word2vec.rst
index 7d7ea275e1..dd40ba3ddf 100644
--- a/docs/src/auto_examples/tutorials/run_word2vec.rst
+++ b/docs/src/auto_examples/tutorials/run_word2vec.rst
@@ -159,15 +159,6 @@ this vector algebra for yourself. That demo runs ``word2vec`` on the
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
- 2020-09-30 17:00:40,474 : INFO : loading projection weights from /Users/kofola3/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
- 2020-09-30 17:01:46,484 : INFO : loaded (3000000, 300) matrix from /Users/kofola3/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
-
@@ -351,8 +342,8 @@ would handle a larger corpus.
from gensim.test.utils import datapath
from gensim import utils
- class MyCorpus(object):
- """An interator that yields sentences (lists of str)."""
+ class MyCorpus:
+ """An iterator that yields sentences (lists of str)."""
def __iter__(self):
corpus_path = datapath('lee_background.cor')
@@ -364,15 +355,6 @@ would handle a larger corpus.
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
- 2020-09-30 17:02:00,362 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
- 2020-09-30 17:02:00,366 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
-
@@ -398,46 +380,6 @@ training parameters much for now, we'll revisit them later.
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
- 2020-09-30 17:02:00,550 : INFO : collecting all words and their counts
- 2020-09-30 17:02:00,551 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
- 2020-09-30 17:02:00,657 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
- 2020-09-30 17:02:00,657 : INFO : Loading a fresh vocabulary
- 2020-09-30 17:02:00,668 : INFO : effective_min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
- 2020-09-30 17:02:00,668 : INFO : effective_min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
- 2020-09-30 17:02:00,683 : INFO : deleting the raw counts dictionary of 6981 items
- 2020-09-30 17:02:00,741 : INFO : sample=0.001 downsamples 51 most-common words
- 2020-09-30 17:02:00,741 : INFO : downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
- 2020-09-30 17:02:00,769 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes
- 2020-09-30 17:02:00,770 : INFO : resetting layer weights
- 2020-09-30 17:02:00,875 : INFO : training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
- 2020-09-30 17:02:00,993 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:00,994 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:00,999 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:00,999 : INFO : EPOCH - 1 : training on 58152 raw words (35967 effective words) took 0.1s, 305737 effective words/s
- 2020-09-30 17:02:01,099 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:01,103 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:01,106 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:01,106 : INFO : EPOCH - 2 : training on 58152 raw words (35955 effective words) took 0.1s, 343839 effective words/s
- 2020-09-30 17:02:01,210 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:01,218 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:01,220 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:01,220 : INFO : EPOCH - 3 : training on 58152 raw words (35878 effective words) took 0.1s, 316674 effective words/s
- 2020-09-30 17:02:01,326 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:01,333 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:01,336 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:01,336 : INFO : EPOCH - 4 : training on 58152 raw words (35809 effective words) took 0.1s, 312256 effective words/s
- 2020-09-30 17:02:01,434 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:01,438 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:01,441 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:01,441 : INFO : EPOCH - 5 : training on 58152 raw words (35998 effective words) took 0.1s, 344237 effective words/s
- 2020-09-30 17:02:01,441 : INFO : training on a 290760 raw words (179607 effective words) took 0.6s, 317010 effective words/s
-
@@ -522,20 +464,6 @@ You can store/load models using the standard gensim methods:
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
- 2020-09-30 17:02:01,737 : INFO : saving Word2Vec object under /var/folders/w0/f7blghz9277068cnyyd3nd200000gn/T/gensim-model-36yeu47d, separately None
- 2020-09-30 17:02:01,740 : INFO : not storing attribute cum_table
- 2020-09-30 17:02:01,785 : INFO : saved /var/folders/w0/f7blghz9277068cnyyd3nd200000gn/T/gensim-model-36yeu47d
- 2020-09-30 17:02:01,786 : INFO : loading Word2Vec object from /var/folders/w0/f7blghz9277068cnyyd3nd200000gn/T/gensim-model-36yeu47d
- 2020-09-30 17:02:01,801 : INFO : loading wv recursively from /var/folders/w0/f7blghz9277068cnyyd3nd200000gn/T/gensim-model-36yeu47d.wv.* with mmap=None
- 2020-09-30 17:02:01,801 : INFO : setting ignored attribute cum_table to None
- 2020-09-30 17:02:01,821 : INFO : loaded /var/folders/w0/f7blghz9277068cnyyd3nd200000gn/T/gensim-model-36yeu47d
-
@@ -575,46 +503,6 @@ default value of min_count=5
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
- 2020-09-30 17:02:01,918 : INFO : collecting all words and their counts
- 2020-09-30 17:02:01,921 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
- 2020-09-30 17:02:02,011 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
- 2020-09-30 17:02:02,011 : INFO : Loading a fresh vocabulary
- 2020-09-30 17:02:02,018 : INFO : effective_min_count=10 retains 889 unique words (12% of original 6981, drops 6092)
- 2020-09-30 17:02:02,018 : INFO : effective_min_count=10 leaves 43776 word corpus (75% of original 58152, drops 14376)
- 2020-09-30 17:02:02,028 : INFO : deleting the raw counts dictionary of 6981 items
- 2020-09-30 17:02:02,029 : INFO : sample=0.001 downsamples 55 most-common words
- 2020-09-30 17:02:02,029 : INFO : downsampling leaves estimated 29691 word corpus (67.8% of prior 43776)
- 2020-09-30 17:02:02,041 : INFO : estimated required memory for 889 words and 100 dimensions: 1155700 bytes
- 2020-09-30 17:02:02,041 : INFO : resetting layer weights
- 2020-09-30 17:02:02,083 : INFO : training model with 3 workers on 889 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
- 2020-09-30 17:02:02,184 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:02,190 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:02,192 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:02,192 : INFO : EPOCH - 1 : training on 58152 raw words (29629 effective words) took 0.1s, 276020 effective words/s
- 2020-09-30 17:02:02,287 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:02,292 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:02,295 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:02,295 : INFO : EPOCH - 2 : training on 58152 raw words (29624 effective words) took 0.1s, 290768 effective words/s
- 2020-09-30 17:02:02,394 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:02,397 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:02,400 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:02,400 : INFO : EPOCH - 3 : training on 58152 raw words (29769 effective words) took 0.1s, 286475 effective words/s
- 2020-09-30 17:02:02,496 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:02,499 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:02,501 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:02,502 : INFO : EPOCH - 4 : training on 58152 raw words (29578 effective words) took 0.1s, 293835 effective words/s
- 2020-09-30 17:02:02,598 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:02,601 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:02,604 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:02,604 : INFO : EPOCH - 5 : training on 58152 raw words (29707 effective words) took 0.1s, 292782 effective words/s
- 2020-09-30 17:02:02,604 : INFO : training on a 290760 raw words (148307 effective words) took 0.5s, 284858 effective words/s
-
@@ -639,46 +527,6 @@ accurate) models. Reasonable values are in the tens to hundreds.
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
- 2020-09-30 17:02:02,626 : INFO : collecting all words and their counts
- 2020-09-30 17:02:02,628 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
- 2020-09-30 17:02:02,722 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
- 2020-09-30 17:02:02,722 : INFO : Loading a fresh vocabulary
- 2020-09-30 17:02:02,734 : INFO : effective_min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
- 2020-09-30 17:02:02,734 : INFO : effective_min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
- 2020-09-30 17:02:02,748 : INFO : deleting the raw counts dictionary of 6981 items
- 2020-09-30 17:02:02,748 : INFO : sample=0.001 downsamples 51 most-common words
- 2020-09-30 17:02:02,748 : INFO : downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
- 2020-09-30 17:02:02,770 : INFO : estimated required memory for 1750 words and 200 dimensions: 3675000 bytes
- 2020-09-30 17:02:02,770 : INFO : resetting layer weights
- 2020-09-30 17:02:02,864 : INFO : training model with 3 workers on 1750 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
- 2020-09-30 17:02:02,973 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:02,979 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:02,982 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:02,982 : INFO : EPOCH - 1 : training on 58152 raw words (35994 effective words) took 0.1s, 307729 effective words/s
- 2020-09-30 17:02:03,087 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:03,093 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:03,097 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:03,097 : INFO : EPOCH - 2 : training on 58152 raw words (35944 effective words) took 0.1s, 317636 effective words/s
- 2020-09-30 17:02:03,202 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:03,208 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:03,212 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:03,212 : INFO : EPOCH - 3 : training on 58152 raw words (36007 effective words) took 0.1s, 314282 effective words/s
- 2020-09-30 17:02:03,320 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:03,327 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:03,330 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:03,330 : INFO : EPOCH - 4 : training on 58152 raw words (35992 effective words) took 0.1s, 307219 effective words/s
- 2020-09-30 17:02:03,436 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:03,442 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:03,445 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:03,446 : INFO : EPOCH - 5 : training on 58152 raw words (36003 effective words) took 0.1s, 314793 effective words/s
- 2020-09-30 17:02:03,446 : INFO : training on a 290760 raw words (179940 effective words) took 0.6s, 309327 effective words/s
-
@@ -701,51 +549,6 @@ is for training parallelization, to speed up training:
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
- 2020-09-30 17:02:03,470 : INFO : collecting all words and their counts
- 2020-09-30 17:02:03,472 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
- 2020-09-30 17:02:03,571 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
- 2020-09-30 17:02:03,571 : INFO : Loading a fresh vocabulary
- 2020-09-30 17:02:03,582 : INFO : effective_min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
- 2020-09-30 17:02:03,582 : INFO : effective_min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
- 2020-09-30 17:02:03,595 : INFO : deleting the raw counts dictionary of 6981 items
- 2020-09-30 17:02:03,595 : INFO : sample=0.001 downsamples 51 most-common words
- 2020-09-30 17:02:03,595 : INFO : downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
- 2020-09-30 17:02:03,616 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes
- 2020-09-30 17:02:03,616 : INFO : resetting layer weights
- 2020-09-30 17:02:03,704 : INFO : training model with 4 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
- 2020-09-30 17:02:03,809 : INFO : worker thread finished; awaiting finish of 3 more threads
- 2020-09-30 17:02:03,810 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:03,810 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:03,815 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:03,816 : INFO : EPOCH - 1 : training on 58152 raw words (35953 effective words) took 0.1s, 326539 effective words/s
- 2020-09-30 17:02:03,912 : INFO : worker thread finished; awaiting finish of 3 more threads
- 2020-09-30 17:02:03,913 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:03,915 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:03,920 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:03,920 : INFO : EPOCH - 2 : training on 58152 raw words (35895 effective words) took 0.1s, 348415 effective words/s
- 2020-09-30 17:02:04,017 : INFO : worker thread finished; awaiting finish of 3 more threads
- 2020-09-30 17:02:04,018 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:04,021 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:04,024 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:04,024 : INFO : EPOCH - 3 : training on 58152 raw words (35907 effective words) took 0.1s, 347822 effective words/s
- 2020-09-30 17:02:04,127 : INFO : worker thread finished; awaiting finish of 3 more threads
- 2020-09-30 17:02:04,127 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:04,128 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:04,134 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:04,134 : INFO : EPOCH - 4 : training on 58152 raw words (35909 effective words) took 0.1s, 333947 effective words/s
- 2020-09-30 17:02:04,232 : INFO : worker thread finished; awaiting finish of 3 more threads
- 2020-09-30 17:02:04,232 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:04,233 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:04,238 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:04,238 : INFO : EPOCH - 5 : training on 58152 raw words (35957 effective words) took 0.1s, 347693 effective words/s
- 2020-09-30 17:02:04,238 : INFO : training on a 290760 raw words (179621 effective words) took 0.5s, 335988 effective words/s
-
@@ -811,21 +614,8 @@ Gensim supports the same evaluation set, in exactly the same format:
.. code-block:: none
- 2020-09-30 17:02:04,350 : INFO : Evaluating word analogies for top 300000 words in the model on /Volumes/work/workspace/gensim/trunk/gensim/test/test_data/questions-words.txt
- 2020-09-30 17:02:04,358 : INFO : capital-common-countries: 0.0% (0/6)
- 2020-09-30 17:02:04,376 : INFO : capital-world: 0.0% (0/2)
- 2020-09-30 17:02:04,392 : INFO : family: 0.0% (0/6)
- 2020-09-30 17:02:04,409 : INFO : gram3-comparative: 0.0% (0/20)
- 2020-09-30 17:02:04,416 : INFO : gram4-superlative: 0.0% (0/12)
- 2020-09-30 17:02:04,423 : INFO : gram5-present-participle: 0.0% (0/20)
- 2020-09-30 17:02:04,435 : INFO : gram6-nationality-adjective: 0.0% (0/30)
- 2020-09-30 17:02:04,445 : INFO : gram7-past-tense: 0.0% (0/20)
- 2020-09-30 17:02:04,457 : INFO : gram8-plural: 3.3% (1/30)
- 2020-09-30 17:02:04,462 : INFO : Quadruplets with out-of-vocabulary words: 99.3%
- 2020-09-30 17:02:04,465 : INFO : NB: analogies containing OOV words were skipped from evaluation! To change this behavior, use "dummy4unknown=True"
- 2020-09-30 17:02:04,465 : INFO : Total accuracy: 0.7% (1/146)
- (0.00684931506849315, [{'section': 'capital-common-countries', 'correct': [], 'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'), ('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'), ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'), ('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'), ('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'), ('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN')]}, {'section': 'capital-world', 'correct': [], 'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'), ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE')]}, {'section': 'currency', 'correct': [], 'incorrect': []}, {'section': 'city-in-state', 'correct': [], 'incorrect': []}, {'section': 'family', 'correct': [], 'incorrect': [('HE', 'SHE', 'HIS', 'HER'), ('HE', 'SHE', 'MAN', 'WOMAN'), ('HIS', 'HER', 'MAN', 'WOMAN'), ('HIS', 'HER', 'HE', 'SHE'), ('MAN', 'WOMAN', 'HE', 'SHE'), ('MAN', 'WOMAN', 'HIS', 'HER')]}, {'section': 'gram1-adjective-to-adverb', 'correct': [], 'incorrect': []}, {'section': 'gram2-opposite', 'correct': [], 'incorrect': []}, {'section': 'gram3-comparative', 'correct': [], 'incorrect': [('GOOD', 'BETTER', 'GREAT', 'GREATER'), ('GOOD', 'BETTER', 'LONG', 'LONGER'), ('GOOD', 'BETTER', 'LOW', 'LOWER'), ('GOOD', 'BETTER', 'SMALL', 'SMALLER'), ('GREAT', 'GREATER', 'LONG', 'LONGER'), ('GREAT', 'GREATER', 'LOW', 'LOWER'), ('GREAT', 'GREATER', 'SMALL', 'SMALLER'), ('GREAT', 'GREATER', 'GOOD', 'BETTER'), ('LONG', 'LONGER', 'LOW', 'LOWER'), ('LONG', 'LONGER', 'SMALL', 'SMALLER'), ('LONG', 'LONGER', 'GOOD', 'BETTER'), ('LONG', 'LONGER', 'GREAT', 'GREATER'), ('LOW', 'LOWER', 'SMALL', 'SMALLER'), ('LOW', 'LOWER', 'GOOD', 'BETTER'), ('LOW', 'LOWER', 'GREAT', 'GREATER'), ('LOW', 'LOWER', 'LONG', 'LONGER'), ('SMALL', 'SMALLER', 'GOOD', 'BETTER'), ('SMALL', 'SMALLER', 'GREAT', 'GREATER'), ('SMALL', 'SMALLER', 'LONG', 'LONGER'), ('SMALL', 'SMALLER', 'LOW', 'LOWER')]}, {'section': 'gram4-superlative', 'correct': [], 'incorrect': [('BIG', 'BIGGEST', 'GOOD', 'BEST'), ('BIG', 'BIGGEST', 'GREAT', 'GREATEST'), ('BIG', 'BIGGEST', 'LARGE', 'LARGEST'), ('GOOD', 'BEST', 'GREAT', 'GREATEST'), ('GOOD', 'BEST', 'LARGE', 'LARGEST'), ('GOOD', 'BEST', 'BIG', 'BIGGEST'), ('GREAT', 'GREATEST', 'LARGE', 'LARGEST'), ('GREAT', 'GREATEST', 'BIG', 'BIGGEST'), ('GREAT', 'GREATEST', 'GOOD', 'BEST'), ('LARGE', 'LARGEST', 'BIG', 'BIGGEST'), ('LARGE', 'LARGEST', 'GOOD', 'BEST'), ('LARGE', 'LARGEST', 'GREAT', 'GREATEST')]}, {'section': 'gram5-present-participle', 'correct': [], 'incorrect': [('GO', 'GOING', 'LOOK', 'LOOKING'), ('GO', 'GOING', 'PLAY', 'PLAYING'), ('GO', 'GOING', 'RUN', 'RUNNING'), ('GO', 'GOING', 'SAY', 'SAYING'), ('LOOK', 'LOOKING', 'PLAY', 'PLAYING'), ('LOOK', 'LOOKING', 'RUN', 'RUNNING'), ('LOOK', 'LOOKING', 'SAY', 'SAYING'), ('LOOK', 'LOOKING', 'GO', 'GOING'), ('PLAY', 'PLAYING', 'RUN', 'RUNNING'), ('PLAY', 'PLAYING', 'SAY', 'SAYING'), ('PLAY', 'PLAYING', 'GO', 'GOING'), ('PLAY', 'PLAYING', 'LOOK', 'LOOKING'), ('RUN', 'RUNNING', 'SAY', 'SAYING'), ('RUN', 'RUNNING', 'GO', 'GOING'), ('RUN', 'RUNNING', 'LOOK', 'LOOKING'), ('RUN', 'RUNNING', 'PLAY', 'PLAYING'), ('SAY', 'SAYING', 'GO', 'GOING'), ('SAY', 'SAYING', 'LOOK', 'LOOKING'), ('SAY', 'SAYING', 'PLAY', 'PLAYING'), ('SAY', 'SAYING', 'RUN', 'RUNNING')]}, {'section': 'gram6-nationality-adjective', 'correct': [], 'incorrect': [('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'), ('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'), ('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'), ('AUSTRALIA', 'AUSTRALIAN', 'JAPAN', 'JAPANESE'), ('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'), ('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'), ('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'), ('FRANCE', 'FRENCH', 'JAPAN', 'JAPANESE'), ('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'), ('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'), ('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'), ('INDIA', 'INDIAN', 'JAPAN', 'JAPANESE'), ('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'), ('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN'), ('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'), ('ISRAEL', 'ISRAELI', 'JAPAN', 'JAPANESE'), ('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'), ('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'), ('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'), ('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'), ('JAPAN', 'JAPANESE', 'SWITZERLAND', 'SWISS'), ('JAPAN', 'JAPANESE', 'AUSTRALIA', 'AUSTRALIAN'), ('JAPAN', 'JAPANESE', 'FRANCE', 'FRENCH'), ('JAPAN', 'JAPANESE', 'INDIA', 'INDIAN'), ('JAPAN', 'JAPANESE', 'ISRAEL', 'ISRAELI'), ('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'), ('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'), ('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'), ('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'), ('SWITZERLAND', 'SWISS', 'JAPAN', 'JAPANESE')]}, {'section': 'gram7-past-tense', 'correct': [], 'incorrect': [('GOING', 'WENT', 'PAYING', 'PAID'), ('GOING', 'WENT', 'PLAYING', 'PLAYED'), ('GOING', 'WENT', 'SAYING', 'SAID'), ('GOING', 'WENT', 'TAKING', 'TOOK'), ('PAYING', 'PAID', 'PLAYING', 'PLAYED'), ('PAYING', 'PAID', 'SAYING', 'SAID'), ('PAYING', 'PAID', 'TAKING', 'TOOK'), ('PAYING', 'PAID', 'GOING', 'WENT'), ('PLAYING', 'PLAYED', 'SAYING', 'SAID'), ('PLAYING', 'PLAYED', 'TAKING', 'TOOK'), ('PLAYING', 'PLAYED', 'GOING', 'WENT'), ('PLAYING', 'PLAYED', 'PAYING', 'PAID'), ('SAYING', 'SAID', 'TAKING', 'TOOK'), ('SAYING', 'SAID', 'GOING', 'WENT'), ('SAYING', 'SAID', 'PAYING', 'PAID'), ('SAYING', 'SAID', 'PLAYING', 'PLAYED'), ('TAKING', 'TOOK', 'GOING', 'WENT'), ('TAKING', 'TOOK', 'PAYING', 'PAID'), ('TAKING', 'TOOK', 'PLAYING', 'PLAYED'), ('TAKING', 'TOOK', 'SAYING', 'SAID')]}, {'section': 'gram8-plural', 'correct': [('CAR', 'CARS', 'BUILDING', 'BUILDINGS')], 'incorrect': [('BUILDING', 'BUILDINGS', 'CAR', 'CARS'), ('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'), ('BUILDING', 'BUILDINGS', 'MAN', 'MEN'), ('BUILDING', 'BUILDINGS', 'ROAD', 'ROADS'), ('BUILDING', 'BUILDINGS', 'WOMAN', 'WOMEN'), ('CAR', 'CARS', 'CHILD', 'CHILDREN'), ('CAR', 'CARS', 'MAN', 'MEN'), ('CAR', 'CARS', 'ROAD', 'ROADS'), ('CAR', 'CARS', 'WOMAN', 'WOMEN'), ('CHILD', 'CHILDREN', 'MAN', 'MEN'), ('CHILD', 'CHILDREN', 'ROAD', 'ROADS'), ('CHILD', 'CHILDREN', 'WOMAN', 'WOMEN'), ('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'), ('CHILD', 'CHILDREN', 'CAR', 'CARS'), ('MAN', 'MEN', 'ROAD', 'ROADS'), ('MAN', 'MEN', 'WOMAN', 'WOMEN'), ('MAN', 'MEN', 'BUILDING', 'BUILDINGS'), ('MAN', 'MEN', 'CAR', 'CARS'), ('MAN', 'MEN', 'CHILD', 'CHILDREN'), ('ROAD', 'ROADS', 'WOMAN', 'WOMEN'), ('ROAD', 'ROADS', 'BUILDING', 'BUILDINGS'), ('ROAD', 'ROADS', 'CAR', 'CARS'), ('ROAD', 'ROADS', 'CHILD', 'CHILDREN'), ('ROAD', 'ROADS', 'MAN', 'MEN'), ('WOMAN', 'WOMEN', 'BUILDING', 'BUILDINGS'), ('WOMAN', 'WOMEN', 'CAR', 'CARS'), ('WOMAN', 'WOMEN', 'CHILD', 'CHILDREN'), ('WOMAN', 'WOMEN', 'MAN', 'MEN'), ('WOMAN', 'WOMEN', 'ROAD', 'ROADS')]}, {'section': 'gram9-plural-verbs', 'correct': [], 'incorrect': []}, {'section': 'Total accuracy', 'correct': [('CAR', 'CARS', 'BUILDING', 'BUILDINGS')], 'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'), ('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'), ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'), ('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'), ('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'), ('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN'), ('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'), ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'), ('HE', 'SHE', 'HIS', 'HER'), ('HE', 'SHE', 'MAN', 'WOMAN'), ('HIS', 'HER', 'MAN', 'WOMAN'), ('HIS', 'HER', 'HE', 'SHE'), ('MAN', 'WOMAN', 'HE', 'SHE'), ('MAN', 'WOMAN', 'HIS', 'HER'), ('GOOD', 'BETTER', 'GREAT', 'GREATER'), ('GOOD', 'BETTER', 'LONG', 'LONGER'), ('GOOD', 'BETTER', 'LOW', 'LOWER'), ('GOOD', 'BETTER', 'SMALL', 'SMALLER'), ('GREAT', 'GREATER', 'LONG', 'LONGER'), ('GREAT', 'GREATER', 'LOW', 'LOWER'), ('GREAT', 'GREATER', 'SMALL', 'SMALLER'), ('GREAT', 'GREATER', 'GOOD', 'BETTER'), ('LONG', 'LONGER', 'LOW', 'LOWER'), ('LONG', 'LONGER', 'SMALL', 'SMALLER'), ('LONG', 'LONGER', 'GOOD', 'BETTER'), ('LONG', 'LONGER', 'GREAT', 'GREATER'), ('LOW', 'LOWER', 'SMALL', 'SMALLER'), ('LOW', 'LOWER', 'GOOD', 'BETTER'), ('LOW', 'LOWER', 'GREAT', 'GREATER'), ('LOW', 'LOWER', 'LONG', 'LONGER'), ('SMALL', 'SMALLER', 'GOOD', 'BETTER'), ('SMALL', 'SMALLER', 'GREAT', 'GREATER'), ('SMALL', 'SMALLER', 'LONG', 'LONGER'), ('SMALL', 'SMALLER', 'LOW', 'LOWER'), ('BIG', 'BIGGEST', 'GOOD', 'BEST'), ('BIG', 'BIGGEST', 'GREAT', 'GREATEST'), ('BIG', 'BIGGEST', 'LARGE', 'LARGEST'), ('GOOD', 'BEST', 'GREAT', 'GREATEST'), ('GOOD', 'BEST', 'LARGE', 'LARGEST'), ('GOOD', 'BEST', 'BIG', 'BIGGEST'), ('GREAT', 'GREATEST', 'LARGE', 'LARGEST'), ('GREAT', 'GREATEST', 'BIG', 'BIGGEST'), ('GREAT', 'GREATEST', 'GOOD', 'BEST'), ('LARGE', 'LARGEST', 'BIG', 'BIGGEST'), ('LARGE', 'LARGEST', 'GOOD', 'BEST'), ('LARGE', 'LARGEST', 'GREAT', 'GREATEST'), ('GO', 'GOING', 'LOOK', 'LOOKING'), ('GO', 'GOING', 'PLAY', 'PLAYING'), ('GO', 'GOING', 'RUN', 'RUNNING'), ('GO', 'GOING', 'SAY', 'SAYING'), ('LOOK', 'LOOKING', 'PLAY', 'PLAYING'), ('LOOK', 'LOOKING', 'RUN', 'RUNNING'), ('LOOK', 'LOOKING', 'SAY', 'SAYING'), ('LOOK', 'LOOKING', 'GO', 'GOING'), ('PLAY', 'PLAYING', 'RUN', 'RUNNING'), ('PLAY', 'PLAYING', 'SAY', 'SAYING'), ('PLAY', 'PLAYING', 'GO', 'GOING'), ('PLAY', 'PLAYING', 'LOOK', 'LOOKING'), ('RUN', 'RUNNING', 'SAY', 'SAYING'), ('RUN', 'RUNNING', 'GO', 'GOING'), ('RUN', 'RUNNING', 'LOOK', 'LOOKING'), ('RUN', 'RUNNING', 'PLAY', 'PLAYING'), ('SAY', 'SAYING', 'GO', 'GOING'), ('SAY', 'SAYING', 'LOOK', 'LOOKING'), ('SAY', 'SAYING', 'PLAY', 'PLAYING'), ('SAY', 'SAYING', 'RUN', 'RUNNING'), ('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'), ('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'), ('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'), ('AUSTRALIA', 'AUSTRALIAN', 'JAPAN', 'JAPANESE'), ('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'), ('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'), ('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'), ('FRANCE', 'FRENCH', 'JAPAN', 'JAPANESE'), ('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'), ('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'), ('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'), ('INDIA', 'INDIAN', 'JAPAN', 'JAPANESE'), ('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'), ('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN'), ('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'), ('ISRAEL', 'ISRAELI', 'JAPAN', 'JAPANESE'), ('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'), ('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'), ('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'), ('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'), ('JAPAN', 'JAPANESE', 'SWITZERLAND', 'SWISS'), ('JAPAN', 'JAPANESE', 'AUSTRALIA', 'AUSTRALIAN'), ('JAPAN', 'JAPANESE', 'FRANCE', 'FRENCH'), ('JAPAN', 'JAPANESE', 'INDIA', 'INDIAN'), ('JAPAN', 'JAPANESE', 'ISRAEL', 'ISRAELI'), ('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'), ('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'), ('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'), ('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'), ('SWITZERLAND', 'SWISS', 'JAPAN', 'JAPANESE'), ('GOING', 'WENT', 'PAYING', 'PAID'), ('GOING', 'WENT', 'PLAYING', 'PLAYED'), ('GOING', 'WENT', 'SAYING', 'SAID'), ('GOING', 'WENT', 'TAKING', 'TOOK'), ('PAYING', 'PAID', 'PLAYING', 'PLAYED'), ('PAYING', 'PAID', 'SAYING', 'SAID'), ('PAYING', 'PAID', 'TAKING', 'TOOK'), ('PAYING', 'PAID', 'GOING', 'WENT'), ('PLAYING', 'PLAYED', 'SAYING', 'SAID'), ('PLAYING', 'PLAYED', 'TAKING', 'TOOK'), ('PLAYING', 'PLAYED', 'GOING', 'WENT'), ('PLAYING', 'PLAYED', 'PAYING', 'PAID'), ('SAYING', 'SAID', 'TAKING', 'TOOK'), ('SAYING', 'SAID', 'GOING', 'WENT'), ('SAYING', 'SAID', 'PAYING', 'PAID'), ('SAYING', 'SAID', 'PLAYING', 'PLAYED'), ('TAKING', 'TOOK', 'GOING', 'WENT'), ('TAKING', 'TOOK', 'PAYING', 'PAID'), ('TAKING', 'TOOK', 'PLAYING', 'PLAYED'), ('TAKING', 'TOOK', 'SAYING', 'SAID'), ('BUILDING', 'BUILDINGS', 'CAR', 'CARS'), ('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'), ('BUILDING', 'BUILDINGS', 'MAN', 'MEN'), ('BUILDING', 'BUILDINGS', 'ROAD', 'ROADS'), ('BUILDING', 'BUILDINGS', 'WOMAN', 'WOMEN'), ('CAR', 'CARS', 'CHILD', 'CHILDREN'), ('CAR', 'CARS', 'MAN', 'MEN'), ('CAR', 'CARS', 'ROAD', 'ROADS'), ('CAR', 'CARS', 'WOMAN', 'WOMEN'), ('CHILD', 'CHILDREN', 'MAN', 'MEN'), ('CHILD', 'CHILDREN', 'ROAD', 'ROADS'), ('CHILD', 'CHILDREN', 'WOMAN', 'WOMEN'), ('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'), ('CHILD', 'CHILDREN', 'CAR', 'CARS'), ('MAN', 'MEN', 'ROAD', 'ROADS'), ('MAN', 'MEN', 'WOMAN', 'WOMEN'), ('MAN', 'MEN', 'BUILDING', 'BUILDINGS'), ('MAN', 'MEN', 'CAR', 'CARS'), ('MAN', 'MEN', 'CHILD', 'CHILDREN'), ('ROAD', 'ROADS', 'WOMAN', 'WOMEN'), ('ROAD', 'ROADS', 'BUILDING', 'BUILDINGS'), ('ROAD', 'ROADS', 'CAR', 'CARS'), ('ROAD', 'ROADS', 'CHILD', 'CHILDREN'), ('ROAD', 'ROADS', 'MAN', 'MEN'), ('WOMAN', 'WOMEN', 'BUILDING', 'BUILDINGS'), ('WOMAN', 'WOMEN', 'CAR', 'CARS'), ('WOMAN', 'WOMEN', 'CHILD', 'CHILDREN'), ('WOMAN', 'WOMEN', 'MAN', 'MEN'), ('WOMAN', 'WOMEN', 'ROAD', 'ROADS')]}])
+ (0.0, [{'section': 'capital-common-countries', 'correct': [], 'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'), ('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'), ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'), ('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'), ('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'), ('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN')]}, {'section': 'capital-world', 'correct': [], 'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'), ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE')]}, {'section': 'currency', 'correct': [], 'incorrect': []}, {'section': 'city-in-state', 'correct': [], 'incorrect': []}, {'section': 'family', 'correct': [], 'incorrect': [('HE', 'SHE', 'HIS', 'HER'), ('HE', 'SHE', 'MAN', 'WOMAN'), ('HIS', 'HER', 'MAN', 'WOMAN'), ('HIS', 'HER', 'HE', 'SHE'), ('MAN', 'WOMAN', 'HE', 'SHE'), ('MAN', 'WOMAN', 'HIS', 'HER')]}, {'section': 'gram1-adjective-to-adverb', 'correct': [], 'incorrect': []}, {'section': 'gram2-opposite', 'correct': [], 'incorrect': []}, {'section': 'gram3-comparative', 'correct': [], 'incorrect': [('GOOD', 'BETTER', 'GREAT', 'GREATER'), ('GOOD', 'BETTER', 'LONG', 'LONGER'), ('GOOD', 'BETTER', 'LOW', 'LOWER'), ('GOOD', 'BETTER', 'SMALL', 'SMALLER'), ('GREAT', 'GREATER', 'LONG', 'LONGER'), ('GREAT', 'GREATER', 'LOW', 'LOWER'), ('GREAT', 'GREATER', 'SMALL', 'SMALLER'), ('GREAT', 'GREATER', 'GOOD', 'BETTER'), ('LONG', 'LONGER', 'LOW', 'LOWER'), ('LONG', 'LONGER', 'SMALL', 'SMALLER'), ('LONG', 'LONGER', 'GOOD', 'BETTER'), ('LONG', 'LONGER', 'GREAT', 'GREATER'), ('LOW', 'LOWER', 'SMALL', 'SMALLER'), ('LOW', 'LOWER', 'GOOD', 'BETTER'), ('LOW', 'LOWER', 'GREAT', 'GREATER'), ('LOW', 'LOWER', 'LONG', 'LONGER'), ('SMALL', 'SMALLER', 'GOOD', 'BETTER'), ('SMALL', 'SMALLER', 'GREAT', 'GREATER'), ('SMALL', 'SMALLER', 'LONG', 'LONGER'), ('SMALL', 'SMALLER', 'LOW', 'LOWER')]}, {'section': 'gram4-superlative', 'correct': [], 'incorrect': [('BIG', 'BIGGEST', 'GOOD', 'BEST'), ('BIG', 'BIGGEST', 'GREAT', 'GREATEST'), ('BIG', 'BIGGEST', 'LARGE', 'LARGEST'), ('GOOD', 'BEST', 'GREAT', 'GREATEST'), ('GOOD', 'BEST', 'LARGE', 'LARGEST'), ('GOOD', 'BEST', 'BIG', 'BIGGEST'), ('GREAT', 'GREATEST', 'LARGE', 'LARGEST'), ('GREAT', 'GREATEST', 'BIG', 'BIGGEST'), ('GREAT', 'GREATEST', 'GOOD', 'BEST'), ('LARGE', 'LARGEST', 'BIG', 'BIGGEST'), ('LARGE', 'LARGEST', 'GOOD', 'BEST'), ('LARGE', 'LARGEST', 'GREAT', 'GREATEST')]}, {'section': 'gram5-present-participle', 'correct': [], 'incorrect': [('GO', 'GOING', 'LOOK', 'LOOKING'), ('GO', 'GOING', 'PLAY', 'PLAYING'), ('GO', 'GOING', 'RUN', 'RUNNING'), ('GO', 'GOING', 'SAY', 'SAYING'), ('LOOK', 'LOOKING', 'PLAY', 'PLAYING'), ('LOOK', 'LOOKING', 'RUN', 'RUNNING'), ('LOOK', 'LOOKING', 'SAY', 'SAYING'), ('LOOK', 'LOOKING', 'GO', 'GOING'), ('PLAY', 'PLAYING', 'RUN', 'RUNNING'), ('PLAY', 'PLAYING', 'SAY', 'SAYING'), ('PLAY', 'PLAYING', 'GO', 'GOING'), ('PLAY', 'PLAYING', 'LOOK', 'LOOKING'), ('RUN', 'RUNNING', 'SAY', 'SAYING'), ('RUN', 'RUNNING', 'GO', 'GOING'), ('RUN', 'RUNNING', 'LOOK', 'LOOKING'), ('RUN', 'RUNNING', 'PLAY', 'PLAYING'), ('SAY', 'SAYING', 'GO', 'GOING'), ('SAY', 'SAYING', 'LOOK', 'LOOKING'), ('SAY', 'SAYING', 'PLAY', 'PLAYING'), ('SAY', 'SAYING', 'RUN', 'RUNNING')]}, {'section': 'gram6-nationality-adjective', 'correct': [], 'incorrect': [('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'), ('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'), ('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'), ('AUSTRALIA', 'AUSTRALIAN', 'JAPAN', 'JAPANESE'), ('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'), ('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'), ('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'), ('FRANCE', 'FRENCH', 'JAPAN', 'JAPANESE'), ('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'), ('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'), ('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'), ('INDIA', 'INDIAN', 'JAPAN', 'JAPANESE'), ('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'), ('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN'), ('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'), ('ISRAEL', 'ISRAELI', 'JAPAN', 'JAPANESE'), ('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'), ('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'), ('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'), ('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'), ('JAPAN', 'JAPANESE', 'SWITZERLAND', 'SWISS'), ('JAPAN', 'JAPANESE', 'AUSTRALIA', 'AUSTRALIAN'), ('JAPAN', 'JAPANESE', 'FRANCE', 'FRENCH'), ('JAPAN', 'JAPANESE', 'INDIA', 'INDIAN'), ('JAPAN', 'JAPANESE', 'ISRAEL', 'ISRAELI'), ('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'), ('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'), ('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'), ('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'), ('SWITZERLAND', 'SWISS', 'JAPAN', 'JAPANESE')]}, {'section': 'gram7-past-tense', 'correct': [], 'incorrect': [('GOING', 'WENT', 'PAYING', 'PAID'), ('GOING', 'WENT', 'PLAYING', 'PLAYED'), ('GOING', 'WENT', 'SAYING', 'SAID'), ('GOING', 'WENT', 'TAKING', 'TOOK'), ('PAYING', 'PAID', 'PLAYING', 'PLAYED'), ('PAYING', 'PAID', 'SAYING', 'SAID'), ('PAYING', 'PAID', 'TAKING', 'TOOK'), ('PAYING', 'PAID', 'GOING', 'WENT'), ('PLAYING', 'PLAYED', 'SAYING', 'SAID'), ('PLAYING', 'PLAYED', 'TAKING', 'TOOK'), ('PLAYING', 'PLAYED', 'GOING', 'WENT'), ('PLAYING', 'PLAYED', 'PAYING', 'PAID'), ('SAYING', 'SAID', 'TAKING', 'TOOK'), ('SAYING', 'SAID', 'GOING', 'WENT'), ('SAYING', 'SAID', 'PAYING', 'PAID'), ('SAYING', 'SAID', 'PLAYING', 'PLAYED'), ('TAKING', 'TOOK', 'GOING', 'WENT'), ('TAKING', 'TOOK', 'PAYING', 'PAID'), ('TAKING', 'TOOK', 'PLAYING', 'PLAYED'), ('TAKING', 'TOOK', 'SAYING', 'SAID')]}, {'section': 'gram8-plural', 'correct': [], 'incorrect': [('BUILDING', 'BUILDINGS', 'CAR', 'CARS'), ('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'), ('BUILDING', 'BUILDINGS', 'MAN', 'MEN'), ('BUILDING', 'BUILDINGS', 'ROAD', 'ROADS'), ('BUILDING', 'BUILDINGS', 'WOMAN', 'WOMEN'), ('CAR', 'CARS', 'CHILD', 'CHILDREN'), ('CAR', 'CARS', 'MAN', 'MEN'), ('CAR', 'CARS', 'ROAD', 'ROADS'), ('CAR', 'CARS', 'WOMAN', 'WOMEN'), ('CAR', 'CARS', 'BUILDING', 'BUILDINGS'), ('CHILD', 'CHILDREN', 'MAN', 'MEN'), ('CHILD', 'CHILDREN', 'ROAD', 'ROADS'), ('CHILD', 'CHILDREN', 'WOMAN', 'WOMEN'), ('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'), ('CHILD', 'CHILDREN', 'CAR', 'CARS'), ('MAN', 'MEN', 'ROAD', 'ROADS'), ('MAN', 'MEN', 'WOMAN', 'WOMEN'), ('MAN', 'MEN', 'BUILDING', 'BUILDINGS'), ('MAN', 'MEN', 'CAR', 'CARS'), ('MAN', 'MEN', 'CHILD', 'CHILDREN'), ('ROAD', 'ROADS', 'WOMAN', 'WOMEN'), ('ROAD', 'ROADS', 'BUILDING', 'BUILDINGS'), ('ROAD', 'ROADS', 'CAR', 'CARS'), ('ROAD', 'ROADS', 'CHILD', 'CHILDREN'), ('ROAD', 'ROADS', 'MAN', 'MEN'), ('WOMAN', 'WOMEN', 'BUILDING', 'BUILDINGS'), ('WOMAN', 'WOMEN', 'CAR', 'CARS'), ('WOMAN', 'WOMEN', 'CHILD', 'CHILDREN'), ('WOMAN', 'WOMEN', 'MAN', 'MEN'), ('WOMAN', 'WOMEN', 'ROAD', 'ROADS')]}, {'section': 'gram9-plural-verbs', 'correct': [], 'incorrect': []}, {'section': 'Total accuracy', 'correct': [], 'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'), ('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'), ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'), ('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'), ('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'), ('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN'), ('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'), ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'), ('HE', 'SHE', 'HIS', 'HER'), ('HE', 'SHE', 'MAN', 'WOMAN'), ('HIS', 'HER', 'MAN', 'WOMAN'), ('HIS', 'HER', 'HE', 'SHE'), ('MAN', 'WOMAN', 'HE', 'SHE'), ('MAN', 'WOMAN', 'HIS', 'HER'), ('GOOD', 'BETTER', 'GREAT', 'GREATER'), ('GOOD', 'BETTER', 'LONG', 'LONGER'), ('GOOD', 'BETTER', 'LOW', 'LOWER'), ('GOOD', 'BETTER', 'SMALL', 'SMALLER'), ('GREAT', 'GREATER', 'LONG', 'LONGER'), ('GREAT', 'GREATER', 'LOW', 'LOWER'), ('GREAT', 'GREATER', 'SMALL', 'SMALLER'), ('GREAT', 'GREATER', 'GOOD', 'BETTER'), ('LONG', 'LONGER', 'LOW', 'LOWER'), ('LONG', 'LONGER', 'SMALL', 'SMALLER'), ('LONG', 'LONGER', 'GOOD', 'BETTER'), ('LONG', 'LONGER', 'GREAT', 'GREATER'), ('LOW', 'LOWER', 'SMALL', 'SMALLER'), ('LOW', 'LOWER', 'GOOD', 'BETTER'), ('LOW', 'LOWER', 'GREAT', 'GREATER'), ('LOW', 'LOWER', 'LONG', 'LONGER'), ('SMALL', 'SMALLER', 'GOOD', 'BETTER'), ('SMALL', 'SMALLER', 'GREAT', 'GREATER'), ('SMALL', 'SMALLER', 'LONG', 'LONGER'), ('SMALL', 'SMALLER', 'LOW', 'LOWER'), ('BIG', 'BIGGEST', 'GOOD', 'BEST'), ('BIG', 'BIGGEST', 'GREAT', 'GREATEST'), ('BIG', 'BIGGEST', 'LARGE', 'LARGEST'), ('GOOD', 'BEST', 'GREAT', 'GREATEST'), ('GOOD', 'BEST', 'LARGE', 'LARGEST'), ('GOOD', 'BEST', 'BIG', 'BIGGEST'), ('GREAT', 'GREATEST', 'LARGE', 'LARGEST'), ('GREAT', 'GREATEST', 'BIG', 'BIGGEST'), ('GREAT', 'GREATEST', 'GOOD', 'BEST'), ('LARGE', 'LARGEST', 'BIG', 'BIGGEST'), ('LARGE', 'LARGEST', 'GOOD', 'BEST'), ('LARGE', 'LARGEST', 'GREAT', 'GREATEST'), ('GO', 'GOING', 'LOOK', 'LOOKING'), ('GO', 'GOING', 'PLAY', 'PLAYING'), ('GO', 'GOING', 'RUN', 'RUNNING'), ('GO', 'GOING', 'SAY', 'SAYING'), ('LOOK', 'LOOKING', 'PLAY', 'PLAYING'), ('LOOK', 'LOOKING', 'RUN', 'RUNNING'), ('LOOK', 'LOOKING', 'SAY', 'SAYING'), ('LOOK', 'LOOKING', 'GO', 'GOING'), ('PLAY', 'PLAYING', 'RUN', 'RUNNING'), ('PLAY', 'PLAYING', 'SAY', 'SAYING'), ('PLAY', 'PLAYING', 'GO', 'GOING'), ('PLAY', 'PLAYING', 'LOOK', 'LOOKING'), ('RUN', 'RUNNING', 'SAY', 'SAYING'), ('RUN', 'RUNNING', 'GO', 'GOING'), ('RUN', 'RUNNING', 'LOOK', 'LOOKING'), ('RUN', 'RUNNING', 'PLAY', 'PLAYING'), ('SAY', 'SAYING', 'GO', 'GOING'), ('SAY', 'SAYING', 'LOOK', 'LOOKING'), ('SAY', 'SAYING', 'PLAY', 'PLAYING'), ('SAY', 'SAYING', 'RUN', 'RUNNING'), ('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'), ('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'), ('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'), ('AUSTRALIA', 'AUSTRALIAN', 'JAPAN', 'JAPANESE'), ('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'), ('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'), ('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'), ('FRANCE', 'FRENCH', 'JAPAN', 'JAPANESE'), ('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'), ('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'), ('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'), ('INDIA', 'INDIAN', 'JAPAN', 'JAPANESE'), ('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'), ('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN'), ('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'), ('ISRAEL', 'ISRAELI', 'JAPAN', 'JAPANESE'), ('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'), ('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'), ('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'), ('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'), ('JAPAN', 'JAPANESE', 'SWITZERLAND', 'SWISS'), ('JAPAN', 'JAPANESE', 'AUSTRALIA', 'AUSTRALIAN'), ('JAPAN', 'JAPANESE', 'FRANCE', 'FRENCH'), ('JAPAN', 'JAPANESE', 'INDIA', 'INDIAN'), ('JAPAN', 'JAPANESE', 'ISRAEL', 'ISRAELI'), ('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'), ('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'), ('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'), ('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'), ('SWITZERLAND', 'SWISS', 'JAPAN', 'JAPANESE'), ('GOING', 'WENT', 'PAYING', 'PAID'), ('GOING', 'WENT', 'PLAYING', 'PLAYED'), ('GOING', 'WENT', 'SAYING', 'SAID'), ('GOING', 'WENT', 'TAKING', 'TOOK'), ('PAYING', 'PAID', 'PLAYING', 'PLAYED'), ('PAYING', 'PAID', 'SAYING', 'SAID'), ('PAYING', 'PAID', 'TAKING', 'TOOK'), ('PAYING', 'PAID', 'GOING', 'WENT'), ('PLAYING', 'PLAYED', 'SAYING', 'SAID'), ('PLAYING', 'PLAYED', 'TAKING', 'TOOK'), ('PLAYING', 'PLAYED', 'GOING', 'WENT'), ('PLAYING', 'PLAYED', 'PAYING', 'PAID'), ('SAYING', 'SAID', 'TAKING', 'TOOK'), ('SAYING', 'SAID', 'GOING', 'WENT'), ('SAYING', 'SAID', 'PAYING', 'PAID'), ('SAYING', 'SAID', 'PLAYING', 'PLAYED'), ('TAKING', 'TOOK', 'GOING', 'WENT'), ('TAKING', 'TOOK', 'PAYING', 'PAID'), ('TAKING', 'TOOK', 'PLAYING', 'PLAYED'), ('TAKING', 'TOOK', 'SAYING', 'SAID'), ('BUILDING', 'BUILDINGS', 'CAR', 'CARS'), ('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'), ('BUILDING', 'BUILDINGS', 'MAN', 'MEN'), ('BUILDING', 'BUILDINGS', 'ROAD', 'ROADS'), ('BUILDING', 'BUILDINGS', 'WOMAN', 'WOMEN'), ('CAR', 'CARS', 'CHILD', 'CHILDREN'), ('CAR', 'CARS', 'MAN', 'MEN'), ('CAR', 'CARS', 'ROAD', 'ROADS'), ('CAR', 'CARS', 'WOMAN', 'WOMEN'), ('CAR', 'CARS', 'BUILDING', 'BUILDINGS'), ('CHILD', 'CHILDREN', 'MAN', 'MEN'), ('CHILD', 'CHILDREN', 'ROAD', 'ROADS'), ('CHILD', 'CHILDREN', 'WOMAN', 'WOMEN'), ('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'), ('CHILD', 'CHILDREN', 'CAR', 'CARS'), ('MAN', 'MEN', 'ROAD', 'ROADS'), ('MAN', 'MEN', 'WOMAN', 'WOMEN'), ('MAN', 'MEN', 'BUILDING', 'BUILDINGS'), ('MAN', 'MEN', 'CAR', 'CARS'), ('MAN', 'MEN', 'CHILD', 'CHILDREN'), ('ROAD', 'ROADS', 'WOMAN', 'WOMEN'), ('ROAD', 'ROADS', 'BUILDING', 'BUILDINGS'), ('ROAD', 'ROADS', 'CAR', 'CARS'), ('ROAD', 'ROADS', 'CHILD', 'CHILDREN'), ('ROAD', 'ROADS', 'MAN', 'MEN'), ('WOMAN', 'WOMEN', 'BUILDING', 'BUILDINGS'), ('WOMAN', 'WOMEN', 'CAR', 'CARS'), ('WOMAN', 'WOMEN', 'CHILD', 'CHILDREN'), ('WOMAN', 'WOMEN', 'MAN', 'MEN'), ('WOMAN', 'WOMEN', 'ROAD', 'ROADS')]}])
@@ -859,11 +649,8 @@ are less similar because they are related but not interchangeable.
.. code-block:: none
- 2020-09-30 17:02:04,681 : INFO : Pearson correlation coefficient against /Volumes/work/workspace/gensim/trunk/gensim/test/test_data/wordsim353.tsv: 0.1072
- 2020-09-30 17:02:04,682 : INFO : Spearman rank-order correlation coefficient against /Volumes/work/workspace/gensim/trunk/gensim/test/test_data/wordsim353.tsv: 0.0977
- 2020-09-30 17:02:04,682 : INFO : Pairs with unknown words ratio: 83.0%
- ((0.10718629411012633, 0.41498744701424156), SpearmanrResult(correlation=0.09773516803468056, pvalue=0.4575366217424267), 83.0028328611898)
+ ((0.1014236962315867, 0.44065378924434523), SpearmanrResult(correlation=0.07441989763914543, pvalue=0.5719973648460552), 83.0028328611898)
@@ -901,50 +688,6 @@ and `new vocabulary words `_:
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
- 2020-09-30 17:02:04,775 : INFO : loading Word2Vec object from /var/folders/w0/f7blghz9277068cnyyd3nd200000gn/T/gensim-model-36yeu47d
- 2020-09-30 17:02:04,788 : INFO : loading wv recursively from /var/folders/w0/f7blghz9277068cnyyd3nd200000gn/T/gensim-model-36yeu47d.wv.* with mmap=None
- 2020-09-30 17:02:04,789 : INFO : setting ignored attribute cum_table to None
- 2020-09-30 17:02:04,809 : INFO : loaded /var/folders/w0/f7blghz9277068cnyyd3nd200000gn/T/gensim-model-36yeu47d
- 2020-09-30 17:02:04,809 : INFO : collecting all words and their counts
- 2020-09-30 17:02:04,809 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
- 2020-09-30 17:02:04,810 : INFO : collected 13 word types from a corpus of 13 raw words and 1 sentences
- 2020-09-30 17:02:04,810 : INFO : Updating model with new vocabulary
- 2020-09-30 17:02:04,819 : INFO : New added 0 unique words (0% of original 13) and increased the count of 0 pre-existing words (0% of original 13)
- 2020-09-30 17:02:04,819 : INFO : deleting the raw counts dictionary of 13 items
- 2020-09-30 17:02:04,819 : INFO : sample=0.001 downsamples 0 most-common words
- 2020-09-30 17:02:04,819 : INFO : downsampling leaves estimated 0 word corpus (0.0% of prior 0)
- 2020-09-30 17:02:04,838 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes
- 2020-09-30 17:02:04,838 : INFO : updating layer weights
- 2020-09-30 17:02:04,839 : WARNING : Effective 'alpha' higher than previous training cycles
- 2020-09-30 17:02:04,839 : INFO : training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
- 2020-09-30 17:02:04,842 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:04,843 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:04,843 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:04,843 : INFO : EPOCH - 1 : training on 13 raw words (6 effective words) took 0.0s, 5326 effective words/s
- 2020-09-30 17:02:04,844 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:04,845 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:04,845 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:04,845 : INFO : EPOCH - 2 : training on 13 raw words (5 effective words) took 0.0s, 6975 effective words/s
- 2020-09-30 17:02:04,846 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:04,846 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:04,846 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:04,846 : INFO : EPOCH - 3 : training on 13 raw words (5 effective words) took 0.0s, 8539 effective words/s
- 2020-09-30 17:02:04,847 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:04,847 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:04,847 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:04,848 : INFO : EPOCH - 4 : training on 13 raw words (6 effective words) took 0.0s, 11100 effective words/s
- 2020-09-30 17:02:04,848 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:04,849 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:04,849 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:04,849 : INFO : EPOCH - 5 : training on 13 raw words (5 effective words) took 0.0s, 9718 effective words/s
- 2020-09-30 17:02:04,849 : INFO : training on a 65 raw words (27 effective words) took 0.0s, 2900 effective words/s
-
@@ -994,40 +737,7 @@ attribute ``running_training_loss`` and can be retrieved using the function
.. code-block:: none
- 2020-09-30 17:02:05,025 : INFO : collecting all words and their counts
- 2020-09-30 17:02:05,027 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
- 2020-09-30 17:02:05,112 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
- 2020-09-30 17:02:05,112 : INFO : Loading a fresh vocabulary
- 2020-09-30 17:02:05,152 : INFO : effective_min_count=1 retains 6981 unique words (100% of original 6981, drops 0)
- 2020-09-30 17:02:05,152 : INFO : effective_min_count=1 leaves 58152 word corpus (100% of original 58152, drops 0)
- 2020-09-30 17:02:05,207 : INFO : deleting the raw counts dictionary of 6981 items
- 2020-09-30 17:02:05,207 : INFO : sample=0.001 downsamples 43 most-common words
- 2020-09-30 17:02:05,207 : INFO : downsampling leaves estimated 45723 word corpus (78.6% of prior 58152)
- 2020-09-30 17:02:05,294 : INFO : estimated required memory for 6981 words and 100 dimensions: 9075300 bytes
- 2020-09-30 17:02:05,294 : INFO : resetting layer weights
- 2020-09-30 17:02:05,651 : INFO : training model with 3 workers on 6981 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
- 2020-09-30 17:02:05,800 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:05,839 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:05,841 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:05,841 : INFO : EPOCH - 1 : training on 58152 raw words (45692 effective words) took 0.2s, 242729 effective words/s
- 2020-09-30 17:02:06,028 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:06,032 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:06,037 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:06,037 : INFO : EPOCH - 2 : training on 58152 raw words (45778 effective words) took 0.2s, 234367 effective words/s
- 2020-09-30 17:02:06,218 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:06,222 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:06,225 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:06,225 : INFO : EPOCH - 3 : training on 58152 raw words (45684 effective words) took 0.2s, 244363 effective words/s
- 2020-09-30 17:02:06,400 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:06,407 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:06,409 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:06,409 : INFO : EPOCH - 4 : training on 58152 raw words (45651 effective words) took 0.2s, 249862 effective words/s
- 2020-09-30 17:02:06,558 : INFO : worker thread finished; awaiting finish of 2 more threads
- 2020-09-30 17:02:06,597 : INFO : worker thread finished; awaiting finish of 1 more threads
- 2020-09-30 17:02:06,600 : INFO : worker thread finished; awaiting finish of 0 more threads
- 2020-09-30 17:02:06,600 : INFO : EPOCH - 5 : training on 58152 raw words (45745 effective words) took 0.2s, 240328 effective words/s
- 2020-09-30 17:02:06,600 : INFO : training on a 290760 raw words (228550 effective words) took 0.9s, 240759 effective words/s
- 1365568.125
+ 1369454.25
@@ -1165,55 +875,55 @@ standard deviation of the test duration.
.. code-block:: none
- Word2vec model #0: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 0.3307774066925049, 'train_time_std': 0.00578659163388716}
- Word2vec model #1: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 0.3314487934112549, 'train_time_std': 0.004201913501261655}
- Word2vec model #2: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 0.5213752587636312, 'train_time_std': 0.008047867089155704}
- Word2vec model #3: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 0.5293020407358805, 'train_time_std': 0.005368254032954145}
- Word2vec model #4: {'train_data': '25kB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 0.571751594543457, 'train_time_std': 0.001023259266794945}
- Word2vec model #5: {'train_data': '25kB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 0.5736987590789795, 'train_time_std': 0.00740075638673385}
- Word2vec model #6: {'train_data': '25kB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 1.1089734236399333, 'train_time_std': 0.029923990619945186}
- Word2vec model #7: {'train_data': '25kB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 1.2068419456481934, 'train_time_std': 0.006783016321594606}
- Word2vec model #8: {'train_data': '1MB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 0.9139569600423177, 'train_time_std': 0.04541121423444599}
- Word2vec model #9: {'train_data': '1MB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 0.9152584075927734, 'train_time_std': 0.05191135337399049}
- Word2vec model #10: {'train_data': '1MB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 1.6703286170959473, 'train_time_std': 0.11292966925292192}
- Word2vec model #11: {'train_data': '1MB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 1.583152135213216, 'train_time_std': 0.04577290669842482}
- Word2vec model #12: {'train_data': '1MB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 1.811710516611735, 'train_time_std': 0.01081321887556254}
- Word2vec model #13: {'train_data': '1MB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 1.8143157164255779, 'train_time_std': 0.026406013455100835}
- Word2vec model #14: {'train_data': '1MB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 3.5845812956492105, 'train_time_std': 0.08968344917541199}
- Word2vec model #15: {'train_data': '1MB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 3.6167975266774497, 'train_time_std': 0.14609390508721276}
- Word2vec model #16: {'train_data': '10MB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 8.021462361017862, 'train_time_std': 0.21593094159548987}
- Word2vec model #17: {'train_data': '10MB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 7.931290070215861, 'train_time_std': 0.25084118769867136}
- Word2vec model #18: {'train_data': '10MB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 15.51533571879069, 'train_time_std': 0.8857355166766315}
- Word2vec model #19: {'train_data': '10MB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 15.930208921432495, 'train_time_std': 0.6417048653898146}
- Word2vec model #20: {'train_data': '10MB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 21.687038342158, 'train_time_std': 0.3261330075856754}
- Word2vec model #21: {'train_data': '10MB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 21.280882279078167, 'train_time_std': 0.12885843584913614}
- Word2vec model #22: {'train_data': '10MB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 43.11969208717346, 'train_time_std': 0.8133788671881127}
- Word2vec model #23: {'train_data': '10MB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 40.59294398625692, 'train_time_std': 0.47622639550838375}
+ Word2vec model #0: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 0.25217413902282715, 'train_time_std': 0.020226552024939795}
+ Word2vec model #1: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 0.25898512204488117, 'train_time_std': 0.026276375796854143}
+ Word2vec model #2: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 0.4194076855977376, 'train_time_std': 0.0021983060310549808}
+ Word2vec model #3: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 0.4308760166168213, 'train_time_std': 0.0009999532723555815}
+ Word2vec model #4: {'train_data': '25kB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 0.47211599349975586, 'train_time_std': 0.015136686417800442}
+ Word2vec model #5: {'train_data': '25kB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 0.4695216814676921, 'train_time_std': 0.0033446725418043747}
+ Word2vec model #6: {'train_data': '25kB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 0.9502590497334799, 'train_time_std': 0.005153258425238986}
+ Word2vec model #7: {'train_data': '25kB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 0.9424160321553549, 'train_time_std': 0.009776048211734903}
+ Word2vec model #8: {'train_data': '1MB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 0.6441135406494141, 'train_time_std': 0.00934594899599891}
+ Word2vec model #9: {'train_data': '1MB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 0.656217098236084, 'train_time_std': 0.02703627277086478}
+ Word2vec model #10: {'train_data': '1MB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 1.3150715033213298, 'train_time_std': 0.09457246701267184}
+ Word2vec model #11: {'train_data': '1MB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 1.205832560857137, 'train_time_std': 0.005158620074483131}
+ Word2vec model #12: {'train_data': '1MB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 1.5065066814422607, 'train_time_std': 0.036966116484319765}
+ Word2vec model #13: {'train_data': '1MB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 1.537813663482666, 'train_time_std': 0.01020688183426915}
+ Word2vec model #14: {'train_data': '1MB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 3.302257219950358, 'train_time_std': 0.04523242606424026}
+ Word2vec model #15: {'train_data': '1MB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 3.4928714434305825, 'train_time_std': 0.19327551634697}
+ Word2vec model #16: {'train_data': '10MB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 7.446084260940552, 'train_time_std': 0.7894319693665308}
+ Word2vec model #17: {'train_data': '10MB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 7.060012976328532, 'train_time_std': 0.2136692186366028}
+ Word2vec model #18: {'train_data': '10MB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 14.277136087417603, 'train_time_std': 0.7441633349142932}
+ Word2vec model #19: {'train_data': '10MB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 13.758649031321207, 'train_time_std': 0.37393987718126326}
+ Word2vec model #20: {'train_data': '10MB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 20.35730775197347, 'train_time_std': 0.41241047454786994}
+ Word2vec model #21: {'train_data': '10MB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 21.380844751993816, 'train_time_std': 1.6909472056783184}
+ Word2vec model #22: {'train_data': '10MB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 44.4877184232076, 'train_time_std': 1.1314265197889173}
+ Word2vec model #23: {'train_data': '10MB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 44.517534812291466, 'train_time_std': 1.4472790491207064}
compute_loss hs sg train_data train_time_mean train_time_std
- 4 True 0 1 25kB 0.571752 0.001023
- 5 False 0 1 25kB 0.573699 0.007401
- 6 True 1 1 25kB 1.108973 0.029924
- 7 False 1 1 25kB 1.206842 0.006783
- 0 True 0 0 25kB 0.330777 0.005787
- 1 False 0 0 25kB 0.331449 0.004202
- 2 True 1 0 25kB 0.521375 0.008048
- 3 False 1 0 25kB 0.529302 0.005368
- 12 True 0 1 1MB 1.811711 0.010813
- 13 False 0 1 1MB 1.814316 0.026406
- 14 True 1 1 1MB 3.584581 0.089683
- 15 False 1 1 1MB 3.616798 0.146094
- 8 True 0 0 1MB 0.913957 0.045411
- 9 False 0 0 1MB 0.915258 0.051911
- 10 True 1 0 1MB 1.670329 0.112930
- 11 False 1 0 1MB 1.583152 0.045773
- 20 True 0 1 10MB 21.687038 0.326133
- 21 False 0 1 10MB 21.280882 0.128858
- 22 True 1 1 10MB 43.119692 0.813379
- 23 False 1 1 10MB 40.592944 0.476226
- 16 True 0 0 10MB 8.021462 0.215931
- 17 False 0 0 10MB 7.931290 0.250841
- 18 True 1 0 10MB 15.515336 0.885736
- 19 False 1 0 10MB 15.930209 0.641705
+ 4 True 0 1 25kB 0.472116 0.015137
+ 5 False 0 1 25kB 0.469522 0.003345
+ 6 True 1 1 25kB 0.950259 0.005153
+ 7 False 1 1 25kB 0.942416 0.009776
+ 0 True 0 0 25kB 0.252174 0.020227
+ 1 False 0 0 25kB 0.258985 0.026276
+ 2 True 1 0 25kB 0.419408 0.002198
+ 3 False 1 0 25kB 0.430876 0.001000
+ 12 True 0 1 1MB 1.506507 0.036966
+ 13 False 0 1 1MB 1.537814 0.010207
+ 14 True 1 1 1MB 3.302257 0.045232
+ 15 False 1 1 1MB 3.492871 0.193276
+ 8 True 0 0 1MB 0.644114 0.009346
+ 9 False 0 0 1MB 0.656217 0.027036
+ 10 True 1 0 1MB 1.315072 0.094572
+ 11 False 1 0 1MB 1.205833 0.005159
+ 20 True 0 1 10MB 20.357308 0.412410
+ 21 False 0 1 10MB 21.380845 1.690947
+ 22 True 1 1 10MB 44.487718 1.131427
+ 23 False 1 1 10MB 44.517535 1.447279
+ 16 True 0 0 10MB 7.446084 0.789432
+ 17 False 0 0 10MB 7.060013 0.213669
+ 18 True 1 0 10MB 14.277136 0.744163
+ 19 False 1 0 10MB 13.758649 0.373940
@@ -1333,9 +1043,9 @@ Links
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 11 minutes 46.634 seconds)
+ **Total running time of the script:** ( 11 minutes 26.674 seconds)
-**Estimated memory usage:** 6399 MB
+**Estimated memory usage:** 7177 MB
.. _sphx_glr_download_auto_examples_tutorials_run_word2vec.py:
diff --git a/docs/src/auto_examples/tutorials/sg_execution_times.rst b/docs/src/auto_examples/tutorials/sg_execution_times.rst
index af55f3f18a..7003c2957e 100644
--- a/docs/src/auto_examples/tutorials/sg_execution_times.rst
+++ b/docs/src/auto_examples/tutorials/sg_execution_times.rst
@@ -5,18 +5,18 @@
Computation times
=================
-**00:07.863** total execution time for **auto_examples_tutorials** files:
+**11:26.674** total execution time for **auto_examples_tutorials** files:
-+-----------------------------------------------------------------------------------------------+-----------+----------+
-| :ref:`sphx_glr_auto_examples_tutorials_run_annoy.py` (``run_annoy.py``) | 14:40.672 | 752.8 MB |
-+-----------------------------------------------------------------------------------------------+-----------+----------+
-| :ref:`sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` (``run_doc2vec_lee.py``) | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------+-----------+----------+
-| :ref:`sphx_glr_auto_examples_tutorials_run_fasttext.py` (``run_fasttext.py``) | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------+-----------+----------+
-| :ref:`sphx_glr_auto_examples_tutorials_run_lda.py` (``run_lda.py``) | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------+-----------+----------+
-| :ref:`sphx_glr_auto_examples_tutorials_run_wmd.py` (``run_wmd.py``) | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------+-----------+----------+
-| :ref:`sphx_glr_auto_examples_tutorials_run_word2vec.py` (``run_word2vec.py``) | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------+-----------+----------+
++-------------------------------------------------------------------------------------+-----------+-----------+
+| :ref:`sphx_glr_auto_examples_tutorials_run_word2vec.py` (``run_word2vec.py``) | 11:26.674 | 7177.5 MB |
++-------------------------------------------------------------------------------------+-----------+-----------+
+| :ref:`sphx_glr_auto_examples_tutorials_run_annoy.py` (``run_annoy.py``) | 00:00.000 | 0.0 MB |
++-------------------------------------------------------------------------------------+-----------+-----------+
+| :ref:`sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` (``run_doc2vec_lee.py``) | 00:00.000 | 0.0 MB |
++-------------------------------------------------------------------------------------+-----------+-----------+
+| :ref:`sphx_glr_auto_examples_tutorials_run_fasttext.py` (``run_fasttext.py``) | 00:00.000 | 0.0 MB |
++-------------------------------------------------------------------------------------+-----------+-----------+
+| :ref:`sphx_glr_auto_examples_tutorials_run_lda.py` (``run_lda.py``) | 00:00.000 | 0.0 MB |
++-------------------------------------------------------------------------------------+-----------+-----------+
+| :ref:`sphx_glr_auto_examples_tutorials_run_wmd.py` (``run_wmd.py``) | 00:00.000 | 0.0 MB |
++-------------------------------------------------------------------------------------+-----------+-----------+
diff --git a/docs/src/gallery/core/run_corpora_and_vector_spaces.py b/docs/src/gallery/core/run_corpora_and_vector_spaces.py
index 5a77b4e637..0a49614123 100644
--- a/docs/src/gallery/core/run_corpora_and_vector_spaces.py
+++ b/docs/src/gallery/core/run_corpora_and_vector_spaces.py
@@ -136,7 +136,7 @@
from smart_open import open # for transparently opening remote files
-class MyCorpus(object):
+class MyCorpus:
def __iter__(self):
for line in open('https://radimrehurek.com/gensim/mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
diff --git a/docs/src/gallery/tutorials/run_word2vec.py b/docs/src/gallery/tutorials/run_word2vec.py
index 01b0e2bb86..c5ef323bb2 100644
--- a/docs/src/gallery/tutorials/run_word2vec.py
+++ b/docs/src/gallery/tutorials/run_word2vec.py
@@ -197,8 +197,8 @@
from gensim.test.utils import datapath
from gensim import utils
-class MyCorpus(object):
- """An interator that yields sentences (lists of str)."""
+class MyCorpus:
+ """An iterator that yields sentences (lists of str)."""
def __iter__(self):
corpus_path = datapath('lee_background.cor')
diff --git a/gensim/examples/dmlcz/dmlcorpus.py b/gensim/examples/dmlcz/dmlcorpus.py
index a0d9007fa3..d76c622c95 100644
--- a/gensim/examples/dmlcz/dmlcorpus.py
+++ b/gensim/examples/dmlcz/dmlcorpus.py
@@ -20,7 +20,7 @@
logger = logging.getLogger('gensim.corpora.dmlcorpus')
-class DmlConfig(object):
+class DmlConfig:
"""
DmlConfig contains parameters necessary for the abstraction of a 'corpus of
articles' (see the `DmlCorpus` class).
diff --git a/gensim/examples/dmlcz/sources.py b/gensim/examples/dmlcz/sources.py
index c9782f80c4..4c6eb8a048 100644
--- a/gensim/examples/dmlcz/sources.py
+++ b/gensim/examples/dmlcz/sources.py
@@ -32,7 +32,7 @@
logger = logging.getLogger('gensim.corpora.sources')
-class ArticleSource(object):
+class ArticleSource:
"""
Objects of this class describe a single source of articles.
diff --git a/gensim/matutils.py b/gensim/matutils.py
index dbdd3f1439..48a7ee180a 100644
--- a/gensim/matutils.py
+++ b/gensim/matutils.py
@@ -336,7 +336,7 @@ def scipy2sparse(vec, eps=1e-9):
return [(int(pos), float(val)) for pos, val in zip(vec.indices, vec.data) if np.abs(val) > eps]
-class Scipy2Corpus(object):
+class Scipy2Corpus:
"""Convert a sequence of dense/sparse vectors into a streamed Gensim corpus object.
See Also
@@ -508,7 +508,7 @@ def corpus2dense(corpus, num_terms, num_docs=None, dtype=np.float32):
return result.astype(dtype)
-class Dense2Corpus(object):
+class Dense2Corpus:
"""Treat dense numpy array as a streamed Gensim corpus in the bag-of-words format.
Notes
@@ -555,7 +555,7 @@ def __len__(self):
return len(self.dense)
-class Sparse2Corpus(object):
+class Sparse2Corpus:
"""Convert a matrix in scipy.sparse format into a streaming Gensim corpus.
See Also
@@ -1132,7 +1132,7 @@ def qr_destroy(la):
return q, r
-class MmWriter(object):
+class MmWriter:
"""Store a corpus in `Matrix Market format `_,
using :class:`~gensim.corpora.mmcorpus.MmCorpus`.
diff --git a/gensim/models/basemodel.py b/gensim/models/basemodel.py
index 04422f8199..01466f68f9 100644
--- a/gensim/models/basemodel.py
+++ b/gensim/models/basemodel.py
@@ -1,4 +1,5 @@
-class BaseTopicModel(object):
+class BaseTopicModel:
+
def print_topic(self, topicno, topn=10):
"""Get a single topic as a formatted string.
diff --git a/gensim/models/callbacks.py b/gensim/models/callbacks.py
index cefdd33091..ab2bb05d8e 100644
--- a/gensim/models/callbacks.py
+++ b/gensim/models/callbacks.py
@@ -61,7 +61,7 @@
...
>>>
>>> epoch_logger = EpochLogger()
- >>> w2v_model = Word2Vec(common_texts, iter=5, size=10, min_count=0, seed=42, callbacks=[epoch_logger])
+ >>> w2v_model = Word2Vec(common_texts, epochs=5, vector_size=10, min_count=0, seed=42, callbacks=[epoch_logger])
Epoch #0 start
Epoch #0 end
Epoch #1 start
@@ -106,7 +106,7 @@
VISDOM_INSTALLED = False
-class Metric(object):
+class Metric:
"""Base Metric class for topic model evaluation metrics.
Concrete implementations include:
@@ -442,7 +442,7 @@ def get_value(self, **kwargs):
return np.sum(diff_diagonal)
-class Callback(object):
+class Callback:
"""A class representing routines called reactively at specific phases during trained.
These can be used to log or visualize the training progress using any of the metric scores developed before.
@@ -568,7 +568,7 @@ def on_epoch_end(self, epoch, topics=None):
return current_metrics
-class CallbackAny2Vec(object):
+class CallbackAny2Vec:
"""Base class to build callbacks for :class:`~gensim.models.word2vec.Word2Vec` & subclasses.
Callbacks are used to apply custom functions over the model at specific points
diff --git a/gensim/models/doc2vec.py b/gensim/models/doc2vec.py
index 9d8489657e..51fdfdce43 100644
--- a/gensim/models/doc2vec.py
+++ b/gensim/models/doc2vec.py
@@ -751,7 +751,7 @@ def save_word2vec_format(self, fname, doctag_vec=False, word_vec=True, prefix='*
@deprecated(
"Gensim 4.0.0 implemented internal optimizations that make calls to init_sims() unnecessary. "
"init_sims() is now obsoleted and will be completely removed in future versions. "
- "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims"
+ "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4"
)
def init_sims(self, replace=False):
"""
@@ -1085,7 +1085,7 @@ class Doc2VecTrainables(utils.SaveLoad):
"""Obsolete class retained for now as load-compatibility state capture"""
-class TaggedBrownCorpus(object):
+class TaggedBrownCorpus:
def __init__(self, dirname):
"""Reader for the `Brown corpus (part of NLTK data) `_.
@@ -1123,7 +1123,7 @@ def __iter__(self):
yield TaggedDocument(words, ['%s_SENT_%s' % (fname, item_no)])
-class TaggedLineDocument(object):
+class TaggedLineDocument:
def __init__(self, source):
"""Iterate over a file that contains documents: one line = :class:`~gensim.models.doc2vec.TaggedDocument` object.
diff --git a/gensim/models/fasttext.py b/gensim/models/fasttext.py
index 460a1682f5..38dd4172f6 100644
--- a/gensim/models/fasttext.py
+++ b/gensim/models/fasttext.py
@@ -30,7 +30,7 @@
.. sourcecode:: pycon
- >>> # from gensim.models import FastText # FIXME: why does Sphinx dislike this import?
+ >>> from gensim.models import FastText
>>> from gensim.test.utils import common_texts # some example sentences
>>>
>>> print(common_texts[0])
@@ -50,16 +50,7 @@
.. sourcecode:: pycon
- >>> model2 = FastText(vector_size=4, window=3, min_count=1, sentences=common_texts, iter=10)
-
-.. Important::
- This style of initialize-and-train in a single line is **deprecated**. We include it here
- for backward compatibility only.
-
- Please use the initialize-`build_vocab`-`train` pattern above instead, including using `epochs`
- instead of `iter`.
- The motivation is to simplify the API and resolve naming inconsistencies,
- e.g. the iter parameter to the constructor is called epochs in the train function.
+ >>> model2 = FastText(vector_size=4, window=3, min_count=1, sentences=common_texts, epochs=10)
The two models above are instantiated differently, but behave identically.
For example, we can compare the embeddings they've calculated for the word "computer":
@@ -108,7 +99,7 @@
>>> from gensim import utils
>>>
>>>
- >>> class MyIter(object):
+ >>> class MyIter:
... def __iter__(self):
... path = datapath('crime-and-punishment.txt')
... with utils.open(path, 'r', encoding='utf-8') as fin:
@@ -139,7 +130,7 @@
>>> import numpy as np
>>>
- >>> 'computation' in model.wv.vocab # New word, currently out of vocab
+ >>> 'computation' in model.wv.key_to_index # New word, currently out of vocab
False
>>> old_vector = np.copy(model.wv['computation']) # Grab the existing vector
>>> new_sentences = [
@@ -157,7 +148,7 @@
>>> new_vector = model.wv['computation']
>>> np.allclose(old_vector, new_vector, atol=1e-4) # Vector has changed, model has learnt something
False
- >>> 'computation' in model.wv.vocab # Word is still out of vocab
+ >>> 'computation' in model.wv.key_to_index # Word is still out of vocab
False
.. Important::
@@ -178,7 +169,7 @@
.. sourcecode:: pycon
- >>> 'computer' in fb_model.wv.vocab # New word, currently out of vocab
+ >>> 'computer' in fb_model.wv.key_to_index # New word, currently out of vocab
False
>>> old_computer = np.copy(fb_model.wv['computer']) # Calculate current vectors
>>> fb_model.build_vocab(new_sentences, update=True)
@@ -186,7 +177,7 @@
>>> new_computer = fb_model.wv['computer']
>>> np.allclose(old_computer, new_computer, atol=1e-4) # Vector has changed, model has learnt something
False
- >>> 'computer' in fb_model.wv.vocab # New word is now in the vocabulary
+ >>> 'computer' in fb_model.wv.key_to_index # New word is now in the vocabulary
True
If you do not intend to continue training the model, consider using the
@@ -200,25 +191,25 @@
>>> cap_path = datapath("crime-and-punishment.bin")
>>> wv = load_facebook_vectors(cap_path)
>>>
- >>> 'landlord' in wv.vocab # Word is out of vocabulary
+ >>> 'landlord' in wv.key_to_index # Word is out of vocabulary
False
- >>> oov_vector = wv['landlord']
+ >>> oov_vector = wv['landlord'] # Even OOV words have vectors in FastText
>>>
- >>> 'landlady' in wv.vocab # Word is in the vocabulary
+ >>> 'landlady' in wv.key_to_index # Word is in the vocabulary
True
>>> iv_vector = wv['landlady']
-Retrieve word-vector for vocab and out-of-vocab word:
+Retrieve the word-vector for vocab and out-of-vocab word:
.. sourcecode:: pycon
>>> existent_word = "computer"
- >>> existent_word in model.wv.vocab
+ >>> existent_word in model.wv.key_to_index
True
>>> computer_vec = model.wv[existent_word] # numpy vector of a word
>>>
>>> oov_word = "graph-out-of-vocab"
- >>> oov_word in model.wv.vocab
+ >>> oov_word in model.wv.key_to_index
False
>>> oov_vec = model.wv[oov_word] # numpy vector for OOV word
@@ -488,9 +479,9 @@ def estimate_memory(self, vocab_size=None, report=None):
hashes = ft_ngram_hashes(word, self.wv.min_n, self.wv.max_n, self.wv.bucket)
num_ngrams += len(hashes)
# A list (64 bytes) with one np.array (100 bytes) per key, with a total of
- # num_ngrams uint32s (4 bytes) amongst them
- # Only used during training, not stored with the model
- report['buckets_word'] = 64 + (100 * len(self.wv)) + (4 * num_ngrams) # FIXME: caching & calc sensible?
+ # num_ngrams uint32s (4 bytes) amongst them.
+ # Only used during training, not stored with the model.
+ report['buckets_word'] = 64 + (100 * len(self.wv)) + (4 * num_ngrams) # TODO: caching & calc sensible?
report['total'] = sum(report.values())
logger.info(
"estimated required memory for %i words, %i buckets and %i dimensions: %i bytes",
@@ -541,7 +532,7 @@ def _do_train_job(self, sentences, alpha, inits):
@deprecated(
"Gensim 4.0.0 implemented internal optimizations that make calls to init_sims() unnecessary. "
"init_sims() is now obsoleted and will be completely removed in future versions. "
- "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims"
+ "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4"
)
def init_sims(self, replace=False):
"""
@@ -699,11 +690,11 @@ def load_facebook_model(path, encoding='utf-8'):
>>> cap_path = datapath("crime-and-punishment.bin")
>>> fb_model = load_facebook_model(cap_path)
>>>
- >>> 'landlord' in fb_model.wv.vocab # Word is out of vocabulary
+ >>> 'landlord' in fb_model.wv.key_to_index # Word is out of vocabulary
False
>>> oov_term = fb_model.wv['landlord']
>>>
- >>> 'landlady' in fb_model.wv.vocab # Word is in the vocabulary
+ >>> 'landlady' in fb_model.wv.key_to_index # Word is in the vocabulary
True
>>> iv_term = fb_model.wv['landlady']
>>>
@@ -764,11 +755,11 @@ def load_facebook_vectors(path, encoding='utf-8'):
>>> cap_path = datapath("crime-and-punishment.bin")
>>> fbkv = load_facebook_vectors(cap_path)
>>>
- >>> 'landlord' in fbkv.vocab # Word is out of vocabulary
+ >>> 'landlord' in fbkv.key_to_index # Word is out of vocabulary
False
>>> oov_vector = fbkv['landlord']
>>>
- >>> 'landlady' in fbkv.vocab # Word is in the vocabulary
+ >>> 'landlady' in fbkv.key_to_index # Word is in the vocabulary
True
>>> iv_vector = fbkv['landlady']
@@ -1193,7 +1184,7 @@ def recalc_char_ngram_buckets(self):
Scan the vocabulary, calculate ngrams and their hashes, and cache the list of ngrams for each known word.
"""
- # FIXME: evaluate if precaching even necessary, compared to recalculating as needed
+ # TODO: evaluate if precaching even necessary, compared to recalculating as needed.
if self.bucket == 0:
self.buckets_word = [np.array([], dtype=np.uint32)] * len(self.index_to_key)
return
diff --git a/gensim/models/hdpmodel.py b/gensim/models/hdpmodel.py
index c29e0a0737..5e0fbfe3e2 100755
--- a/gensim/models/hdpmodel.py
+++ b/gensim/models/hdpmodel.py
@@ -140,7 +140,7 @@ def lda_e_step(doc_word_ids, doc_word_counts, alpha, beta, max_iter=100):
return likelihood, gamma
-class SuffStats(object):
+class SuffStats:
"""Stores sufficient statistics for the current chunk of document(s) whenever Hdp model is updated with new corpus.
These stats are used when updating lambda and top level sticks. The statistics include number of documents in the
chunk, length of words in the documents and top level truncation level.
@@ -953,7 +953,7 @@ def evaluate_test_corpus(self, corpus):
return score
-class HdpTopicFormatter(object):
+class HdpTopicFormatter:
"""Helper class for :class:`gensim.models.hdpmodel.HdpModel` to format the output of topics."""
(STYLE_GENSIM, STYLE_PRETTY) = (1, 2)
diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
index 3a92c24f62..193c5f8f0f 100644
--- a/gensim/models/keyedvectors.py
+++ b/gensim/models/keyedvectors.py
@@ -67,7 +67,7 @@
>>> from gensim.test.utils import lee_corpus_list
>>> from gensim.models import Word2Vec
>>>
- >>> model = Word2Vec(lee_corpus_list, size=24, epochs=100)
+ >>> model = Word2Vec(lee_corpus_list, vector_size=24, epochs=100)
>>> word_vectors = model.wv
Persist the word vectors to disk with
@@ -215,7 +215,7 @@ def __init__(self, vector_size, count=0, dtype=np.float32, mapfile_path=None):
Vector dimensions will default to `np.float32` (AKA `REAL` in some Gensim code) unless
another type is provided here.
mapfile_path : string, optional
- FIXME: UNDER CONSTRUCTION / WILL CHANGE PRE-4.0.0 PER #2955 / #2975
+ FIXME: UNDER CONSTRUCTION / WILL CHANGE PRE-4.0.0 PER #2955 / #2975.
"""
self.vector_size = vector_size
# pre-allocating `index_to_key` to full size helps avoid redundant re-allocations, esp for `expandos`
@@ -587,7 +587,7 @@ def vectors_norm(self):
raise AttributeError(
"The `.vectors_norm` attribute is computed dynamically since Gensim 4.0.0. "
"Use `.get_normed_vectors()` instead.\n"
- "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims"
+ "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4"
)
@vectors_norm.setter
@@ -625,7 +625,7 @@ def fill_norms(self, force=False):
def index2entity(self):
raise AttributeError(
"The index2entity attribute has been replaced by index_to_key since Gensim 4.0.0.\n"
- "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims"
+ "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4"
)
@index2entity.setter
@@ -636,7 +636,7 @@ def index2entity(self, value):
def index2word(self):
raise AttributeError(
"The index2word attribute has been replaced by index_to_key since Gensim 4.0.0.\n"
- "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims"
+ "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4"
)
@index2word.setter
@@ -649,7 +649,7 @@ def vocab(self):
"The vocab attribute was removed from KeyedVector in Gensim 4.0.0.\n"
"Use KeyedVector's .key_to_index dict, .index_to_key list, and methods "
".get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.\n"
- "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims"
+ "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4"
)
@vocab.setter
@@ -974,7 +974,7 @@ def most_similar_cosmul(self, positive=None, negative=None, topn=10):
one-dimensional numpy array with the size of the vocabulary.
"""
- # FIXME: Update to better match & share code with most_similar()
+ # TODO: Update to better match & share code with most_similar()
if isinstance(topn, Integral) and topn < 1:
return []
@@ -1435,7 +1435,7 @@ def evaluate_word_pairs(
@deprecated(
"Use fill_norms() instead. "
- "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims"
+ "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4"
)
def init_sims(self, replace=False):
"""Precompute data helpful for bulk similarity calculations.
@@ -1510,9 +1510,9 @@ def save_word2vec_format(
Parameters
----------
fname : str
- The file path used to save the vectors in.
+ File path to save the vectors to.
fvocab : str, optional
- File path used to save the vocabulary.
+ File path to save additional vocabulary information to. `None` to not store the vocabulary.
binary : bool, optional
If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
total_vec : int, optional
@@ -1520,49 +1520,69 @@ def save_word2vec_format(
(in case word vectors are appended with document vectors afterwards).
write_header : bool, optional
If False, don't write the 1st line declaring the count of vectors and dimensions.
- FIXME: doc prefix, append, sort_attr
+ This is the format used by e.g. gloVe vectors.
+ prefix : str, optional
+ String to prepend in front of each stored word. Default = no prefix.
+ append : bool, optional
+ If set, open `fname` in `ab` mode instead of the default `wb` mode.
+ sort_attr : str, optional
+ Sort the output vectors in descending order of this attribute. Default: most frequent keys first.
+
"""
if total_vec is None:
total_vec = len(self.index_to_key)
mode = 'wb' if not append else 'ab'
- if 'count' in self.expandos:
- # if frequency-info available, store in most-to-least-frequent order
+
+ if sort_attr in self.expandos:
store_order_vocab_keys = sorted(self.key_to_index.keys(), key=lambda k: -self.get_vecattr(k, sort_attr))
else:
+ # This can happen even for the default `count`: the "native C word2vec" format does not store counts,
+ # so models loaded via load_word2vec_format() do not have the "count" attribute set. They have
+ # no attributes at all, and fall under this code path.
+ if fvocab is not None:
+ raise ValueError(f"Cannot store vocabulary with '{sort_attr}' because that attribute does not exist")
+ logger.warning(
+ "attribute %s not present in %s; will store in internal index_to_key order",
+ sort_attr, self,
+ )
store_order_vocab_keys = self.index_to_key
if fvocab is not None:
logger.info("storing vocabulary in %s", fvocab)
with utils.open(fvocab, mode) as vout:
for word in store_order_vocab_keys:
- vout.write(utils.to_utf8("%s%s %s\n" % (prefix, word, self.get_vecattr(word, sort_attr))))
+ vout.write(f"{prefix}{word} {self.get_vecattr(word, sort_attr)}\n".encode('utf8'))
logger.info("storing %sx%s projection weights into %s", total_vec, self.vector_size, fname)
assert (len(self.index_to_key), self.vector_size) == self.vectors.shape
- # after (possibly-empty) initial range of int-only keys,
- # store in sorted order: most frequent keys at the top
+ # After (possibly-empty) initial range of int-only keys in Doc2Vec,
+ # store in sorted order: most frequent keys at the top.
+ # XXX: get rid of this: not used much, too complex and brittle.
+ # See https://github.com/RaRe-Technologies/gensim/pull/2981#discussion_r512969788
index_id_count = 0
for i, val in enumerate(self.index_to_key):
- if not (i == val):
+ if i != val:
break
index_id_count += 1
keys_to_write = itertools.chain(range(0, index_id_count), store_order_vocab_keys)
+ # Store the actual vectors to the output file, in the order defined by sort_attr.
with utils.open(fname, mode) as fout:
if write_header:
- fout.write(utils.to_utf8("%s %s\n" % (total_vec, self.vector_size)))
+ fout.write(f"{total_vec} {self.vector_size}\n".encode('utf8'))
for key in keys_to_write:
- row = self[key]
+ key_vector = self[key]
if binary:
- row = row.astype(REAL)
- fout.write(utils.to_utf8(prefix + str(key)) + b" " + row.tobytes())
+ fout.write(f"{prefix}{key} ".encode('utf8') + key_vector.astype(REAL).tobytes())
else:
- fout.write(utils.to_utf8("%s%s %s\n" % (prefix, str(key), ' '.join(repr(val) for val in row))))
+ fout.write(f"{prefix}{key} {' '.join(repr(val) for val in key_vector)}\n".encode('utf8'))
@classmethod
- def load_word2vec_format(cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict',
- limit=None, datatype=REAL, no_header=False):
+ def load_word2vec_format(
+ cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict',
+ limit=None, datatype=REAL, no_header=False,
+ ):
"""Load the input-hidden weight matrix from the original C word2vec-tool format.
Warnings
@@ -1607,7 +1627,8 @@ def load_word2vec_format(cls, fname, fvocab=None, binary=False, encoding='utf8',
"""
return _load_word2vec_format(
cls, fname, fvocab=fvocab, binary=binary, encoding=encoding, unicode_errors=unicode_errors,
- limit=limit, datatype=datatype, no_header=no_header)
+ limit=limit, datatype=datatype, no_header=no_header,
+ )
def intersect_word2vec_format(self, fname, lockf=0.0, binary=False, encoding='utf8', unicode_errors='strict'):
"""Merge in an input-hidden weight matrix loaded from the original C word2vec-tool format,
@@ -1699,7 +1720,7 @@ def similarity_unseen_docs(self, *args, **kwargs):
EuclideanKeyedVectors = KeyedVectors
-class CompatVocab(object):
+class CompatVocab:
def __init__(self, **kwargs):
"""A single vocabulary item, used internally for collecting per-word frequency/sampling info,
and for constructing binary trees (incl. both word leaves and inner nodes).
@@ -1811,8 +1832,10 @@ def _word2vec_detect_sizes_text(fin, limit, datatype, unicode_errors, encoding):
return vocab_size, vector_size
-def _load_word2vec_format(cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict',
- limit=sys.maxsize, datatype=REAL, no_header=False, binary_chunk_size=100 * 1024):
+def _load_word2vec_format(
+ cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict',
+ limit=sys.maxsize, datatype=REAL, no_header=False, binary_chunk_size=100 * 1024,
+ ):
"""Load the input-hidden weight matrix from the original C word2vec-tool format.
Note that the information stored in the file is incomplete (the binary tree is missing),
@@ -1850,7 +1873,6 @@ def _load_word2vec_format(cls, fname, fvocab=None, binary=False, encoding='utf8'
Returns the loaded model as an instance of :class:`cls`.
"""
-
counts = None
if fvocab is not None:
logger.info("loading word counts from %s", fvocab)
@@ -1872,15 +1894,14 @@ def _load_word2vec_format(cls, fname, fvocab=None, binary=False, encoding='utf8'
fin = utils.open(fname, 'rb')
else:
header = utils.to_unicode(fin.readline(), encoding=encoding)
- vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
+ vocab_size, vector_size = [int(x) for x in header.split()] # throws for invalid file format
if limit:
vocab_size = min(vocab_size, limit)
kv = cls(vector_size, vocab_size, dtype=datatype)
if binary:
_word2vec_read_binary(
- fin, kv, counts,
- vocab_size, vector_size, datatype, unicode_errors, binary_chunk_size,
+ fin, kv, counts, vocab_size, vector_size, datatype, unicode_errors, binary_chunk_size,
)
else:
_word2vec_read_text(fin, kv, counts, vocab_size, vector_size, datatype, unicode_errors, encoding)
@@ -1915,9 +1936,11 @@ def pseudorandom_weak_vector(size, seed_string=None, hashfxn=hash):
def prep_vectors(target_shape, prior_vectors=None, seed=0, dtype=REAL):
- """FIXME: NAME/DOCS CHANGES PRE-4.0.0 FOR #2955/#2975 MMAP & OTHER INITIALIZATION CLEANUP WORK
- Return a numpy array of the given shape. Reuse prior_vectors object or values
- to extent possible. Initialize new values randomly if requested."""
+ """Return a numpy array of the given shape. Reuse prior_vectors object or values
+ to extent possible. Initialize new values randomly if requested.
+
+ FIXME: NAME/DOCS CHANGES PRE-4.0.0 FOR #2955/#2975 MMAP & OTHER INITIALIZATION CLEANUP WORK.
+ """
if prior_vectors is None:
prior_vectors = np.zeros((0, 0))
if prior_vectors.shape == target_shape:
diff --git a/gensim/models/lda_dispatcher.py b/gensim/models/lda_dispatcher.py
index f81cd806eb..41dc3e632b 100755
--- a/gensim/models/lda_dispatcher.py
+++ b/gensim/models/lda_dispatcher.py
@@ -87,7 +87,7 @@
LDA_DISPATCHER_PREFIX = 'gensim.lda_dispatcher'
-class Dispatcher(object):
+class Dispatcher:
"""Dispatcher object that communicates and coordinates individual workers.
Warnings
diff --git a/gensim/models/lda_worker.py b/gensim/models/lda_worker.py
index cac24c2698..25d787738e 100755
--- a/gensim/models/lda_worker.py
+++ b/gensim/models/lda_worker.py
@@ -78,7 +78,7 @@
LDA_WORKER_PREFIX = 'gensim.lda_worker'
-class Worker(object):
+class Worker:
"""Used as a Pyro4 class with exposed methods.
Exposes every non-private method and property of the class automatically to be available for remote access.
diff --git a/gensim/models/lsi_worker.py b/gensim/models/lsi_worker.py
index 4a38ba8e2d..a3b5845f19 100755
--- a/gensim/models/lsi_worker.py
+++ b/gensim/models/lsi_worker.py
@@ -1,194 +1,193 @@
-#!/usr/bin/env python
-# -*- coding: utf-8 -*-
-#
-# Copyright (C) 2010 Radim Rehurek
-# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
-
-"""Worker ("slave") process used in computing distributed Latent Semantic Indexing (LSI,
-:class:`~gensim.models.lsimodel.LsiModel`) models.
-
-Run this script on every node in your cluster. If you wish, you may even run it multiple times on a single machine,
-to make better use of multiple cores (just beware that memory footprint increases linearly).
-
-
-How to use distributed LSI
---------------------------
-
-#. Install needed dependencies (Pyro4) ::
-
- pip install gensim[distributed]
-
-#. Setup serialization (on each machine) ::
-
- export PYRO_SERIALIZERS_ACCEPTED=pickle
- export PYRO_SERIALIZER=pickle
-
-#. Run nameserver ::
-
- python -m Pyro4.naming -n 0.0.0.0 &
-
-#. Run workers (on each machine) ::
-
- python -m gensim.models.lsi_worker &
-
-#. Run dispatcher ::
-
- python -m gensim.models.lsi_dispatcher &
-
-#. Run :class:`~gensim.models.lsimodel.LsiModel` in distributed mode:
-
- .. sourcecode:: pycon
-
- >>> from gensim.test.utils import common_corpus, common_dictionary
- >>> from gensim.models import LsiModel
- >>>
- >>> model = LsiModel(common_corpus, id2word=common_dictionary, distributed=True)
-
-
-Command line arguments
-----------------------
-
-.. program-output:: python -m gensim.models.lsi_worker --help
- :ellipsis: 0, -3
-
-"""
-
-from __future__ import with_statement
-import os
-import sys
-import logging
-import argparse
-import threading
-import tempfile
-try:
- import Queue
-except ImportError:
- import queue as Queue
-import Pyro4
-from gensim.models import lsimodel
-from gensim import utils
-
-logger = logging.getLogger(__name__)
-
-
-SAVE_DEBUG = 0 # save intermediate models after every SAVE_DEBUG updates (0 for never)
-
-
-class Worker(object):
- def __init__(self):
- """Partly initialize the model.
-
- A full initialization requires a call to :meth:`~gensim.models.lsi_worker.Worker.initialize`.
-
- """
- self.model = None
-
- @Pyro4.expose
- def initialize(self, myid, dispatcher, **model_params):
- """Fully initialize the worker.
-
- Parameters
- ----------
- myid : int
- An ID number used to identify this worker in the dispatcher object.
- dispatcher : :class:`~gensim.models.lsi_dispatcher.Dispatcher`
- The dispatcher responsible for scheduling this worker.
- **model_params
- Keyword parameters to initialize the inner LSI model, see :class:`~gensim.models.lsimodel.LsiModel`.
-
- """
- self.lock_update = threading.Lock()
- self.jobsdone = 0 # how many jobs has this worker completed?
- # id of this worker in the dispatcher; just a convenience var for easy access/logging TODO remove?
- self.myid = myid
- self.dispatcher = dispatcher
- self.finished = False
- logger.info("initializing worker #%s", myid)
- self.model = lsimodel.LsiModel(**model_params)
-
- @Pyro4.expose
- @Pyro4.oneway
- def requestjob(self):
- """Request jobs from the dispatcher, in a perpetual loop until :meth:`~gensim.models.lsi_worker.Worker.getstate`
- is called.
-
- Raises
- ------
- RuntimeError
- If `self.model` is None (i.e. worker not initialized).
-
- """
- if self.model is None:
- raise RuntimeError("worker must be initialized before receiving jobs")
-
- job = None
- while job is None and not self.finished:
- try:
- job = self.dispatcher.getjob(self.myid)
- except Queue.Empty:
- # no new job: try again, unless we're finished with all work
- continue
- if job is not None:
- logger.info("worker #%s received job #%i", self.myid, self.jobsdone)
- self.processjob(job)
- self.dispatcher.jobdone(self.myid)
- else:
- logger.info("worker #%i stopping asking for jobs", self.myid)
-
- @utils.synchronous('lock_update')
- def processjob(self, job):
- """Incrementally process the job and potentially logs progress.
-
- Parameters
- ----------
- job : iterable of list of (int, float)
- Corpus in BoW format.
-
- """
- self.model.add_documents(job)
- self.jobsdone += 1
- if SAVE_DEBUG and self.jobsdone % SAVE_DEBUG == 0:
- fname = os.path.join(tempfile.gettempdir(), 'lsi_worker.pkl')
- self.model.save(fname)
-
- @Pyro4.expose
- @utils.synchronous('lock_update')
- def getstate(self):
- """Log and get the LSI model's current projection.
-
- Returns
- -------
- :class:`~gensim.models.lsimodel.Projection`
- The current projection.
-
- """
- logger.info("worker #%i returning its state after %s jobs", self.myid, self.jobsdone)
- assert isinstance(self.model.projection, lsimodel.Projection)
- self.finished = True
- return self.model.projection
-
- @Pyro4.expose
- @utils.synchronous('lock_update')
- def reset(self):
- """Reset the worker by deleting its current projection."""
- logger.info("resetting worker #%i", self.myid)
- self.model.projection = self.model.projection.empty_like()
- self.finished = False
-
- @Pyro4.oneway
- def exit(self):
- """Terminate the worker."""
- logger.info("terminating worker #%i", self.myid)
- os._exit(0)
-
-
-if __name__ == '__main__':
- """The main script. """
- logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s', level=logging.INFO)
-
- parser = argparse.ArgumentParser(description=__doc__[:-135], formatter_class=argparse.RawTextHelpFormatter)
- _ = parser.parse_args()
-
- logger.info("running %s", " ".join(sys.argv))
- utils.pyro_daemon('gensim.lsi_worker', Worker(), random_suffix=True)
- logger.info("finished running %s", parser.prog)
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Copyright (C) 2010 Radim Rehurek
+# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
+
+"""Worker ("slave") process used in computing distributed Latent Semantic Indexing (LSI,
+:class:`~gensim.models.lsimodel.LsiModel`) models.
+
+Run this script on every node in your cluster. If you wish, you may even run it multiple times on a single machine,
+to make better use of multiple cores (just beware that memory footprint increases linearly).
+
+
+How to use distributed LSI
+--------------------------
+
+#. Install needed dependencies (Pyro4) ::
+
+ pip install gensim[distributed]
+
+#. Setup serialization (on each machine) ::
+
+ export PYRO_SERIALIZERS_ACCEPTED=pickle
+ export PYRO_SERIALIZER=pickle
+
+#. Run nameserver ::
+
+ python -m Pyro4.naming -n 0.0.0.0 &
+
+#. Run workers (on each machine) ::
+
+ python -m gensim.models.lsi_worker &
+
+#. Run dispatcher ::
+
+ python -m gensim.models.lsi_dispatcher &
+
+#. Run :class:`~gensim.models.lsimodel.LsiModel` in distributed mode:
+
+ .. sourcecode:: pycon
+
+ >>> from gensim.test.utils import common_corpus, common_dictionary
+ >>> from gensim.models import LsiModel
+ >>>
+ >>> model = LsiModel(common_corpus, id2word=common_dictionary, distributed=True)
+
+
+Command line arguments
+----------------------
+
+.. program-output:: python -m gensim.models.lsi_worker --help
+ :ellipsis: 0, -3
+
+"""
+
+import os
+import sys
+import logging
+import argparse
+import threading
+import tempfile
+import queue as Queue
+
+import Pyro4
+
+from gensim.models import lsimodel
+from gensim import utils
+
+
+logger = logging.getLogger(__name__)
+
+
+SAVE_DEBUG = 0 # save intermediate models after every SAVE_DEBUG updates (0 for never)
+
+
+class Worker:
+ def __init__(self):
+ """Partly initialize the model.
+
+ A full initialization requires a call to :meth:`~gensim.models.lsi_worker.Worker.initialize`.
+
+ """
+ self.model = None
+
+ @Pyro4.expose
+ def initialize(self, myid, dispatcher, **model_params):
+ """Fully initialize the worker.
+
+ Parameters
+ ----------
+ myid : int
+ An ID number used to identify this worker in the dispatcher object.
+ dispatcher : :class:`~gensim.models.lsi_dispatcher.Dispatcher`
+ The dispatcher responsible for scheduling this worker.
+ **model_params
+ Keyword parameters to initialize the inner LSI model, see :class:`~gensim.models.lsimodel.LsiModel`.
+
+ """
+ self.lock_update = threading.Lock()
+ self.jobsdone = 0 # how many jobs has this worker completed?
+ # id of this worker in the dispatcher; just a convenience var for easy access/logging TODO remove?
+ self.myid = myid
+ self.dispatcher = dispatcher
+ self.finished = False
+ logger.info("initializing worker #%s", myid)
+ self.model = lsimodel.LsiModel(**model_params)
+
+ @Pyro4.expose
+ @Pyro4.oneway
+ def requestjob(self):
+ """Request jobs from the dispatcher, in a perpetual loop until :meth:`~gensim.models.lsi_worker.Worker.getstate`
+ is called.
+
+ Raises
+ ------
+ RuntimeError
+ If `self.model` is None (i.e. worker not initialized).
+
+ """
+ if self.model is None:
+ raise RuntimeError("worker must be initialized before receiving jobs")
+
+ job = None
+ while job is None and not self.finished:
+ try:
+ job = self.dispatcher.getjob(self.myid)
+ except Queue.Empty:
+ # no new job: try again, unless we're finished with all work
+ continue
+ if job is not None:
+ logger.info("worker #%s received job #%i", self.myid, self.jobsdone)
+ self.processjob(job)
+ self.dispatcher.jobdone(self.myid)
+ else:
+ logger.info("worker #%i stopping asking for jobs", self.myid)
+
+ @utils.synchronous('lock_update')
+ def processjob(self, job):
+ """Incrementally process the job and potentially logs progress.
+
+ Parameters
+ ----------
+ job : iterable of list of (int, float)
+ Corpus in BoW format.
+
+ """
+ self.model.add_documents(job)
+ self.jobsdone += 1
+ if SAVE_DEBUG and self.jobsdone % SAVE_DEBUG == 0:
+ fname = os.path.join(tempfile.gettempdir(), 'lsi_worker.pkl')
+ self.model.save(fname)
+
+ @Pyro4.expose
+ @utils.synchronous('lock_update')
+ def getstate(self):
+ """Log and get the LSI model's current projection.
+
+ Returns
+ -------
+ :class:`~gensim.models.lsimodel.Projection`
+ The current projection.
+
+ """
+ logger.info("worker #%i returning its state after %s jobs", self.myid, self.jobsdone)
+ assert isinstance(self.model.projection, lsimodel.Projection)
+ self.finished = True
+ return self.model.projection
+
+ @Pyro4.expose
+ @utils.synchronous('lock_update')
+ def reset(self):
+ """Reset the worker by deleting its current projection."""
+ logger.info("resetting worker #%i", self.myid)
+ self.model.projection = self.model.projection.empty_like()
+ self.finished = False
+
+ @Pyro4.oneway
+ def exit(self):
+ """Terminate the worker."""
+ logger.info("terminating worker #%i", self.myid)
+ os._exit(0)
+
+
+if __name__ == '__main__':
+ """The main script. """
+ logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s', level=logging.INFO)
+
+ parser = argparse.ArgumentParser(description=__doc__[:-135], formatter_class=argparse.RawTextHelpFormatter)
+ _ = parser.parse_args()
+
+ logger.info("running %s", " ".join(sys.argv))
+ utils.pyro_daemon('gensim.lsi_worker', Worker(), random_suffix=True)
+ logger.info("finished running %s", parser.prog)
diff --git a/gensim/models/poincare.py b/gensim/models/poincare.py
index 050aa52e9b..136fd6b6d5 100644
--- a/gensim/models/poincare.py
+++ b/gensim/models/poincare.py
@@ -699,7 +699,7 @@ def _train_batchwise(self, epochs, batch_size=10, print_every=1000, check_gradie
avg_loss = 0.0
-class PoincareBatch(object):
+class PoincareBatch:
"""Compute Poincare distances, gradients and loss for a training batch.
Store intermediate state to avoid recomputing multiple times.
@@ -1305,7 +1305,7 @@ def difference_in_hierarchy(self, node_or_vector_1, node_or_vector_2):
return self.norm(node_or_vector_2) - self.norm(node_or_vector_1)
-class PoincareRelations(object):
+class PoincareRelations:
"""Stream relations for `PoincareModel` from a tsv-like file."""
def __init__(self, file_path, encoding='utf8', delimiter='\t'):
@@ -1354,7 +1354,7 @@ def __iter__(self):
yield tuple(row)
-class NegativesBuffer(object):
+class NegativesBuffer:
"""Buffer and return negative samples."""
def __init__(self, items):
@@ -1405,7 +1405,7 @@ def get_items(self, num_items):
return self._items[start_index:end_index]
-class ReconstructionEvaluation(object):
+class ReconstructionEvaluation:
"""Evaluate reconstruction on given network for given embedding."""
def __init__(self, file_path, embedding):
@@ -1508,7 +1508,7 @@ def evaluate_mean_rank_and_map(self, max_n=None):
return np.mean(ranks), np.mean(avg_precision_scores)
-class LinkPredictionEvaluation(object):
+class LinkPredictionEvaluation:
"""Evaluate reconstruction on given network for given embedding."""
def __init__(self, train_path, test_path, embedding):
@@ -1619,7 +1619,7 @@ def evaluate_mean_rank_and_map(self, max_n=None):
return np.mean(ranks), np.mean(avg_precision_scores)
-class LexicalEntailmentEvaluation(object):
+class LexicalEntailmentEvaluation:
"""Evaluate reconstruction on given network for any embedding."""
def __init__(self, filepath):
diff --git a/gensim/models/word2vec.py b/gensim/models/word2vec.py
index 05b7f43cd6..c53d252bf4 100755
--- a/gensim/models/word2vec.py
+++ b/gensim/models/word2vec.py
@@ -861,7 +861,7 @@ def update_weights(self):
@deprecated(
"Gensim 4.0.0 implemented internal optimizations that make calls to init_sims() unnecessary. "
"init_sims() is now obsoleted and will be completely removed in future versions. "
- "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims"
+ "See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4"
)
def init_sims(self, replace=False):
"""
@@ -1856,8 +1856,8 @@ def __str__(self):
and learning rate.
"""
- return "%s(vocab=%s, size=%s, alpha=%s)" % (
- self.__class__.__name__, len(self.wv.index_to_key), self.wv.vector_size, self.alpha
+ return "%s(vocab=%s, vector_size=%s, alpha=%s)" % (
+ self.__class__.__name__, len(self.wv.index_to_key), self.wv.vector_size, self.alpha,
)
def save(self, *args, **kwargs):
@@ -1965,7 +1965,7 @@ def get_latest_training_loss(self):
return self.running_training_loss
-class BrownCorpus(object):
+class BrownCorpus:
def __init__(self, dirname):
"""Iterate over sentences from the `Brown corpus `_
(part of `NLTK data `_).
@@ -1991,7 +1991,7 @@ def __iter__(self):
yield words
-class Text8Corpus(object):
+class Text8Corpus:
def __init__(self, fname, max_sentence_length=MAX_WORDS_IN_BATCH):
"""Iterate over sentences from the "text8" corpus, unzipped from http://mattmahoney.net/dc/text8.zip."""
self.fname = fname
@@ -2019,7 +2019,7 @@ def __iter__(self):
sentence = sentence[self.max_sentence_length:]
-class LineSentence(object):
+class LineSentence:
def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
"""Iterate over a file that contains sentences: one line = one sentence.
Words must be already preprocessed and separated by whitespace.
@@ -2068,7 +2068,7 @@ def __iter__(self):
i += self.max_sentence_length
-class PathLineSentences(object):
+class PathLineSentences:
def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
"""Like :class:`~gensim.models.word2vec.LineSentence`, but process all files in a directory
in alphabetical order by filename.
@@ -2259,9 +2259,9 @@ def _assign_binary_codes(wv):
corpus = LineSentence(args.train)
model = Word2Vec(
- corpus, size=args.size, min_count=args.min_count, workers=args.threads,
+ corpus, vector_size=args.size, min_count=args.min_count, workers=args.threads,
window=args.window, sample=args.sample, sg=skipgram, hs=args.hs,
- negative=args.negative, cbow_mean=1, iter=args.iter
+ negative=args.negative, cbow_mean=1, epochs=args.iter,
)
if args.output:
diff --git a/gensim/models/wrappers/wordrank.py b/gensim/models/wrappers/wordrank.py
index ba49d73c14..6de3c256ad 100644
--- a/gensim/models/wrappers/wordrank.py
+++ b/gensim/models/wrappers/wordrank.py
@@ -49,13 +49,12 @@
import os
import copy
import multiprocessing
+from shutil import copyfile, rmtree
from gensim import utils
from gensim.models.keyedvectors import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
-from shutil import copyfile, rmtree
-
logger = logging.getLogger(__name__)
diff --git a/gensim/parsing/porter.py b/gensim/parsing/porter.py
index 528103a874..81c465e1b7 100644
--- a/gensim/parsing/porter.py
+++ b/gensim/parsing/porter.py
@@ -30,7 +30,7 @@
"""
-class PorterStemmer(object):
+class PorterStemmer:
"""Class contains implementation of Porter stemming algorithm.
Attributes
diff --git a/gensim/scripts/word2vec_standalone.py b/gensim/scripts/word2vec_standalone.py
index 57f4d907ba..22be887cd1 100644
--- a/gensim/scripts/word2vec_standalone.py
+++ b/gensim/scripts/word2vec_standalone.py
@@ -120,9 +120,9 @@
corpus = LineSentence(args.train)
model = Word2Vec(
- corpus, size=args.size, min_count=args.min_count, workers=args.threads,
+ corpus, vector_size=args.size, min_count=args.min_count, workers=args.threads,
window=args.window, sample=args.sample, alpha=args.alpha, sg=skipgram,
- hs=args.hs, negative=args.negative, cbow_mean=1, iter=args.iter
+ hs=args.hs, negative=args.negative, cbow_mean=1, epochs=args.iter,
)
if args.output:
diff --git a/gensim/similarities/annoy.py b/gensim/similarities/annoy.py
index eaebaaa770..b237c11a99 100644
--- a/gensim/similarities/annoy.py
+++ b/gensim/similarities/annoy.py
@@ -123,7 +123,7 @@ def load(self, fname):
>>> from tempfile import mkstemp
>>>
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
- >>> model = Word2Vec(sentences, min_count=1, seed=1, iter=10)
+ >>> model = Word2Vec(sentences, min_count=1, seed=1, epochs=10)
>>>
>>> indexer = AnnoyIndexer(model, 2)
>>> _, temp_fn = mkstemp()
diff --git a/gensim/similarities/docsim.py b/gensim/similarities/docsim.py
index 14ada07904..bce59620b7 100755
--- a/gensim/similarities/docsim.py
+++ b/gensim/similarities/docsim.py
@@ -890,7 +890,7 @@ class SoftCosineSimilarity(interfaces.SimilarityABC):
>>> from gensim.models import Word2Vec, WordEmbeddingSimilarityIndex
>>> from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix
>>>
- >>> model = Word2Vec(common_texts, size=20, min_count=1) # train word-vectors
+ >>> model = Word2Vec(common_texts, vector_size=20, min_count=1) # train word-vectors
>>> termsim_index = WordEmbeddingSimilarityIndex(model.wv)
>>> dictionary = Dictionary(common_texts)
>>> bow_corpus = [dictionary.doc2bow(document) for document in common_texts]
@@ -1006,7 +1006,7 @@ class WmdSimilarity(interfaces.SimilarityABC):
>>> from gensim.models import Word2Vec
>>> from gensim.similarities import WmdSimilarity
>>>
- >>> model = Word2Vec(common_texts, size=20, min_count=1) # train word-vectors
+ >>> model = Word2Vec(common_texts, vector_size=20, min_count=1) # train word-vectors
>>>
>>> index = WmdSimilarity(common_texts, model)
>>> # Make query.
diff --git a/gensim/similarities/nmslib.py b/gensim/similarities/nmslib.py
index 7ff78539c1..620f32e519 100644
--- a/gensim/similarities/nmslib.py
+++ b/gensim/similarities/nmslib.py
@@ -26,7 +26,7 @@
>>> from gensim.models import Word2Vec
>>>
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
- >>> model = Word2Vec(sentences, min_count=1, iter=10, seed=2)
+ >>> model = Word2Vec(sentences, min_count=1, epochs=10, seed=2)
>>>
>>> indexer = NmslibIndexer(model)
>>> model.wv.most_similar("cat", topn=2, indexer=indexer)
@@ -42,7 +42,7 @@
>>> from tempfile import mkstemp
>>>
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
- >>> model = Word2Vec(sentences, min_count=1, seed=2, iter=10)
+ >>> model = Word2Vec(sentences, min_count=1, seed=2, epochs=10)
>>>
>>> indexer = NmslibIndexer(model)
>>> _, temp_fn = mkstemp()
diff --git a/gensim/similarities/termsim.py b/gensim/similarities/termsim.py
index 3949d77960..c047587339 100644
--- a/gensim/similarities/termsim.py
+++ b/gensim/similarities/termsim.py
@@ -410,7 +410,7 @@ class SparseTermSimilarityMatrix(SaveLoad):
>>> from gensim.similarities.index import AnnoyIndexer
>>> from scikits.sparse.cholmod import cholesky
>>>
- >>> model = Word2Vec(common_texts, size=20, min_count=1) # train word-vectors
+ >>> model = Word2Vec(common_texts, vector_size=20, min_count=1) # train word-vectors
>>> annoy = AnnoyIndexer(model, num_trees=2) # use annoy for faster word similarity lookups
>>> termsim_index = WordEmbeddingSimilarityIndex(model.wv, kwargs={'indexer': annoy})
>>> dictionary = Dictionary(common_texts)
diff --git a/gensim/sklearn_api/d2vmodel.py b/gensim/sklearn_api/d2vmodel.py
index 9f01f9818b..660d101131 100644
--- a/gensim/sklearn_api/d2vmodel.py
+++ b/gensim/sklearn_api/d2vmodel.py
@@ -15,7 +15,7 @@
>>> from gensim.test.utils import common_texts
>>> from gensim.sklearn_api import D2VTransformer
>>>
- >>> model = D2VTransformer(min_count=1, size=5)
+ >>> model = D2VTransformer(min_count=1, vector_size=5)
>>> docvecs = model.fit_transform(common_texts) # represent `common_texts` as vectors
"""
diff --git a/gensim/test/svd_error.py b/gensim/test/svd_error.py
index e6ab11bb78..1763866e00 100755
--- a/gensim/test/svd_error.py
+++ b/gensim/test/svd_error.py
@@ -72,7 +72,7 @@ def print_error(name, aat, u, s, ideal_nf, ideal_n2):
sys.stdout.flush()
-class ClippedCorpus(object):
+class ClippedCorpus:
def __init__(self, corpus, max_docs, max_terms):
self.corpus = corpus
self.max_docs, self.max_terms = max_docs, max_terms
diff --git a/gensim/test/test_big.py b/gensim/test/test_big.py
index f422953d18..1716285c10 100644
--- a/gensim/test/test_big.py
+++ b/gensim/test/test_big.py
@@ -19,7 +19,7 @@
from gensim.test.utils import get_tmpfile
-class BigCorpus(object):
+class BigCorpus:
"""A corpus of a large number of docs & large vocab"""
def __init__(self, words_only=False, num_terms=200000, num_docs=1000000, doc_len=100):
@@ -46,7 +46,7 @@ class TestLargeData(unittest.TestCase):
def testWord2Vec(self):
corpus = BigCorpus(words_only=True, num_docs=100000, num_terms=3000000, doc_len=200)
tmpf = get_tmpfile('gensim_big.tst')
- model = gensim.models.Word2Vec(corpus, size=300, workers=4)
+ model = gensim.models.Word2Vec(corpus, vector_size=300, workers=4)
model.save(tmpf, ignore=['syn1'])
del model
gensim.models.Word2Vec.load(tmpf)
diff --git a/gensim/test/test_corpora.py b/gensim/test/test_corpora.py
index e13e06ca36..611cc875eb 100644
--- a/gensim/test/test_corpora.py
+++ b/gensim/test/test_corpora.py
@@ -30,7 +30,7 @@
AZURE = bool(os.environ.get('PIPELINE_WORKSPACE'))
-class DummyTransformer(object):
+class DummyTransformer:
def __getitem__(self, bow):
if len(next(iter(bow))) == 2:
# single bag of words
diff --git a/gensim/test/test_d2vmodel.py b/gensim/test/test_d2vmodel.py
index 44d3d1612c..aa24203277 100644
--- a/gensim/test/test_d2vmodel.py
+++ b/gensim/test/test_d2vmodel.py
@@ -48,7 +48,7 @@ def __iter__(self):
class TestD2VTransformer(unittest.TestCase):
def TestWorksWithIterableNotHavingElementWithZeroIndex(self):
a = IterableWithoutZeroElement(common_texts)
- transformer = D2VTransformer(min_count=1, size=5)
+ transformer = D2VTransformer(min_count=1, vector_size=5)
transformer.fit(a)
diff --git a/gensim/test/test_fasttext.py b/gensim/test/test_fasttext.py
index 3d3537d03e..4864802b5c 100644
--- a/gensim/test/test_fasttext.py
+++ b/gensim/test/test_fasttext.py
@@ -18,8 +18,9 @@
from gensim.models.word2vec import LineSentence
from gensim.models.fasttext import FastText as FT_gensim, FastTextKeyedVectors, _unpack
from gensim.models.keyedvectors import KeyedVectors
-from gensim.test.utils import datapath, get_tmpfile, temporary_file, common_texts as sentences, \
- lee_corpus_list as list_corpus
+from gensim.test.utils import (
+ datapath, get_tmpfile, temporary_file, common_texts as sentences, lee_corpus_list as list_corpus,
+)
from gensim.test.test_word2vec import TestWord2VecModel
import gensim.models._fasttext_bin
from gensim.models.fasttext_inner import compute_ngrams, compute_ngrams_bytes, ft_hash_bytes
@@ -813,7 +814,7 @@ def test_estimate_memory(self):
self.assertEqual(report['syn0_vocab'], 192)
self.assertEqual(report['syn1'], 192)
self.assertEqual(report['syn1neg'], 192)
- # FIXME: these fixed numbers for particular implementation generations encumber changes without real QA
+ # TODO: these fixed numbers for particular implementation generations encumber changes without real QA
# perhaps instead verify reports' total is within some close factor of a deep-audit of actual memory used?
self.assertEqual(report['syn0_ngrams'], model.vector_size * np.dtype(np.float32).itemsize * BUCKET)
self.assertEqual(report['buckets_word'], 688)
@@ -996,10 +997,10 @@ def test_continuation_native(self):
self.model_structural_sanity(native)
#
- # Pick a word that's is in both corpuses.
+ # Pick a word that is in both corpuses.
# Its vectors should be different between training runs.
#
- word = 'human' # FIXME: this isn't actually in model, except via OOV ngrams
+ word = 'society'
old_vector = native.wv.get_vector(word).tolist()
native.train(list_corpus, total_examples=len(list_corpus), epochs=native.epochs)
diff --git a/gensim/test/test_ldamodel.py b/gensim/test/test_ldamodel.py
index f1c17ac0c9..2b92e3887b 100644
--- a/gensim/test/test_ldamodel.py
+++ b/gensim/test/test_ldamodel.py
@@ -292,7 +292,7 @@ def testGetDocumentTopics(self):
self.assertTrue(isinstance(phi_values, list))
# word_topics looks like this: ({word_id => [topic_id_most_probable, topic_id_second_most_probable, ...]).
- # we check one case in word_topics, i.e of the first word in the doc, and it's likely topics.
+ # we check one case in word_topics, i.e of the first word in the doc, and its likely topics.
# FIXME: Fails on osx and win
# expected_word = 0
diff --git a/gensim/test/test_probability_estimation.py b/gensim/test/test_probability_estimation.py
index 1e674415f3..73820d8df5 100644
--- a/gensim/test/test_probability_estimation.py
+++ b/gensim/test/test_probability_estimation.py
@@ -16,7 +16,7 @@
from gensim.topic_coherence import probability_estimation
-class BaseTestCases(object):
+class BaseTestCases:
class ProbabilityEstimationBase(unittest.TestCase):
texts = [
diff --git a/gensim/test/test_similarities.py b/gensim/test/test_similarities.py
index 6a0321fdbe..819493a3fe 100644
--- a/gensim/test/test_similarities.py
+++ b/gensim/test/test_similarities.py
@@ -556,7 +556,7 @@ def testWord2Vec(self):
self.assertLoadedIndexEqual(index, model)
def testFastText(self):
- class LeeReader(object):
+ class LeeReader:
def __init__(self, fn):
self.fn = fn
@@ -715,7 +715,7 @@ def test_word2vec(self):
self.assertLoadedIndexEqual(index, model)
def test_fasttext(self):
- class LeeReader(object):
+ class LeeReader:
def __init__(self, fn):
self.fn = fn
diff --git a/gensim/test/test_text_analysis.py b/gensim/test/test_text_analysis.py
index 83df8ece57..2f4524aacc 100644
--- a/gensim/test/test_text_analysis.py
+++ b/gensim/test/test_text_analysis.py
@@ -8,7 +8,7 @@
from gensim.test.utils import common_texts
-class BaseTestCases(object):
+class BaseTestCases:
class TextAnalyzerTestBase(unittest.TestCase):
texts = [
diff --git a/gensim/test/test_word2vec.py b/gensim/test/test_word2vec.py
index a9e57036f0..c8219cdddd 100644
--- a/gensim/test/test_word2vec.py
+++ b/gensim/test/test_word2vec.py
@@ -409,8 +409,10 @@ def testPersistenceWord2VecFormatWithVocab(self):
testvocab = get_tmpfile('gensim_word2vec.vocab')
model.wv.save_word2vec_format(tmpf, testvocab, binary=True)
binary_model_with_vocab_kv = keyedvectors.KeyedVectors.load_word2vec_format(tmpf, testvocab, binary=True)
- self.assertEqual(model.wv.get_vecattr('human', 'count'),
- binary_model_with_vocab_kv.get_vecattr('human', 'count'))
+ self.assertEqual(
+ model.wv.get_vecattr('human', 'count'),
+ binary_model_with_vocab_kv.get_vecattr('human', 'count'),
+ )
def testPersistenceKeyedVectorsFormatWithVocab(self):
"""Test storing/loading the entire model and vocabulary in word2vec format."""
@@ -419,8 +421,10 @@ def testPersistenceKeyedVectorsFormatWithVocab(self):
testvocab = get_tmpfile('gensim_word2vec.vocab')
model.wv.save_word2vec_format(tmpf, testvocab, binary=True)
kv_binary_model_with_vocab = keyedvectors.KeyedVectors.load_word2vec_format(tmpf, testvocab, binary=True)
- self.assertEqual(model.wv.get_vecattr('human', 'count'),
- kv_binary_model_with_vocab.get_vecattr('human', 'count'))
+ self.assertEqual(
+ model.wv.get_vecattr('human', 'count'),
+ kv_binary_model_with_vocab.get_vecattr('human', 'count'),
+ )
def testPersistenceWord2VecFormatCombinationWithStandardPersistence(self):
"""Test storing/loading the entire model and vocabulary in word2vec format chained with
diff --git a/gensim/test/utils.py b/gensim/test/utils.py
index 158c0989b8..526d3a436a 100644
--- a/gensim/test/utils.py
+++ b/gensim/test/utils.py
@@ -208,7 +208,7 @@ def temporary_file(name=""):
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
-class LeeCorpus(object):
+class LeeCorpus:
def __iter__(self):
with open(datapath('lee_background.cor')) as f:
for line in f:
diff --git a/gensim/topic_coherence/indirect_confirmation_measure.py b/gensim/topic_coherence/indirect_confirmation_measure.py
index 76077813d0..3e96c8fbc2 100644
--- a/gensim/topic_coherence/indirect_confirmation_measure.py
+++ b/gensim/topic_coherence/indirect_confirmation_measure.py
@@ -182,7 +182,7 @@ def cosine_similarity(segmented_topics, accumulator, topics, measure='nlr',
return topic_coherences
-class ContextVectorComputer(object):
+class ContextVectorComputer:
"""Lazily compute context vectors for topic segments.
Parameters
diff --git a/gensim/topic_coherence/probability_estimation.py b/gensim/topic_coherence/probability_estimation.py
index 27aa95b80d..6296437a94 100644
--- a/gensim/topic_coherence/probability_estimation.py
+++ b/gensim/topic_coherence/probability_estimation.py
@@ -1,7 +1,6 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
-# Copyright (C) 2013 Radim Rehurek
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
"""This module contains functions to perform segmentation on a list of topics."""
@@ -11,7 +10,8 @@
from gensim.topic_coherence.text_analysis import (
CorpusAccumulator, WordOccurrenceAccumulator, ParallelWordOccurrenceAccumulator,
- WordVectorsAccumulator)
+ WordVectorsAccumulator,
+)
logger = logging.getLogger(__name__)
@@ -218,7 +218,7 @@ def p_word2vec(texts, segmented_topics, dictionary, window_size=None, processes=
... ['human', 'interface', 'computer'],
... ['survey', 'user', 'computer', 'system', 'response', 'time']
... ]
- >>> model = word2vec.Word2Vec(sentences, size=100, min_count=1)
+ >>> model = word2vec.Word2Vec(sentences, vector_size=100, min_count=1)
>>> accumulator = probability_estimation.p_word2vec(texts, segmented_topics, dictionary, 2, 1, model)
"""
diff --git a/gensim/topic_coherence/text_analysis.py b/gensim/topic_coherence/text_analysis.py
index fb4fda99b8..83cbdc6471 100644
--- a/gensim/topic_coherence/text_analysis.py
+++ b/gensim/topic_coherence/text_analysis.py
@@ -67,7 +67,7 @@ def _ids_to_words(ids, dictionary):
return top_words
-class BaseAnalyzer(object):
+class BaseAnalyzer:
"""Base class for corpus and text analyzers.
Attributes
diff --git a/gensim/utils.py b/gensim/utils.py
index 49877249f1..ba6171f109 100644
--- a/gensim/utils.py
+++ b/gensim/utils.py
@@ -1679,9 +1679,9 @@ def lemmatize(content, allowed_tags=re.compile(r'(NN|VB|JJ|RB)'), light=False,
import warnings
warnings.warn("The light flag is no longer supported by pattern.")
- # tokenization in `pattern` is weird; it gets thrown off by non-letters,
- # producing '==relate/VBN' or '**/NN'... try to preprocess the text a little
- # FIXME this throws away all fancy parsing cues, including sentence structure,
+ # Tokenization in `pattern` is weird; it gets thrown off by non-letters,
+ # producing '==relate/VBN' or '**/NN'... try to preprocess the text a little.
+ # XXX: this throws away all fancy parsing cues, including sentence structure,
# abbreviations etc.
content = ' '.join(tokenize(content, lower=True, errors='ignore'))
diff --git a/release/generate_changelog.py b/release/generate_changelog.py
new file mode 100644
index 0000000000..72b03c7cda
--- /dev/null
+++ b/release/generate_changelog.py
@@ -0,0 +1,86 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Author: Gensim Contributors
+# Copyright (C) 2020 RaRe Technologies s.r.o.
+# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
+
+"""Generate changelog entries for all PRs merged since the last release."""
+import re
+import requests
+
+
+#
+# The releases get sorted in reverse chronological order, so the first release
+# in the list is the most recent.
+#
+get = requests.get('https://api.github.com/repos/RaRe-Technologies/gensim/releases')
+get.raise_for_status()
+most_recent_release = get.json()[0]
+release_timestamp = most_recent_release['published_at']
+
+
+def iter_merged_prs(since=release_timestamp):
+ page = 1
+ while True:
+ get = requests.get(
+ 'https://api.github.com/repos/RaRe-Technologies/gensim/pulls',
+ params={'state': 'closed', 'page': page},
+ )
+ get.raise_for_status()
+ pulls = get.json()
+ if not pulls:
+ break
+
+ for i, pr in enumerate(pulls):
+ if pr['merged_at'] and pr['merged_at'] > since:
+ yield pr
+
+ page += 1
+
+
+def iter_closed_issues(since=release_timestamp):
+ page = 1
+ while True:
+ get = requests.get(
+ 'https://api.github.com/repos/RaRe-Technologies/gensim/issues',
+ params={'state': 'closed', 'page': page, 'since': since},
+ )
+ get.raise_for_status()
+ issues = get.json()
+ if not issues:
+ break
+
+ for i, issue in enumerate(issues):
+ #
+ # In the github API, all pull requests are issues, but not vice versa.
+ #
+ if 'pull_request' not in issue and issue['closed_at'] > since:
+ yield issue
+ page += 1
+
+
+fixed_issue_numbers = set()
+for pr in iter_merged_prs(since=release_timestamp):
+ pr['user_login'] = pr['user']['login']
+ pr['user_html_url'] = pr['user']['html_url']
+ print('* [#%(number)d](%(html_url)s): %(title)s, by [@%(user_login)s](%(user_html_url)s)' % pr)
+
+ #
+ # Unfortunately, the GitHub API doesn't link PRs to issues that they fix,
+ # so we have do it ourselves.
+ #
+ for match in re.finditer(r'fix(es)? #(?P\d+)\b', pr['body'], flags=re.IGNORECASE):
+ fixed_issue_numbers.add(int(match.group('number')))
+
+
+print()
+print('### :question: Closed issues')
+print()
+print('TODO: move each issue to its appropriate section or delete if irrelevant')
+print()
+
+for issue in iter_closed_issues(since=release_timestamp):
+ if 'pull_request' in issue or issue['number'] in fixed_issue_numbers:
+ continue
+ print('* [#%(number)d](%(html_url)s): %(title)s' % issue)
diff --git a/tox.ini b/tox.ini
index 1d0c0b0e09..f73dedf3d7 100644
--- a/tox.ini
+++ b/tox.ini
@@ -5,20 +5,29 @@ skipsdist = True
platform = linux: linux
win: win64
+
[flake8]
ignore = E12, W503
max-line-length = 120
show-source = True
+
[flake8-rst]
filename = *.rst *.py
max-line-length = 120
-ignore = F821 ; TODO remove me when all examples in docstrings will be executable
-exclude=.venv, .git, .tox, dist, doc, build, gensim/models/deprecated
+ignore = E203, # space before :
+ E402, # module level import not at top of file
+ # Classes / functions in a docstring block generate those errors
+ E302, # expected 2 blank lines, found 0
+ E305, # expected 2 blank lines after class or function definition, found 0
+ F821, # undefined name; remove once all docstrings are fully executable
+exclude = .venv, .git, .tox, dist, doc, build, gensim/models/deprecated
+
[pytest]
addopts = -rfxEXs --durations=20 --showlocals
+
[testenv]
recreate = True
@@ -51,14 +60,19 @@ commands =
[testenv:flake8]
recreate = True
deps =
- flake8==3.7.9 # 3.8.0 triggers "AttributeError: 'Namespace' object has no attribute 'output_file'"
+ # Pinned to 3.7.9 because >3.8.0 triggers "AttributeError: 'Namespace' object has no attribute 'output_file'"
+ # in flake8-rst. Apparently some bug in flake8-rst:
+ # https://gitlab.com/pycqa/flake8/-/issues/641
+ # https://github.com/kataev/flake8-rst/pull/23/files
+ flake8==3.7.9
commands = flake8 gensim/ {posargs}
+
[testenv:flake8-docs]
recreate = True
deps =
- flake8-rst==0.4.3
+ flake8-rst==0.7.2
flake8==3.7.9
commands = flake8-rst gensim/ docs/ {posargs}