This is a word aligner for English: given two English sentences, it aligns related words in the two sentences. It exploits the semantic and contextual similarities of the words to make alignment decisions.
Initially, this is a fork of ma-sultan/monolingual-word-aligner, the aligner presented in Sultan et al., 2015 that has been very successful in SemEval STS (Semantic Textual Similarity) Task in recent years.
But in 2016, the team UWB (Brychcin and Svoboda, 2016) improves the aligner. They introduce the consideration of IDF weighting in the Jaccard distance formula but without releasing the new source code. And that's why I offer to share, in this repository, an implementation of this improvement.
In the docs/
directory, you can find the papers cited above.
The results of the two different implementations on the SemEval-2016 STS task crosslingual track evaluation data are reported below.
Method | News | Multi-Src | Mean |
---|---|---|---|
The initial implementation of ma-sultan | 0.89604 | 0.71850 | 0.80831 |
The implementation with IDF weighting | 0.90601 | 0.81447 | 0.86078 |
And the results of the two different implementations on the SemEval-2017 STS task Spanish-English crosslingual track evaluation data are reported below.
Method | track4a | track4b | Mean |
---|---|---|---|
The initial implementation of ma-sultan | 0.66961 | 0.08250 | 0.37605 |
The implementation with IDF weighting | 0.76006 | 0.12447 | 0.44226 |
In the semeval_data/
directory, you can find all the necessary data to repeat the tests by yourself.
For the 2016 evaluation, there are two sets of data, called news
and multisource
.
For the 2017 evaluation, there are two sets of data, called track4a
and track4b
.
The gold standard (expected scores) for the four sets are also in the directory.
You can verify the correlation between the output of the aligner and the related gold standard file with the correlation Perl script as follow:
perl correlation.pl STS.gs.XXX.txt your_output_for_XXX.txt
- Python NLTK
- The Python wrapper for Stanford CoreNLP
-
Install the above tools.
-
Change line 107 of corenlp.py, from
rel, left, right = map(lambda x: remove_id(x), split_entry)
torel, left, right = split_entry
-
Install the NLTK stopword corpus and jsonrpclib.
python -m nltk.downloader stopwords
sudo pip install jsonrpclib
- Download the aligner.
git clone https://github.com/FerreroJeremy/monolingual-word-aligner.git
- Run the
corenlp.py
script to launch the server:
python stanford-corenlp-python/corenlp.py
- Wait the loading of the models, once completed you should see in the terminal:
Loading Models: 5/5
INFO:__main__:Serving on http://127.0.0.1:8080
- In another terminal, run the
testAlign_idf.py
script to launch the comparison between the specified files in the source code:
python testAlign_idf.py