A Word Aligner for English

This is a word aligner for English: given two English sentences, it aligns related words in the two sentences. It exploits the semantic and contextual similarities of the words to make alignment decisions.

Contributions

Initially, this is a fork of ma-sultan/monolingual-word-aligner, the aligner presented in Sultan et al., 2015 that has been very successful in SemEval STS (Semantic Textual Similarity) Task in recent years.

But in 2016, the team UWB (Brychcin and Svoboda, 2016) improves the aligner. They introduce the consideration of IDF weighting in the Jaccard distance formula but without releasing the new source code. And that's why I offer to share, in this repository, an implementation of this improvement.

In the docs/ directory, you can find the papers cited above.

The results of the two different implementations on the SemEval-2016 STS task crosslingual track evaluation data are reported below.

Method	News	Multi-Src	Mean
The initial implementation of ma-sultan	0.89604	0.71850	0.80831
The implementation with IDF weighting	0.90601	0.81447	0.86078

And the results of the two different implementations on the SemEval-2017 STS task Spanish-English crosslingual track evaluation data are reported below.

Method	track4a	track4b	Mean
The initial implementation of ma-sultan	0.66961	0.08250	0.37605
The implementation with IDF weighting	0.76006	0.12447	0.44226

In the semeval_data/ directory, you can find all the necessary data to repeat the tests by yourself. For the 2016 evaluation, there are two sets of data, called news and multisource. For the 2017 evaluation, there are two sets of data, called track4a and track4b. The gold standard (expected scores) for the four sets are also in the directory. You can verify the correlation between the output of the aligner and the related gold standard file with the correlation Perl script as follow:

perl correlation.pl STS.gs.XXX.txt your_output_for_XXX.txt

Requirements

Python NLTK
The Python wrapper for Stanford CoreNLP

Installation and Usage

Install the above tools.
Change line 107 of corenlp.py, from rel, left, right = map(lambda x: remove_id(x), split_entry) to rel, left, right = split_entry
Install the NLTK stopword corpus and jsonrpclib.

python -m nltk.downloader stopwords
sudo pip install jsonrpclib

Download the aligner.

git clone https://github.com/FerreroJeremy/monolingual-word-aligner.git

Run the corenlp.py script to launch the server:

python stanford-corenlp-python/corenlp.py

Wait the loading of the models, once completed you should see in the terminal:

Loading Models: 5/5                                                                                                                       
INFO:__main__:Serving on http://127.0.0.1:8080

In another terminal, run the testAlign_idf.py script to launch the comparison between the specified files in the source code:

python testAlign_idf.py

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Resources		Resources
docs		docs
semeval_data		semeval_data
LICENSE		LICENSE
README.md		README.md
aligner.py		aligner.py
config.py		config.py
coreNlpUtil.py		coreNlpUtil.py
jsonrpc.py		jsonrpc.py
testAlign.py		testAlign.py
testAlign_idf.py		testAlign_idf.py
util.py		util.py
wordSim.py		wordSim.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Word Aligner for English

Contributions

Requirements

Installation and Usage

About

Releases

Packages

Languages

License

FerreroJeremy/monolingual-word-aligner

Folders and files

Latest commit

History

Repository files navigation

A Word Aligner for English

Contributions

Requirements

Installation and Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages