-
Notifications
You must be signed in to change notification settings - Fork 1
tca19/phd-thesis
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Improving methods to learn word representations =============================================== for efficient semantic similarities computations ================================================ ABOUT This repository contains the PhD thesis of Julien Tissier, entitled, "Improving methods to learn word representations for efficient semantic similarities computations" It also contains all the source materials used to produce the thesis, including the Latex .tex source files, the images and their respective source files to generate or modify them (either the Libreoffice Draw source or the Python code) and the slides of the PhD defense. CONTENT This repository is composed of: - chapters/: this folder contains all the chapters of the thesis, as ".tex" source files. There are 10 chapters (from 00-introduction.tex to 09-software.tex), a cover page (000-garde.tex) and the bibliography (99-bibliography.bib). - images/: this folder contains all the images used in the thesis (i.e. with the \includegraphics{} command in the .tex files) either as PNG or PDF. - images-code/: this folder contains the Python code used to generate some plots or illustration images of the thesis with the Matplotlib library. - images-src/: this folder contains the source files of some illustrations images used in the thesis, as Libreoffice Draw files (.odg). - PhD-Defense-Julien-Tissier.pdf: the defense presentation as PDF, 48 slides. - PhD-Thesis-Julien-Tissier.pdf: the thesis as PDF, 127 pages. - makefile: used to generate the thesis from source files. Use the command `make` at the root of this repository to produce it. You will need the following tools: make, pdflatex and bibtex. - phd-thesis.tex: the main .tex file, containing all the Latex package to use and the different chapters to include. SUMMARY Many natural language processing applications rely on word embeddings (also called word representations) to achieve state-of-the-art results. These numerical representations of the language should encode both syntactic and semantic information to perform well in downstream tasks. However, common models (word2vec, GloVe) use generic corpus like Wikipedia to learn them and they therefore lack specific semantic information. Moreover it requires a large memory space to store them because the number of representations to save can be in the order of a million. The topic of my thesis is to develop new learning algorithms to both improve the semantic information encoded within the representations while making them requiring less memory space for storage and their applications in NLP tasks. The first part of my work is to improve the semantic information contained in word embeddings. I developed dict2vec, a model that uses additional information from online lexical dictionaries when learning word representations. The dict2vec word embeddings perform ∼15% better against the embeddings learned by other models on word semantic similarity tasks. The second part of my work is to reduce the memory size of the embeddings. I developed an architecture based on an autoencoder to transform commonly used real-valued embeddings into binary embeddings, reducing their size in memory by 97% with only a loss of ∼2% in accuracy in downstream NLP tasks. AUTHOR Written by Julien Tissier <[email protected]>. COPYRIGHT This thesis and all the files in this repository are licensed under the "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License". By using or downloading this repository, you agree to: 1. NonCommercial - You may not use the material for commercial purposes. 2. Attribution - You must give appropriate credit, provide a link to the licensor, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. 3. ShareAlike - If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. 4. No additional restrictions - You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits. For more details, see https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
About
My PhD thesis with all its source files, including all .tex files and images created, as well as the slides of my defense.