Zero-shot classification using pretrained NLP models

This repo explores the performance of modern pretrained NLP models on a simple zero-shot classification task where the goal is to predict the gender of an individual given their name.

It is truly amazing the fact that we can leverage knowledge which is freely available over the web, 'compress' it as a language model (along with other auxiliary tasks) and then transfer that knowledge to any classification problem of interest with almost zero cost.

My simplistic analysis takes into account two models:

Huggingface's transformers library and more specifically its zero-shot classification pipeline. By default is uses bart-large-mnli (check ref. )
The Universal Sentence Encoder (USE). Tensorflow Hub does not provide ready-to-use pipelines (correct me if I am wrong). That's why I build a simplistic one manually which serves as a baseline comparison.

Ultimately, the goal is to see how well gender information is encoded into the LM-generated embeddings but also how efficiently we can extract this information.

Details regarding the tasks

Our goal is to get first names as inputs and predict:

the gender associated (female / male)
and its origin (region where it is more commonly used)

Evaluation

To assess the evaluation of the proposed solutions I have extracted the most popular first names per region from the corresponding Wikipedia page

A processed version of these tables can be found in the /data folder

Note 1: Some names are repeated if they are 'popular' in multiple areas

Note 2: In some countries the most common names are less than 10

Note 3: Israel's unisex names have been omitted

The evaluation is included in the Jupyter notebook (Zero-shot examples.ipynb)

Project setup

The structure of this repo is simple because the investigation is organised into Jupyter notebooks. You can replicate the experiments by creating a Python 3 environment (virtualenv, Pipenv, ...) and installing the following dependencies:

tensorflow
transformers
tensorflow_hub
sklearn
numpy
pandas
jupyter

Here, I am describing a setup using the awesome poetry tool. You can install it by following the instructions

Then run:

poetry install

To create the virtual env and install the module then

poetry shell

to activate the poetry virtual env

and finally you can give a (space separated) list of names for gender prediction by specifying the backend NLP model as

predict <name> --use4 e.g.

predict George Donald Mary Georgia (no flag by default Huggingface backend) or

predict Thanos Dimitris Aggeliki --use4 (Universal Sentence Encoder backend)

Note: The first time you run it, it will take some time to download all the necessary pretrained NLP models

References

(if you want to spend 2 hours reading NLP related material)

GPT-3 paper
Zero-Shot Learning in Modern NLP
the awesome Huggingface's Transformers library

Next steps

Evaluate on a larger corpus
Try few-shot instead of just zero-shot
Use region-specific gender attributes e.g. femme, homme in French

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
zeroshot		zeroshot
LICENSE		LICENSE
README.md		README.md
Zero-shot examples.ipynb		Zero-shot examples.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zero-shot classification using pretrained NLP models

Details regarding the tasks

Evaluation

Project setup

References

Next steps

About

Releases

Packages

Languages

License

erwtokritos/zero-shot-classification-examples

Folders and files

Latest commit

History

Repository files navigation

Zero-shot classification using pretrained NLP models

Details regarding the tasks

Evaluation

Project setup

References

Next steps

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages