Skip to content

erwtokritos/zero-shot-classification-examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zero-shot classification using pretrained NLP models

This repo explores the performance of modern pretrained NLP models on a simple zero-shot classification task where the goal is to predict the gender of an individual given their name.

It is truly amazing the fact that we can leverage knowledge which is freely available over the web, 'compress' it as a language model (along with other auxiliary tasks) and then transfer that knowledge to any classification problem of interest with almost zero cost.

My simplistic analysis takes into account two models:

  • Huggingface's transformers library and more specifically its zero-shot classification pipeline. By default is uses bart-large-mnli (check ref. )

  • The Universal Sentence Encoder (USE). Tensorflow Hub does not provide ready-to-use pipelines (correct me if I am wrong). That's why I build a simplistic one manually which serves as a baseline comparison.

Ultimately, the goal is to see how well gender information is encoded into the LM-generated embeddings but also how efficiently we can extract this information.


Details regarding the tasks

Our goal is to get first names as inputs and predict:

  • the gender associated (female / male)
  • and its origin (region where it is more commonly used)

Evaluation

To assess the evaluation of the proposed solutions I have extracted the most popular first names per region from the corresponding Wikipedia page

A processed version of these tables can be found in the /data folder

Note 1: Some names are repeated if they are 'popular' in multiple areas

Note 2: In some countries the most common names are less than 10

Note 3: Israel's unisex names have been omitted

The evaluation is included in the Jupyter notebook (Zero-shot examples.ipynb)


Project setup

The structure of this repo is simple because the investigation is organised into Jupyter notebooks. You can replicate the experiments by creating a Python 3 environment (virtualenv, Pipenv, ...) and installing the following dependencies:

  • tensorflow
  • transformers
  • tensorflow_hub
  • sklearn
  • numpy
  • pandas
  • jupyter

Here, I am describing a setup using the awesome poetry tool. You can install it by following the instructions

Then run:

poetry install

To create the virtual env and install the module then

poetry shell

to activate the poetry virtual env

and finally you can give a (space separated) list of names for gender prediction by specifying the backend NLP model as

predict <name> --use4 e.g.

predict George Donald Mary Georgia (no flag by default Huggingface backend) or

predict Thanos Dimitris Aggeliki --use4 (Universal Sentence Encoder backend)

Note: The first time you run it, it will take some time to download all the necessary pretrained NLP models

References

(if you want to spend 2 hours reading NLP related material)

Next steps

  1. Evaluate on a larger corpus
  2. Try few-shot instead of just zero-shot
  3. Use region-specific gender attributes e.g. femme, homme in French

About

Playing around with zero-shot classification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published