Skip to content

A simple Python script for transforming a corpus of documents into text vectors suitable for visualization

Notifications You must be signed in to change notification settings

rosette-api-community/visualize-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 

Repository files navigation

Rosette API Text Embeddings Visualization Sample Code

A simple Python script for transforming a corpus of documents into text vectors suitable for visualization in .tsv format. It uses the Rosette API's /text-embedding endpoint and the BBC News Corpus. Note that the corpus is only free for research purposes.

Getting started

  1. Clone the repo and open the files in your favorite text editor/python IDE.

  2. Download the raw text files zip, bbc-fulltext.zip from http://mlg.ucd.ie/datasets/bbc.html and extract it into the project root folder. You should get a folder called "bbc".

  3. Run visualize-embeddings.py via your python IDE or command line (replace ROSAPI_KEY with your Rosette API key):

     $ python visualize-embeddings.py --key ROSAPI_KEY
    

You'll see that the script parses the raw text files of the corpus into a list of documents. Each document consist of 3 fields:

  • category
  • headline
  • content

The script then creates two files:

  • embeddings.tsv: a TSV file where each line contains the text vector for a document's content field.
  • metadata.tsv: a TSV file where each line contains a document's metadata (i.e. category and headline).

To visualize the embeddings, load them into Google TensorFlow's Embedding Projector. Turn on color coding by category to really see the vectors in action. You can see our projection at this link.

Customize for your data

Try replacing the BBC News corpus with your own data. And if you find anything interesting, we'd love to hear about it! Find us at [email protected].

About

A simple Python script for transforming a corpus of documents into text vectors suitable for visualization

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages