Skip to content

wissam-sib/universal-sentence-encoder-java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Using Universal Sentence Encoder in Java

Universal Sentence Encoder (USE) is transformer-based model that turns natural language sentences into fixed size float vectors.

This repository contains a minimal Java project (with Maven to manage tensorflow dependencies) that provides a class (UseRepresentation) to apply USE on English sentences and turn them into a float array of dimension 512. These vectors can then be used so solve several NLP tasks such as classification, similarity and so on. I adapted solutions from here to implement this project.

Tested on Windows 7 and Ubuntu 20.x.

Step 1 : Prepare the model

Note : You can skip this part and directly download the model here (just unzip the folder universal-sentence-encoder-4-java at your favorite location)

To prepare the model, I used the tensorflow hub in python to download into a KerasLayer. Then I simply saved it while specifying the inputs and outputs:

import tensorflow.compat.v1 as tf
import tensorflow_hub as hub
tf.disable_v2_behavior() 
url = "https://tfhub.dev/google/universal-sentence-encoder/4"
save_path = "path/to/universal-sentence-encoder-4-java/"
with tf.Graph().as_default():
    module = hub.KerasLayer(url)
    model_input = tf.placeholder(tf.string, name="input")
    model_output = tf.identity(module(model_input), name="output")
    with tf.Session() as session:
        session.run(tf.global_variables_initializer())
        tf.saved_model.simple_save(
            session,
            save_path,
            inputs={'input': model_input},
            outputs={'output': model_output},
            legacy_init_op=tf.initializers.tables_initializer(name='init_all_tables'))

I used tensorflow_hub 0.9.0 and tensorflow 2.3.1

Step 2 : Import the project in Java and run it

The project uses Maven for dependencies. In the pom.xml, you'll find the Java version (1.8 but can probably work with higher) and a Tensorflow Snapshot (Tensorflow version is 2.3.1).

In UseRepresentation.java, you can edit the main method (specify the path to "universal-sentence-encoder-4-java/") and then run it to test the model :

UseRepresentation model = new UseRepresentation("path/to/universal-sentence-encoder-4-java");
String[] myStringArray = new String[] { "Hello World", "I am going to be converted to an embedding", "For various NLP tasks" };

try {
	float[][] vectors = model.embed(myStringArray);
	System.out.println(Arrays.deepToString(vectors).replace("], ", "]\n"));
} catch (UnsupportedEncodingException e) {
	// TODO Auto-generated catch block
	e.printStackTrace();
}

Loading the model takes a few seconds (do it only once in your project) but conversion to vector is rather fast.

About

Convert sentences to fixed size embedding in Java

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages