Releases · sign-language-translator/sign-language-datasets

07 Nov 15:45

mdsrqbl

v0.0.4

7c1e21f

Landmark Datasets Latest

Latest

This release contains sign language videos embedded as csv files inside zip archives. The landmarks are rounded to 4 decimal places which give a precision of 0.1mm in world coordinates and 1 pixel in a 10k resolution image.

Text transcription (gloss) of the signs is present in the file names. More synonyms and translations that map to these signs can be seen in the json data in the repo. The dataset has three categories:

Standard Dictionary: (788 + 1)
Standard sign language dictionaries obtained from recognized organizations. The names are country-organization-groupNumber.landmarks-embeddingModel-extension.zip
Dictionary Replications: (788 * 12 * 4 = 37,824) (coming soon!)
Manually recorded sign language videos that are replication of the reference clips. The names are country-organization-groupNumber_personCode_cameraAngle.landmarks-embeddingModel-extension.zip

MediaPipe landmarks Header

World coordinates are 3D body joint coordinates in meters. Image coodinates are fraction of the video height/width where the landmark is located and z value is depth from the camera.

For both models, we get 33pose landmarks and 21 landmarks per hand and 5 values per landmark (x, y, z, visibility, presence).

total_rows = number_of_frames in source video

Assets 11

06 Nov 14:29

mdsrqbl

v0.0.3

7c1e21f

Video Datasets

This release contains sign language videos inside zip archives. Text transcription (gloss) of the videos is present in the file names. More synonyms and translations that map to these videos can be seen in the json data in the repo. The dataset has three categories:

Standard Dictionary: (788)
Standard sign language dictionaries obtained from recognized organizations. The names are country-organization-groupNumber.videos-mp4.zip
Dictionary Recordings: (788 * 12 * 4 = 37,824) (coming soon!)
Manually recorded sign language videos that are replication of the reference clips. The names are country-organization-groupNumber_personCode-cameraAngle.videos-mp4.zip

Assets 5

07 Nov 14:22

mdsrqbl

v0.0.2

1456b64

Dictionary

This release contains all base videos featuring someone performing sign language. They were also called reference_clips because they were shown to other performers to replicate sign language to create a diverse dataset.

The purpose of this release is to provide a URL to individual video files so the rule-based translator can fetch only the required files. If you want this data, see the video_datasets release instead for bundled archives.

The file labels had to be converted to english via translation or transliteration so these filenames may not correspond to the json word-mapping data. Although urls.json & extra_urls.json provide a mapping from the standard filename to these individual assets.

The filename format is as follows:

country-organization-session_text-equivalent-of-video[.disambiguation].mp4

e.g. pk-hfad-1_spring.season.mp4

Assets 792

19 Jul 08:26

mdsrqbl

v0.0.1

1456b64

Language Models for Dataset generation

This release includes language models to write text that can be translated by a rule-based text-to-sign translator.

The tlm_14.0.pt (sign_language_translator.models.TransformerLanguageModel) is a custom transformer trained on ~800 MB of text composed only of the words for which PakistanSignLanguage signs are available (see sign_recordings/collection_to_label_to_language_to_words.json). The tokenizer used is sign_language_translator.languages.text.urdu.Urdu().tokenizer + the digits in numbers and letter in acronyms are split apart as individual tokens to limit the vocab size. Later update will generate disambiguated words. The start & end token are "<" & ">".

The -mixed-.pkl model is trained on unambiguous supported urdu words from a corpus of around 10 MB (2.4 million tokens). It is a mix of 6 ngram models with context window size from 1 to 6. It cannot handle any longer range dependencies so concept drift can be observed in the longer generations. The tokenizer used is slt.languages.text.urdu.Urdu().tokenizer. The start & end token are "<" & ">".

The *.json models are made to demonstrate the functionality of n-gram models. The training data is text_preprocessing.json:person_names.

Contains n-gram based statistical language models trained on 366 Urdu and 366 English names commonly used in Pakistan.
The models predict the next character based on previous 1-3 characters.
The start and end of sequence tokens are "[" and "]".

Assets 9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MediaPipe landmarks Header

Releases: sign-language-translator/sign-language-datasets

Landmark Datasets

MediaPipe landmarks Header

Video Datasets

Dictionary

Language Models for Dataset generation