Releases: sign-language-translator/sign-language-datasets
Landmark Datasets
This release contains sign language videos embedded as csv files inside zip archives. The landmarks are rounded to 4 decimal places which give a precision of 0.1mm in world coordinates and 1 pixel in a 10k resolution image.
Text transcription (gloss) of the signs is present in the file names. More synonyms and translations that map to these signs can be seen in the json data in the repo. The dataset has three categories:
-
Standard Dictionary: (788 + 1)
Standard sign language dictionaries obtained from recognized organizations. The names arecountry-organization-groupNumber.landmarks-embeddingModel-extension.zip
-
Dictionary Replications: (788 * 12 * 4 = 37,824) (coming soon!)
Manually recorded sign language videos that are replication of the reference clips. The names arecountry-organization-groupNumber_personCode_cameraAngle.landmarks-embeddingModel-extension.zip
MediaPipe landmarks Header
World coordinates are 3D body joint coordinates in meters. Image coodinates are fraction of the video height/width where the landmark is located and z value is depth from the camera.
For both models, we get 33pose landmarks and 21 landmarks per hand and 5 values per landmark (x, y, z, visibility, presence).
total_rows = number_of_frames in source video
Video Datasets
This release contains sign language videos inside zip archives. Text transcription (gloss) of the videos is present in the file names. More synonyms and translations that map to these videos can be seen in the json data in the repo. The dataset has three categories:
-
Standard Dictionary: (788)
Standard sign language dictionaries obtained from recognized organizations. The names arecountry-organization-groupNumber.videos-mp4.zip
-
Dictionary Recordings: (788 * 12 * 4 = 37,824) (coming soon!)
Manually recorded sign language videos that are replication of the reference clips. The names arecountry-organization-groupNumber_personCode-cameraAngle.videos-mp4.zip
Dictionary
This release contains all base videos featuring someone performing sign language. They were also called reference_clips because they were shown to other performers to replicate sign language to create a diverse dataset.
The purpose of this release is to provide a URL to individual video files so the rule-based translator can fetch only the required files. If you want this data, see the video_datasets release instead for bundled archives.
The file labels had to be converted to english via translation or transliteration so these filenames may not correspond to the json word-mapping data. Although urls.json & extra_urls.json provide a mapping from the standard filename to these individual assets.
The filename format is as follows:
country-organization-session_text-equivalent-of-video[.disambiguation].mp4
e.g. pk-hfad-1_spring.season.mp4
Language Models for Dataset generation
This release includes language models to write text that can be translated by a rule-based text-to-sign translator.
The tlm_14.0.pt (sign_language_translator.models.TransformerLanguageModel) is a custom transformer trained on ~800 MB of text composed only of the words for which PakistanSignLanguage signs are available (see sign_recordings/collection_to_label_to_language_to_words.json). The tokenizer used is sign_language_translator.languages.text.urdu.Urdu().tokenizer + the digits in numbers and letter in acronyms are split apart as individual tokens to limit the vocab size. Later update will generate disambiguated words. The start & end token are "<" & ">".
The -mixed-.pkl model is trained on unambiguous supported urdu words from a corpus of around 10 MB (2.4 million tokens). It is a mix of 6 ngram models with context window size from 1 to 6. It cannot handle any longer range dependencies so concept drift can be observed in the longer generations. The tokenizer used is slt.languages.text.urdu.Urdu().tokenizer. The start & end token are "<" & ">".
The *.json models are made to demonstrate the functionality of n-gram models. The training data is text_preprocessing.json:person_names.
- Contains n-gram based statistical language models trained on 366 Urdu and 366 English names commonly used in Pakistan.
- The models predict the next character based on previous 1-3 characters.
- The start and end of sequence tokens are "[" and "]".