Skip to content

Releases: lhotse-speech/lhotse

v1.9 Neighboring Peaks

20 Oct 18:32
Compare
Choose a tag to compare

Major features

  • MultiCut data type: simplifies working with multi-channel data (contribution from @desh2608)
  • CSJ recipe (contribution from @teowenshen)
  • lots of bug fixes

What's Changed

  • create proper wav_id in the segments file for multichannel recording by @jtrmal in #831
  • kaldi: add an switch/option to read the durations from kaldi utt2dur … by @jtrmal in #832
  • Update test packages by @pzelasko in #837
  • MultiCut to store multi-channel recordings with shared supervision by @desh2608 in #822
  • Use CutSet for whisper annotation workflow by @desh2608 in #834
  • use spawn() as the strategy to prevent heisenbug by @jtrmal in #841
  • Compatibility for reading alignments saved before Lhotse v1.8 by @pzelasko in #842
  • make regexp string raw by @jtrmal in #836
  • Use absolute recording paths in yesno recipe by @pzelasko in #845
  • Fix CutSet.compute_and_store_features support for lazy CutSets by @pzelasko in #844
  • Fixing some QA functions for lazy manifests by @desh2608 in #848
  • Fix timestamps in Whisper annotation workflow by @pzelasko in #847
  • Update supervisions channels in multi-channel recipes by @desh2608 in #838
  • Allow retaining or trimming channels in trim_to_supervisions by @desh2608 in #852
  • Match cut_id to utt_id if there is exactly one supervision per cut by @wgb14 in #853
  • forced alignment: use num2words to get word timestamps for numbers by @eschmidbauer in #849
  • Prepare CSJ by @teowenshen in #851
  • Small changes in trim_to_supervisions() by @desh2608 in #855
  • Fix checkpoints of samplers that were iterated over more than once within the same epoch by @pzelasko in #854
  • Update fisher_english.py by @maxlvov in #858

New Contributors

Full Changelog: v1.8...v1.9

v1.8 Sudden Avalanche

30 Sep 13:18
Compare
Choose a tag to compare

Breaking changes

  • Python 3.6 is no longer supported as of Lhotse v1.8. If you need to use Python 3.6, please revert to Lhotse 1.7 and earlier.

Highlights

  • New experimental module of lhotse: workflows, now integrates optional third party packages that assist corpus creators in automated data curation. With release 1.8, we support OpenAI Whisper for automatic transcription and segmentation, and torchaudio Wav2Vec2/Hubert ASR bundles for forced alignment.

ctxG6RI

What's Changed

  • Fix read and write in piped CLI by @desh2608 in #807
  • Default behavior of CutSet.mix by @ZuoyunZheng in #809
  • Adding more info about resampling options by @RuABraun in #815
  • Add pad_silence option to extend_by by @desh2608 in #816
  • Message when calling len() on LazyFilter by @desh2608 in #817
  • Refactor cut and retain git blame history by @desh2608 in #820
  • Audio backend refactoring and a workaround for FLAC reading from/writing to in-memory buffers by @pzelasko in #814
  • Experimental Lhotse feature: corpus creation tools (workflows), starting with OpenAI Whisper support by @pzelasko in #824
  • Drop support for Python 3.6 by @pzelasko in #829
  • [workflow] Word-level forced alignment with pretrained models from Torchaudio by @pzelasko in #827

New Contributors

Full Changelog: v1.7...v1.8

v1.7 - Rejuvenation Potion

12 Sep 21:38
Compare
Choose a tag to compare

What's Changed

  • add test data to bvcc by @oplatek in #797
  • Add reverb with fast RIR generator by @desh2608 in #799
  • Support snip_edges=True in online_inference of Kaldi feature extractors by @pzelasko in #802
  • Remove warning about Lhotse not being stable from README.md by @pzelasko in #804
  • Update the documentation related to optional packages by @pzelasko in #805

Full Changelog: v1.6...v1.7

1.6 - Frozen Palm Tree

27 Aug 21:13
Compare
Choose a tag to compare

What's Changed

  • Feature/fix 754 voxceleb download by @mikuchar in #776
  • Support Kaldi data dierectories without segments file. by @MartinKocour in #789
  • Add normalization for text of mulit_cn recipe:thchs_30, tal_csasr, tal_asr, aishell, aishell2,etc by @shanguanma in #760
  • Improve support for custom Recordings by @pzelasko in #791
  • Add Cut.has(field) method to query Cuts for custom attributes by @pzelasko in #792
  • Add normalization for aishell2 recipe by @shanguanma in #790

New Contributors

Full Changelog: 1.5...v1.6

1.5 - Little Leaf

09 Aug 01:04
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.4...v1.5

v1.4 - Candescent Crust

07 Jul 01:08
Compare
Choose a tag to compare

What's Changed

  • Fix lambda warnings from lazy manifests + leverage dill if installed for pickling lambdas by @pzelasko in #748
  • multi_cn recipes: aishell2, magicdata, primewords, stcmds, tal_asr, tal_csasr, thchs_30 by @shanguanma in #738
  • Deprecate strict, proportional_sampling, and bucket_method arguments by @pzelasko in #756
  • Fix lhotse cut simple CLI by @pzelasko in #759
  • Fix issues with eager CutSet creation from lazy manifests by @pzelasko in #763
  • DailyTalk recipe by @pzelasko in #767
  • add aishell2 dev test by @yuekaizhang in #766
  • Enable GlobalMVN computation with on-the-fly feature extraction by @pzelasko in #769
  • Add support for Python 3.10 and PyTorch 1.12 by @pzelasko in #764

New Contributors

Full Changelog: v1.3...1.4

v1.3 - Curiously Inviting Icicles

11 Jun 03:35
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.2...v1.3

v1.2 - Winter in the South

19 May 17:09
Compare
Choose a tag to compare

New Recipes

What's Changed

  • Fix import namespaces by @pzelasko in #698
  • .repeat(..., preserve_id=...) option for repeating manifests by @pzelasko in #699
  • Kaldi impex: remove invalid test by @jtrmal in #700
  • Minor fix in base url for AliMeeting download by @desh2608 in #702
  • [aidatatang_200zh] Avoid being converted to ASCII when preparing manifest by @luomingshuang in #703
  • [ali_meeting] Fix some path errors for ali_meeting.py by @luomingshuang in #705
  • [ali_meeting] Avoid being converted to ASCII by @luomingshuang in #704
  • Test for webdataset data de-duplication across ranks by @pzelasko in #706
  • Fixing data duplication with WebDataset in multi-node multi-worker training by @pzelasko in #707
  • Fix epoch setting for WebDataset shard shuffling by @pzelasko in #708
  • Full shard shuffling with webdataset by @pzelasko in #711
  • Raise an error when BucketingSampler is used with a lazy CutSet by @pzelasko in #710
  • Normalize output path names for recipes by @desh2608 in #712
  • [webdataset] Add shard of origin to Cut.shard_origin custom field by @pzelasko in #714
  • Update examples of combining datasets with RoundRobinSampler and add stop_early option. by @pzelasko in #716
  • pre-commit, isort + CI checks + running it on all code by @pzelasko in #720

New Contributors

Full Changelog: v1.1...v1.2

v1.1 - Minor Mana Potion

03 May 12:05
Compare
Choose a tag to compare

What's Changed

  • auto_increment_epoch in IterableDatasetWrapper, strict=True default for bucketing by @pzelasko in #661
  • Support storing Recording objects in cuts custom fields by @pzelasko in #662
  • nara_wpe based WPE dereverberation as data augmentation by @pzelasko in #663
  • Update rirs noises path by @Tomiinek in #665
  • Fix reading duration when importing piped input data from kaldi by @csukuangfj in #667
  • Fixes to split-lazy by @wgb14 in #664
  • Read matrix shape information without reading the whole matrix. by @csukuangfj in #668
  • Set a num_digits for split-lazy by @wgb14 in #669
  • DynamicBucketingSampler supports very small data by @pzelasko in #670
  • ~20x faster speed perturbation by @pzelasko in #672
  • Fix assertions, rename variables by @pzelasko in #677
  • Add max_cuts option to DynamicBucketingSampler by @pzelasko in #681
  • Fix recording_id in MixedCut.compute_and_store_features(..., mix_eagerly=True) by @pzelasko in #682
  • Support for restoring state of dynamic samplers by @pzelasko in #684
  • Add .cut_into_windows method to individual cuts by @pzelasko in #685
  • kaldi pipeline -- small whitespace fixes and making pycheck happier by @jtrmal in #686
  • Some minor fixes. by @csukuangfj in #688
  • kaldi import/export -- adding basic tests for kaldi import/export by @jtrmal in #687
  • Fix bug in cut.truncate, format selection in cut.save_audio, less memory use in AudioMixer by @pzelasko in #690
  • Pad channels to same length when loading audio by @desh2608 in #689
  • Fix restoring state in dynamic samplers with DataLoader num_workers>0 by @pzelasko in #692
  • Fix samplers dropping cuts when world_size > 1 by @pzelasko in #695
  • FIX: kaldi import fets.scp -- use the correct id when underscore mapping by @jtrmal in #694
  • Audio loading fault tolerant feature extraction in compute_and_store_features by @pzelasko in #683

New Contributors

Full Changelog: v1.0...v1.1

South Peak

06 Apr 01:25
Compare
Choose a tag to compare

Recipes

New corpora

Existing recipes improvements

New features

Support for custom attributes in Cuts

This feature allows attaching arbitrary type of data to cuts: alignments, multiple feature sets, etc.

  • Array and TemporalArray manifests (generalization of Features for arbitrary data) by @pzelasko in #458
  • Support custom MonoCut attrs (with special support for loading Arrays) by @pzelasko in #459
  • Collation of custom cut fields into PyTorch tensors by @pzelasko in #476
  • custom_collate_field pads to the longest read array instead of using CutSet.pad() by @pzelasko in #482
  • Add with_path_prefix to Array and TemporalArray by @pzelasko in #499
  • Collation promotes custom field int sequences to int64 by @pzelasko in #507
  • More flexibility in mixing cuts with custom attributes by @pzelasko in #642

Data augmentation

  • Reverberation using room impulse response by @desh2608 in #477
  • Add early reverb option for RIR-based augmentation by @desh2608 in #524
  • Options for handling multi-channel RIR by @desh2608 in #621

"Lazy cuts" (less memory / faster execution)

  • Option to create a CutSet lazily when everything is sorted on recording IDs by @pzelasko in #493
  • Enable adding/combining of lazy manifests by @pzelasko in #495
  • Resumable batch feature extraction with reduced memory usage by @pzelasko in #508
  • DynamicBucketingSampler: on-the-fly bucketing with restricted memory usage by @pzelasko in #517
  • DynamicCutSampler (like DynamicBucketingSampler but without bucketing) by @pzelasko in #579
  • Custom binary feature storage format by @pzelasko in #522
  • lhotse split-lazy and CutSet.split_lazy() for memory-efficient splits by @pzelasko in #558
  • Optional dependency orjson for up to 50% JSONL reading by @pzelasko in #563
  • CutSet multiplexing by @pzelasko in #565
  • stop_early arg for CutSet.mux by @pzelasko in #585
  • Option to set shuffle buffer size in dynamic samplers by @pzelasko in #587
  • WebDataset integration for optimized sequential I/O by @pzelasko in #582
  • Add WebDataset export CLI and a fault_tolerant option by @pzelasko in #599
  • Add WebdatasetWriter for iterative cut writing by @pzelasko in #602
  • More lazily evaluated methods: map, filter, repeat, shuffle by @pzelasko in #626

New and refreshed APIs

Improved Kaldi import/export

PyTorch API

Documentation

  • Refresh tutorials in examples + quality of life improvements in code by @pzelasko in #617
  • Tutorial for Lhotse's WebDataset integration by @pzelasko in #619
  • Add links in tutorials, fix some issues with lazy manifes...
Read more