20 Oct 18:32

pzelasko

7d9fd0d

v1.9 Neighboring Peaks

Major features

MultiCut data type: simplifies working with multi-channel data (contribution from @desh2608)
CSJ recipe (contribution from @teowenshen)
lots of bug fixes

What's Changed

create proper wav_id in the segments file for multichannel recording by @jtrmal in #831
kaldi: add an switch/option to read the durations from kaldi utt2dur … by @jtrmal in #832
Update test packages by @pzelasko in #837
MultiCut to store multi-channel recordings with shared supervision by @desh2608 in #822
Use CutSet for whisper annotation workflow by @desh2608 in #834
use spawn() as the strategy to prevent heisenbug by @jtrmal in #841
Compatibility for reading alignments saved before Lhotse v1.8 by @pzelasko in #842
make regexp string raw by @jtrmal in #836
Use absolute recording paths in yesno recipe by @pzelasko in #845
Fix CutSet.compute_and_store_features support for lazy CutSets by @pzelasko in #844
Fixing some QA functions for lazy manifests by @desh2608 in #848
Fix timestamps in Whisper annotation workflow by @pzelasko in #847
Update supervisions channels in multi-channel recipes by @desh2608 in #838
Allow retaining or trimming channels in trim_to_supervisions by @desh2608 in #852
Match cut_id to utt_id if there is exactly one supervision per cut by @wgb14 in #853
forced alignment: use num2words to get word timestamps for numbers by @eschmidbauer in #849
Prepare CSJ by @teowenshen in #851
Small changes in trim_to_supervisions() by @desh2608 in #855
Fix checkpoints of samplers that were iterated over more than once within the same epoch by @pzelasko in #854
Update fisher_english.py by @maxlvov in #858

New Contributors

@eschmidbauer made their first contribution in #849
@teowenshen made their first contribution in #851
@maxlvov made their first contribution in #858

Full Changelog: v1.8...v1.9

Contributors

desh2608, eschmidbauer, and 5 other contributors

Assets 2

30 Sep 13:18

pzelasko

v1.8

8db6a02

v1.8 Sudden Avalanche

Breaking changes

Python 3.6 is no longer supported as of Lhotse v1.8. If you need to use Python 3.6, please revert to Lhotse 1.7 and earlier.

Highlights

New experimental module of lhotse: workflows, now integrates optional third party packages that assist corpus creators in automated data curation. With release 1.8, we support OpenAI Whisper for automatic transcription and segmentation, and torchaudio Wav2Vec2/Hubert ASR bundles for forced alignment.

What's Changed

Fix read and write in piped CLI by @desh2608 in #807
Default behavior of CutSet.mix by @ZuoyunZheng in #809
Adding more info about resampling options by @RuABraun in #815
Add pad_silence option to extend_by by @desh2608 in #816
Message when calling len() on LazyFilter by @desh2608 in #817
Refactor cut and retain git blame history by @desh2608 in #820
Audio backend refactoring and a workaround for FLAC reading from/writing to in-memory buffers by @pzelasko in #814
Experimental Lhotse feature: corpus creation tools (workflows), starting with OpenAI Whisper support by @pzelasko in #824
Drop support for Python 3.6 by @pzelasko in #829
[workflow] Word-level forced alignment with pretrained models from Torchaudio by @pzelasko in #827

New Contributors

@ZuoyunZheng made their first contribution in #809

Full Changelog: v1.7...v1.8

Contributors

desh2608, RuABraun, and 2 other contributors

Assets 2

12 Sep 21:38

pzelasko

v1.7

695abb6

v1.7 - Rejuvenation Potion

What's Changed

add test data to bvcc by @oplatek in #797
Add reverb with fast RIR generator by @desh2608 in #799
Support snip_edges=True in online_inference of Kaldi feature extractors by @pzelasko in #802
Remove warning about Lhotse not being stable from README.md by @pzelasko in #804
Update the documentation related to optional packages by @pzelasko in #805

Full Changelog: v1.6...v1.7

Contributors

oplatek, desh2608, and pzelasko

Assets 2

27 Aug 21:13

pzelasko

v1.6

5e734e5

1.6 - Frozen Palm Tree

What's Changed

Feature/fix 754 voxceleb download by @mikuchar in #776
Support Kaldi data dierectories without segments file. by @MartinKocour in #789
Add normalization for text of mulit_cn recipe:thchs_30, tal_csasr, tal_asr, aishell, aishell2,etc by @shanguanma in #760
Improve support for custom Recordings by @pzelasko in #791
Add Cut.has(field) method to query Cuts for custom attributes by @pzelasko in #792
Add normalization for aishell2 recipe by @shanguanma in #790

New Contributors

@mikuchar made their first contribution in #776
@MartinKocour made their first contribution in #789

Full Changelog: 1.5...v1.6

Contributors

MartinKocour, pzelasko, and 2 other contributors

Assets 2

09 Aug 01:04

pzelasko

1.5

445cf01

1.5 - Little Leaf

What's Changed

Describe more information about cuts by @pzelasko in #772
Change vctk.py to adapt the vctk dataset downloaded from edinburgh url by @luomingshuang in #775
Fix restoring sampler state with world_size>1 by @pzelasko in #773
Revert #738 to use aidatatang as the prefix for aidatatang_200zh. by @csukuangfj in #782
use tolerance when checking duration mismatch by @shaynemei in #781

New Contributors

@shaynemei made their first contribution in #781

Full Changelog: v1.4...v1.5

Contributors

csukuangfj, pzelasko, and 2 other contributors

Assets 2

07 Jul 01:08

pzelasko

1.4

609af97

v1.4 - Candescent Crust

What's Changed

Fix lambda warnings from lazy manifests + leverage dill if installed for pickling lambdas by @pzelasko in #748
multi_cn recipes: aishell2, magicdata, primewords, stcmds, tal_asr, tal_csasr, thchs_30 by @shanguanma in #738
Deprecate strict, proportional_sampling, and bucket_method arguments by @pzelasko in #756
Fix lhotse cut simple CLI by @pzelasko in #759
Fix issues with eager CutSet creation from lazy manifests by @pzelasko in #763
DailyTalk recipe by @pzelasko in #767
add aishell2 dev test by @yuekaizhang in #766
Enable GlobalMVN computation with on-the-fly feature extraction by @pzelasko in #769
Add support for Python 3.10 and PyTorch 1.12 by @pzelasko in #764

New Contributors

@yuekaizhang made their first contribution in #766

Full Changelog: v1.3...1.4

Contributors

pzelasko, yuekaizhang, and shanguanma

Assets 2

11 Jun 03:35

pzelasko

v1.3

4d22c32

v1.3 - Curiously Inviting Icicles

What's Changed

Fix plotting MixedCut audio tracks by @pzelasko in #723
[continued] Fixes Bucketing sampler equal duration method that drops cuts by @m-wiesner in #724
feature extraction will read RecordingSet from a file, not just json. by @RuABraun in #728
Use lilcom_chunky as default in CLI by @pzelasko in #729
Set CLI torch number of threads to 1 by @pzelasko in #732
Update wenet_speech.py by @fanlu in #731
Fix heroico regex strings by @jtrmal in #734
Update mgb2.py by @AmirHussein96 in #725
Remove file handle caching from LilcomChunkyReader by @pzelasko in #737
Make h5py an optional dependency by @pzelasko in #741
Assert CutSet.mix() argument cuts is not a lazy manifest by @pzelasko in #742
CutSet: more methods are lazy + two simplified common use-cases attach_tensor and load_audio by @pzelasko in #744
Collections: support reading from/writing to "-" (including webdataset) by @pzelasko in #745
fix CommonVoice prepare by @mohsen-goodarzi in #743

New Contributors

@RuABraun made their first contribution in #728
@mohsen-goodarzi made their first contribution in #743

Full Changelog: v1.2...v1.3

Contributors

fanlu, jtrmal, and 5 other contributors

Assets 2

19 May 17:09

pzelasko

v1.2

024890f

v1.2 - Winter in the South

New Recipes

Adding lhotse recipe to prepare eval2000 data by @GoVivace in #679
adding Earnings-21 dataset from rev-dot-com by @jtrmal in #709
Adding the second revdotcom's earnings corpus by @jtrmal in #713
MGB2 recipe by @AmirHussein96 in #718

What's Changed

Fix import namespaces by @pzelasko in #698
.repeat(..., preserve_id=...) option for repeating manifests by @pzelasko in #699
Kaldi impex: remove invalid test by @jtrmal in #700
Minor fix in base url for AliMeeting download by @desh2608 in #702
[aidatatang_200zh] Avoid being converted to ASCII when preparing manifest by @luomingshuang in #703
[ali_meeting] Fix some path errors for ali_meeting.py by @luomingshuang in #705
[ali_meeting] Avoid being converted to ASCII by @luomingshuang in #704
Test for webdataset data de-duplication across ranks by @pzelasko in #706
Fixing data duplication with WebDataset in multi-node multi-worker training by @pzelasko in #707
Fix epoch setting for WebDataset shard shuffling by @pzelasko in #708
Full shard shuffling with webdataset by @pzelasko in #711
Raise an error when BucketingSampler is used with a lazy CutSet by @pzelasko in #710
Normalize output path names for recipes by @desh2608 in #712
[webdataset] Add shard of origin to Cut.shard_origin custom field by @pzelasko in #714
Update examples of combining datasets with RoundRobinSampler and add stop_early option. by @pzelasko in #716
pre-commit, isort + CI checks + running it on all code by @pzelasko in #720

New Contributors

@GoVivace made their first contribution in #679
@AmirHussein96 made their first contribution in #718

Full Changelog: v1.1...v1.2

Contributors

desh2608, jtrmal, and 4 other contributors

Assets 2

03 May 12:05

pzelasko

v1.1

fbe0461

v1.1 - Minor Mana Potion

What's Changed

auto_increment_epoch in IterableDatasetWrapper, strict=True default for bucketing by @pzelasko in #661
Support storing Recording objects in cuts custom fields by @pzelasko in #662
nara_wpe based WPE dereverberation as data augmentation by @pzelasko in #663
Update rirs noises path by @Tomiinek in #665
Fix reading duration when importing piped input data from kaldi by @csukuangfj in #667
Fixes to split-lazy by @wgb14 in #664
Read matrix shape information without reading the whole matrix. by @csukuangfj in #668
Set a num_digits for split-lazy by @wgb14 in #669
DynamicBucketingSampler supports very small data by @pzelasko in #670
~20x faster speed perturbation by @pzelasko in #672
Fix assertions, rename variables by @pzelasko in #677
Add max_cuts option to DynamicBucketingSampler by @pzelasko in #681
Fix recording_id in MixedCut.compute_and_store_features(..., mix_eagerly=True) by @pzelasko in #682
Support for restoring state of dynamic samplers by @pzelasko in #684
Add .cut_into_windows method to individual cuts by @pzelasko in #685
kaldi pipeline -- small whitespace fixes and making pycheck happier by @jtrmal in #686
Some minor fixes. by @csukuangfj in #688
kaldi import/export -- adding basic tests for kaldi import/export by @jtrmal in #687
Fix bug in cut.truncate, format selection in cut.save_audio, less memory use in AudioMixer by @pzelasko in #690
Pad channels to same length when loading audio by @desh2608 in #689
Fix restoring state in dynamic samplers with DataLoader num_workers>0 by @pzelasko in #692
Fix samplers dropping cuts when world_size > 1 by @pzelasko in #695
FIX: kaldi import fets.scp -- use the correct id when underscore mapping by @jtrmal in #694
Audio loading fault tolerant feature extraction in compute_and_store_features by @pzelasko in #683

New Contributors

@Tomiinek made their first contribution in #665
@wgb14 made their first contribution in #664

Full Changelog: v1.0...v1.1

Contributors

csukuangfj, desh2608, and 4 other contributors

Assets 2

06 Apr 01:25

pzelasko

v1.0

68f8792

South Peak

Recipes

New corpora

VoxCeleb and RIR recipes by @desh2608 in #475
Add WenetSpeech recipe by @pkufool in #487
ICSI meeting corpus by @desh2608 in #526
ASpIRE data preparation recipe by @desh2608 in #528
The People's Speech recipe by @pzelasko in #529
bvcc/VoiceMOS challange data recipe by @oplatek in #578
Add recipe for aidatatang_200zh. by @csukuangfj in #593
SPGISpeech recipe by @desh2608 in #600
[recipe] AliMeeting by @desh2608 in #608

Existing recipes improvements

A change for tedlium.py by @luomingshuang in #479
improve adept preparation: adds text interpretation by @oplatek in #474
Reduce memory usage when writing GigaSpeech manifests by @pzelasko in #494
adding previous utterance for libritts supervisions by @oplatek in #510
bugfix for prepare_librispeech command by @rosrad in #516
Improving Fisher recipe by @pzelasko in #539
Some Fisher fixes and manifest validation fixes by @pzelasko in #541
ICSI Recipe - Minor Doc correction by @LasseWolter in #544
Minor updates for AMI and ICSI by @desh2608 in #545
[AMI] Added missing IHM channels by @LasseWolter in #555
Minor changes to LibriCSS and AISHELL-4 recipes by @desh2608 in #580
Updated ICSI download structure/args for clarity by @LasseWolter in #583
Fix(icsi-recipe): Use new directory structure in prepare_icsi by @LasseWolter in #592
Babel recipe fix by @m-wiesner in #647

New features

Support for custom attributes in Cuts

This feature allows attaching arbitrary type of data to cuts: alignments, multiple feature sets, etc.

Array and TemporalArray manifests (generalization of Features for arbitrary data) by @pzelasko in #458
Support custom MonoCut attrs (with special support for loading Arrays) by @pzelasko in #459
Collation of custom cut fields into PyTorch tensors by @pzelasko in #476
custom_collate_field pads to the longest read array instead of using CutSet.pad() by @pzelasko in #482
Add with_path_prefix to Array and TemporalArray by @pzelasko in #499
Collation promotes custom field int sequences to int64 by @pzelasko in #507
More flexibility in mixing cuts with custom attributes by @pzelasko in #642

Data augmentation

Reverberation using room impulse response by @desh2608 in #477
Add early reverb option for RIR-based augmentation by @desh2608 in #524
Options for handling multi-channel RIR by @desh2608 in #621

"Lazy cuts" (less memory / faster execution)

Option to create a CutSet lazily when everything is sorted on recording IDs by @pzelasko in #493
Enable adding/combining of lazy manifests by @pzelasko in #495
Resumable batch feature extraction with reduced memory usage by @pzelasko in #508
DynamicBucketingSampler: on-the-fly bucketing with restricted memory usage by @pzelasko in #517
DynamicCutSampler (like DynamicBucketingSampler but without bucketing) by @pzelasko in #579
Custom binary feature storage format by @pzelasko in #522
lhotse split-lazy and CutSet.split_lazy() for memory-efficient splits by @pzelasko in #558
Optional dependency orjson for up to 50% JSONL reading by @pzelasko in #563
CutSet multiplexing by @pzelasko in #565
stop_early arg for CutSet.mux by @pzelasko in #585
Option to set shuffle buffer size in dynamic samplers by @pzelasko in #587
WebDataset integration for optimized sequential I/O by @pzelasko in #582
Add WebDataset export CLI and a fault_tolerant option by @pzelasko in #599
Add WebdatasetWriter for iterative cut writing by @pzelasko in #602
More lazily evaluated methods: map, filter, repeat, shuffle by @pzelasko in #626

New and refreshed APIs

Deprecate compute_and_store_recording, add save_audio by @desh2608 in #486
CutSet.decompose() with CLI and streaming read/write enhancements by @pzelasko in #496
Opensmile wrapper by @marcinwitkowski in #504
fill_supervision method for cuts and cut sets by @pzelasko in #505
merge_supervisions method for cuts and cut sets by @pzelasko in #503
Add CLI for lhotse cut trim-to-supervision <in> <out> by @pzelasko in #514
CLI for validating recordings+supervisions and extra supervision check by @pzelasko in #515
rng arg for CutSet.truncate() by @pzelasko in #557
Add cut extend method by @desh2608 in #571
Output more information from CutSet.describe. by @csukuangfj in #606
Cut into windows with hop by @marcinwitkowski in #651

Improved Kaldi import/export

KaldiWriter for exporting features to feats.scp by @pzelasko in #473
Support multi-channel wavs in Kaldi export by @pzelasko in #480
Make speaker_id prefix of utt_id for Kaldi by @desh2608 in #530
Fix bug in export to kaldi function by @HuangZiliAndy in #531
Add method to create SupervisionSet from RTTM files by @desh2608 in #566
Replace kaldiio with kaldi_native_io. by @csukuangfj in #584

PyTorch API

Add SpecAugment state dict by @janvainer in #472
Enable threaded batch IO by @pzelasko in #520
GlobalMVN and Wav2LogFilterBank torchscriptable by @janvainer in #521
Streaming Kaldi feature extractors by @pzelasko in #523
Skipping problematic audios during dataloading - continued by @pzelasko in #533
Make samplers picklable by @pzelasko in #542
Rename SingleCutSampler -> SimpleCutSampler by @pzelasko in #546
Sampler reports for how much padding is typically used by @pzelasko in #560
Modified SpecAugment by @luomingshuang in #598
optimization for specaugment by @luomingshuang in #604
Return audio_lens in multi channel audio collater by @desh2608 in #616
Add sampling statistics report to dynamic samplers by @pzelasko in #628
New parameter: OnTheFlyFeatures(..., return_audio=True) by @pzelasko in #629
RoundRobinSampler: samples mini-batches in turn from each sub-sampler by @pzelasko in #649
Sampler diagnostics updates: preserved across epochs, fixes for various samplers, extended unit tests by @pzelasko in #639
Stricter batch constraint exceeding checks by @pzelasko in #653

Documentation

Refresh tutorials in examples + quality of life improvements in code by @pzelasko in #617
Tutorial for Lhotse's WebDataset integration by @pzelasko in #619
Add links in tutorials, fix some issues with lazy manifes...

Contributors

oplatek, rosrad, and 13 other contributors

Assets 2

Releases: lhotse-speech/lhotse

v1.9 Neighboring Peaks

Major features

What's Changed

New Contributors

Contributors

v1.8 Sudden Avalanche

Breaking changes

Highlights

What's Changed

New Contributors

Contributors

v1.7 - Rejuvenation Potion

What's Changed

Contributors

1.6 - Frozen Palm Tree

What's Changed

New Contributors

Contributors

1.5 - Little Leaf

What's Changed

New Contributors

Contributors

v1.4 - Candescent Crust

What's Changed

New Contributors

Contributors

v1.3 - Curiously Inviting Icicles

What's Changed

New Contributors

Contributors

v1.2 - Winter in the South

New Recipes

What's Changed

New Contributors

Contributors

v1.1 - Minor Mana Potion

What's Changed

New Contributors

Contributors

South Peak

Recipes

New corpora

Existing recipes improvements

New features

Support for custom attributes in Cuts

Data augmentation

"Lazy cuts" (less memory / faster execution)

New and refreshed APIs

Improved Kaldi import/export

PyTorch API

Documentation

Contributors