Releases: lhotse-speech/lhotse
Releases · lhotse-speech/lhotse
v1.9 Neighboring Peaks
Major features
MultiCut
data type: simplifies working with multi-channel data (contribution from @desh2608)- CSJ recipe (contribution from @teowenshen)
- lots of bug fixes
What's Changed
- create proper wav_id in the segments file for multichannel recording by @jtrmal in #831
- kaldi: add an switch/option to read the durations from kaldi utt2dur … by @jtrmal in #832
- Update test packages by @pzelasko in #837
MultiCut
to store multi-channel recordings with shared supervision by @desh2608 in #822- Use CutSet for whisper annotation workflow by @desh2608 in #834
- use spawn() as the strategy to prevent heisenbug by @jtrmal in #841
- Compatibility for reading alignments saved before Lhotse v1.8 by @pzelasko in #842
- make regexp string raw by @jtrmal in #836
- Use absolute recording paths in yesno recipe by @pzelasko in #845
- Fix CutSet.compute_and_store_features support for lazy CutSets by @pzelasko in #844
- Fixing some QA functions for lazy manifests by @desh2608 in #848
- Fix timestamps in Whisper annotation workflow by @pzelasko in #847
- Update supervisions channels in multi-channel recipes by @desh2608 in #838
- Allow retaining or trimming channels in trim_to_supervisions by @desh2608 in #852
- Match
cut_id
toutt_id
if there is exactly one supervision per cut by @wgb14 in #853 - forced alignment: use
num2words
to get word timestamps for numbers by @eschmidbauer in #849 - Prepare CSJ by @teowenshen in #851
- Small changes in
trim_to_supervisions()
by @desh2608 in #855 - Fix checkpoints of samplers that were iterated over more than once within the same epoch by @pzelasko in #854
- Update fisher_english.py by @maxlvov in #858
New Contributors
- @eschmidbauer made their first contribution in #849
- @teowenshen made their first contribution in #851
- @maxlvov made their first contribution in #858
Full Changelog: v1.8...v1.9
v1.8 Sudden Avalanche
Breaking changes
- Python 3.6 is no longer supported as of Lhotse v1.8. If you need to use Python 3.6, please revert to Lhotse 1.7 and earlier.
Highlights
- New experimental module of lhotse:
workflows
, now integrates optional third party packages that assist corpus creators in automated data curation. With release 1.8, we support OpenAI Whisper for automatic transcription and segmentation, and torchaudio Wav2Vec2/Hubert ASR bundles for forced alignment.
What's Changed
- Fix read and write in piped CLI by @desh2608 in #807
- Default behavior of CutSet.mix by @ZuoyunZheng in #809
- Adding more info about resampling options by @RuABraun in #815
- Add
pad_silence
option toextend_by
by @desh2608 in #816 - Message when calling len() on LazyFilter by @desh2608 in #817
- Refactor cut and retain
git blame
history by @desh2608 in #820 - Audio backend refactoring and a workaround for FLAC reading from/writing to in-memory buffers by @pzelasko in #814
- Experimental Lhotse feature: corpus creation tools (
workflows
), starting with OpenAI Whisper support by @pzelasko in #824 - Drop support for Python 3.6 by @pzelasko in #829
- [workflow] Word-level forced alignment with pretrained models from Torchaudio by @pzelasko in #827
New Contributors
- @ZuoyunZheng made their first contribution in #809
Full Changelog: v1.7...v1.8
v1.7 - Rejuvenation Potion
What's Changed
- add test data to bvcc by @oplatek in #797
- Add reverb with fast RIR generator by @desh2608 in #799
- Support
snip_edges=True
inonline_inference
of Kaldi feature extractors by @pzelasko in #802 - Remove warning about Lhotse not being stable from README.md by @pzelasko in #804
- Update the documentation related to optional packages by @pzelasko in #805
Full Changelog: v1.6...v1.7
1.6 - Frozen Palm Tree
What's Changed
- Feature/fix 754 voxceleb download by @mikuchar in #776
- Support Kaldi data dierectories without segments file. by @MartinKocour in #789
- Add normalization for text of mulit_cn recipe:thchs_30, tal_csasr, tal_asr, aishell, aishell2,etc by @shanguanma in #760
- Improve support for custom Recordings by @pzelasko in #791
- Add
Cut.has(field)
method to query Cuts for custom attributes by @pzelasko in #792 - Add normalization for aishell2 recipe by @shanguanma in #790
New Contributors
- @mikuchar made their first contribution in #776
- @MartinKocour made their first contribution in #789
Full Changelog: 1.5...v1.6
1.5 - Little Leaf
What's Changed
- Describe more information about cuts by @pzelasko in #772
- Change vctk.py to adapt the vctk dataset downloaded from edinburgh url by @luomingshuang in #775
- Fix restoring sampler state with
world_size>1
by @pzelasko in #773 - Revert #738 to use aidatatang as the prefix for aidatatang_200zh. by @csukuangfj in #782
- use tolerance when checking duration mismatch by @shaynemei in #781
New Contributors
- @shaynemei made their first contribution in #781
Full Changelog: v1.4...v1.5
v1.4 - Candescent Crust
What's Changed
- Fix lambda warnings from lazy manifests + leverage
dill
if installed for pickling lambdas by @pzelasko in #748 multi_cn
recipes:aishell2
,magicdata
,primewords
,stcmds
,tal_asr
,tal_csasr
,thchs_30
by @shanguanma in #738- Deprecate
strict
,proportional_sampling
, andbucket_method
arguments by @pzelasko in #756 - Fix
lhotse cut simple
CLI by @pzelasko in #759 - Fix issues with eager CutSet creation from lazy manifests by @pzelasko in #763
- DailyTalk recipe by @pzelasko in #767
- add aishell2 dev test by @yuekaizhang in #766
- Enable GlobalMVN computation with on-the-fly feature extraction by @pzelasko in #769
- Add support for Python 3.10 and PyTorch 1.12 by @pzelasko in #764
New Contributors
- @yuekaizhang made their first contribution in #766
Full Changelog: v1.3...1.4
v1.3 - Curiously Inviting Icicles
What's Changed
- Fix plotting MixedCut audio tracks by @pzelasko in #723
- [continued] Fixes Bucketing sampler equal duration method that drops cuts by @m-wiesner in #724
- feature extraction will read RecordingSet from a file, not just json. by @RuABraun in #728
- Use
lilcom_chunky
as default in CLI by @pzelasko in #729 - Set CLI torch number of threads to 1 by @pzelasko in #732
- Update wenet_speech.py by @fanlu in #731
- Fix heroico regex strings by @jtrmal in #734
- Update mgb2.py by @AmirHussein96 in #725
- Remove file handle caching from LilcomChunkyReader by @pzelasko in #737
- Make
h5py
an optional dependency by @pzelasko in #741 - Assert
CutSet.mix()
argumentcuts
is not a lazy manifest by @pzelasko in #742 CutSet
: more methods are lazy + two simplified common use-casesattach_tensor
andload_audio
by @pzelasko in #744- Collections: support reading from/writing to "-" (including webdataset) by @pzelasko in #745
- fix CommonVoice prepare by @mohsen-goodarzi in #743
New Contributors
- @RuABraun made their first contribution in #728
- @mohsen-goodarzi made their first contribution in #743
Full Changelog: v1.2...v1.3
v1.2 - Winter in the South
New Recipes
- Adding lhotse recipe to prepare eval2000 data by @GoVivace in #679
- adding Earnings-21 dataset from rev-dot-com by @jtrmal in #709
- Adding the second revdotcom's earnings corpus by @jtrmal in #713
- MGB2 recipe by @AmirHussein96 in #718
What's Changed
- Fix import namespaces by @pzelasko in #698
.repeat(..., preserve_id=...)
option for repeating manifests by @pzelasko in #699- Kaldi impex: remove invalid test by @jtrmal in #700
- Minor fix in base url for AliMeeting download by @desh2608 in #702
- [aidatatang_200zh] Avoid being converted to ASCII when preparing manifest by @luomingshuang in #703
- [ali_meeting] Fix some path errors for ali_meeting.py by @luomingshuang in #705
- [ali_meeting] Avoid being converted to ASCII by @luomingshuang in #704
- Test for webdataset data de-duplication across ranks by @pzelasko in #706
- Fixing data duplication with WebDataset in multi-node multi-worker training by @pzelasko in #707
- Fix epoch setting for WebDataset shard shuffling by @pzelasko in #708
- Full shard shuffling with webdataset by @pzelasko in #711
- Raise an error when
BucketingSampler
is used with a lazyCutSet
by @pzelasko in #710 - Normalize output path names for recipes by @desh2608 in #712
- [webdataset] Add shard of origin to Cut.shard_origin custom field by @pzelasko in #714
- Update examples of combining datasets with RoundRobinSampler and add
stop_early
option. by @pzelasko in #716 pre-commit
,isort
+ CI checks + running it on all code by @pzelasko in #720
New Contributors
- @GoVivace made their first contribution in #679
- @AmirHussein96 made their first contribution in #718
Full Changelog: v1.1...v1.2
v1.1 - Minor Mana Potion
What's Changed
auto_increment_epoch
inIterableDatasetWrapper
,strict=True
default for bucketing by @pzelasko in #661- Support storing
Recording
objects in cuts custom fields by @pzelasko in #662 nara_wpe
based WPE dereverberation as data augmentation by @pzelasko in #663- Update rirs noises path by @Tomiinek in #665
- Fix reading duration when importing piped input data from kaldi by @csukuangfj in #667
- Fixes to split-lazy by @wgb14 in #664
- Read matrix shape information without reading the whole matrix. by @csukuangfj in #668
- Set a num_digits for split-lazy by @wgb14 in #669
- DynamicBucketingSampler supports very small data by @pzelasko in #670
- ~20x faster speed perturbation by @pzelasko in #672
- Fix assertions, rename variables by @pzelasko in #677
- Add
max_cuts
option toDynamicBucketingSampler
by @pzelasko in #681 - Fix
recording_id
inMixedCut.compute_and_store_features(..., mix_eagerly=True)
by @pzelasko in #682 - Support for restoring state of dynamic samplers by @pzelasko in #684
- Add
.cut_into_windows
method to individual cuts by @pzelasko in #685 - kaldi pipeline -- small whitespace fixes and making pycheck happier by @jtrmal in #686
- Some minor fixes. by @csukuangfj in #688
- kaldi import/export -- adding basic tests for kaldi import/export by @jtrmal in #687
- Fix bug in
cut.truncate
, format selection incut.save_audio
, less memory use inAudioMixer
by @pzelasko in #690 - Pad channels to same length when loading audio by @desh2608 in #689
- Fix restoring state in dynamic samplers with DataLoader num_workers>0 by @pzelasko in #692
- Fix samplers dropping cuts when world_size > 1 by @pzelasko in #695
- FIX: kaldi import fets.scp -- use the correct id when underscore mapping by @jtrmal in #694
- Audio loading fault tolerant feature extraction in
compute_and_store_features
by @pzelasko in #683
New Contributors
Full Changelog: v1.0...v1.1
South Peak
Recipes
New corpora
- VoxCeleb and RIR recipes by @desh2608 in #475
- Add WenetSpeech recipe by @pkufool in #487
- ICSI meeting corpus by @desh2608 in #526
- ASpIRE data preparation recipe by @desh2608 in #528
- The People's Speech recipe by @pzelasko in #529
- bvcc/VoiceMOS challange data recipe by @oplatek in #578
- Add recipe for aidatatang_200zh. by @csukuangfj in #593
- SPGISpeech recipe by @desh2608 in #600
- [recipe] AliMeeting by @desh2608 in #608
Existing recipes improvements
- A change for tedlium.py by @luomingshuang in #479
- improve adept preparation: adds text interpretation by @oplatek in #474
- Reduce memory usage when writing GigaSpeech manifests by @pzelasko in #494
- adding previous utterance for libritts supervisions by @oplatek in #510
- bugfix for prepare_librispeech command by @rosrad in #516
- Improving Fisher recipe by @pzelasko in #539
- Some Fisher fixes and manifest validation fixes by @pzelasko in #541
- ICSI Recipe - Minor Doc correction by @LasseWolter in #544
- Minor updates for AMI and ICSI by @desh2608 in #545
- [AMI] Added missing IHM channels by @LasseWolter in #555
- Minor changes to LibriCSS and AISHELL-4 recipes by @desh2608 in #580
- Updated ICSI download structure/args for clarity by @LasseWolter in #583
- Fix(icsi-recipe): Use new directory structure in prepare_icsi by @LasseWolter in #592
- Babel recipe fix by @m-wiesner in #647
New features
Support for custom attributes in Cuts
This feature allows attaching arbitrary type of data to cuts: alignments, multiple feature sets, etc.
- Array and TemporalArray manifests (generalization of Features for arbitrary data) by @pzelasko in #458
- Support custom MonoCut attrs (with special support for loading Arrays) by @pzelasko in #459
- Collation of custom cut fields into PyTorch tensors by @pzelasko in #476
custom_collate_field
pads to the longest read array instead of usingCutSet.pad()
by @pzelasko in #482- Add
with_path_prefix
toArray
andTemporalArray
by @pzelasko in #499 - Collation promotes custom field int sequences to int64 by @pzelasko in #507
- More flexibility in mixing cuts with custom attributes by @pzelasko in #642
Data augmentation
- Reverberation using room impulse response by @desh2608 in #477
- Add early reverb option for RIR-based augmentation by @desh2608 in #524
- Options for handling multi-channel RIR by @desh2608 in #621
"Lazy cuts" (less memory / faster execution)
- Option to create a CutSet lazily when everything is sorted on recording IDs by @pzelasko in #493
- Enable adding/combining of lazy manifests by @pzelasko in #495
- Resumable batch feature extraction with reduced memory usage by @pzelasko in #508
DynamicBucketingSampler
: on-the-fly bucketing with restricted memory usage by @pzelasko in #517DynamicCutSampler
(likeDynamicBucketingSampler
but without bucketing) by @pzelasko in #579- Custom binary feature storage format by @pzelasko in #522
lhotse split-lazy
andCutSet.split_lazy()
for memory-efficient splits by @pzelasko in #558- Optional dependency
orjson
for up to 50% JSONL reading by @pzelasko in #563 - CutSet multiplexing by @pzelasko in #565
stop_early
arg forCutSet.mux
by @pzelasko in #585- Option to set shuffle buffer size in dynamic samplers by @pzelasko in #587
- WebDataset integration for optimized sequential I/O by @pzelasko in #582
- Add WebDataset export CLI and a
fault_tolerant
option by @pzelasko in #599 - Add
WebdatasetWriter
for iterative cut writing by @pzelasko in #602 - More lazily evaluated methods:
map
,filter
,repeat
,shuffle
by @pzelasko in #626
New and refreshed APIs
- Deprecate compute_and_store_recording, add save_audio by @desh2608 in #486
- CutSet.decompose() with CLI and streaming read/write enhancements by @pzelasko in #496
- Opensmile wrapper by @marcinwitkowski in #504
fill_supervision
method for cuts and cut sets by @pzelasko in #505merge_supervisions
method for cuts and cut sets by @pzelasko in #503- Add CLI for
lhotse cut trim-to-supervision <in> <out>
by @pzelasko in #514 - CLI for validating recordings+supervisions and extra supervision check by @pzelasko in #515
rng
arg forCutSet.truncate()
by @pzelasko in #557- Add cut extend method by @desh2608 in #571
- Output more information from CutSet.describe. by @csukuangfj in #606
- Cut into windows with hop by @marcinwitkowski in #651
Improved Kaldi import/export
- KaldiWriter for exporting features to feats.scp by @pzelasko in #473
- Support multi-channel wavs in Kaldi export by @pzelasko in #480
- Make speaker_id prefix of utt_id for Kaldi by @desh2608 in #530
- Fix bug in export to kaldi function by @HuangZiliAndy in #531
- Add method to create SupervisionSet from RTTM files by @desh2608 in #566
- Replace kaldiio with kaldi_native_io. by @csukuangfj in #584
PyTorch API
- Add SpecAugment state dict by @janvainer in #472
- Enable threaded batch IO by @pzelasko in #520
- GlobalMVN and Wav2LogFilterBank torchscriptable by @janvainer in #521
- Streaming Kaldi feature extractors by @pzelasko in #523
- Skipping problematic audios during dataloading - continued by @pzelasko in #533
- Make samplers picklable by @pzelasko in #542
- Rename
SingleCutSampler
->SimpleCutSampler
by @pzelasko in #546 - Sampler reports for how much padding is typically used by @pzelasko in #560
- Modified SpecAugment by @luomingshuang in #598
- optimization for specaugment by @luomingshuang in #604
- Return
audio_lens
in multi channel audio collater by @desh2608 in #616 - Add sampling statistics report to dynamic samplers by @pzelasko in #628
- New parameter:
OnTheFlyFeatures(..., return_audio=True)
by @pzelasko in #629 - RoundRobinSampler: samples mini-batches in turn from each sub-sampler by @pzelasko in #649
- Sampler diagnostics updates: preserved across epochs, fixes for various samplers, extended unit tests by @pzelasko in #639
- Stricter batch constraint exceeding checks by @pzelasko in #653