Skip to content

v0.13.0rc1 - new data structure based on awkward arrays

Pre-release
Pre-release
Compare
Choose a tag to compare
@grst grst released this 07 Apr 06:58
d8ec147

This update introduces a new datastructure based on awkward arrays.
The new datastructure is described in more detail in the documentation and is considered the "official" way of representing AIRR data for scverse core and ecosystem packages.

Benefits of the new data structure include:

  • a more natural, lossless representation of AIRR Rearrangement data
  • separation of AIRR data and the receptor model, thereby getting rid of previous limitations (e.g. "only productive chains") and enabling other use-cases (e.g. spatial AIRR data) in the future.
  • clean adata.obs as AIRR data is not expanded into columns
  • support for MuData for working with paired gene expression and AIRR data as separate modalities.

The overall workflow stays the same, however this update required several backwards-incompatible changes which are summarized below.

Backwards-incompatible changes

New data structure

Closes issue #327.

Changed behavior:

  • there are no "has_ir" and "multichain" columns in adata.obs anymore
  • By default all fields are imported from AIRR rearrangement and 10x data.
  • The restriction that all chains added to an AirrCell must have the same fields has been removed. Missing fields are automatically filled with missing values.
  • io.upgrade_schema can update from v0.7 to v0.13 schema. AnnData objects generated with scirpy <= 0.6.x cannot be read anymore.
  • pl.spectratype now has a chain attributed and the meaning of the cdr3_col attribute has changed.

New functions:

  • pp.index_chains
  • pp.merge_chains

Removed functions:

  • pp.merge_with_ir
  • pp.merge_airr_chains

API supporting MuData

Closes issue #383

All functions take (where applicable) the additional, optional keyword arguments

  • airr_mod: the modality in MuData that contains AIRR information (default: "airr")
  • airr_key: the slot in adata.obsm that contains AIRR rearrangement data (default: "airr")
  • chain_idx_key: the slot in adata.obsm that contains indices specifying which chains in adata.obsm[airr_key] are the primary/secondary chains etc.

New class:

  • util.DataHandler

Updated example datasets

The example datasets have been updated to be based on the new datastructure and are now based on MuData.

  • The example datasets have been regenerated from scratch using the loader notebooks described in the docstring. The Maynard dataset gene expression is now based on values generated with Salmon instead of RSEM/featurecounts.
  • Scirpy now uses pooch to manage example datasets.

Cleanup

  • Removed the deprecated functions io.from_tcr_objs, io.from_ir_objs, io.to_ir_objs, pp.merge_with_tcr, pp.tcr_neighbors, pp.ir_neighbors, tl.chain_pairing
  • Removed the deprecated classes TcrCell, AirrChain, TcrChain
  • Removed the function pl.cdr_convergence which was never public anyway.

Additions

Easy-access functions (scirpy.get)

Closes issue #184

New functions:

  • get.airr
  • get.obs_context
  • get.airr_context

Fixes

  • Several type hints that were previously inaccurate are now updated.
  • Fix x-axis labelling in pl.clonotype_overlap raises an error if row annotations are not unique for each group.

Documentation

The documentation has been updated to reflect the changes described above, in particular the tutorials and the page about the data structure.

Other changes

  • The minimum required Python version is now 3.8 (#381)
  • Increased the minium version of tqdm to 4.63 (See tqdm/tqdm#1082)
  • pl.repertoire_overlap now always runs tl.repertoire_overlap internally and doesn't rely on cached values.
  • The mode dendro_only in pl.repertoire_overlap has been removed.
  • Cells that have a receptor, but no CDR3 sequence have previously received a separate clonotype in tl.define_clonotypes. Now they are receiving no clonotype (i.e. np.nan) as do cells without a receptor.
  • The function tl.clonal_expansion now returns a pd.Series instead of a np.array with inplace=False
  • Removed deprecation for clonotype_imbalanced, see #330
  • The group_abundance tool and plotting function used has_ir as a default group as we could previously rely on this column being present. With the new datastructure, this is not the case. To no break old code, the has_ir column is tempoarily added when requested. The group_abundance function will have to be rewritten enitrely in the future, see #232
  • In pl.spectratype, the parameter groupby has been replaced by chain.
  • We now use isort to organize imports.
  • Static typing has been improved internally (using pylance). It's not perfectly consistent yet, but we will keep working on this in the future.