tskibd
is a tool designed to calculate Identity by Descent (IBD) segments from genealogical trees represented in the tree sequence format.
It leverages the tskit C API library, which you can find at tskit repository.
This software has been tested on MacOS and Linux operating systems. The dependencies (including version numbers) for compilation and execution are defined in the env.yml Conda recipe. Installation time is typically within a minute.
If you don't have Conda
installed on your system, you can easily install it from here.
Once Conda
is installed, create the tskit
environment using the recipe
provided in env.yml
by running:
conda env create -f ./env.yml
conda activate tskibd
The meson/ninja
build system is used to compile tskibd
.
Please refer to the previous section to prepare the compilation envrionment.
git clone https://github.com/bguo068/tskibd.git
cd tskibd
git submodule update --init --recursive # IMPORTANT!
mv tskit/c/VERSION tskit/c/VERSION.txt # Fix filename issue
meson setup build
ninja -C build tskibd
The compiled executable can be found at build/tskibd
.
Usage: tskibd <chromN> <<bp_per_cm/recombination_rate_map_path>> <sampling_window> <mincm> <treeseq_file> [<out>]
Positional parameters:
<chr_no> Chromosome number (integer). This parameter is used for
naming the output files, such as 1.ibd and 1.map.
<bp_per_cm/recombination_rate_map_path>
This positional argument can be used in two ways:
(1) if it is a numeric type, it will be interpreted as the
base pairs per centimorgan.
For example, use '1000000' for human or
'15000' for p.f. (P. falciparum).
(2) If it is a string, it will be interpreted as the path
to a recombination rate map file. This file is a
space/tab-delimited two-column table without a header. The
first column contains integers indicating the coordinates
in base-pair, and the second column contains float numbers
indicating the coordinates in centimorgan. For positions
that is not explicited included, we use linear
interpolation/extrapolation for bp to cm mapping.
<sampling_window> Sampling window size in base pairs. Use '1' to check all
trees or '1000' to check trees covering positions of 0,
1000, 2000 bp, and so on. We recommend a window size
equivalent to 0.01 centimorgan. For instance, for human
use 10000 bp, and for p.f. use 150.
<mincm> Minimum length in centimorgan of IBD (Identity by Descent)
segment to keep. IBD segments shorter than this value
will be ignored and not written to the output file.
<treeseq_file> Tree sequence file that has finished coalescent.
<out> Optional: output IBD filename.
If unspecified, it will be '{chrno}.ibd'
Otherwise, it will be '{out}.ibd'
Example running on the test data:
# use default output prefix
build/tskibd 1 15000 150 2 example_data/chr1.trees
# or use specified output prefix
build/tskibd 1 15000 150 2 example_data/chr1.trees out/chr1
# from version 0.0.2, we allow specifying recombination rate map
# instead of a constant recombination rate
echo "750000 50.0" > rate_map.txt
echo "1500000 100.0" >> rate_map.txt
build/tskibd 1 rate_map.txt 150 2 example_data/chr1.trees out/chr1
The input genealogical trees are expected to be in the tree sequence format.
Please ensure that the input trees are fully coalesced. For trees simulated from
msprime
, they are typically fully
coalesced. However, if the trees are from a forward simulator like
SLiM
, additional coalescent simulation
might be required to ensure full coalescence (refer to PySLiM
tutorial for more details).
Notes on mutational information of a tree sequence: the mutational information
is not used to infer IBD segments but is used to annotate IBD segments
(the last column of the output .ibd
file).
If all mutations are included in the tree sequeunce, tskibd
might run very slow.
We recommand directly use tree sequence before mutations are added (no mutations)
or keep only those for sites under selection (selected mutation).
- col1 - Id1: node id 1 (a sample node of the specified tree sequence).
- col2 - Id2: node id 2.
- col3 - Start: IBD segment start position in base pair.
- col4 - End: IBD segment end position in base pair.
- col5 - Ancestor: the node id of the ancestor from which the IBD segment is originated.
- col6 - Tmrca: the time (backward) of the ancestor node.
- col7 - HasMutation: whether the IBD segment carries a mutation. (1: yes, 0: no)
If you find this tool useful, please cite our paper: https://www.nature.com/articles/s41467-024-46659-0