Fast heuristic construction of a ancestral recombination graphs from sequence data.
##Installation
Simply run make
within the fastARG directory. This requires a C compiler and the zlib library.
##Basic usage
To construct an ARG from haplotype data
$ fastarg build ex1.txt > output.arg
To output haplotypes from this ARG
$ fastarg leafseq output.arg > output.txt
Other possible commands are displayed by running ./fastARG
with no additional arguments.
##Input format
For m loci (sites) and n haplotypes/genotypes, the input file should consist of m lines, each giving an (integer) position followed by whitespace, then a string of n non-whitespace characters denoting the state of the n genotypes (or haplotypes) at that locus.
The file ex1.txt
in the fastARG directory gives an example for m=3 loci and n=4 haplotypes:
1 1000
2 0011
3 0101
Note that the first integer on the line (the "position") is currently treated only as a label, and otherwise ignored. The position along the genome is given by the position of the line in the file: the first line denotes the state of haplotypes at position 0, the second the stare at position 1, etc.
The sequence data consists of non-whitespace characters with character states represented as 0
or 1
. For unphased (genotype) data additionally the character 2
indicates a heterozygote - if no 2
s are present, the characters are assumed to come from haploid genomes. Any other character is taken as missing data (internally represented as 3
).
#Output format
Each line of the ouput file starts with a single letter which indicates the interpretation of the (tab-separated) digits listed on the rest of that line. The initial lines give the starting parameters:
###Starting parameters
A single line starting with E
is followed by an single (long) integer giving the random seed initially passed to the srand48()
function
A single line starting with N
is followed by two integers giving the number of loci (m) and the number of haplotypes (n) respectively
###ARG specification
The following set of lines describe the ARG. Note that nodes in the ARG are indexed from 0...(n-1) for leaf nodes, with numbers from n upwards used for internal ARG nodes.
Any number of lines starting with C
may exist, which indicate coalescence events. The initial C
is followed by 5 or more tab-delimited integers. The first 4 integers represent the current node ID, a descendant node ID, and the half-closed genomic interval over which this coalescence applies. For instance, the lines
C 5 4 0 2 0
C 5 3 0 3 0
Indicate that node 5 consists of a coalescence between (some of) nodes 4 and 3. For node 4, the parts of the haplotype that coalesce at node 5 are those indexed by the interval [0,2) i.e. loci 0 and 1. For node 3, the parts of the haplotype that coalesce are in the interval [0,3) i.e. loci 0, 1, and 2.
The fifth number on the coalescence line denotes the number of mutations that
Any number of lines starting with R may exist, which indicate recombination events. This is followed by 5 or
###Sequence specification
A single line starting with an S
is followed by the ID of the deepest node in all trees (the root), then a sequence of m characters giving the inferred haplotype at this root. This serves to completely define the ancestral locus history.
#Example
The ex1.txt
file should produce something like the following output
E 11
N 4 3
R 4 2 0 3 0
C 5 4 0 2 0
C 5 3 0 3 0
C 6 0 0 3 1 0
C 6 4 2 3 0
C 7 1 0 3 0
C 7 5 0 3 1 1
C 8 6 0 3 0
C 8 7 0 3 1 2
S 8 000
showing that the random seed is 11 (line 1) and the number of haplotypes and number of loci is 4 and 3 respectively (line 2).
The ARG indicated by the R
and C
records in the above file is shown by the diagram to the left below. Recombination nodes have been prefixed with R, coalescence events with C, and dots represent lines that visually cross, but do not intersect. The diagram to the right is the same but with mutations overlain: *0 represents a mutation at locus 0, *1 at locus 1, and *2 at locus 2
C8 C8
/ \ / *2
/ C7 / C7
/ / \ / / *1
C6 / C5 C6 / C5
/ \. / | / \. / |
/ .\ / | / .\ / |
/ / 4R | *0 / 4R |
| | | | | | | |
0 1 2 3 0 1 2 3
== ARG == == ARG with ==
== mutations==
The output also states that the line joining node 5 to node 4 only applies to loci in the range [0,2), and the line joining node 4 to node 6 applies to loci in the range [2,3). Thus loci 0 and 1 are inferred to be related as shown in the left hand coalescence tree below, whereas the inferred tree for locus 2 is the right hand tree below.
C8 C8
/ \ / *2
/ C7 / C7
/ / *1 / / \
C6 / C5 C6 / C5
/ / / | / \. |
/ / / | / .\ |
*0 / 4R | / / 4R |
| | | | | | | |
0 1 2 3 0 1 2 3
== Loci == == Locus 2 ==
== 0 & 1==
From this we can infer that haplotype 0 has a mutation at position 0, haplotype 1 has a mutation at locus 2, haplotype 2 has a mutation at locus 1, and haplotype 3 has a mutation at loci 1 and 2.
The final line in the file (initial letter S
) states that node 8 (the root) is inferred to have the sequence 000
, hence all mutations are from 0->1, and the haplotypes are
Haplotype
Locus | 0 1 2 3
------+---------
0 | 1 0 0 0
1 | 0 0 1 1
2 | 0 1 0 1
which reconstructs the same data as in the original input file.