title | author | date |
---|---|---|
Graphical Fragment Assembly (GFA) Format Specification |
The GFA Format Specification Working Group |
2015-07-24 |
The master version of this document can be found at
https://github.com/pmelsted/GFA-spec
The purpose of the GFA format is to capture sequence graphs as the product of an assembly, a representation of variation in genomes, splice graphs in genes, or even overlap between reads from long-read sequencing technology.
The GFA format is a tab-delimited text format for describing a set of sequences and their overlap. . The first field of the line identifies the type of the line. Header lines start with H
. Segment lines start with S
. Link lines start with L
. A containment line starts with C
.
- Segment a continuous sequence or subsequence.
- Link an overlap between two segments. Each link is from the end of one segment to the beginning of another segment. The link stores the orientation of each segment and the amount of basepairs overlapping.
- Containment an overlap between two segments where one is contained in the other.
- Path an ordered list of oriented segments, where each consecutive pair of oriented segments are supported by a link record.
Each line in GFA has tab-delimited fields and the first field defines the type of line. The type of the line defines the following required fields. The required fields are followed by optional fields.
Line | Type |
---|---|
H |
Header line |
S |
Segment line |
L |
Link line |
C |
Containment line |
P |
Path line |
All optional fields follow the TAG:TYPE:VALUE
format where TAG
is a two-character string that matches /[A-Za-z][A-Za-z0-9]/
. Each TAG
can only appear once in one line. A TAG
containing lowercase letters are reserved for end users. A TYPE
is a single case-sensitive letter which defines the format of VALUE
.
Type | RegEx | Description |
---|---|---|
A |
[!-~] |
Printable character |
i |
[-+]?[0-9]+ |
Signed integer |
f |
[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)? |
Single-precision floating number |
Z |
[ !-~]+ |
Printable string, including space |
J |
[ !-~]+ |
JSON, excluding new-line and tab characters |
H |
[0-9A-F]+ |
Byte array in the Hex format |
B |
[cCsSiIf](,[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)+ |
Integer or numeric array |
The header line has only optional fields. The following fields are defined for the header line.
Tag | Type | description |
---|---|---|
VN |
Z |
Version number |
Col | Field | Type | Regexp/Range | Brief description |
---|---|---|---|---|
2 | Name |
String | [!-)+-<>-~][!-~]* |
Segment name |
3 | Sequence |
String | `* | [A-Za-z=.]+` |
Segment names must not contain whitespace characters or start with *
or =
. All other printable ASCII characters are allowed. Names are case sensitive.
The Sequence field can be *
meaning that the sequence is not stored in the GFA file.
Tag | Type | Description |
---|---|---|
LN |
i |
Segment length |
RC |
i |
Read count |
FC |
i |
Fragment count |
KC |
i |
k-mer count |
OC |
J |
Occurences |
The field is designated for pan-genome analysis graphs and represents all occurences of a segment in the genomes. The field should satisfy the following JSON schema:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "Occurrences",
"description": "Occurrences of a segment in genomes",
"type": "array",
"items": {
"title": "Occcurence",
"type": "object",
"properties": {
"chr": {
"description": "An identifier of the chromosome containing the occurence",
"type": "string"
},
"source": {
"description": "A source containing the chromosome, e.g. a FASTA file",
"type": "string"
},
"pos": {
"description": "Zero-based position of the start of the segment on the positive strand of the chromosome",
"type": "number",
"minimum": 0
},
"strand": {
"description": "This value is 'true' if the occurence contains the positive version of the segment, i.e. its read from 5' to 3' end, and is false otherwise",
"type": "boolean"
}
},
"required": ["chr", "pos", "strand"]
}
}
}
Links are the primary mechanism to connect segments. Links connect oriented segments. A link from A
to B
means that the end of A
overlaps with the start of B
. If either is marked with -
, we replace the sequence of the segment with its reverse complement, whereas a +
indicates the segment sequence is used as-is.
The length of the overlap is determined by the CIGAR
string of the link. When the overlap is 0M
the B
segment follows directly after A
. When the CIGAR
string is *
, the nature of the overlap is not specified.
... (explain how to interpret the overlap between segments, also for non-M
it is not symmetric).
Col | Field | Type | Regexp/Range | Brief description |
---|---|---|---|---|
2 | From |
String | [!-)+-<>-~][!-~]* |
name of segment |
3 | FromOrient |
String | `+ | -` |
4 | To |
String | [!-)+-<>-~][!-~]* |
name of segment |
5 | ToOrient |
String | `+ | -` |
6 | Overlap |
String | `* | ([0-9]+[MIDNSHPX=])+` |
Tag | Type | Description |
---|---|---|
MQ |
i |
Mapping quality |
NM |
i |
# mismatches/gaps |
RC |
i |
Read count |
FC |
i |
Fragment count |
KC |
i |
k-mer count |
(need motivation for this)
Col | Field | Type | Regexp/Range | Brief description |
---|---|---|---|---|
2 | From |
String | [!-)+-<>-~][!-~]* |
name of segment |
3 | FromOrient |
String | `+ | -` |
4 | To |
String | [!-)+-<>-~][!-~]* |
name of segment |
5 | ToOrient |
String | `+ | -` |
6 | pos |
int | [0-9]* |
0-based start of contained segment |
7 | Overlap |
String | `* | ([0-9]+[MIDNSHPX=])+` |
Tag | Type | description |
---|---|---|
RC |
i |
Read Coverage |
NM |
i |
Number of mismatches/gaps |
Col | Field | Type | Regexp/Range | Brief description |
---|---|---|---|---|
1 | RecordType |
Char | P |
Record type |
2 | PathName |
String | [!-)+-<>-~][!-~]* |
Path name |
3 | SegmentName |
String | [!-)+-<>-~][!-~]* |
A comma-separated list of segment names and orientations |
4 | CIGAR |
String | `* | ([0-9]+[MIDNSHPX=])+` |
A CIGAR
string may be *
, in which case the CIGAR
string is determined by fetching the CIGAR
string from the corresponding link record, or by performing a pairwise overlap alignment of the two sequences.
None specified.
H VN:Z:1.0
S 11 ACCTT
S 12 TCAAGG
S 13 CTTGATT
L 11 + 12 - 4M
L 12 - 13 + 5M
L 11 + 13 + 3M
P 14 11+,12-,13+ 4M,5M
The resulting path is:
11 ACCTT
12 CCTTGA
13 CTTGATT
14 ACCTTGATT
Almost three years ago, there was a lengthy discussion in the Assemblathon mailing list about a generic format for fragment assemmbly. The end product is the FASTG format. In the discussion, I have expressed several major concerns with the format. The top one is that it is mathematically wrong. Three years later, FASTG is still not widely used. At this point, Adam Phillippy and Pall Melsted openly called for a generic assembly format again. I also feel the pressing necessity of standardization, so decided to give a try myself. This is the Graphical Fragment Assembly format, or GFA in abbreviation.
In this post, I will start from the theoretical basis of assembly graph, describe the format and finally discuss the potential issues with the proposal.
I showed an earlier version of this format to Richard Durbin, Daniel Zerbino and Benedict Paten last night in Oxford. That version was a variant of FASTA. When I was formalizing the format in this post, I found FASTA is too crowded and too limited. Following the suggestion of Daniel, I finally adopted a format similar to ASQG and the PSMC output.
DNA sequence assembly is often (though not always) represented as a graph. There are multiple types of graphs including de Bruijn graph, overlap graph, unitig graph and string graph. They are all birected graph. Briefly, in this graph, each vertex is a sequence and each arc is an overlap. Because DNA sequences have two strands, an arc may have four directions, representing the four possible overlaps: forward-forward, forward-reverse, reverse-forward and reverse-reverse. It should be noted that a k-mer de Bruijn graph is equivalent to an overlap graph for k-mer reads with (k-1)-mer overlaps. It is a bidirected graph, too.
The critical problem with FASTG is that it puts sequneces on arcs/edges. It is
unable to describe a simple topology such as A->B; C->B; C->D
without adding a
dummy node, which breaks the theoretical elegance of assembly graphs. Due to the
historical confusion between vertices and edges, I will avoid using these
terminologies. I will use a segment for a piece of sequence and a link for a
connection between segments.
Although we can describe an assembly graph with bidirected arcs, I find in practice, it is easier and more explicit to describe links between the ends of segments. Gene Myers took a similar approach in his string graph paper. Based on this observation, I uniquely label the 5'-end and the 3'-end of each segment. The following shows an assembly graph with seven segments in GFA:
H VN:Z:1.0
S 1 2 CGATGCAA *
L 2 3 5M
S 3 4 TGCAAAGTAC *
L 3 6 0M
S 5 6 TGCAACGTATAGACTTGTCAC * RC:i:4
L 6 8 1M1D2M1S
S 7 8 GCATATA *
L 7 9 0M
S 9 10 CGATGATA *
S 11 12 ATGA *
C 9 11 2 4M
If we name a segment with the two ordered integers, the example above is
equivalent to a bidirected graph 1:2>->3:4; 5:6>->3:4; 5:6>-<7:8<->9:10
with
11:12
contained in 9:10
. The H
line is the header. An S
line describes a
segment which consists of 5'-end label, 3'-end label, sequence and
pseudo-quality. An L
line represents a link which consists of the labels of
the two ends and a CIGAR that describes the overlap alignment taking the first
end as the target/upper sequence. The CIGAR can describe symmetric overlaps
(e.g. 5M
), assembly gaps (e.g. 10N
), gapped overlaps, open-end alignments
(e.g. 1M1D2M1S
; heading S
for clipping on the second sequence and tailing
S
on the first), or unaligned overlaps (e.g. 5S10I8D2S
; no M
operators).
It is related to but different from the CIGAR used in SAM. A C
line represents
a containment, which is only relevant to read-to-read overlaps.
For all lines, additional information is described with tags in a format identical to SAM. Predefined tags include:
Line Tag Type Meaing
-----------------------------------------------------------------------
H VN Z Version number
H QT A Type of pseudo-quality. Valid values: `Q`, `D` or `K`
S RC i # reads assembled into the segment
L/C MQ i Mapping quality of the overlap/containment
L NM i # mismatches/gaps
S LN i Segment length
-
If this format cannot encode your assembly, please let me know. Thank you. Suggestions on making GFA work would be appreciated even more. :-)
-
It is unusual to uniquely label the two ends of a segment. ABySS, SGA and most other assemblers uniquely label a segment. In my view, end-labeling has a few advantages: a) it requires fewer operations for reverse-complementing and unambiguous merging; b) by representing a bidirected arc with
A+,B-
, we are still convertingA
to two labels; c) my own assembler only works with end-labeling. I think it should always be easy to convert the segment-labeling to the end-labeling but not vice versa. Unless there are strong arguments against end-labeling, I will keep it. -
Use a string to label an end. I like integers for efficiency, but don't object to strings in principle.
-
I don't like the CIGAR I proposed. It is too complex. If you can find a cleaner way to describe all kinds of overlaps and gaps, please let me know. These complex overlaps are not uncommon in a long-read assembly or for scaffolding.
-
In FASTG, we can encode a simple "bubble" with
ACGT[C,T]TAGT
. Although GFA can describe this assembly, it needs to add three more segments and four links, which are quite heavy. One option is to allow such simple bubbles on theS
line with a specific header tag indicating that the file contains small bubbles. Is it a good idea? How many assemblers can take advantage of this potential addition? -
If we can agree on a format, I can write a parser and a few basic tools such as flip, unambiguous merge and perhaps more complex operations such as tip trimming and bubble popping if I have time.
-
Any other suggestions?
Considering to replace end-labeling with the more common segment-labeling. The example above will look like (better or worse?):
H VN:Z:1.0
S 1 CGATGCAA *
L 1 + 2 + 5M
S 2 TGCAAAGTAC *
L 3 + 2 + 0M
S 3 TGCAACGTATAGACTTGTCAC * RC:i:4
L 3 + 4 - 1M1D2M1S
S 4 GCATATA *
L 4 - 5 + 0M
S 5 CGATGATA *
S 6 ATGA *
C 5 + 6 + 2 4M
I was out of the town in the past few days, so have not been able to focus on GFA. Now I am back to work to give the first update on the format based on the comments from many people, which I appreciate a lot.
In comparison to my initial proposal, the first and the major change is to name segments instead of the ends of segments. This seems the consensus so far. Secondly, I am thinking to move the quality field on the "S" line to an optional tag. Not many assemblers produce quality or per-base read depth. Thirdly, more people prefer to explictly encode bubbles as multiple segments, rather than inline them in the sequence. I will use explicit bubbles at least in the initial iteration.
Here is the graph from the previous post in the updated format:
H VN:Z:1.0
S 1 CGATGCAA
L 1 + 2 + 5M
S 2 TGCAAAGTAC
L 3 + 2 + 0M
S 3 TGCAACGTATAGACTTGTCAC RC:i:4
L 3 + 4 - 1M1D2M1S
S 4 GCATATA
L 4 - 5 + 0M
S 5 CGATGATA
S 6 ATGA
C 5 + 6 + 2 4M
A little bit more formally, GFA consists of four types of lines indicated by the first letter at each line. The format of each line is:
Line Fixed fields Comments
---------------------------------------------------------------
H N/A Header
S segName,segSeq Segment
L segName1,segOri1,segName2,segOri2,CIGAR Link
C segName1,segOri1,segName2,segOri2,pos,CIGAR Contained
Here is a list of predefined tags:
Line Tag Type Comments
-----------------------------------------------------------------------
H VN Z Version number
L/S RC i # reads that support the segment/link
L/S FC i # fragments that support the segment/link
L/S KC i # k-mer that support the segment/link
L/C MQ i Mapping quality of the overlap/containment
L/C NM i # mismatches/gaps
S LN i Segment length
Discussions and open issues:
-
How to describe complex overlaps with simple syntax. Currently, GFA uses a CIGAR, but I think it is bit overcomplicated.
-
Random access to GFA. I am not quite sure how this is useful in practice, but it is worth thinking.
-
Small bubbles. Although I said that a few others and I would prefer to encode bubbles as explicit segments in the initial iteration, I know a few would like a better representation.
-
Where to keep the read-to-contig alignment. My preference is to keep them in a separate BAM file.
-
Where to keep the segment sequences. My preference is to keep them in GFA. Nonetheless, we still allow to put a "*" at the sequence field. We can still describe the topology without the sequence data.
-
"Twin edges". A link can be represented in two directions. My preference is to allow both directions. The parser should throw a warning or an error if the two directions are inconsistent.
In the next step, I will write a standalone parser for GFA and clean up a few dirty corners meanwhile. I will also try to write a few converters for existing assembly formats by various assemblers and implement a few basic tools. If you have any suggestions, please let me know. After all, I am not so experienced in de novo assemblies as most of the readers.
Finally, I should emphasize that the format has not been fixed at all, far from it. Please keep the comments coming. The discussions so far are very helpful to me. Thank you!