You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SAM file pandora is producing allows us to understand exactly how it is mapping regions of reads to PRGs. However, we did not yet evaluate this mapping in details in a coarse-grained way. We can do it with a simulated dataset. This is also related with @Danderson123 project.
1. Main input: PanRG
The main input for our simulation is a PanRG, not a genome. The idea is that from a PanRG we can simulate a genome with possibly unknown regions between the PRGs so that we have a perfect annotation, and know exactly which regions of reads map to which PRG.
2. Simulation algorithm
This is the main algorithm to build a simulated genome G, simulated annotation A, and simulated reads R:
For each PRG P in the PanRG, choose a random path and transform it to a sequence, which will be the allele of P in G (this can be easily done with the pandora random subcommand);
Build a copy number profile for the PanRG. This is basically assigning an integer to each PRG meaning how many copies of that gene is present in G. An integer 0 means the gene is absent from G, 1 means it is present just once, >1 means there are several copies of the gene. The profile is derived from 6 probabilities: CN0, CN1, ..., CN5 which denotes the probability of a gene having CNn copies. These probabilities must sum to 1. For simplicity, in the first iteration, we will just work with CN0 and CN1;
(Step skipped for now): randomly mutates gene copies WRT a mutation profile;
Given which genes are in G and how many times it appears in G, create an order of these genes with strands assigned randomly. This will give us a gene ordering of G, e.g.: gene4+ -> gene2- -> gene1+ -> gene3+ -> gene3-.
Generate genome G using the gene ordering obtained in step 4 by translating the order and the strands to sequences and concatenating them;
Generate an annotation A (gff3) of G using the gene ordering obtained in step 4;
(Optional step): Generate random intergenic regions between two genes in G. This is parameterised by PIR (Probability of Intergenic Region - how probable is two genes have an intergenic region), MinIR (min length of the intergenic region) and MaxIR (max length of the intergenic region). In every junction between two genes, we check if we should add a random intergenic region between them. If we do so, we have to shift the annotation A to the right of the genes that come after this added intergenic region. Bacteria rarely have intergenic regions, but this can also simulate an incomplete annotation;
We simulate long reads from G, with an error rate and coverage profile;
3. Output and evaluation
By the end of all these steps, we will have a set of long reads R from G, where we know exactly where each read come from in G. Using the annotation, we know exactly which region of the read come from each gene. We can then run pandora mapping R to the PanRG, and look at the SAM and see if we were able to map each region of each read to their gene of origin.
4. Possible variations
There are many variations of this pipeline that we can study by changing simulation parameters. We can for example see how pandora mapping accuracy change when we add several genes with copy number 2, 3, 4, etc. We can see the effect of increasing/decreasing mutation rate between copies. We can add frequent and large intergenic regions to simulate a badly annotated genome, etc.
5. Conclusion
I think this is a good way to build a simulated dataset to improve pandora for our current analyses. Note that this simulation and evaluation framework does not care about variations between the genome and the simulated reads, nor try to evaluate if pandora genotypes these variations correctly. This was already extensively evaluated in the pandora paper. This framework concerns with the larger picture. It is all about mapping regions of reads to the correct PRGs, and this is what the current @Danderson123 and @LeahRoberts project is mostly about.
6. Feedback
Could I have any feedback on this from you @iqbal-lab@rmcolq@mbhall88@Danderson123@LeahRoberts ? @iqbal-lab and @Danderson123 IDK how this simulation and evaluation framework relates with the current simulations you are working on for @Danderson123 project. There might be several overlaps between this and your simulations. However, the objectives are different I think. Here I just want to debug if pandora is able to map reads to the correct genes, and how copy number, mutation and error rate affects this mapping. I will not evaluate gene ordering here.
I will wait for your approval or rejection of this development direction. I have to say, however, that trying to evaluate this directly on real data is way too complicated for a long list of reasons...
The text was updated successfully, but these errors were encountered:
@Danderson123 is building a prototype gene order de bruijn graph based on the Pandora Sam. @martinghunt has given him test cases and simulated perfect reads. His goal is
A. to confirm his graph is correctly constructed given perfect input
B. to understand how his graph properties vary with read length distribution, given the length distribution of genes.
To do this, he might as well have mocked perfect Pandora Sam files.
Leandro wants to know if our mapping is working, partly for Dan, but also for basic Pandora. For that, the proposal above makes sense to me. But I'm not going to think hard til Monday
The SAM file
pandora
is producing allows us to understand exactly how it is mapping regions of reads to PRGs. However, we did not yet evaluate this mapping in details in a coarse-grained way. We can do it with a simulated dataset. This is also related with @Danderson123 project.1. Main input: PanRG
The main input for our simulation is a PanRG, not a genome. The idea is that from a PanRG we can simulate a genome with possibly unknown regions between the PRGs so that we have a perfect annotation, and know exactly which regions of reads map to which PRG.
2. Simulation algorithm
This is the main algorithm to build a simulated genome
G
, simulated annotationA
, and simulated readsR
:P
in the PanRG, choose a random path and transform it to a sequence, which will be the allele ofP
inG
(this can be easily done with thepandora random
subcommand);G
. An integer0
means the gene is absent fromG
,1
means it is present just once,>1
means there are several copies of the gene. The profile is derived from 6 probabilities:CN0
,CN1
, ...,CN5
which denotes the probability of a gene havingCNn
copies. These probabilities must sum to 1. For simplicity, in the first iteration, we will just work withCN0
andCN1
;G
and how many times it appears inG
, create an order of these genes with strands assigned randomly. This will give us a gene ordering ofG
, e.g.:gene4+ -> gene2- -> gene1+ -> gene3+ -> gene3-
.G
using the gene ordering obtained in step 4 by translating the order and the strands to sequences and concatenating them;A
(gff3) ofG
using the gene ordering obtained in step 4;G
. This is parameterised byPIR
(Probability of Intergenic Region - how probable is two genes have an intergenic region),MinIR
(min length of the intergenic region) andMaxIR
(max length of the intergenic region). In every junction between two genes, we check if we should add a random intergenic region between them. If we do so, we have to shift the annotationA
to the right of the genes that come after this added intergenic region. Bacteria rarely have intergenic regions, but this can also simulate an incomplete annotation;G
, with an error rate and coverage profile;3. Output and evaluation
By the end of all these steps, we will have a set of long reads
R
fromG
, where we know exactly where each read come from inG
. Using the annotation, we know exactly which region of the read come from each gene. We can then runpandora
mappingR
to the PanRG, and look at the SAM and see if we were able to map each region of each read to their gene of origin.4. Possible variations
There are many variations of this pipeline that we can study by changing simulation parameters. We can for example see how
pandora
mapping accuracy change when we add several genes with copy number 2, 3, 4, etc. We can see the effect of increasing/decreasing mutation rate between copies. We can add frequent and large intergenic regions to simulate a badly annotated genome, etc.5. Conclusion
I think this is a good way to build a simulated dataset to improve
pandora
for our current analyses. Note that this simulation and evaluation framework does not care about variations between the genome and the simulated reads, nor try to evaluate ifpandora
genotypes these variations correctly. This was already extensively evaluated in thepandora
paper. This framework concerns with the larger picture. It is all about mapping regions of reads to the correct PRGs, and this is what the current @Danderson123 and @LeahRoberts project is mostly about.6. Feedback
Could I have any feedback on this from you @iqbal-lab @rmcolq @mbhall88 @Danderson123 @LeahRoberts ? @iqbal-lab and @Danderson123 IDK how this simulation and evaluation framework relates with the current simulations you are working on for @Danderson123 project. There might be several overlaps between this and your simulations. However, the objectives are different I think. Here I just want to debug if
pandora
is able to map reads to the correct genes, and how copy number, mutation and error rate affects this mapping. I will not evaluate gene ordering here.I will wait for your approval or rejection of this development direction. I have to say, however, that trying to evaluate this directly on real data is way too complicated for a long list of reasons...
The text was updated successfully, but these errors were encountered: