Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgress database query question #634

Open
genomicsguy opened this issue Aug 8, 2024 · 2 comments
Open

Postgress database query question #634

genomicsguy opened this issue Aug 8, 2024 · 2 comments

Comments

@genomicsguy
Copy link

Could you provide an example query for how you create the GFF or GTF download for the human annotations.

Please could you also confirm that https://rnacentral.org/static/img/rnacentral_latest_schema.png is the latest schema. For example, the schema shows "RnaPrecomputed" as the main table, yet the table available through the postgress interface is called "rnc_rna_precomputed".

Thank you!

@genomicsguy
Copy link
Author

Here another example. You have two tables "rnc_sequence_regions_active_mapped" and "rnc_sequence_regions_active_provided" within the rnacen schema. I am assuming one or both of these is related to "SequenceRegionActive" in the online schema, but it is not clear which, and the tables contain a column not listed online, "providing_databases."

@blakesweeney
Copy link
Member

Hi, the query we use is:

SELECT
  json_build_object(
      'assembly_id', :'assembly_id',
      'region_id', max(regions.region_name),
      'rna_id', max(pre.id),
      'description', max(pre.short_description),
      'rna_type',  max(pre.rna_type),
      'databases', regexp_split_to_array(max(pre."databases"), ','),
      'providing_databases', array_agg(ac.database),
      'chromosome', max(regions.chromosome),
      'strand', max(regions.strand),
      'identity', max(regions.identity),
      'was_mapped', bool_or(regions.was_mapped),
      'exons', array_agg(distinct exons.*)
  )
FROM rnc_rna_precomputed pre
JOIN rnc_sequence_regions_active regions
ON
  regions.urs_taxid = pre.id
JOIN rnc_sequence_exons exons
ON
  exons.region_id = regions.id
join rnc_accession_sequence_region sra
	on sra.region_id = regions.id
join rnc_accessions ac
	on sra.accession = ac.accession
WHERE
  pre.is_active = true
  AND regions.assembly_id = :'assembly_id'
GROUP BY regions.id
ORDER BY max(regions.chromosome), max(regions.region_start), regions.id

The result of this query is turned into GFF/BED as needed. In the database we have a table of provided coordinates, and of mapped coordinates. We combine these to produce the final coordinates. Note that the table name in the schema diagram follows python naming conventions, while the table names do not. We probably should update the diagram to indicate table names, but they are generally a simple modification of the table names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants