Skip to content

Releases: vanheeringen-lab/genomepy

[0.16.1] - 2023-06-14

14 Jun 13:41
Compare
Choose a tag to compare

Fixed

  • fix for NCBI's assembly report header "asm_submitter" instead of "submitter"

[0.16.0] - 2023-05-31

31 May 11:41
Compare
Choose a tag to compare

Added

  • genomepy search now accepts the --exact flag
  • genomepy.Annotation.attributes() returns a list of all attributes from the GTF attributes column.
    • e.g. gene_name, gene_version
    • nice to use with genomepy.Annotation.from_attributes() or genomepy.Annotation.gtf_dict()
  • When installing assemblies from older Ensembl release versions, a clearer error message is given if assembly cannot be found:
    • if the release does not exist, options will be given
    • if the assembly does not exist on the release version, all available options are given
    • if the URL to the genome or annotation files is incorrect, the error message stays the same
  • new config option: ucsc_mirror, options: eu or us.
    • the mirror should only affect download speed
    • can be nice if the other mirror is down!

Changed

  • function get_division is now a class method of EnsemblProvider
  • EnsemblProvider class methods get_division and get_version now require an assembly name.
  • UCSC data is now downloaded over HTTPS instead of HTTP

Fixed

  • genomepy.install() now returns a Genome instance with updated annotation attributes.
  • now ignoring ~1600 assemblies from the Ensembl database with incorrect metadata
    • no easy way to retrieve this data

[0.15.0] - 2023-02-28

28 Feb 12:48
Compare
Choose a tag to compare

Added

  • you can now tune the cache expiration time in the config
    • create a config with genomepy config generate, then tweak the values as desired.
  • support for biopython >=1.80 with pyfaidx update
  • raise an informative error when UCSC tools are missing
    • this should only happen in Pip installations

Fixed

  • disabling already disabled plugins no longer throws an error
  • bgzipping fixes:
    • bgzip works again with python>3.7 (openssl shenanigans. tabix was deprecated for htslib)
    • genome index works with genome install --bgzip (a 2nd is created with the correct naming format)
    • export file works with genome install --bgzip
    • genomepy.install_genome(bgzip=True) returns a Genome class instance with correct paths

[0.14.0] - 2022-08-01

01 Aug 13:37
Compare
Choose a tag to compare

Added

  • now using filelock for improved thread safety
  • now checking if every API/FTP/HTTP(S) is accessible before proceeding
  • genomepy search improvements:
    • text search now accepts regex, and multiple substrings (space separated) are unordered.
    • taxonomy search now returns all hits that start with the given number.

Changed

  • switched to pyproject.toml + hatchling for packaging

Fixed

  • updated the README and CLI documentation to mention the Local provider

[0.13.1] - 2022-06-21

21 Jun 15:35
Compare
Choose a tag to compare

Changed

  • removed unused keys from Ensembl and UCSC databases to reduce their size

Fixed

  • added a retry for initializing the diskcache (seq2science/issues/887)
  • can now find ensembl urls for genomes not using url_names properly (#205)

[0.13.0] - 2022-06-02

02 Jun 15:00
Compare
Choose a tag to compare

Added

  • genomepy search and genomepy genomes can now return the (unfiltered) absolute genome size with argument --size

Changed

  • changed caching backend to diskcache (thread safe)
  • reduced the local cache size of NCBI (by about half)
    • by only storing assembly summary columns actually used by genomepy

[0.12.0] - 2022-03-28

28 Mar 15:38
Compare
Choose a tag to compare

Added

  • genomepy.Annotation.lengths() to retrieve the gene/transcript lengths.
  • genomepy.Annotation.from_attributes() can extract any sub-column that pesky attributes column

Changed

  • updated Boyle-lab blacklists
  • genomepy.Annotation.genes() default changed from bed (commonly containing transcript names) to gtf (gene names)

Fixed

  • blacklists now work with GENCODE
  • query_mygene no longer filters input.
  • genomepy install with local provider now understands you want the annotation if you pass a path to an annotation

[0.11.1] - 2022-01-06

06 Jan 12:05
Compare
Choose a tag to compare

Added

  • quiet flag for genomepy.Annotation
  • genomepy -v flag

Changed

  • genomepy.Annotation returns a FileNotFoundError instead of a ValueError where appropriate.
  • download_assembly_report refactored. Now downloads the report for the exact same assembly accession (and not the nearest NCBI assembly).
  • broader unit tests for UCSC assembly accession scraping

Fixed

  • inconsistent behaviour with assembly reports (#193 + #194)

[0.11.0] - 2021-11-18

18 Nov 10:17
Compare
Choose a tag to compare

Added

  • extened docstrings
  • GENCODE support (GENCODE gene annotations with UCSC genomes)
    • only contains the main chromosomes, no scaffolds or alternate haplotypes.
    • only contains 4 assemblies (2 mouse, 2 human)
    • excellent annotations for these regions & species though!
  • Ensembl's GRCh37 can now be downloaded through genomepy
  • Local fasta/gtf/gff(3)/bed file support
    • you can install a local genome and/or annotation by providing local path(s) to genomepy install
      • if annotation downloading is requested, but not annotation path is provided,
        a gtf/gff(3) annotation will be sought in the genome's source directory.
  • Annotation.gtf_dict creates a dictionary for any key-value pair in the GTF columns or attribute fields!
    • e.g. Annotation.gtf_dict("seqname", "gene_name")

Changed

  • Genome.track2fasta can now ignore comment lines (starting with #)
  • Genome.track2fasta will skip header lines (a warning will be printed)
  • Genome.track2fasta will ignore regions that cannot be parsed (a warning will be printed)
    • these fixes should improve gimme scan performance and feedback
  • UCSC annotation conversion tool settings tweaked. Better results with source gff files.
  • Ensembl now uses HTTP instead of FTP (in some cases). This improves stability on some servers.
  • tweaked search result alignment for clarity
  • explained UCSC annotations in the README
  • better file path handling (relative paths, user home and variables are expanded)
  • Annotation now accepts a file/directory/genomepy name as first argument.
    • this merges 2 arguments into one.
  • Annotation.map_genes now works without a README file
    • you can now set Annotation.tax_id manually.

Fixed

  • Ensembl annotations from previous releases can now be downloaded as intended.
  • Genome.track2fasta will skip regions that clearly dont make sense (start>end, and start<0)

Version 0.10.0

30 Jul 13:41
Compare
Choose a tag to compare

[0.10.0] - 2021-07-30

Added

  • Annotation class, containing
    • regex filter (genomepy.Annotation.filter_regex())
    • sanitize functions (genomepy.Annotation.sanitize())
      • option to skip filtering and/or matching the annotation to the genome (also on CLI)
    • gene name remapping to various formats (genomepy.Annotation.map_genes())
      • using MyGene.info. Can be queried separately (genomepy.annotation.query_mygene())
    • contig name remapping to other provider formats (genomepy.Annotation.map_locations())
    • get the annotations, or gene locations, as dataframes (genomepy.Annotation.gtf, bed or gene_coords() respectively)
    • get the gene names as a list (genomepy.Annotation.genes("gtf") or genomepy.Annotation.genes("bed"))
  • genomepy install now attempts to install the NCBI assembly report
  • NCBI provider also indexes the NCBI genbank_historical summary
  • genomepy search now shows if the genome has an annotation
    • this slows down the results a bit
    • to compensate, results are now shown as soon as they are found
    • for UCSC, availability of any of the 4 annotations is shown
  • genomepy annotation shows the first line(s) of each gene annotation.gtf
  • for developers:
    • pre-commit-hooks for linting
    • formatting/linting script tests/format.sh (optional argument lint)
    • isort & autoflake formatters

Changed

  • provider module split per provider
  • ProviderBase overhauled, now called Provider
  • regex filtering separated from Provider.download_genome
  • utils module split into utils, files and online
  • now using loguru for pretty logging
  • accession search improved
    • now finds GCA and GCF accessions
    • now ignores patch levels
  • genomepy install automatic provider selection refactored
    • Provider.online_providers returns a generator (faster!)
  • genomepy install uses a combined filter function (faster!)
  • genomepy install only zips annotation files if the genome is zipped (with the bgzip flag) (faster!)
  • NCBI provider should be parsed faster (faster!)
  • new dependency: pandas
  • tests no longer format code

Fixed

  • broken URLs should keep genomepy occupied for less long (check_url will immediately return on "Not Found" errors 404/450) (faster!)
  • the Genome class now passes arguments to the parent Fasta class
  • the Genome class now regenerates the sizes and gaps files similarly to the Fasta class and its index (when the genome is younger) (faster!)
  • somewhat more pythonic tests