Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ete4 #750 #751

Merged
merged 8 commits into from
May 31, 2024
Merged

Ete4 #750 #751

merged 8 commits into from
May 31, 2024

Conversation

dengzq1234
Copy link
Contributor

this PR is to solve #750

It includes the following features:

  • Convert methods in GTDBTaxa() which involves meaningless numeric ids to internal methods, such as:
    • get_lineage_translator() -> _get_lineage_translator()
    • get_name_translator() -> _get_name_translator()
    • translate_to_names() -> _translate_to_names()
  • Convert input of get_rank() from numeric id to string id in GTDBTaxa(), for example:
print(gtdb.get_rank(['c__Thorarchaeia', 'RS_GCF_001477695.1']))
{'c__Thorarchaeia': 'class', 'RS_GCF_001477695.1': 'subspecies'}
  • Add flag ignore_unclassified in both NCBITaxa() and GTDBTaxa(), when ignore_unclassified=True, annotate_tree() will igore empty annotation of leaves

@dengzq1234 dengzq1234 requested a review from jordibc May 20, 2024 14:46
Copy link
Contributor

@jordibc jordibc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dengzq1234

Before I merge this PR, I'd like to understand the couple of things I mentioned before, and also we should understand why after this change the ncbiquery test fails (which seems to be from your change in line 135 of tests/test_ncbiquery.py, that I imagine you had a reason to do):

$ pytest tests/test_ncbiquery.py 
[...]
tests/test_ncbiquery.py ......F.
[...]
____________________________________________ Test_ncbiquery.test_merged_id _____________________________________________

self = <tests.test_ncbiquery.Test_ncbiquery testMethod=test_merged_id>

    def test_merged_id(self):
      ncbi = NCBITaxa(dbfile=DATABASE_PATH)
      t1 = ncbi.get_lineage(649756)
>     self.assertEqual(t1, [1, 131567, 2, 1783272, 1239, 186801, 3085636, 186803, 207244, 649756])
E     AssertionError: Lists differ: [1, 131567, 2, 1783272, 1239, 186801, 186802, 186803, 207244, 649756] != [1, 131567, 2, 1783272, 1239, 186801, 3085636, 186803, 207244, 649756]
E     
E     First differing element 6:
E     186802
E     3085636
E     
E     - [1, 131567, 2, 1783272, 1239, 186801, 186802, 186803, 207244, 649756]
E     ?                                       ^  ^^^
E     
E     + [1, 131567, 2, 1783272, 1239, 186801, 3085636, 186803, 207244, 649756]
E     ?                                       ^^ + ^^

test_ncbiquery.py:135: AssertionError

def get_rank(self, taxids):
def _dirty_id_suffix(self, taxid):
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this function used. Why is it defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a function I want to do to handle GTDB accesssion,
because I realized sometime they use ids in both way, for example:
GCA_000003645.1 GB_GCA_000003645.1
GCF_900113245.1 RS_GCF_900113245.1
they are equivalent so I want to have a function to handle the fuzzy. But I will keep it in my dev repo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

Comment on lines 152 to 169
def _get_rank(self, taxids):
"""Return dictionary converting taxids to their GTDB taxonomy rank."""
ids = ','.join('"%s"' % v for v in set(taxids) - {None, ''})
result = self.db.execute('SELECT taxid, rank FROM species WHERE taxid IN (%s)' % ids)
return {tax: spname for tax, spname in result.fetchall()}

def get_rank(self, taxids):
taxid2rank = {}
name2ids = self._get_name_translator(taxids)
overlap_ids = name2ids.values()
taxids = [item for sublist in overlap_ids for item in sublist]
"""Return dictionary converting taxids to their GTDB taxonomy rank."""
ids = ','.join('"%s"' % v for v in set(taxids) - {None, ''})
result = self.db.execute('SELECT taxid, rank FROM species WHERE taxid IN (%s)' % ids)
for tax, rank in result.fetchall():
taxid2rank[list(self._get_taxid_translator([tax]).values())[0]] = rank

return taxid2rank
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change. Why do we have get_rank() and _get_rank(). What's their difference? Also, get_rank() would need a docstring, and we should avoid having similarly-named functions since that makes it hard to guess which one is appropriate to use.

@dengzq1234
Copy link
Contributor Author

Thanks @dengzq1234

Before I merge this PR, I'd like to understand the couple of things I mentioned before, and also we should understand why after this change the ncbiquery test fails (which seems to be from your change in line 135 of tests/test_ncbiquery.py, that I imagine you had a reason to do):

$ pytest tests/test_ncbiquery.py 
[...]
tests/test_ncbiquery.py ......F.
[...]
____________________________________________ Test_ncbiquery.test_merged_id _____________________________________________

self = <tests.test_ncbiquery.Test_ncbiquery testMethod=test_merged_id>

    def test_merged_id(self):
      ncbi = NCBITaxa(dbfile=DATABASE_PATH)
      t1 = ncbi.get_lineage(649756)
>     self.assertEqual(t1, [1, 131567, 2, 1783272, 1239, 186801, 3085636, 186803, 207244, 649756])
E     AssertionError: Lists differ: [1, 131567, 2, 1783272, 1239, 186801, 186802, 186803, 207244, 649756] != [1, 131567, 2, 1783272, 1239, 186801, 3085636, 186803, 207244, 649756]
E     
E     First differing element 6:
E     186802
E     3085636
E     
E     - [1, 131567, 2, 1783272, 1239, 186801, 186802, 186803, 207244, 649756]
E     ?                                       ^  ^^^
E     
E     + [1, 131567, 2, 1783272, 1239, 186801, 3085636, 186803, 207244, 649756]
E     ?                                       ^^ + ^^

test_ncbiquery.py:135: AssertionError

I changed it because I updated my ncbi taxonomy database, I wonder if it would effect test_ncbiquery.taxa.sqlite? For my latest update of ncbi taxonomy, I won't pass this unitest if I don't change it.

@dengzq1234
Copy link
Contributor Author

@jordibc Already update with the next ncbi unitest

Copy link
Contributor

@jordibc jordibc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @dengzq1234 !

def get_rank(self, taxids):
def _dirty_id_suffix(self, taxid):
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

@jordibc jordibc merged commit ff6767d into ete4 May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants