Ete4 #750 #751

dengzq1234 · 2024-05-20T14:42:32Z

this PR is to solve #750

It includes the following features:

Convert methods in GTDBTaxa() which involves meaningless numeric ids to internal methods, such as:
- get_lineage_translator() -> _get_lineage_translator()
- get_name_translator() -> _get_name_translator()
- translate_to_names() -> _translate_to_names()
Convert input of get_rank() from numeric id to string id in GTDBTaxa(), for example:

print(gtdb.get_rank(['c__Thorarchaeia', 'RS_GCF_001477695.1']))
{'c__Thorarchaeia': 'class', 'RS_GCF_001477695.1': 'subspecies'}

Add flag ignore_unclassified in both NCBITaxa() and GTDBTaxa(), when ignore_unclassified=True, annotate_tree() will igore empty annotation of leaves

jordibc

Thanks @dengzq1234

Before I merge this PR, I'd like to understand the couple of things I mentioned before, and also we should understand why after this change the ncbiquery test fails (which seems to be from your change in line 135 of tests/test_ncbiquery.py, that I imagine you had a reason to do):

$ pytest tests/test_ncbiquery.py 
[...]
tests/test_ncbiquery.py ......F.
[...]
____________________________________________ Test_ncbiquery.test_merged_id _____________________________________________

self = <tests.test_ncbiquery.Test_ncbiquery testMethod=test_merged_id>

    def test_merged_id(self):
      ncbi = NCBITaxa(dbfile=DATABASE_PATH)
      t1 = ncbi.get_lineage(649756)
>     self.assertEqual(t1, [1, 131567, 2, 1783272, 1239, 186801, 3085636, 186803, 207244, 649756])
E     AssertionError: Lists differ: [1, 131567, 2, 1783272, 1239, 186801, 186802, 186803, 207244, 649756] != [1, 131567, 2, 1783272, 1239, 186801, 3085636, 186803, 207244, 649756]
E     
E     First differing element 6:
E     186802
E     3085636
E     
E     - [1, 131567, 2, 1783272, 1239, 186801, 186802, 186803, 207244, 649756]
E     ?                                       ^  ^^^
E     
E     + [1, 131567, 2, 1783272, 1239, 186801, 3085636, 186803, 207244, 649756]
E     ?                                       ^^ + ^^

test_ncbiquery.py:135: AssertionError

jordibc · 2024-05-25T04:26:41Z

ete4/gtdb_taxonomy/gtdbquery.py

- def get_rank(self, taxids):
+ def _dirty_id_suffix(self, taxid):
+ pass
+


I don't see this function used. Why is it defined?

This is a function I want to do to handle GTDB accesssion,
because I realized sometime they use ids in both way, for example:
GCA_000003645.1 GB_GCA_000003645.1
GCF_900113245.1 RS_GCF_900113245.1
they are equivalent so I want to have a function to handle the fuzzy. But I will keep it in my dev repo

Great, thanks!

jordibc · 2024-05-25T04:33:43Z

ete4/gtdb_taxonomy/gtdbquery.py

+ def _get_rank(self, taxids):
 """Return dictionary converting taxids to their GTDB taxonomy rank."""
 ids = ','.join('"%s"' % v for v in set(taxids) - {None, ''})
 result = self.db.execute('SELECT taxid, rank FROM species WHERE taxid IN (%s)' % ids)
 return {tax: spname for tax, spname in result.fetchall()}
+
+ def get_rank(self, taxids):
+ taxid2rank = {}
+ name2ids = self._get_name_translator(taxids)
+ overlap_ids = name2ids.values()
+ taxids = [item for sublist in overlap_ids for item in sublist]
+ """Return dictionary converting taxids to their GTDB taxonomy rank."""
+ ids = ','.join('"%s"' % v for v in set(taxids) - {None, ''})
+ result = self.db.execute('SELECT taxid, rank FROM species WHERE taxid IN (%s)' % ids)
+ for tax, rank in result.fetchall():
+ taxid2rank[list(self._get_taxid_translator([tax]).values())[0]] = rank
+
+ return taxid2rank


I don't understand this change. Why do we have get_rank() and _get_rank(). What's their difference? Also, get_rank() would need a docstring, and we should avoid having similarly-named functions since that makes it hard to guess which one is appropriate to use.

dengzq1234 · 2024-05-27T07:15:00Z

Thanks @dengzq1234

Before I merge this PR, I'd like to understand the couple of things I mentioned before, and also we should understand why after this change the ncbiquery test fails (which seems to be from your change in line 135 of tests/test_ncbiquery.py, that I imagine you had a reason to do):

$ pytest tests/test_ncbiquery.py 
[...]
tests/test_ncbiquery.py ......F.
[...]
____________________________________________ Test_ncbiquery.test_merged_id _____________________________________________

self = <tests.test_ncbiquery.Test_ncbiquery testMethod=test_merged_id>

    def test_merged_id(self):
      ncbi = NCBITaxa(dbfile=DATABASE_PATH)
      t1 = ncbi.get_lineage(649756)
>     self.assertEqual(t1, [1, 131567, 2, 1783272, 1239, 186801, 3085636, 186803, 207244, 649756])
E     AssertionError: Lists differ: [1, 131567, 2, 1783272, 1239, 186801, 186802, 186803, 207244, 649756] != [1, 131567, 2, 1783272, 1239, 186801, 3085636, 186803, 207244, 649756]
E     
E     First differing element 6:
E     186802
E     3085636
E     
E     - [1, 131567, 2, 1783272, 1239, 186801, 186802, 186803, 207244, 649756]
E     ?                                       ^  ^^^
E     
E     + [1, 131567, 2, 1783272, 1239, 186801, 3085636, 186803, 207244, 649756]
E     ?                                       ^^ + ^^

test_ncbiquery.py:135: AssertionError

I changed it because I updated my ncbi taxonomy database, I wonder if it would effect test_ncbiquery.taxa.sqlite? For my latest update of ncbi taxonomy, I won't pass this unitest if I don't change it.

dengzq1234 · 2024-05-30T20:21:08Z

@jordibc Already update with the next ncbi unitest

jordibc

Looks good, thanks @dengzq1234 !

jordibc · 2024-05-31T18:17:32Z

ete4/gtdb_taxonomy/gtdbquery.py

- def get_rank(self, taxids):
+ def _dirty_id_suffix(self, taxid):
+ pass
+


Great, thanks!

dengzq1234 added 4 commits May 20, 2024 14:30

add ignore unclassified in ncbi taxa

3cad7e3

update ncbi test

142cd33

allow ignore_unclassified in gtdb taxa module

1d0122d

update db

a9f4704

dengzq1234 requested a review from jordibc May 20, 2024 14:46

make get_lineage internal

9a04ce4

jordibc requested changes May 25, 2024

View reviewed changes

dengzq1234 added 3 commits May 27, 2024 11:15

rename _get_rank to _get_id2rank

8ccac7c

Merge branch 'ete4' into ete4_#750

7739f27

update sytanx for new test

c1533ac

jordibc approved these changes May 31, 2024

View reviewed changes

ete4/gtdb_taxonomy/gtdbquery.py

def get_rank(self, taxids):

def _dirty_id_suffix(self, taxid):

pass

Copy link

Contributor

jordibc May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

jordibc merged commit ff6767d into ete4 May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ete4 #750 #751

Ete4 #750 #751

dengzq1234 commented May 20, 2024

jordibc left a comment

jordibc May 25, 2024

dengzq1234 May 27, 2024

jordibc May 31, 2024

jordibc May 25, 2024

dengzq1234 commented May 27, 2024

dengzq1234 commented May 30, 2024

jordibc left a comment

jordibc May 31, 2024

Ete4 #750 #751

Ete4 #750 #751

Conversation

dengzq1234 commented May 20, 2024

jordibc left a comment

Choose a reason for hiding this comment

jordibc May 25, 2024

Choose a reason for hiding this comment

dengzq1234 May 27, 2024

Choose a reason for hiding this comment

jordibc May 31, 2024

Choose a reason for hiding this comment

jordibc May 25, 2024

Choose a reason for hiding this comment

dengzq1234 commented May 27, 2024

dengzq1234 commented May 30, 2024

jordibc left a comment

Choose a reason for hiding this comment

jordibc May 31, 2024

Choose a reason for hiding this comment