Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIKE: In place enrichment #741

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from
Draft

SPIKE: In place enrichment #741

wants to merge 29 commits into from

Conversation

jbothma
Copy link
Contributor

@jbothma jbothma commented Apr 5, 2024

towards #691

@pudo
Copy link
Member

pudo commented Apr 6, 2024

One thing that I'm not sure about: is the memory being eaten up by the index, or by running pairs? The latter makes this massive matrix of scores. If we could live with the memory footprint of the target index (or make the tokenizer emit fewer tokens per entity!), we could also implement a new method on index, call it match(entity), that would find the top N matches for a given entity from input and then we'd score those. This way we don't need the matrix from (https://github.com/opensanctions/nomenklatura/blob/main/nomenklatura/index/index.py#L76) in memory.

@pudo
Copy link
Member

pudo commented Apr 6, 2024

Couldn't help myself, sorry: opensanctions/nomenklatura#143

@jbothma
Copy link
Contributor Author

jbothma commented Apr 22, 2024

@pudo for these type issues the local enricher is gonna return zavod entities and I'm struggling to see how to tell the type system that since it's based on the zavod store we're creating, not an argument to the enricher.

mypy --strict --exclude zavod/tests zavod/
zavod/runner/local_enricher.py:56: error: Argument 1 to "match" of "Index" has incompatible type "CE"; expected "Entity"  [arg-type]
zavod/runner/local_enricher.py:71: error: Incompatible types in "yield" (actual type "Entity", expected type "CE")  [misc]
zavod/runner/local_enricher.py:82: error: Argument 1 to "get_adjacent" of "View" has incompatible type "CE"; expected "Entity"  [arg-type]
zavod/runner/local_enricher.py:83: error: Incompatible types in "yield from" (actual type "Entity", expected type "CE")  [misc]
Found 4 errors in 1 file (checked 80 source files)

could the solution be to decouple the return types from the argument types? This changes the interfaces but I can't think of a scenario where the the enricher needs what's returned to be the same type as the query entity - it just needs to be a CompositeEntity, right?

diff --git a/zavod/zavod/runner/local_enricher.py b/zavod/zavod/runner/local_enricher.py
index 53dc3687..f5f0b0a9 100644
--- a/zavod/zavod/runner/local_enricher.py
+++ b/zavod/zavod/runner/local_enricher.py
@@ -2,7 +2,7 @@ import logging
 from typing import Generator, Optional
 from followthemoney.namespace import Namespace

-from nomenklatura.entity import CE
+from nomenklatura.entity import CE, CompositeEntity
 from nomenklatura.dataset import DS
 from nomenklatura.cache import Cache
 from nomenklatura.enrich.common import Enricher, EnricherConfig
@@ -12,6 +12,7 @@ from nomenklatura.matching import get_algorithm

 from zavod.meta import get_catalog
 from zavod.store import get_store
+from zavod.entity import Entity


 log = logging.getLogger(__name__)
@@ -52,7 +53,7 @@ class LocalEnricher(Enricher):
         if self.get_config_bool("strip_namespace"):
             self._ns = Namespace()

-    def match(self, entity: CE) -> Generator[CE, None, None]:
+    def match(self, entity: CE) -> Generator[Entity, None, None]:
         for match_id, index_score in self._index.match(entity)[:MATCH_CANDIDATES]:
             match = self._view.get_entity(match_id.id)
             if match is None:
@@ -70,7 +71,7 @@ class LocalEnricher(Enricher):
             if result.score >= self._threshold:
                 yield match

-    def _traverse_nested(self, entity: CE, depth: int) -> Generator[CE, None, None]:
+    def _traverse_nested(self, entity: CE, depth: int) -> Generator[CompositeEntity, None, None]:
         if depth == 0:
             return

@@ -82,5 +83,5 @@ class LocalEnricher(Enricher):
         for prop, adjacent in self._view.get_adjacent(entity):
             yield from self._traverse_nested(adjacent, depth - 1)

-    def expand(self, entity: CE, match: CE) -> Generator[CE, None, None]:
+    def expand(self, entity: CE, match: CE) -> Generator[CompositeEntity, None, None]:
         yield from self._traverse_nested(match, 2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants