Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: How can I make sure no duplicate documents are added to the vector store? #13162

Closed
1 task done
dsanr opened this issue Apr 29, 2024 · 4 comments
Closed
1 task done
Labels
question Further information is requested

Comments

@dsanr
Copy link

dsanr commented Apr 29, 2024

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

Currently, I am using Ingestion Pipeline
pipeline = IngestionPipeline( transformations = [SentenceSplitter(chunk_size=512, chunk_overlap=128), OpenAIEmbedding(model_name='text-embedding-3-small', callback_manager=CallbackManager([token_counter]))], docstore=SimpleDocumentStore(), vector_store=vector_store)
nodes = pipeline.run(documents=documents, show_progress=True, num_workers=-1)
vector_store is a Chroma vector DB, document is paul_graham essay.
When I re-run the pipeline with the same documents, no duplicates are added to Docstore. But embeddings are calculated again for the document and the chroma_collection count is increasing.
How can I ensure no duplicate chunks/nodes are added to the vector_store?

@dsanr dsanr added the question Further information is requested label Apr 29, 2024
Copy link

dosubot bot commented Apr 29, 2024

Hey @dsanr, nice to see you back with another interesting query! 😊 I'm diving into your issue about preventing duplicates in the vector store. Hang tight, I'll get back to you with some insights shortly.

@logan-markewich
Copy link
Collaborator

Are you saving and loading the docstore between runs? Do your documents have consistent IDs? Both of those are pretty important

Copy link

dosubot bot commented Apr 29, 2024

To address the issue of preventing duplicates in your vector store when re-running the pipeline, it's crucial to implement a strategy that checks for document existence by ID and hash, updating as necessary. This can be achieved by setting the docstore_strategy in your IngestionPipeline to UPSERTS. This strategy ensures that only unique or updated documents are processed and stored, by checking if a document is already in the docstore based on its ID and hash.

Additionally, it's important to save the state of your docstore to disk after each pipeline run and load it at the beginning of subsequent runs. This practice maintains a consistent state of known documents across pipeline executions, effectively preventing the addition of duplicate chunks/nodes to the vector_store.

Here's a simplified example to guide you:

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter

# Initialize your pipeline with the desired docstore strategy
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ],
    docstore=SimpleDocumentStore(),
    # Add other necessary initialization parameters here
)

# Load the docstore from disk at the start (if it exists)
pipeline.load(persist_dir="path/to/your/persisted/data")

# Run your pipeline processing
# ...

# Persist the docstore to disk at the end
pipeline.persist(persist_dir="path/to/your/persisted/data")

This setup ensures that your pipeline checks for duplicates using the document ID and hash, updating the docstore and vector_store only with unique or updated documents. By persisting the docstore state between runs, you maintain consistency in document IDs and avoid reprocessing duplicates.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@dsanr
Copy link
Author

dsanr commented May 5, 2024

@logan-markewich Saving and loading the docstore between runs made it work. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants