New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: How can I make sure no duplicate documents are added to the vector store? #13162
Comments
Hey @dsanr, nice to see you back with another interesting query! 😊 I'm diving into your issue about preventing duplicates in the vector store. Hang tight, I'll get back to you with some insights shortly. |
Are you saving and loading the docstore between runs? Do your documents have consistent IDs? Both of those are pretty important |
To address the issue of preventing duplicates in your vector store when re-running the pipeline, it's crucial to implement a strategy that checks for document existence by ID and hash, updating as necessary. This can be achieved by setting the Additionally, it's important to save the state of your Here's a simplified example to guide you: from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter
# Initialize your pipeline with the desired docstore strategy
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(),
HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
],
docstore=SimpleDocumentStore(),
# Add other necessary initialization parameters here
)
# Load the docstore from disk at the start (if it exists)
pipeline.load(persist_dir="path/to/your/persisted/data")
# Run your pipeline processing
# ...
# Persist the docstore to disk at the end
pipeline.persist(persist_dir="path/to/your/persisted/data") This setup ensures that your pipeline checks for duplicates using the document ID and hash, updating the
|
@logan-markewich Saving and loading the docstore between runs made it work. Thank you. |
Question Validation
Question
Currently, I am using Ingestion Pipeline
pipeline = IngestionPipeline( transformations = [SentenceSplitter(chunk_size=512, chunk_overlap=128), OpenAIEmbedding(model_name='text-embedding-3-small', callback_manager=CallbackManager([token_counter]))], docstore=SimpleDocumentStore(), vector_store=vector_store)
nodes = pipeline.run(documents=documents, show_progress=True, num_workers=-1)
vector_store is a Chroma vector DB, document is paul_graham essay.
When I re-run the pipeline with the same documents, no duplicates are added to Docstore. But embeddings are calculated again for the document and the chroma_collection count is increasing.
How can I ensure no duplicate chunks/nodes are added to the vector_store?
The text was updated successfully, but these errors were encountered: