[Question]: How can I make sure no duplicate documents are added to the vector store? #13162

dsanr · 2024-04-29T21:36:28Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

Currently, I am using Ingestion Pipeline
pipeline = IngestionPipeline( transformations = [SentenceSplitter(chunk_size=512, chunk_overlap=128), OpenAIEmbedding(model_name='text-embedding-3-small', callback_manager=CallbackManager([token_counter]))], docstore=SimpleDocumentStore(), vector_store=vector_store)
nodes = pipeline.run(documents=documents, show_progress=True, num_workers=-1)
vector_store is a Chroma vector DB, document is paul_graham essay.
When I re-run the pipeline with the same documents, no duplicates are added to Docstore. But embeddings are calculated again for the document and the chroma_collection count is increasing.
How can I ensure no duplicate chunks/nodes are added to the vector_store?

The text was updated successfully, but these errors were encountered:

dosubot · 2024-04-29T21:36:55Z

Hey @dsanr, nice to see you back with another interesting query! 😊 I'm diving into your issue about preventing duplicates in the vector store. Hang tight, I'll get back to you with some insights shortly.

logan-markewich · 2024-04-29T21:42:39Z

Are you saving and loading the docstore between runs? Do your documents have consistent IDs? Both of those are pretty important

dosubot · 2024-04-29T22:00:04Z

To address the issue of preventing duplicates in your vector store when re-running the pipeline, it's crucial to implement a strategy that checks for document existence by ID and hash, updating as necessary. This can be achieved by setting the docstore_strategy in your IngestionPipeline to UPSERTS. This strategy ensures that only unique or updated documents are processed and stored, by checking if a document is already in the docstore based on its ID and hash.

Additionally, it's important to save the state of your docstore to disk after each pipeline run and load it at the beginning of subsequent runs. This practice maintains a consistent state of known documents across pipeline executions, effectively preventing the addition of duplicate chunks/nodes to the vector_store.

Here's a simplified example to guide you:

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter

# Initialize your pipeline with the desired docstore strategy
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ],
    docstore=SimpleDocumentStore(),
    # Add other necessary initialization parameters here
)

# Load the docstore from disk at the start (if it exists)
pipeline.load(persist_dir="path/to/your/persisted/data")

# Run your pipeline processing
# ...

# Persist the docstore to disk at the end
pipeline.persist(persist_dir="path/to/your/persisted/data")

This setup ensures that your pipeline checks for duplicates using the document ID and hash, updating the docstore and vector_store only with unique or updated documents. By persisting the docstore state between runs, you maintain consistency in document IDs and avoid reprocessing duplicates.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

dsanr · 2024-05-05T08:31:22Z

@logan-markewich Saving and loading the docstore between runs made it work. Thank you.

dsanr added the question Further information is requested label Apr 29, 2024

logan-markewich closed this as completed May 5, 2024

kyouens mentioned this issue May 13, 2024

[Question]: Ingestion pipeline duplicates with Postgres #13461

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: How can I make sure no duplicate documents are added to the vector store? #13162

[Question]: How can I make sure no duplicate documents are added to the vector store? #13162

dsanr commented Apr 29, 2024

dosubot bot commented Apr 29, 2024

logan-markewich commented Apr 29, 2024

dosubot bot commented Apr 29, 2024

Details

dsanr commented May 5, 2024

[Question]: How can I make sure no duplicate documents are added to the vector store? #13162

[Question]: How can I make sure no duplicate documents are added to the vector store? #13162

Comments

dsanr commented Apr 29, 2024

Question Validation

Question

dosubot bot commented Apr 29, 2024

logan-markewich commented Apr 29, 2024

dosubot bot commented Apr 29, 2024

Details

dsanr commented May 5, 2024