Dupi is an engine for identifying and exploring duplicative text in sets of documents.
Dupi is in alpha/early beta development stage. Please feel free to give it a try (and file issues). We have run it on several document sets successfully, but it definitely needs more testing.
Throw hundreds of thousands of textual documents at it. Or extract text from other documents and send that to dupi.
Find and query for repeated chunks of text.