Skip to content

Commit

Permalink
fix: elasticsearch (langchain-ai#2402)
Browse files Browse the repository at this point in the history
- Create a new docker-compose file to start an Elasticsearch instance
for integration tests.
- Add new tests to `test_elasticsearch.py` to verify Elasticsearch
functionality.
- Include an optional group `test_integration` in the `pyproject.toml`
file. This group should contain dependencies for integration tests and
can be installed using the command `poetry install --with
test_integration`. Any new dependencies should be added by running
`poetry add some_new_deps --group "test_integration" `

Note:
New tests running in live mode, which involve end-to-end testing of the
OpenAI API. In the future, adding `pytest-vcr` to record and replay all
API requests would be a nice feature for testing process.More info:
https://pytest-vcr.readthedocs.io/en/latest/

Fixes langchain-ai#2386
  • Loading branch information
sergerdn authored Apr 5, 2023
1 parent 4d730a9 commit b410dc7
Show file tree
Hide file tree
Showing 7 changed files with 186 additions and 34 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -141,3 +141,4 @@ wandb/

# asdf tool versions
.tool-versions
/.ruff_cache/
2 changes: 1 addition & 1 deletion langchain/vectorstores/elastic_vector_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ def from_texts(
raise ValueError(
"Your elasticsearch client string is misformatted. " f"Got error: {e} "
)
index_name = uuid.uuid4().hex
index_name = kwargs.get("index_name", uuid.uuid4().hex)
embeddings = embedding.embed_documents(texts)
dim = len(embeddings[0])
mapping = _default_text_mapping(dim)
Expand Down
21 changes: 10 additions & 11 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 7 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,13 @@ freezegun = "^1.2.2"
responses = "^0.22.0"
pytest-asyncio = "^0.20.3"

[tool.poetry.group.test_integration]
optional = true

[tool.poetry.group.test_integration.dependencies]
openai = "^0.27.4"
elasticsearch = {extras = ["async"], version = "^8.6.2"}

[tool.poetry.group.lint.dependencies]
ruff = "^0.0.249"
types-toml = "^0.10.8.1"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
version: "3"

services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.7.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- xpack.security.http.ssl.enabled=false
- ELASTIC_PASSWORD=password
ports:
- "9200:9200"
healthcheck:
test: [ "CMD-SHELL", "curl --silent --fail http://localhost:9200/_cluster/health || exit 1" ]
interval: 1s
retries: 360

kibana:
image: docker.elastic.co/kibana/kibana:8.7.0
environment:
- ELASTICSEARCH_URL=http://elasticsearch:9200
- ELASTICSEARCH_USERNAME=kibana_system
- ELASTICSEARCH_PASSWORD=password
- KIBANA_PASSWORD=password
ports:
- "5601:5601"
healthcheck:
test: [ "CMD-SHELL", "curl --silent --fail http://localhost:5601/login || exit 1" ]
interval: 10s
retries: 60
7 changes: 7 additions & 0 deletions tests/integration_tests/vectorstores/fixtures/sharks.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Sharks are a group of elasmobranch fish characterized by a cartilaginous skeleton, five to seven gill slits on the sides of the head, and pectoral fins that are not fused to the head. Modern sharks are classified within the clade Selachimorpha (or Selachii) and are the sister group to the Batoidea (rays and kin). Some sources extend the term "shark" as an informal category including extinct members of Chondrichthyes (cartilaginous fish) with a shark-like morphology, such as hybodonts and xenacanths. Shark-like chondrichthyans such as Cladoselache and Doliodus first appeared in the Devonian Period (419-359 Ma), though some fossilized chondrichthyan-like scales are as old as the Late Ordovician (458-444 Ma). The oldest modern sharks (selachians) are known from the Early Jurassic, about 200 Ma.

Sharks range in size from the small dwarf lanternshark (Etmopterus perryi), a deep sea species that is only 17 centimetres (6.7 in) in length, to the whale shark (Rhincodon typus), the largest fish in the world, which reaches approximately 12 metres (40 ft) in length. They are found in all seas and are common to depths up to 2,000 metres (6,600 ft). They generally do not live in freshwater, although there are a few known exceptions, such as the bull shark and the river shark, which can be found in both seawater and freshwater.[3] Sharks have a covering of dermal denticles that protects their skin from damage and parasites in addition to improving their fluid dynamics. They have numerous sets of replaceable teeth.

Several species are apex predators, which are organisms that are at the top of their food chain. Select examples include the tiger shark, blue shark, great white shark, mako shark, thresher shark, and hammerhead shark.

Sharks are caught by humans for shark meat or shark fin soup. Many shark populations are threatened by human activities. Since 1970, shark populations have been reduced by 71%, mostly from overfishing.
152 changes: 130 additions & 22 deletions tests/integration_tests/vectorstores/test_elasticsearch.py
Original file line number Diff line number Diff line change
@@ -1,29 +1,137 @@
"""Test ElasticSearch functionality."""
import logging
import os
from typing import Generator, List, Union

import pytest
from elasticsearch import Elasticsearch

from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch
from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings

logging.basicConfig(level=logging.DEBUG)

"""
cd tests/integration_tests/vectorstores/docker-compose
docker-compose -f elasticsearch.yml up
"""


class TestElasticsearch:
@pytest.fixture(scope="class", autouse=True)
def elasticsearch_url(self) -> Union[str, Generator[str, None, None]]:
"""Return the elasticsearch url."""
url = "http://localhost:9200"
yield url
es = Elasticsearch(hosts=url)

# Clear all indexes
index_names = es.indices.get(index="_all").keys()
for index_name in index_names:
# print(index_name)
es.indices.delete(index=index_name)

@pytest.fixture(scope="class", autouse=True)
def openai_api_key(self) -> Union[str, Generator[str, None, None]]:
"""Return the OpenAI API key."""
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
raise ValueError("OPENAI_API_KEY environment variable is not set")

yield openai_api_key

@pytest.fixture(scope="class")
def documents(self) -> Generator[List[Document], None, None]:
"""Return a generator that yields a list of documents."""
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

documents = TextLoader(
os.path.join(os.path.dirname(__file__), "fixtures", "sharks.txt")
).load()
yield text_splitter.split_documents(documents)

def test_similarity_search_without_metadata(self, elasticsearch_url: str) -> None:
"""Test end to end construction and search without metadata."""
texts = ["foo", "bar", "baz"]
docsearch = ElasticVectorSearch.from_texts(
texts, FakeEmbeddings(), elasticsearch_url=elasticsearch_url
)
output = docsearch.similarity_search("foo", k=1)
assert output == [Document(page_content="foo")]

def test_similarity_search_with_metadata(self, elasticsearch_url: str) -> None:
"""Test end to end construction and search with metadata."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": i} for i in range(len(texts))]
docsearch = ElasticVectorSearch.from_texts(
texts,
FakeEmbeddings(),
metadatas=metadatas,
elasticsearch_url=elasticsearch_url,
)
output = docsearch.similarity_search("foo", k=1)
assert output == [Document(page_content="foo", metadata={"page": 0})]

def test_default_index_from_documents(
self, documents: List[Document], openai_api_key: str, elasticsearch_url: str
) -> None:
"""This test checks the construction of a default
ElasticSearch index using the 'from_documents'."""
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)

elastic_vector_search = ElasticVectorSearch.from_documents(
documents=documents,
embedding=embedding,
elasticsearch_url=elasticsearch_url,
)

search_result = elastic_vector_search.similarity_search("sharks")

print(search_result)
assert len(search_result) != 0

def test_custom_index_from_documents(
self, documents: List[Document], openai_api_key: str, elasticsearch_url: str
) -> None:
"""This test checks the construction of a custom
ElasticSearch index using the 'from_documents'."""
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)
elastic_vector_search = ElasticVectorSearch.from_documents(
documents=documents,
embedding=embedding,
elasticsearch_url=elasticsearch_url,
index_name="custom_index",
)
es = Elasticsearch(hosts=elasticsearch_url)
index_names = es.indices.get(index="_all").keys()
assert "custom_index" in index_names

search_result = elastic_vector_search.similarity_search("sharks")
print(search_result)

assert len(search_result) != 0

def test_custom_index_add_documents(
self, documents: List[Document], openai_api_key: str, elasticsearch_url: str
) -> None:
"""This test checks the construction of a custom
ElasticSearch index using the 'add_documents'."""
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)
elastic_vector_search = ElasticVectorSearch(
embedding=embedding,
elasticsearch_url=elasticsearch_url,
index_name="custom_index",
)
es = Elasticsearch(hosts=elasticsearch_url)
index_names = es.indices.get(index="_all").keys()
assert "custom_index" in index_names

elastic_vector_search.add_documents(documents)
search_result = elastic_vector_search.similarity_search("sharks")
print(search_result)

def test_elasticsearch() -> None:
"""Test end to end construction and search."""
texts = ["foo", "bar", "baz"]
docsearch = ElasticVectorSearch.from_texts(
texts, FakeEmbeddings(), elasticsearch_url="http://localhost:9200"
)
output = docsearch.similarity_search("foo", k=1)
assert output == [Document(page_content="foo")]


def test_elasticsearch_with_metadatas() -> None:
"""Test end to end construction and search."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": i} for i in range(len(texts))]
docsearch = ElasticVectorSearch.from_texts(
texts,
FakeEmbeddings(),
metadatas=metadatas,
elasticsearch_url="http://localhost:9200",
)
output = docsearch.similarity_search("foo", k=1)
assert output == [Document(page_content="foo", metadata={"page": 0})]
assert len(search_result) != 0

0 comments on commit b410dc7

Please sign in to comment.