This project is my side projec of the implementation of an AI-powered Enterprise RAG (Retrieval-augmented generation). It uses a pre-trained model to generate embeddings for books and then uses Elasticsearch to index and search for books by using multi-modal search:
- traditional text search
- ๐งฎ consine similarity search using embeddings (meaning books are recommended based on not just key words but semantic, user preferences, etc. which are all embedded as a vector)
- I did not choose a vector database as elasticsearch provides vector storage and search capabilities. It is not as good as a vector database but it is good enough for this project. Milvus is a good alternative if you want to use a vector database.
- For the big firms with more resources, the perfect stack should be: Pytorch + ONNX for model development, FastAPI + Docker for deployment, and RAY + Grafana for lifecycle MLOps with
pickle
If you run this project locally after git clone
, indexing and searching part only uses a small sample dataset as I want the interviewer (or anyone who is interested in using it) to run the code on their machine and see the results. It takes time to share a parquet file with 1.5M records and its embeddings. The online version is using the full dataset.
If you haven't tried onnx before, please check it out. It is a great way to deploy your models in production if you care about performance in production.
- Python3.10.10
- Docker (>24.0.5 should work)
- Docker-compose
# check your python version
# recommend using pyenv to manage python versions
python --V # should be >= 3.10.10
python -m venv venv
source venv/bin/activate
make install
make onnx
: construct onnx modelmake elastic-up
: start Elasticsearchmake index-books
: index books (might need to run this several times as elasticsearch might not be ready)make run
: start FastAPI server
make test
The port might be different if you have already running services on port 8080
TODO: Add deployment instructions
It uses fastapi-cookiecutter template. The project structure is as follows:
.
โโโ app
โ โโโ api
โ โโโ core
โ โโโ __init__.py
โ โโโ main.py
โ โโโ models
โ โโโ __pycache__
โ โโโ services
โ โโโ templates
โโโ docker-compose.yml
โโโ Dockerfile
โโโ Makefile
โโโ ml
โ โโโ data
โ โโโ features
โ โโโ __init__.py
โ โโโ model
โ โโโ __pycache__
โโโ notebooks
โ โโโ construct_sample_dataset.ipynb
โ โโโ onnx_runtime.ipynb
โโโ poetry.lock
โโโ pyproject.toml
โโโ README.md
โโโ search
โ โโโ books_embeddings.csv
โ โโโ docker-compose.yml
โ โโโ index_books.py
โโโ tests
โ โโโ __init__.py
โ โโโ __pycache__
โ โโโ test_api.py
โ โโโ test_elastic_search.py
โ โโโ test_onnx_embedding.py
Originally, the data is downloaded from Goodreads Book Graph Datasets. The author also provides the code to download the data.
I downloaded the data and uploaded it to my Google Cloud Storage bucket. Please let me know if you found above links are broken and I will provide you with the data.
There are many tables in the dataset, but we are only interested in the following tables:
- books: detailed meta-data about 2.36M books
- reviews: Complete 15.7m reviews (~5g) and 15M records with detailed review text