A vector databaseis a type of dababase that stores data as high-dimentional vectors, which are mathematical representations of feautres or attributes. Each vector has a vcertain number of dimensions, which can range from tens of thousands, depending on the complexity
and granularity
of the data. The vectors are usually generated by applying some kind of transformation or embedding function to the raw data, such as text, images, audio, video and others. The embedding function can be based on various methods, such as machine learning models, word embeddings, feature extraction algorithms.
The main difference between a vector database and a traditional(relational) databases lies in type of data they store
. While relational database are designed for structured data
that fits into tables, vector databses are intended for unstructured data
, such as text or images. And the type of data that is stored also influence how the data is retrieved
:
In the relational database, query results are based on matches
for specific keywords.
In vector database, query results are based on similarity between vectors
.
Typically, the SQL is more complex, the time-consuming is longer. And it does not guarantee the accuracy of the results.
Vector embedding is a process of transforming data into a vector representation. The vector representation is a mathematical representation of the data that can be used for various purposes, such as similarity search, classification, clustering, and others. Some ML algorithms can convert a given object into a numerical representaion that preserves the information of that object, i.e., ML model accepts the prompts and returns us a long list of numbers. The long list of numbers is the nummerical representation of our word and is called vector embedding. And these embeddings are a long list of numbers, we call them high-dimensional. Let's pretend for a second that these embeedings are only three dimensional to visualize them as shown below.
Source from Explaining Vector Databases in 3 Levels of Difficulty
The numerical representations enable us to apply mathematical calculations to objects, such as words. For example, the following calculation will not work unless you replace the words with their embeddings:
drink - food + hungry = thirsty
And because we are able to use the embeddings for calculations, we can also calculate the distances between a pair of embedded objects. The closer two embeeded objects are to one another, the more similat they are.
Vector dbs have been around before thr hype around LLMs started. Originally, they were used in recommendation systems because they can quickly find similar objects for a given query. But because they can provide long-term memory to LLMs, they have also been used in QA applications recently.
Vector databases are able to retrieve similar objects of a query quickly because they have already pre-calculated
them. The underlying concept is called Approximate Nearest Neighbor(ANN) search, which uses different algorithms for indexing and calculating similarities.
Calculating the similarities between a query and every embedded object you have with a simple k-neares neighbors (kNN) algorithm can become time-consuming when you have millions of embeddings. With ANN, you can trade in some accuracy in exchange for speed and retrieve the approximately most similar objects to a query.
A vector database indexes the vector embeddings. This step maps the vectors to a data structure that will enable faster searching. Indexing can thus help you retrieve a smaller portion of all the available vectors and thus speeds up retrieval. More detail about indexing can look up Hierarchical Navigable Small World(HNSW).
To find the nearest neighbors to the query from the indexed vectors, a vector database applies a similarity measure. Common similarity measures include cosine similaritym dot product, Euclidean distance, Manhattan distance, and Hamming distance.
https://learn.microsoft.com/en-us/semantic-kernel/memories/vector-db?source=docs https://towardsdatascience.com/explaining-vector-databases-in-3-levels-of-difficulty-fc392e48ab78