Yes, ClickHouse can perform vector search. The main advantages of using ClickHouse for vector search compared to using more specialized vector databases include:
- Using ClickHouse's filtering and full-text search capabilities to refine your dataset before performing a search.
- Performing analytics on your datasets.
- Running a
JOIN
against your existing data. - No need to manage yet another database and complicate your infrastructure.
Here is a quick tutorial on how to use ClickHouse for vector search.
1. Create embeddings
Your data (documents, images, or structured data) must be converted to embeddings. We recommend creating embeddings using the OpenAI Embeddings API or using the open-source Python library SentenceTransformers.
You can think of an embedding as a large array of floating-point numbers that represent your data. Check out this guide from OpenAI to learn more about embeddings.
2. Store the embeddings
Once you have generated embeddings, you need to store them in ClickHouse. Each embedding should be stored in a separate row and can include metadata for filtering, aggregations, or analytics. Here's an example of a table that can store images with captions:
CREATE TABLE images
(
`_file` LowCardinality(String),
`caption` String,
`image_embedding` Array(Float32)
)
ENGINE = MergeTree;
3. Search for related embeddings
Let's say you want to search for pictures of dogs in your dataset. You can use a distance function like cosineDistance
to take an embedding of a dog image and search for related images:
SELECT
_file,
caption,
cosineDistance(
-- An embedding of your "input" dog picture
[0.5736801028251648, 0.2516217529773712, ..., -0.6825592517852783],
image_embedding
) AS score
FROM images
ORDER BY score ASC
LIMIT 10
This query returns the _file
names and caption
of the top 10 images most likely to be related to your provided dog image.
Further Reading
To follow a more in-depth tutorial on vector search using ClickHouse, please see: