Blog

Smart search, smarter results: Inside an Aerospike-powered search engine

Aerospike Vector Search enhances website search with hybrid models, combining keyword precision and semantic understanding for high-quality results.

December 11, 2024 | 5 min read
profile-headshot
Art Anderson
Director of Developer Advocacy

When was the last time you used a website’s built-in search instead of just using Google? If you’re anything like me, it’s probably been a while—and for good reason. Traditional search methods often struggle to truly understand what you’re looking for. But search is evolving. We’ve moved beyond the limitations of exact matches and into a new era where search engines can grasp the meaning of your queries, delivering results that feel almost intuitive.

Following the launch of Aerospike Vector Search earlier this year, two wildly intelligent interns and I set out to see if we could build a better search experience on the Aerospike website. This blog documents part of that journey and explores how Aerospike Vector Search helped us solve some interesting problems while “eating a bit of our own dog food” along the way.

From keywords to meaning: A hybrid search approach

For many years, the Aerospike website used an off-the-shelf search appliance. It was a fairly straightforward keyword search that provided okay results, but things like filtering and ranking weren’t very robust, and we had little control over indexing our content. Most people usually resorted to using Google to search the website. It was time to build something better.

From the beginning, we looked to use a hybrid model, marrying the contextual understanding of vectors with the precision of keyword search. Aerospike, being a lightning-fast key-value store, made things easy when it came to searching the keyword index, and its ability to handle complex data types streamlined building hybrid search. The search engine we built uses an inverted index, which is common in other search applications, allowing us to search keywords for the documents that contain them instead of trying to parse through a whole lot of documents for a few keywords.

While building and searching the keyword index, we used spaCy, a natural language processing library that tokenizes, lemmatizes, and removes stop words. The keyword search does an amazing job… when you spell your keyword correctly or when spaCy derives the appropriate root word, but what about inevitable typos? Synonyms? How do we handle all that?

Enter vectors. Vector search isn’t the only solution to the problem above, but it has unique advantages. The vectors we’re searching for, also referred to as embeddings, are created by passing data, in this case, text from scraping webpages, into a machine learning model for processing. This embedding model takes unstructured data and turns it into a vector that captures the semantic meaning and relationships of the underlying text. This allows you to perform a search by comparing the values of the vector embeddings using a distance calculation

Aerospike Vector Search makes searching across thousands, millions, and billions of vectors fast and easy by combining the tried-and-true Aerospike key-value storage with a new distributed Hierarchical Navigable Small World (HNSW) index. HNSW does a great job finding the “nearest neighbors” to our query vector, and the Aerospike key-value storage makes retrieval of those records blazing fast. 

Aerospike Vector Search: Simple yet powerful

I’d love to share a long tale of overcoming integration challenges, but honestly, adding vector search to our hybrid model was probably the easiest part! Here’s how straightforward it was:

  • Unified database: Aerospike Vector Search uses the same Aerospike key-value store as our keyword index, allowing us to store everything in one database for amazing performance and massive scalability. 

  • Effortless API: The API is extremely simple to use and does all the heavy lifting when it comes to search. 

  • Seamless process: We just add the vectors to the index and pass the query vector in when we need results. Records come back ranked and ready to go!

Code samples

We use Google’s VertexAI to create our vector embeddings for storage in Aerospike and querying the database.

model = vertexai.language_models.TextEmbeddingModel.from_pretrained("text-embedding-004")

def get_embedding(sentences: list[dict], task: EmbedTask):
   inputs = [
       TextEmbeddingInput(
           text=sentence.get("text"),
           title=sentence.get("title", None),
           task_type=task,
       ) for sentence in sentences
   ]
   return model.get_embeddings(inputs)

Once we have our list of vectors from a document, we upsert them into Aerospike Vector Search.

def update_vector_index(vector_client: Client, url: str, embeddings: list[list[float]]):
   for idx, embedding in enumerate(embeddings):
       vector_client.upsert(
           namespace=NAMESPACE,
           set_name=VECTOR_SET,
           key=f"{url}___{str(idx)}",
           record_data={VECTOR_FIELD: embedding}
       )

We take the user's query, pass it to the embedding model from above, and then use it to search our vectors.

vector_results = vector_client.vector_search(
    namespace=NAMESPACE,
    index_name="vector_idx",
    query=embedding,
    limit=count
)

The best part? Vectors capture the semantic meaning of our text, enabling the vector search to return high-quality results, even with typos, synonyms, and other fuzzy lookup issues that our keyword index couldn’t handle.

If you’d like to explore the full implementation yourself, check out the code here.

What’s next?

Having control over our own search implementation allows us to provide a better overall experience while opening the door to more advanced applications, enabling web visitors to explore and learn about Aerospike better. Looking ahead, we plan to add more personalized search experiences by implementing retrieval augmented generation (RAG) and combining relevant search results with large language models (LLMs). 

Start learning more about Aerospike Vector Search by searching for it on aerospike.com.

Download Community Edition (Free version!)

Aerospike Server Community Edition (CE) is a free, open source Aerospike distribution. It is the common core of Aerospike Enterprise Edition (EE) with the same developer API and performance characteristics.