How to boost your database performance with OpenAIs new v3 embeddings
In late January of this year, OpenAI unveiled their latest v3 embedding models, marking a significant milestone. This announcement was particularly noteworthy, not only because OpenAI released not one, but two variants of their new model in varying degrees of complexity (text-embedding-3-small
and text-embedding-3-large
), but also because these models substantially outperform their previous flagship model text-embedding-ada-002
in terms of accuracy, especially in the multilingual MIRACL Benchmark. Our previous post, "Introduction to Retrieval Augmented Generators", highlights the crucial role that semantic search for relevant content plays within the overall system. With these significant advancements in the semantic representation of content, notable progress for RAG systems, particularly those involving non-English content, can be achieved.
Benchmark scores of OpenAIs v3 embeddings
Additionally, as a bonus, the new models, or at least the smaller of the two, are available at a significantly reduced price compared to their predecessor. The smaller model is now priced at just a fifth ($0.00002 / 1k tokens) of the cost of the previous model. However, the larger flagship model comes with a 30% increase in cost ($0.00013 / 1k tokens).
Beyond the welcome news of cost reductions and performance improvements, the new embeddings introduce an additional feature: native support for shortening embeddings. The new embeddings come with 1536 and 3072 dimensions respectively. Thanks to the novel shortening feature, these embeddings can now be reduced in length without losing their semantic integrity, as seen with other embeddings. This allows for a choice between their size and the level of accuracy they can achieve. To demonstrate the capabilities of the new embedding models, benchmark values for the shortened embeddings have also been published. Both model variants achieve a higher MTEB score in their shortened form than their predecessors. Remarkably, this includes the more complex model achieving higher scores even when shortened to 256 dimensions.
Benchmark scores of OpenAIs v3 embeddings at different different dimensions
But how does this work, and where does this ability to simply shorten the vectors' length come from?
"Matryoshka Embeddings"
Matryoshka dolls (generated by DALL-E)
The new v3 embedding models boast native support for embedding shortening, a feature made possible through their training with a novel technique known as "Matryoshka Representation Learning". Introduced in the eponymous paper, this method encodes information at varying degrees of granularity within a single vector. Matryoshka Representation Learning (MRL) trains a single high-dimensional vector to encapsulate information across different granularities, similar to nesting dolls. This concept is inspired by Russian Matryoshka dolls, where smaller dolls fit inside larger ones, one after another.
Schematic Representation of the Functioning of MRL (source: paper "Matryoshka Representation Learning")
If you're interested in understanding the intricacies of this process, the paper detailing it can be found here.
Practical note: To shorten embeddings, the OpenAI API introduces an additional dimensions
parameter, which allows users to specify the desired number of dimensions to retain. While this operation can also be performed manually by simply truncating the vector, it's important to note that the embedding must be normalized afterward — a step automatically handled when using the dimensions
parameter within the API. Performing this manually, however, can be advantageous in certain scenarios, as it allows for the maintenance of an embedding in various lengths without the need to repeatedly query the API and incur additional costs.
Operational Implications
What does it mean to choose between different lengths of embeddings? This choice significantly impacts the operational aspects of a Retrieval Augmented Generator (RAG) system, especially concerning the vector database. Beyond the linear increase in storage requirements as the size of the embeddings grows, there is also a need for larger and more complex indices, as well as significantly increased computational resources for similarity calculations. This can notably affect the performance of the vector database and, by extension, the entire system. Therefore, finding the optimal dimensions for embeddings extends beyond mere considerations of storage needs.
Does this mean we must choose between accuracy and performance? What if we want both?
Fast AND accurate retrieval?
Amid all these announcements about increased accuracy and the option to choose embedding lengths, the real game-changing feature might be overlooked. The ability to shorten embeddings does not merely force a choice between retrieval accuracy and database performance; rather, it offers the possibility to achieve BOTH! The key to this is 'adaptive retrieval'.
Note: This should not be confused with "adaptive RAG," where information is retrieved only as necessary. For more information on this concept, refer to this paper. This method, already introduced and recommended by the authors of the original paper on "Matryoshka Representation Learning", is why we adhere to the term 'adaptive retrieval' in this article.
How does 'adaptive retrieval' work?
Before delving into the specifics of 'adaptive retrieval', let's first consider the standard retrieval process. To find relevant information that answers a user query, the query is transformed using the same embedding model as the information in the vector database. A similarity search is then conducted within the database, typically using cosine similarity or an equivalent but faster dot product for normalized vectors. This similarity comparison must be performed against every entry in the database, meaning the computational effort increases with both the number of dimensions and the number of entries. Consequently, large embeddings, such as those produced by the model text-embedding-3-large
with 3072 dimensions, can lead to significant performance decreases in a vector database with an extensive knowledge base.
'Adaptive retrieval' leverages the ability to shorten embeddings in a two-step process to significantly reduce one of these factors at each step. In the first step, while the number of database entries searched cannot be limited, the shortening of embeddings is utilized. Searching with reduced dimensions decreases the complexity of the similarity calculations. Although accuracy is slightly reduced, this approach allows for a highly efficient pre-selection of potentially relevant information from the database.
In the second step, a search is conducted within this pre-selected set using a higher number of dimensions, thus increasing accuracy. The drastically reduced number of elements to be searched ensures this step is much more efficient.
Schema of Adaptive Retrieval
Despite requiring an additional step, the significant reduction in complexity during both phases markedly enhances the overall performance of the retrieval process.
Case study: Supabase and adaptive retrieval
In this section, we aim to shed light on the tangible results attainable through the application of this technique in a practical setting. We reference a case study conducted by 'Supabase', and full acknowledgment for this section is due to the 'Supabase' team and their insightful blog post, which you can access here.
For the results presented in their blog, the 'dbpedia' dataset (available on HuggingFace) consisting of 1 million embeddings, was processed using OpenAI's text-embedding-3-large
model. The database utilized was a PostgreSQL equipped with the pgvector extension.
For all subsequent accuracy evaluations, the following definition of accuracy is employed: A KNN search on full-sized 3072 dimension vectors serves as the reference. The accuracy metric is based on the number of IDs that the ANN search returns, matching those found by the KNN search.
Accuracy implications of ANN
Since indices, which accept minor accuracy losses for performance enhancement, are used in practice, the first step involved determining whether these inaccuracies were significant or negligible for subsequent accuracy considerations.
The initial experiment examined the impact of this indexing (HNSW) and the approximation of calculations on the results. Therefore a KNN search and an ANN search with 1536 dimensions were applied. The KNN search achieved an accuracy of 89.5%, meaning that, compared to the KNN search with the full 3072 dimensions, 89.5% of identical entries were found in the top 10. With the approximate ANN search, an accuracy of 89.2% (at 670 QPS) was achieved. Thus, these performance-enhancing approximations can be utilized in further considerations due to the minimal deterioration in results.
One pass with reduced dimensionality
Given that approximations are viable and to gauge the impact of dimensionality reduction, accuracies after the first pass were compared in this experiment. With 1536 dimensions, the accuracy drops to about 89%, meaning 9 out of 10 elements still match those found by the fully resolved and error-free KNN search. A reduction to 256 dimensions decreases the accuracy further to 59%, thus matching 6 out of 10 identical search results.
One pass performance with shortened embeddings (source: Supabase blog)
Optimal first-pass dimensionality
As the previous test indicated, the accuracy of search results in the first pass is logically highly dependent on the chosen number of dimensions. To compensate for this and maintain a consistent accuracy level throughout the process, the number of pre-selected elements, from which the most relevant results can then be chosen in the second step with full resolution, must be increased. Hence, finding the sweet spot for this "load distribution" between the first and second passes is essential. Tests with varying embedding lengths from 256 - 768 dimensions in the first step were conducted. For the second step, a KNN search with the full 3072 dimensions was performed.
First pass optimization (source: Supabase blog)
It was discovered that maintaining a constant accuracy of 99% for the overall process, the best performance in terms of QPS was achieved with 512 dimensions. It was necessary to use the top 80 elements in the first pass to ensure this accuracy for the final top 10 elements.
Final performance
Armed with insights from this optimization attempt, further tests with 512 dimensions in the first step were conducted. With the settings chosen for this test, a robust 580 QPS was achieved at an accuracy of 99%. By sacrificing 5% in accuracy, an impressive 700 QPS was achieved.
Performance of final setup (source: Supabase blog)
Key Insights
Comparison Between 'Adaptive Retrieval' and 'Simple Retrieval'
Unfortunately, this article does not mention the throughput achievable with an ANN search at 3072 dimensions, which would serve as a necessary benchmark for assessing the performance improvement offered by this method. However, we know from the first test that 670 QPS was reached with 1536 dimensions at an accuracy of 89%. The 'adaptive retrieval' approach, despite a 5% higher throughput, managed to achieve a higher accuracy of 94%. Although this comparison already indicates both a performance and accuracy improvement through this method, drawing this comparison at the full length of the embeddings—and thus at the highest accuracy range—would be intriguing. In such a scenario, double the number of dimensions would need to be utilized, likely resulting in a significantly reduced throughput for the single-stage process.
Dimension Granularities
Since the embedding models are trained with specific granularities (lengths of dimensions), it is not advisable to arbitrarily choose the number of dimensions. For lengths not explicitly considered during training, accuracy drops may apparently occur. The Supabase team speculates the following:
As of the time of writing, we don't yet know. But we do know that they were likely trained on at least the following granularities (based on their MTEB comparisons):
- text-embedding-3-small: 512 and 1536
- text-embedding-3-large: 256, 1024, and 3072
Functional Index
For implementing this method, the Supabase team utilized a feature available for pgvector called "functional index". A functional index is defined on the result of a function applied to one or more columns of a single table. This allows for the dynamic selection of dimensions used for indexing (employed in the first pass). Moreover, it eliminates the need to persist embeddings in varying lengths, requiring only the maximum desired length.
'Funnel Retrieval'
The two-step process introduced could theoretically be extended to additional stages. This is referred to as 'Funnel Retrieval' in the original MRL paper:
"Funnel thins out the initial shortlist by a repeated re-ranking and shortlisting with a series of increasing capacity representations. Funnel halves the shortlist size and doubles the representation size at every step of re-ranking."
However, since the first step is significantly more complex, Supabase provides the following assessment:
- The first pass is the most expensive, even with an index (as seen using
explain analyze
). It has to filter the entire table down to a much smaller number of records (relatively). This is why creating an index on the first pass is crucial. - The second pass is very quick. Even at 3072d using KNN, the time taken to re-rank the initial shortlist is a small fraction of the time taken to complete the first pass.
So the performance gained by splitting the second pass into multiple passes is likely minimal. Finding ways to optimize the first pass will likely result in more gains."
Conclusion
The release of OpenAI's v3 Embedding Models in January 2024 marks a significant advancement in artificial intelligence, introducing two new model variants that outperform their predecessor in accuracy, especially in multilingual benchmarks. This leap forward is crucial for Retrieval Augmented Generator (RAG) systems, enhancing their ability to process non-English content effectively.
One of the most notable features of these models is their native support for embedding shortening, enabled by the novel "Matryoshka Representation Learning" technique. This feature offers unprecedented flexibility, allowing users to balance between embedding size and accuracy. This is particularly beneficial for RAG systems and vector databases, where operational efficiency and accuracy are paramount.
The concept of 'adaptive retrieval', demonstrated through the case study by 'Supabase', showcases how embedding shortening can significantly improve query processing speeds (QPS) while maintaining high accuracy. This method leverages the shortening capability to optimize both the computational load and the accuracy of searches within extensive databases.
The insights from the 'Supabase' case study highlight the importance of considering dimension granularities, utilizing functional indexes for dynamic dimension selection, and the potential for extending the adaptive retrieval process. These findings underscore the sophisticated balance between technological innovation and practical application, paving the way for more efficient and effective AI-driven systems.
In summary, the developments brought forth by OpenAI's latest models, combined with the operational enhancements they enable, represent a transformative step in the realm of semantic search and retrieval. As we explore these technologies further, the integration of increased accuracy with improved performance through adaptive retrieval sets a new benchmark for the future of AI-driven information retrieval.
Further Reading
- How to test your RAG pipeline?
- Recursive Retrieval with llamaindex
- Increase RAG performance using ColBERT reranker
Interested in how to train your very own Large Language Model?
We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:
- Cost control
- Data privacy
- Excellent performance - adjusted specifically for your intended use