Improving AI Search Accuracy: NVIDIA Strengthens RAG Pipeline with Reranking

Alvin Lang
July 30, 2024 18:19

NVIDIA has introduced re-ranking capabilities to improve the accuracy and relevance of AI-powered enterprise search results and to enhance the RAG pipeline and semantic search.

According to the NVIDIA Technology Blog, in the rapidly evolving landscape of AI-based applications, reranking has emerged as a key technology for improving the accuracy and relevance of enterprise search results. Using advanced machine learning algorithms, reranking refines initial search results to better match user intent and context, significantly improving the effectiveness of semantic search.

The Role of Re-Prioritization in AI

Reranking plays a critical role in optimizing your augmented search generation (RAG) pipeline, ensuring that your large-scale language models (LLMs) are working with the most relevant and high-quality information. This dual benefit of reranking, which improves both semantic search and your RAG pipeline, makes it an indispensable tool for businesses looking to deliver superior search experiences and stay competitive in the digital marketplace.

What is re-ranking?

Re-ranking is a sophisticated technique that leverages LLM’s advanced language understanding capabilities to enhance the relevance of search results. Initially, a set of candidate documents or phrases are retrieved using traditional information retrieval methods such as BM25 or vector similarity search. These candidates are then fed into LLM, which analyzes the semantic relevance between the query and each document. LLM can then re-order the documents to prioritize the most relevant documents by assigning a relevance score.

This process goes beyond simple keyword matching to understand the context and meaning of the query and documents, thereby significantly improving the quality of search results. Re-ranking is typically used as a second step after the initial quick search phase to ensure that only the most relevant documents are shown to the user. Results from multiple data sources can also be combined and integrated into the RAG pipeline to further ensure that the context is ideally aligned for a particular query.

NVIDIA’s Re-Ranking Implementation

In this post, the NVIDIA Tech Blog explains the use of the NVIDIA NeMo Retriever reranking NIM. This transformer encoder, a LoRA fine-tuned version of the Mistral-7B, uses only the first 16 layers for higher throughput. The last embedding output of the decoder model is used as a pooling strategy, and the binary classification head is fine-tuned for the ranking task.

Visit the NVIDIA API Catalog to access the NVIDIA NeMo Retriever collection of world-class information retrieval microservices.

Combine results from multiple data sources

In addition to improving the accuracy of a single data source, re-ranking can be used to combine multiple data sources in a RAG pipeline. Consider a pipeline with data from a semantic store and a BM25 store. Each store is queried independently and returns results that each store considers highly relevant. The role of re-ranking is to determine the overall relevance of the results.

The following code example combines the previous semantic search results with the BM25 results. The results are as follows: combined_docs Re-rank NIMs to sort them by relevance to your query.

all_docs = docs + bm25_docs

reranker.top_n = 5

combined_docs = reranker.compress_documents(query=query, documents=all_docs)

Connect to the RAG pipeline

In addition to using re-ranking independently, you can further improve the response by adding it to the RAG pipeline to ensure that the most relevant chunks are used to augment the original query.

In this case, connect: compression_retriever Brings objects from the previous step into the RAG pipeline.

from langchain.chains import RetrievalQA
from langchain_nvidia_ai_endpoints import ChatNVIDIA

chain = RetrievalQA.from_chain_type(
    llm=ChatNVIDIA(temperature=0), retriever=compression_retriever
)
result = chain("query": query)
print(result.get("result"))

The RAG pipeline now uses the correct top-ranked chunks and summarizes key insights.

The A100 GPU is used for training the 7B model in the supervised 
fine-tuning/instruction tuning ablation study. The training is 
performed on 16 A100 GPU nodes, with each node having 8 GPUs. The 
training hours for each stage of the 7B model are: projector 
initialization: 4 hours; visual language pre-training: 30 hours; 
and visual instruction-tuning: 6 hours. The total training time 
corresponds to 5.1k GPU hours, with most of the computation being 
spent on the pre-training stage. The training time could potentially 
be reduced by at least 30% with proper optimization. The high image 
resolution of 336 ×336 used in the training corresponds to 576 
tokens/image.

conclusion

RAG has emerged as a powerful approach that combines the strengths of LLM and dense vector representations. With dense vector representations, RAG models can scale efficiently, making them suitable for large-scale enterprise applications such as multilingual customer service chatbots and code-generating agents.

As LLM continues to evolve, RAG will play an increasingly important role in driving innovation and delivering high-quality intelligent systems that can understand and produce human-like language.

When building a RAG pipeline, it is important to properly split the vector storage documents into chunks by optimizing the chunk size for specific content and selecting LLMs with appropriate context lengths. In some cases, complex chains of multiple LLMs may be required. To optimize RAG performance and measure success, a powerful set of evaluators and metrics are used.

For more information on additional models and chains, see NVIDIA AI LangChain Endpoint.

Image source: Shutterstock

Improving AI Search Accuracy: NVIDIA Strengthens RAG Pipeline with Reranking

As you challenge the mixed technology signal, OnDo Price Hovers challenges the August Bullish predictions.

XRP Open Interests decrease by $ 2.4B after recent sale

KAITO unveils Capital Launchpad, a Web3 crowdfunding platform that will be released later this week.

The New Bybit Web3 Is Here–Fueling On-Chain Thrills With $200,000 Up For Grabs

Stella (XLM) Eye 35% Rally and Ripple and SEC END 5 years legal battle

Builders Are Proving What’s Possible With CARV’s AI Stack

Caldera Announces Partnership With EigenCloud To Integrate EigenDA V2

Are Monero in danger? Five orphan blocks were found during the Cubic Mining War.

One Card To Seamlessly Bridge Web3 Assets And Real-World Spending

Coinbase’s USDC fee, encryption or other banks?

Protocol Update 001 -scale L1

As you challenge the mixed technology signal, OnDo Price Hovers challenges the August Bullish predictions.

XRP struggles for $ 3: Do Whale Offroads attract it lower?

Bybit’s Ben Zhou Charts Bold New Course To Rewrite Crypto Success At Mid-Year Keynote

Top Insights

The New Bybit Web3 Is Here–Fueling On-Chain Thrills With $200,000 Up For Grabs

Stella (XLM) Eye 35% Rally and Ripple and SEC END 5 years legal battle

Builders Are Proving What’s Possible With CARV’s AI Stack

Most Popular

World of Women (WoW) joins Tezos ecosystem to drive gender equality in Web3

Bitcoin liquidity drying as the market adapts to sharp modifications

Cboe Exchange said Global X’s application for a spot Bitcoin ETF has been withdrawn.

Improving AI Search Accuracy: NVIDIA Strengthens RAG Pipeline with Reranking

The Role of Re-Prioritization in AI

What is re-ranking?

NVIDIA’s Re-Ranking Implementation

Combine results from multiple data sources

Connect to the RAG pipeline

conclusion

Related Posts