According to the NVIDIA Technology Blog, in the rapidly evolving landscape of AI-based applications, reranking has emerged as a key technology for improving the accuracy and relevance of enterprise search results. Using advanced machine learning algorithms, reranking refines initial search results to better match user intent and context, significantly improving the effectiveness of semantic search.
The Role of Re-Prioritization in AI
Reranking plays a critical role in optimizing your augmented search generation (RAG) pipeline, ensuring that your large-scale language models (LLMs) are working with the most relevant and high-quality information. This dual benefit of reranking, which improves both semantic search and your RAG pipeline, makes it an indispensable tool for businesses looking to deliver superior search experiences and stay competitive in the digital marketplace.
What is re-ranking?
Re-ranking is a sophisticated technique that leverages LLM’s advanced language understanding capabilities to enhance the relevance of search results. Initially, a set of candidate documents or phrases are retrieved using traditional information retrieval methods such as BM25 or vector similarity search. These candidates are then fed into LLM, which analyzes the semantic relevance between the query and each document. LLM can then re-order the documents to prioritize the most relevant documents by assigning a relevance score.
This process goes beyond simple keyword matching to understand the context and meaning of the query and documents, thereby significantly improving the quality of search results. Re-ranking is typically used as a second step after the initial quick search phase to ensure that only the most relevant documents are shown to the user. Results from multiple data sources can also be combined and integrated into the RAG pipeline to further ensure that the context is ideally aligned for a particular query.
NVIDIA’s Re-Ranking Implementation
In this post, the NVIDIA Tech Blog explains the use of the NVIDIA NeMo Retriever reranking NIM. This transformer encoder, a LoRA fine-tuned version of the Mistral-7B, uses only the first 16 layers for higher throughput. The last embedding output of the decoder model is used as a pooling strategy, and the binary classification head is fine-tuned for the ranking task.
Visit the NVIDIA API Catalog to access the NVIDIA NeMo Retriever collection of world-class information retrieval microservices.
Combine results from multiple data sources
In addition to improving the accuracy of a single data source, re-ranking can be used to combine multiple data sources in a RAG pipeline. Consider a pipeline with data from a semantic store and a BM25 store. Each store is queried independently and returns results that each store considers highly relevant. The role of re-ranking is to determine the overall relevance of the results.
The following code example combines the previous semantic search results with the BM25 results. The results are as follows: combined_docs
Re-rank NIMs to sort them by relevance to your query.
all_docs = docs + bm25_docs reranker.top_n = 5 combined_docs = reranker.compress_documents(query=query, documents=all_docs)
Connect to the RAG pipeline
In addition to using re-ranking independently, you can further improve the response by adding it to the RAG pipeline to ensure that the most relevant chunks are used to augment the original query.
In this case, connect: compression_retriever
Brings objects from the previous step into the RAG pipeline.
from langchain.chains import RetrievalQA from langchain_nvidia_ai_endpoints import ChatNVIDIA chain = RetrievalQA.from_chain_type( llm=ChatNVIDIA(temperature=0), retriever=compression_retriever ) result = chain("query": query) print(result.get("result"))
The RAG pipeline now uses the correct top-ranked chunks and summarizes key insights.
The A100 GPU is used for training the 7B model in the supervised fine-tuning/instruction tuning ablation study. The training is performed on 16 A100 GPU nodes, with each node having 8 GPUs. The training hours for each stage of the 7B model are: projector initialization: 4 hours; visual language pre-training: 30 hours; and visual instruction-tuning: 6 hours. The total training time corresponds to 5.1k GPU hours, with most of the computation being spent on the pre-training stage. The training time could potentially be reduced by at least 30% with proper optimization. The high image resolution of 336 ×336 used in the training corresponds to 576 tokens/image.
conclusion
RAG has emerged as a powerful approach that combines the strengths of LLM and dense vector representations. With dense vector representations, RAG models can scale efficiently, making them suitable for large-scale enterprise applications such as multilingual customer service chatbots and code-generating agents.
As LLM continues to evolve, RAG will play an increasingly important role in driving innovation and delivering high-quality intelligent systems that can understand and produce human-like language.
When building a RAG pipeline, it is important to properly split the vector storage documents into chunks by optimizing the chunk size for specific content and selecting LLMs with appropriate context lengths. In some cases, complex chains of multiple LLMs may be required. To optimize RAG performance and measure success, a powerful set of evaluators and metrics are used.
For more information on additional models and chains, see NVIDIA AI LangChain Endpoint.
Image source: Shutterstock