The advent of large-scale language models (LLMs) has provided significant benefits to the AI industry, providing versatile tools that can generate human-like text and handle a wide range of tasks. However, while LLMs demonstrate impressive general knowledge, their performance in specialized fields such as veterinary medicine is limited when used outside the box. To enhance their utility in specific fields, the industry typically adopts two main strategies: fine-tuning and augmented retrieval generation (RAG).
Fine tuning vs RAG
Fine-tuning involves training models on carefully curated and structured datasets, requiring significant hardware resources and the involvement of domain experts, a process that is often time-consuming and expensive. Unfortunately, in many fields, accessing domain experts in a way that is compatible with business constraints is prohibitively difficult.
In contrast, RAG involves building a comprehensive knowledge literature corpus along with an effective retrieval system that extracts relevant text chunks to process user queries. By adding this retrieved information to user queries, LLM can generate better answers. This approach still requires subject matter experts to curate the best sources for the dataset, but it is easier to handle and more business-friendly than fine-tuning. In addition, since it does not require extensive training of the model, this approach is less computationally intensive and more cost-effective.
NVIDIA NIM and NLP Pipeline
NVIDIA NIM simplifies the design of NLP pipelines using LLM. These microservices simplify the deployment of generative AI models across platforms, allowing teams to self-host LLM while providing a standard API to build applications.
NIM abstracts model inference internals, such as the execution engine and runtime operations, ensuring optimal performance using TensorRT-LLM, vLLM, etc. Key features include:
- Scalable deployment
- Support for various LLM architectures through optimized engines
- Flexible integration into existing workflows
- Enterprise-grade security with SafeTensor and continuous CVE monitoring
Developers can run NIM microservices with Docker and perform inference using the API. They can also modify the container command to use trained model weights specialized for specific tasks, such as document parsing.
Reimagining veterinary care using AI
AITEM is part of the NVIDIA Inception Program for startups, and its collaboration with NVIDIA has focused on AI-based solutions across a range of industries and life sciences. In the veterinary field, AITEM is developing LAIKA, an innovative AI pilot designed to support veterinarians by processing patient data and providing diagnostic suggestions, guidance, and explanations.
LAIKA integrates multiple LLM and RAG pipelines. The RAG component retrieves relevant information from a curated dataset of veterinary resources. During the preparation phase, each resource is divided into chunks, and embeddings are computed and stored in the RAG database. During the inference phase, queries are preprocessed, embeddings are computed, and compared to embeddings in the RAG database using a geometric distance metric. The closest match is selected as the most relevant and used to generate a response.
Due to the potential redundancy in the RAG database, multiple chunks retrieved may contain the same information, which limits the diversity of concepts provided to the answering system. To address this, LAIKA uses the Maximum Marginal Relevance (MMR) algorithm to minimize chunk redundancy and ensure a wider range of relevant information.
NVIDIA NeMo Retriever Re-Ranking NIM Microservices
The NVIDIA API Catalog includes the NeMo Retriever NIM microservice, which enables organizations to seamlessly connect custom models to a variety of business data and provide highly accurate responses. The NVIDIA Retrieval QA Mistral 4B Rerank NIM microservice is designed to assess the probability that a given text passage contains relevant information to answer a user query. Integrating this model into a RAG pipeline allows you to filter out searches that fail the reranking model’s evaluation, ensuring that only the most relevant and accurate information is used.
To evaluate the impact of this step on the RAG pipeline, AITEM designed the following experiment:
- We extract a dataset consisting of approximately 100 anonymized questions from LAIKA users.
- Currently running the RAG pipeline to retrieve chunks for each question.
- Sorts the retrieved chunks based on the probabilities provided by the re-ranking model.
- Evaluate each chunk for relevance to the query.
- We analyze the probability distribution of the re-ranking model based on the relevance determined in step 4.
- Compare the chunk rankings in step 3 with the relevance in step 4.
LAIKA user questions can vary considerably in form. Some questions include a detailed description of the situation but do not ask specific questions. Others include precise questions about the study, while others seek guidance or differential diagnosis based on clinical cases or analysis documents.
Because of the large number of chunks per question, AITEM used the Llama 3.1 70B Instruct NIM microservice, which is also available in the NVIDIA API Catalog, for evaluation.
To better understand the performance of the reranking model, we examined specific queries and model responses in detail. Table 1 highlights the top and bottom reranking chunks for a sample query related to differential diagnosis of a weight-losing cat.
Text | Logit reranking |
Causes of weight loss that can be particularly difficult to diagnose include gastrointestinal conditions that don’t cause vomiting, intestinal conditions that don’t cause vomiting or diarrhea, and liver disease. | 3.3125 |
Differential diagnosis for nonspecific signs such as anorexia, weight loss, vomiting, and diarrhea… Acute pancreatitis is rare in cats,… and signs are nonspecific and poorly defined (anorexia, lethargy, weight loss). | 2.3222 |
Severe weight loss (with or without increased appetite) may be seen in cancerous cachexia, maldigestion/malabsorption. Some conditions, such as hyperthyroidism in cats, may cause increased appetite. However, a normal appetite does not exclude the presence of a serious condition. | 2.2265 |
Overall, weight loss was the most common symptom, with little difference between groups. | -5.0078 |
Other customer complaints include lethargy, loss of appetite, weight loss, and vomiting. | -7.3672 |
There were six British Shorthairs, four European Shorthairs, and one Bengal cat. The clinical signs reported by the owners were: decreased or anorexia… | -10.3281 |
Figure 4 compares the re-ranking model probability output distribution (in logits) between relevant (good) and irrelevant (bad) chunks. The probability of good chunks is higher than that of bad chunks, and a t-test confirms that this difference is statistically significant, with a p-value lower than 3e-72.
Figure 5 shows the difference in the distribution of the reordering-induced sorting positions. Good chunks are mostly in the upper positions, while bad chunks are in the lower positions. The Mann-Whitney test confirmed that this difference is statistically significant, with a p-value lower than 9e-31.
Figure 6 shows the rank distribution and helps to define an effective cutoff point. Most chunks in the top 5 positions are good, but most chunks in positions 11 to 15 are bad. Therefore, keeping only the top 5 searches or some other selected number can be one way to effectively exclude most bad chunks.
By pairing lightweight embedding models with the NVIDIA reranking NIM microservice, we can improve search accuracy while optimizing the search pipeline and minimizing ingestion costs. Execution time can be improved by 1.75x (Figure 7).
Better Answers with NVIDIA Reranking NIM Microservices
Results show that adding the NVIDIA Rerank NIM microservice to the LAIKA RAG pipeline has a positive impact on the relevance of the retrieved chunks. By delivering more accurate and specialized information to the downstream response LLM, it provides the model with the knowledge needed for highly specialized fields such as veterinary medicine.
The NVIDIA Rerank NIM microservice, available in the NVIDIA API Catalog, simplifies adoption by making it easy to pull in models, run them, and infer them via APIs. It is pre-quantized and optimized with NVIDIA TensorRT for virtually every platform, eliminating the stress of setup and manual optimization.
For more information and latest updates on LAIKA and other AITEM projects, visit AITEM Solutions and follow LAIKA and AITEM on LinkedIn.
Image source: Shutterstock