Crypto Flexs
  • DIRECTORY
  • CRYPTO
    • ETHEREUM
    • BITCOIN
    • ALTCOIN
  • BLOCKCHAIN
  • EXCHANGE
  • TRADING
  • SUBMIT
Crypto Flexs
  • DIRECTORY
  • CRYPTO
    • ETHEREUM
    • BITCOIN
    • ALTCOIN
  • BLOCKCHAIN
  • EXCHANGE
  • TRADING
  • SUBMIT
Crypto Flexs
Home»ADOPTION NEWS»Anyscale, exploring direct preference optimization using synthetic data
ADOPTION NEWS

Anyscale, exploring direct preference optimization using synthetic data

By Crypto FlexsAugust 22, 20243 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email
Anyscale, exploring direct preference optimization using synthetic data
Share
Facebook Twitter LinkedIn Pinterest Email

Felix Pinkston
22 Aug 2024 03:00

Anyscale’s latest blog post dives deeper into Direct Preference Optimization (DPO) with Synthetic Data, highlighting its methodology and applications in language model tuning.





According to Anyscale, Direct Preference Optimization (DPO) has emerged as a prominent methodology for tuning language models to align their output with human preferences. The company’s latest blog post provides an in-depth case study on applying DPO using synthetic data, specifically in the context of summary tasks.

Generate synthetic data

Synthetic data generation has become a powerful technique for creating high-quality data sets. Anyscale’s approach uses AI models as data augmentors and judgers to improve subsequent models. This blog describes a detailed pipeline for synthetic data generation, highlighting the utility of Ray Data and vLLM for scaling and rapid experimentation.

DPO Training and Insights

Direct Preference Optimization (DPO) is a widely adopted algorithm for preference adjustment as it provides a balanced balance between complexity and effectiveness. Anyscale has integrated DPO into its LLM product family, allowing users to build preference adjustment models through an intuitive API. This blog covers modeling insights and experiments conducted on DPO for the sake of summary.

evaluation

Anyscale uses Ray Data and vLLM for batch inference to evaluate the generated summaries at scale. Evaluation is essential to determine the quality of the model, and Anyscale emphasizes the importance of task-specific evaluations that align with training objectives. This blog provides key details on setting up affinity functions for effective evaluation.

Comparison with supervised fine-tuning

This blog contrasts DPO with traditional supervised fine-tuning (SFT). SFT relies on collecting high-quality data and accurately mimicking the desired behavior, while preference tuning focuses on which responses are preferred over others. This approach directly addresses model-specific issues by allowing for scalable data generation and in-policy data collection.

Case Study: Summary

The case study applies DPO to the Mistral-7B-instruct-v0.1 model to summarize CNN articles. Anyscale designed a synthetic summary preference dataset to reduce costs and ensure consistency between training and evaluation using synthetic judgers. The preference function evaluates summaries by combining word count minimization and Q&A accuracy.

Data generation

Anyscale used the Mistral-7B-Instruct-v0.1 model to generate policy data for summarization. This process involved generating multiple summaries for each article and using the Llama-3-70B-Instruct model to create and answer multiple-choice questions on the original text. This method ensured a variety of outputs and accurate assessments.

DPO Training

Anyscale implemented DPO in their LLM post-training offering, allowing users to configure hyperparameters and compute resources for their training runs. This blog provides a detailed example of a DPO training configuration, highlighting the importance of the β hyperparameter and efficient training using Ray.

evaluation

The evaluation included calculating the win rate of each model and comparing the DPO-trained model with the original and other baselines. The results showed that DPO was advantageous in balancing accuracy and compression, and outperformed SFT and GPT-4o baselines.

Insights and Challenges

Anyscale has uncovered key insights into DPO training, including the critical role of β and learning rate hyperparameters. The blog also discusses failure modes such as long off-topic endings and gibberish tokens, emphasizing the need for careful hyperparameter tuning and monitoring.

Repetitive policy training

The blog suggests iterative on-policy learning as a way to improve DPO performance. By regenerating training data with fine-tuned models and applying additional DPO rounds, Anyscale achieves significant performance gains, making DPO competitive with existing RLHF methods.

For a full detailed case study and methodology, you can refer to Anyscale’s original post.

Image source: Shutterstock


Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

SOL price remains capped at $140 as altcoin ETF competitors reshape cryptocurrency demand.

December 5, 2025

Michael Burry’s Short-Term Investment in the AI ​​Market: A Cautionary Tale Amid the Tech Hype

November 19, 2025

BTC Rebound Targets $110K, but CME Gap Cloud Forecasts

November 11, 2025
Add A Comment

Comments are closed.

Recent Posts

Ethereum inches toward a critical decision point: bullish breakout or deeper dive?

December 9, 2025

Superform brings institutional-level yields to everyday users with its new Stablecoin Neobank product.

December 9, 2025

I need to use a voucher with lights, is there a Linux application that can do this?

December 8, 2025

Bybit Institutional Sets The Stage For 2026 At High-Profile Abu Dhabi Gala

December 8, 2025

ONDO price soars after SEC concludes confidential investigation with no charges

December 8, 2025

Moca Network Launches MocaProof Beta, The Digital Identity Verification And Reward Platform

December 8, 2025

SemiLiquid Unveils Programmable Credit Protocol, Built With Avalanche, Advancing Institutional Credit On Tokenised Collateral

December 8, 2025

Sonami Launches First Layer 2 Token On Solana To Ensure Transaction Efficiency And End Congestion Spikes

December 8, 2025

Bybit And Circle Forge Strategic Partnership To Advance Global USDC Adoption

December 8, 2025

Buy 136K ETH at price to prepare for 28% surge

December 8, 2025

ETF Momentum Drives XRP, ETH And BTC Investors Toward HoursMining Cloud Mining For Passive Income, With Some Users Earning Up To $1,980 Per Day

December 8, 2025

Crypto Flexs is a Professional Cryptocurrency News Platform. Here we will provide you only interesting content, which you will like very much. We’re dedicated to providing you the best of Cryptocurrency. We hope you enjoy our Cryptocurrency News as much as we enjoy offering them to you.

Contact Us : Partner(@)Cryptoflexs.com

Top Insights

Ethereum inches toward a critical decision point: bullish breakout or deeper dive?

December 9, 2025

Superform brings institutional-level yields to everyday users with its new Stablecoin Neobank product.

December 9, 2025

I need to use a voucher with lights, is there a Linux application that can do this?

December 8, 2025
Most Popular

More Ethereum Investors Are Starting to Sell: What Do You Need to Know?

June 11, 2024

Bitcoin investors are becoming millionaires 22 times faster than stock investors

January 9, 2025

Ethereum in Action Part 2: How to Build a Better Democracy in 100 Lines of Code

April 20, 2024
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms and Conditions
© 2025 Crypto Flexs

Type above and press Enter to search. Press Esc to cancel.