Inference Engine 2.0 with Together AI, Turbo, and Lite Endpoints Announced

Terryl Dickey
18 July 2024 18:41

Together AI launches Inference Engine 2.0, offering Turbo and Lite endpoints for improved performance, quality, and cost efficiency.

Together AI announced the release of its new Inference Engine 2.0, which includes Turbo and Lite endpoints. This new inference stack is designed to deliver significantly faster decoding throughput and superior performance compared to existing solutions.

Performance Improvement

According to together.ai, Together Inference Engine 2.0 delivers 4x faster decoding throughput than open source vLLM and 1.3x to 2.5x faster than commercial solutions like Amazon Bedrock, Azure AI, Fireworks, and Octo AI. The engine achieves over 400 tokens per second on Meta Llama 3 8B thanks to advances in FlashAttention-3, faster GEMM and MHA kernels, quality-preserving quantization, and speculative decoding.

New Turbo and Lite endpoints

Together AI introduces new Turbo and Lite endpoints starting with Meta Llama 3. These endpoints balance performance, quality, and cost, allowing enterprises to avoid compromises. Together Turbo closely matches the quality of full-precision FP16 models, while Together Lite provides the most cost-effective and scalable Llama 3 models available.

Turbo endpoints provide fast FP8 performance while maintaining quality, are consistent with the FP16 reference model, and outperform other FP8 solutions on AlpacaEval 2.0. These Turbo endpoints are priced at $0.88 per million tokens for 70B and $0.18 per million for 8B, making them significantly cheaper than GPT-4o.

Together Lite endpoints provide high-quality AI models at a low cost using INT4 quantization, and for Llama 3 8B Lite, it is 6x cheaper than GPT-4o-mini at $0.10 per million tokens.

Adoption and Approval

More than 100,000 developers and companies, including Zomato, DuckDuckGo, and The Washington Post, are already leveraging the Together Inference Engine for their generative AI applications. Rinshul Chandra, COO of Food Delivery at Zomato, praised the engine for its high quality, speed, and accuracy.

Technological innovation

Together Inference Engine 2.0 incorporates several technological advancements, including FlashAttention-3, custom speculators, and quality-preserving quantization techniques. These innovations contribute to the engine’s superior performance and cost-effectiveness.

Future outlook

Together AI plans to continue pushing the boundaries of AI acceleration. The company aims to ensure that the Together Inference Engine remains at the forefront of AI technology by expanding support for new models, technologies, and kernels.

Turbo and Lite endpoints for the Llama 3 model are available starting today, with plans to expand to other models soon. Visit the Together AI pricing page for more information.

Image source: Shutterstock

Inference Engine 2.0 with Together AI, Turbo, and Lite Endpoints Announced

Leonardo AI unveils comprehensive image editing suite with six model options

Ether Funds Turn Negative, But Bears Still Retain Control: Why?

BNB holders gained 177% in 15 months through Binance Rewards Program.

Use AI In Crypto Research- Transforming How Users Discover Blockchain Resources

Siren token rises 340% as analysts indicate concentrated holding.

OpenAI explores 5GW convergence power deal with Helion Energy

Bitmine Immersion Technologies (BMNR) Announces ETH Holdings Reach 4.661 Million Tokens, And Total Crypto And Total Cash Holdings Of $11.0 Billion

Bitcoin Under $50K, 5 Key Takeaways from Gold’s Bear Market

Ethereum investor Druckenmiller predicts a stablecoin-based payment system.

Top cryptocurrency tax tips to optimize your 2026 filing

Ethereum Exchange Inflow Signal Turns: Whales Reduce Selling Pressure

XRP SEC Classification Status: Impact on Markets

Crypto Bettors Are Leaving Traditional Sportsbooks Behind- Cloudbet’s 2026 Numbers Show Why

Bitcoin tests $74K resistance amid cumulative increase

Top Insights

Use AI In Crypto Research- Transforming How Users Discover Blockchain Resources

Siren token rises 340% as analysts indicate concentrated holding.

OpenAI explores 5GW convergence power deal with Helion Energy

Most Popular

SAYLOLOLOR takes down the purchase of a new Bitcoin after the 76.9 billion Q1 purchase of the strategy.

How to get balance before downloading the entire blockchain in Bitcoin Core

Cryptocurrency adoption remains stable despite market downturn

Inference Engine 2.0 with Together AI, Turbo, and Lite Endpoints Announced

Performance Improvement

New Turbo and Lite endpoints

Adoption and Approval

Technological innovation

Future outlook

Related Posts