Crypto Flexs
  • DIRECTORY
  • CRYPTO
    • ETHEREUM
    • BITCOIN
    • ALTCOIN
  • BLOCKCHAIN
  • EXCHANGE
  • TRADING
  • SUBMIT
Crypto Flexs
  • DIRECTORY
  • CRYPTO
    • ETHEREUM
    • BITCOIN
    • ALTCOIN
  • BLOCKCHAIN
  • EXCHANGE
  • TRADING
  • SUBMIT
Crypto Flexs
Home»ADOPTION NEWS»IBM Research unveils cost-effective AI inference through speculative decoding
ADOPTION NEWS

IBM Research unveils cost-effective AI inference through speculative decoding

By Crypto FlexsJune 24, 20243 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email
IBM Research unveils cost-effective AI inference through speculative decoding
Share
Facebook Twitter LinkedIn Pinterest Email





IBM Research announced a breakthrough in AI inference that combines speculative decoding and paging attention to improve the cost performance of large language models (LLMs). According to IBM Research, this development is expected to make customer care chatbots more efficient and cost-effective.

In recent years, LLM has improved the ability of chatbots to understand customer queries and provide accurate responses. However, the high cost and slow speed of delivering these models has led to broader adoption of AI. Speculative decoding emerges as an optimization technique that accelerates AI inference by generating tokens faster. This can improve customer experience by reducing latency by 2-3x.

Despite the benefits, reducing latency typically comes with a trade-off of increased operating costs due to reduced throughput or fewer users who can simultaneously utilize the model. IBM Research solved this problem by quadrupling throughput while halving the latency of the open source Granite 20B code model.

Speculative Decoding: Efficiency in Token Generation

LLM uses an inefficient translator architecture for text generation. Typically, forward passing is required to process each previously generated token before generating a new token. Speculative decoding modifies this process to evaluate multiple prospective tokens simultaneously. Once these tokens are verified, multiple tokens can be generated in one forward pass, improving inference speed.

This technique can be implemented in smaller, more efficient models or as part of the base model itself. Speculative decoding can maximize the efficiency of each GPU by processing tokens in parallel, doubling or tripling the speed of inference. While researchers at DeepMind and Google leveraged draft models when they first introduced speculative decoding, new methods like the Medusa speculator do not require auxiliary models.

IBM researchers tuned Medusa speculators by conditioning future tokens on each other rather than on the model’s next predicted token. This approach, combined with an efficient fine-tuning method using large and small batches of text, aligns the speculator’s responses closely with the LLM, significantly improving inference speed.

Paged Attention: Optimize memory usage

Reducing LLM latency often reduces throughput due to increased GPU memory strain. Dynamic batching can alleviate this, but not when speculative decoding is also competing for memory. IBM researchers solved this problem using paged attention, an optimization technique inspired by paging concepts in virtual memory and operating systems.

Existing attention algorithms store key-value (KV) sequences in contiguous memory, which results in fragmentation. However, paging attention breaks these sequences into smaller blocks, or pages, that can be accessed as needed. This method frees memory by minimizing redundant computations and allowing speculators to generate multiple candidates for each predicted word without duplicating the entire KV cache.

meaning of the future

IBM has integrated speculative decoding and attention into its Granite 20B code model. IBM Speculator has been made open source by Hugging Face so other developers can apply these technologies to their LLMs. IBM plans to implement these optimization technologies across all models of the watsonx platform to enhance enterprise AI applications.

Image source: Shutterstock



Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

Michael Burry’s Short-Term Investment in the AI ​​Market: A Cautionary Tale Amid the Tech Hype

November 19, 2025

BTC Rebound Targets $110K, but CME Gap Cloud Forecasts

November 11, 2025

TRX Price Prediction: TRON targets $0.35-$0.62 despite the current oversold situation.

October 26, 2025
Add A Comment

Comments are closed.

Recent Posts

Bitcoin remains stable as Texas allocates $5 million to BlackRock’s IBIT.

November 26, 2025

Bull and Bear Scenarios for XRP That Could Happen in November

November 26, 2025

Quantum-secure data storage for app developers with open source Shamir secret sharing for capacitors

November 26, 2025

Bybit’s 7th Anniversary Shares A $2.5 Million Thank-You With Nearly 80 Million Traders Worldwide

November 26, 2025

MEXC Launches Year-End Golden Era Showdown With 2,000g Gold Bar And BTC From 10 Million USDT Prize Pool

November 26, 2025

How SolStaking’s Yield Model Makes It Possible To Earn $7,700 Per Day In Passive Income — As Solana Reclaims Market Momentum

November 26, 2025

Monad mainnet fraud warnings increase as fake ERC20 transfers spread to new chains

November 26, 2025

The ETH Whale Buying Spree Has Begun! BlackchainMining Is Taking You On The Get-rich-quick Train

November 26, 2025

CreatorFi Launches On Aptos With $2M Strategic Backing To Scale Stablecoin Credit For Creators

November 25, 2025

Bybit Lowers Barrier To Elite Wealth Management Solutions With Year-End Exclusive For VIP Clients

November 25, 2025

TrustLinq Launches Swiss-Regulated Crypto-to-Fiat Payment Platform To Boost Cryptocurrency Adoption

November 25, 2025

Crypto Flexs is a Professional Cryptocurrency News Platform. Here we will provide you only interesting content, which you will like very much. We’re dedicated to providing you the best of Cryptocurrency. We hope you enjoy our Cryptocurrency News as much as we enjoy offering them to you.

Contact Us : Partner(@)Cryptoflexs.com

Top Insights

Bitcoin remains stable as Texas allocates $5 million to BlackRock’s IBIT.

November 26, 2025

Bull and Bear Scenarios for XRP That Could Happen in November

November 26, 2025

Quantum-secure data storage for app developers with open source Shamir secret sharing for capacitors

November 26, 2025
Most Popular

Potential charges Haliey Welch could face if the SEC investigates the launch of HAWK

December 6, 2024

BitMEX lists NOTUSDT perpetual swap with 10x leverage

May 16, 2024

A rough patch for Litecoin? Forecasts hint at another price plunge

January 5, 2024
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms and Conditions
© 2025 Crypto Flexs

Type above and press Enter to search. Press Esc to cancel.