The Math Behind Inference for Llama3.1 405B

2 min read

In a recent Twitter thread, Yangqing Jia, CEO of Lepton AI, provided a fascinating breakdown of the economics behind AI inference APIs. This analysis comes at a crucial time when many companies are announcing their API offerings for Llama3.1 405B, sparking discussions about profitability in the AI industry.

Jia's analysis focuses on the often-overlooked aspects of AI inference, particularly the importance of considering both input and output tokens in pricing models. Here's a breakdown of his key points:

  1. Batched Output Speed: With the usual concurrency of 10 in the production environment, a Llama 405B parameter model can achieve an output throughput of approximately 300 tokens per second.
  2. Input Token Consideration: Jia emphasizes that input tokens, often overlooked, significantly impact processing. With a typical input-to-output ratio of 10:1, the system processes about 3,000 input tokens per second alongside the 300 output tokens.
  3. Revenue Calculation: Based on Lepton's pricing of $2.8 per million tokens, this throughput could generate about $798.34 per day.
  4. Hardware Costs: Using the example of an 8xH100 GPU setup at AWS on-demand prices, the daily cost would be around $670.08.
  5. Potential for Profitability: With revenue exceeding costs, Jia argues that profitability is possible, albeit with slim margins.

However, Jia is quick to point out several factors that can influence these calculations:

  • Traffic Variability: Consistent high concurrency is not guaranteed, affecting overall throughput.
  • Pricing Model: Both input and output tokens are billed, a detail that can be easily misunderstood.
  • Hardware Costs: Serverless APIs need elasticity, so higher-priced on-demand GPUs are required.
  • Speculative Decoding: This technique's effectiveness diminishes at high concurrency levels. However, automatically turning it on during a low concurrency period can improve performance.
  • Prompt Caching: Can provide additional efficiency for certain use cases.
  • Hardware Alternatives: Different GPU models such as H200 & MI300 may alter the economic equation but the conclusion of profitability is unchanged.

Jia's analysis provides a valuable perspective on the economics of AI inference APIs. It highlights the delicate balance between pricing, hardware costs, and optimization techniques that companies must navigate to achieve profitability.

While Jia's numbers suggest that profitability is possible, they also underscore the tight margins and the importance of efficient operations in this highly competitive field. As more companies enter the API market, it will be interesting to see how pricing models and technological optimizations evolve to maintain profitability while delivering high-quality AI services.