Skip to main content
Back to Writing

LLM Observability

Token economics, cost optimization strategies, and observability tooling for production LLM deployments

LLM API Costs and Observability

Mechanism

When an application sends text to an LLM API, the text is divided into tokens. Tokens are the fundamental units of text processing in LLMs, representing common character sequences or word fragments. The provider charges for both the tokens sent to the model (input/prompt tokens) and tokens generated by the model (output/completion tokens). Total costs are then calculated by multiplying the number of tokens processed by the cost per token for each model used.

Token Usage Optimization

Effective token optimization requires understanding how different components contribute to overall token consumption.

  • System Instructions: Define model's behavior/constraints; typically sent with every request in a session
  • User Messages: The specific queries or inputs from end-users
  • Assistant Messages: Previous model responses that provide conversation context
  • Function Definitions: Descriptions of available tools/functions that can be called by the model

In a typical chat application, according to analysis by Langfuse on 1.2M production LLM calls:

  • System instructions: 5-25% of input tokens
  • Message history: 60-90% of input tokens
  • Function definitions (when used): 10-40% of input tokens
  • Current user query: Often <5% of input tokens

Tokenization Consideration

  • Different models use different tokenizers (e.g., tiktoken and SentencePiece).
  • Unicode characters, symbols, and non-English text often tokenize differently than standard English
  • Code and JSON typically tokenize less efficiently than natural language

LLM Cost Debugging

Token Usage Profiling

Comprehensive token tracking across your application should include:

  • Instrumenting API calls to log token counts for both input and output
  • Categorizing usage by function, feature, endpoint, or user segment
  • Establishing baselines and alerting on anomalous increases

Optimization Testing Framework

Develop systematic approaches to test optimization strategies:

  • Create benchmark prompts representing common scenarios
  • Implement A/B testing for different prompt structures and system instructions
  • Measure token efficiency (as quality of result/total tokens used)
  • Establish baseline metrics and set target cost-per-request goals
  • Log and analyze the impact of optimization changes on both costs and application performance

Deployment-Specific Optimization Strategies

Production Environment Optimizations

Implementing multi-level caching can reduce LLM API costs for applications with repetitive queries:

  • Request-level deduplication for identical prompts within short timeframes
  • Response caching with configurable TTL based on content volatility
  • Semantic caching using embeddings to find similar previous requests

Batch Processing: Google's Vertex AI recommends grouping similar requests to amortize system prompt costs:

  • Combine multiple user queries into single API calls when appropriate
  • Process data in batches rather than individual items when possible
  • Use bulk embedding generation instead of individual embedding calls

Development Environment Considerations

Implement a development mode that simulates LLM responses:

  • Use cached responses from production for common development scenarios
  • Generate placeholder responses for new patterns to avoid development costs
  • Add intentional latency to simulate real API behavior

Token Budget Enforcement: OpenAI's enterprise best practices recommend setting stricter token limits in development:

  • Enforce maximum context sizes below production limits
  • Alert devs when requests exceed token budgets
  • Create dashboards showing projected production costs of development patterns

LLM Observability and Monitoring Tools

LLM monitoring is the continuous tracking of performance metrics, usage patterns, and costs of large language models in production environments. LLM observability goes deeper, providing insights into why models behave as they do by collecting and analyzing detailed logs, traces, and metrics.

Observability Solutions

Langfuse

A specialized LLM observability platform that offers:

  • Automatic token usage calculation using appropriate tokenizers
  • Custom model definitions for self-hosted or fine-tuned models
  • Multi-tenancy support for tracking costs per customer or team
  • Cost inference based on model parameters and usage patterns

Humanloop

An enterprise-focused evaluation and monitoring platform offering:

  • Real-time cost monitoring dashboards with alerting capabilities
  • Evaluator frameworks for assessing model quality and performance
  • Automated detection of harmful outputs and anomalies
  • Integrated alert systems for cost spikes, performance degradation, or compliance issues
  • CI/CD integration for continuous performance verification

Elastic LLM Observability

Infrastructure-focused monitoring that includes:

  • Integration with popular model providers (OpenAI, Azure OpenAI, Amazon Bedrock)
  • Out-of-the-box dashboards for cost and performance metrics
  • OpenTelemetry integration for standardized observability
  • Anomaly detection to identify unusual usage patterns or inefficiencies

Tinybird

An open-source solution that provides:

  • Real-time dashboards for tracking LLM costs, requests, tokens, and duration
  • Multi-dimensional filtering by model, provider, organization, project, and user
  • Integration with Python (via LiteLLM) and TypeScript (via Vercel AI SDK)

Other Considerations

  • Cost Allocation: Set up tagging or metadata systems to attribute costs to specific features, teams, or customers for accurate accounting.
  • Alerting Thresholds: Establish meaningful thresholds for cost spikes, latency issues, or error rates to prevent unexpected budget overruns.
  • Caching Strategy Monitoring: Track cache hit rates to ensure caching mechanisms are functioning efficiently and providing expected cost savings.
  • Integration with Existing Tools: Connect LLM observability to the existing monitoring stack for holistic visibility.