LLM Observability

LLM API Costs and Observability

Mechanism

When an application sends text to an LLM API, the text is divided into tokens. Tokens are the fundamental units of text processing in LLMs, representing common character sequences or word fragments. The provider charges for both the tokens sent to the model (input/prompt tokens) and tokens generated by the model (output/completion tokens). Total costs are then calculated by multiplying the number of tokens processed by the cost per token for each model used.

Token Usage Optimization

Effective token optimization requires understanding how different components contribute to overall token consumption.

System Instructions: Define model's behavior/constraints; typically sent with every request in a session
User Messages: The specific queries or inputs from end-users
Assistant Messages: Previous model responses that provide conversation context
Function Definitions: Descriptions of available tools/functions that can be called by the model

In a typical chat application, according to analysis by Langfuse on 1.2M production LLM calls:

System instructions: 5-25% of input tokens
Message history: 60-90% of input tokens
Function definitions (when used): 10-40% of input tokens
Current user query: Often <5% of input tokens

Tokenization Consideration

Different models use different tokenizers (e.g., tiktoken and SentencePiece).
Unicode characters, symbols, and non-English text often tokenize differently than standard English
Code and JSON typically tokenize less efficiently than natural language

LLM Cost Debugging

Token Usage Profiling

Comprehensive token tracking across your application should include:

Instrumenting API calls to log token counts for both input and output
Categorizing usage by function, feature, endpoint, or user segment
Establishing baselines and alerting on anomalous increases

Optimization Testing Framework

Develop systematic approaches to test optimization strategies:

Create benchmark prompts representing common scenarios
Implement A/B testing for different prompt structures and system instructions
Measure token efficiency (as quality of result/total tokens used)
Establish baseline metrics and set target cost-per-request goals
Log and analyze the impact of optimization changes on both costs and application performance

Deployment-Specific Optimization Strategies

Production Environment Optimizations

Implementing multi-level caching can reduce LLM API costs for applications with repetitive queries:

Request-level deduplication for identical prompts within short timeframes
Response caching with configurable TTL based on content volatility
Semantic caching using embeddings to find similar previous requests

Batch Processing: Google's Vertex AI recommends grouping similar requests to amortize system prompt costs:

Combine multiple user queries into single API calls when appropriate
Process data in batches rather than individual items when possible
Use bulk embedding generation instead of individual embedding calls

Development Environment Considerations

Implement a development mode that simulates LLM responses:

Use cached responses from production for common development scenarios
Generate placeholder responses for new patterns to avoid development costs
Add intentional latency to simulate real API behavior

Token Budget Enforcement: OpenAI's enterprise best practices recommend setting stricter token limits in development:

Enforce maximum context sizes below production limits
Alert devs when requests exceed token budgets
Create dashboards showing projected production costs of development patterns

LLM Observability and Monitoring Tools

LLM monitoring is the continuous tracking of performance metrics, usage patterns, and costs of large language models in production environments. LLM observability goes deeper, providing insights into why models behave as they do by collecting and analyzing detailed logs, traces, and metrics.

Observability Solutions

Langfuse

A specialized LLM observability platform that offers:

Automatic token usage calculation using appropriate tokenizers
Custom model definitions for self-hosted or fine-tuned models
Multi-tenancy support for tracking costs per customer or team
Cost inference based on model parameters and usage patterns

Humanloop

An enterprise-focused evaluation and monitoring platform offering:

Real-time cost monitoring dashboards with alerting capabilities
Evaluator frameworks for assessing model quality and performance
Automated detection of harmful outputs and anomalies
Integrated alert systems for cost spikes, performance degradation, or compliance issues
CI/CD integration for continuous performance verification

Elastic LLM Observability

Infrastructure-focused monitoring that includes:

Integration with popular model providers (OpenAI, Azure OpenAI, Amazon Bedrock)
Out-of-the-box dashboards for cost and performance metrics
OpenTelemetry integration for standardized observability
Anomaly detection to identify unusual usage patterns or inefficiencies

Tinybird

An open-source solution that provides:

Real-time dashboards for tracking LLM costs, requests, tokens, and duration
Multi-dimensional filtering by model, provider, organization, project, and user
Integration with Python (via LiteLLM) and TypeScript (via Vercel AI SDK)

Other Considerations

Cost Allocation: Set up tagging or metadata systems to attribute costs to specific features, teams, or customers for accurate accounting.
Alerting Thresholds: Establish meaningful thresholds for cost spikes, latency issues, or error rates to prevent unexpected budget overruns.
Caching Strategy Monitoring: Track cache hit rates to ensure caching mechanisms are functioning efficiently and providing expected cost savings.
Integration with Existing Tools: Connect LLM observability to the existing monitoring stack for holistic visibility.