Token Counting without Latency
Problem
Section titled “Problem”You need to count tokens in LLM requests for cost tracking and rate limiting, but token counting libraries can be slow and add significant latency to your requests, especially for large inputs or high-throughput scenarios.
Solution
Section titled “Solution”Implement asynchronous token counting with caching and estimation for non-critical paths. This works because you can count tokens in parallel with the LLM request, cache counts for repeated inputs, and use fast estimation algorithms when exact counts aren’t needed.
Why This Matters
Section titled “Why This Matters”Token counting is a necessary evil in LLM applications. You need it for cost tracking, rate limiting, context window management, and billing, but accurate token counting requires running the model’s tokenizer, which can add 5-50ms of latency per request depending on input size. At 1000 requests per second, that overhead becomes significant.
The solution is to recognize that not all token counts need to be exact at the same time. Rate limiting needs a fast estimate immediately (is this request roughly within budget?), while billing needs exact counts eventually (how many tokens did we actually use?). By separating these concerns, you can use a character-ratio estimator (approximately 4 characters per token for English text) for immediate decisions and compute exact counts asynchronously in the background.
The caching layer addresses a pattern common in production: system prompts, few-shot examples, and template prefixes repeat across requests. Caching their token counts eliminates redundant computation for the most common inputs. The sync.RWMutex protects the cache for concurrent access, using a read lock for cache hits (the hot path) and a write lock only when inserting new entries. The OTel spans distinguish between cached, estimated, and exact counts, which is valuable for monitoring cache hit rates and tuning the estimation ratio for your specific model and language distribution.
Code Example
Section titled “Code Example”package main
import ( "context" "fmt" "log" "sync"
"go.opentelemetry.io/otel" "go.opentelemetry.io/otel/attribute" "go.opentelemetry.io/otel/trace"
"github.com/lookatitude/beluga-ai/schema")
var tracer = otel.Tracer("beluga.llms.token_counting")
// TokenCounter provides fast token countingtype TokenCounter struct { cache map[string]int mu sync.RWMutex estimator TokenEstimator exactCounter TokenCounterFunc}
// TokenCounterFunc counts tokens exactly (may be slow)type TokenCounterFunc func(text string) (int, error)
// TokenEstimator estimates tokens quickly (approximate)type TokenEstimator func(text string) int
// NewTokenCounter creates a new token counterfunc NewTokenCounter(exactCounter TokenCounterFunc, estimator TokenEstimator) *TokenCounter { return &TokenCounter{ cache: make(map[string]int), estimator: estimator, exactCounter: exactCounter, }}
// CountTokens counts tokens with caching and async fallbackfunc (tc *TokenCounter) CountTokens(ctx context.Context, text string, requireExact bool) (int, error) { ctx, span := tracer.Start(ctx, "token_counter.count") defer span.End()
// Check cache first tc.mu.RLock() if count, exists := tc.cache[text]; exists { tc.mu.RUnlock() span.SetAttributes( attribute.Bool("token_count.cached", true), attribute.Int("token_count.value", count), ) return count, nil } tc.mu.RUnlock()
// If exact count not required, use fast estimation if !requireExact { estimated := tc.estimator(text) span.SetAttributes( attribute.Bool("token_count.estimated", true), attribute.Int("token_count.value", estimated), )
// Cache estimation asynchronously go tc.cacheExactCount(text)
return estimated, nil }
// Count exactly (may be slow) count, err := tc.exactCounter(text) if err != nil { span.RecordError(err) span.SetStatus(trace.StatusError, err.Error()) return 0, err }
// Cache result tc.mu.Lock() tc.cache[text] = count tc.mu.Unlock()
span.SetAttributes( attribute.Bool("token_count.exact", true), attribute.Int("token_count.value", count), )
return count, nil}
// cacheExactCount caches exact count asynchronouslyfunc (tc *TokenCounter) cacheExactCount(text string) { count, err := tc.exactCounter(text) if err != nil { return }
tc.mu.Lock() tc.cache[text] = count tc.mu.Unlock()}
// CountTokensAsync counts tokens asynchronouslyfunc (tc *TokenCounter) CountTokensAsync(ctx context.Context, text string) <-chan TokenCountResult { resultCh := make(chan TokenCountResult, 1)
go func() { defer close(resultCh)
// Use estimation for immediate response estimated := tc.estimator(text) resultCh <- TokenCountResult{ Count: estimated, Estimated: true, }
// Count exactly in background exact, err := tc.exactCounter(text) if err == nil { tc.mu.Lock() tc.cache[text] = exact tc.mu.Unlock()
resultCh <- TokenCountResult{ Count: exact, Estimated: false, } } }()
return resultCh}
// TokenCountResult represents token count resulttype TokenCountResult struct { Count int Estimated bool}
// FastEstimator estimates tokens using character countfunc FastEstimator(model string) TokenEstimator { ratios := map[string]float64{ "gpt-3.5-turbo": 0.25, "gpt-4": 0.25, "claude-3": 0.25, }
ratio := ratios[model] if ratio == 0 { ratio = 0.25 // Default }
return func(text string) int { return int(float64(len(text)) * ratio) }}
// CountTokensForMessages counts tokens for message arraysfunc (tc *TokenCounter) CountTokensForMessages(ctx context.Context, messages []schema.Message, requireExact bool) (int, error) { ctx, span := tracer.Start(ctx, "token_counter.count_messages") defer span.End()
total := 0
for i, msg := range messages { content := msg.GetContent() count, err := tc.CountTokens(ctx, content, requireExact) if err != nil { span.RecordError(err) return 0, fmt.Errorf("failed to count tokens for message %d: %w", i, err) } total += count }
span.SetAttributes( attribute.Int("token_count.total", total), attribute.Int("message_count", len(messages)), )
return total, nil}
func main() { ctx := context.Background()
exactCounter := func(text string) (int, error) { return len(text) / 4, nil // Simplified }
estimator := FastEstimator("gpt-3.5-turbo") counter := NewTokenCounter(exactCounter, estimator)
text := "This is a test message for token counting" count, err := counter.CountTokens(ctx, text, false) if err != nil { log.Fatalf("Failed to count tokens: %v", err) } fmt.Printf("Token count: %d\n", count)}Explanation
Section titled “Explanation”-
Cache-first lookup — Token counts are cached by text content using a
sync.RWMutex-protected map. The read lock is acquired first for cache lookups (the common case), avoiding write lock contention. Repeated inputs like system prompts, few-shot examples, and template prefixes are counted once and served from cache thereafter, eliminating the most common source of redundant computation. -
Estimation with async backfill — When exact counts aren’t required (e.g., for rate limiting or pre-flight checks), the
FastEstimatorreturns immediately using a character-to-token ratio. Meanwhile, a background goroutine computes the exact count and caches it for future use. This means the first request uses an estimate, but all subsequent requests for the same text get an exact cached count without any computation. -
Async counting channel — The
CountTokensAsyncmethod returns a channel that delivers two results: an immediate estimate followed by the exact count. Consumers can use the estimate for fast decisions (like rate limiting) and the exact count for accurate operations (like billing). The channel is buffered with capacity 1 to prevent the goroutine from blocking if the consumer only reads one result. -
Model-specific estimation — The
FastEstimatorfunction accepts a model name and returns a tuned ratio. Different models tokenize differently (GPT-4 and Claude use roughly 4 characters per token for English, but this varies by language), so the ratio can be customized per model for better estimation accuracy.
Testing
Section titled “Testing”func TestTokenCounter_Caching(t *testing.T) { callCount := 0 exactCounter := func(text string) (int, error) { callCount++ return len(text) / 4, nil }
estimator := FastEstimator("gpt-3.5-turbo") counter := NewTokenCounter(exactCounter, estimator)
text := "test message"
// First call - should call exact counter count1, _ := counter.CountTokens(context.Background(), text, true)
// Second call - should use cache count2, _ := counter.CountTokens(context.Background(), text, true)
require.Equal(t, count1, count2) require.Equal(t, 1, callCount) // Should only call once}Variations
Section titled “Variations”Model-Specific Counting
Section titled “Model-Specific Counting”Use different counting strategies per model:
type ModelTokenCounter struct { counters map[string]*TokenCounter}Batch Counting
Section titled “Batch Counting”Count tokens for multiple texts in parallel:
func (tc *TokenCounter) CountTokensBatch(ctx context.Context, texts []string) ([]int, error) { // Count in parallel}Related Recipes
Section titled “Related Recipes”- Streaming Tool Calls — Handle streaming with tool calls
- S2S Voice Metrics — Track token usage metrics