SaaS Monitoring and Observability: Logs, Metrics, and Traces

Observability is the foundation of operational confidence in a SaaS application. Without it, you are debugging in the dark. This guide covers the three pillars of observability — logs, metrics, and traces — implemented for SaaS applications on Cloudflare Workers and TanStack Start, with practical patterns used at tanstackship.com for production monitoring, alerting, and root cause analysis.


The Three Pillars of Observability

Pillar What It Answers Storage Retention Cost
Logs “What happened?” Workers Analytics Engine 7-30 days Low
Metrics “What is the trend?” Analytics Engine + Grafana 3-12 months Medium
Traces “Why did it happen?” Workers Trace (Tail Worker) 1-7 days Low (sampled)

Logging: Structured Logging on the Edge

Log Format

// src/lib/logger.ts
type LogLevel = "debug" | "info" | "warn" | "error"
interface LogEntry 
  timestamp: string
  level: LogLevel
  message: string
  requestId: string
  userId?: string
  duration?: number
  error?: string
  metadata?: Record<string, unknown>

function createLogger(request: Request) 
  const requestId = request.headers.get("cf-ray") ?? crypto.randomUUID()
  return 
    info: (msg: string, meta?: Record<string, unknown>) =>
      log( level: "info", message: msg, requestId, metadata: meta ),
    warn: (msg: string, meta?: Record<string, unknown>) =>
      log( level: "warn", message: msg, requestId, metadata: meta ),
    error: (msg: string, error?: Error, meta?: Record<string, unknown>) =>
      log(
        level: "error",
        message: msg,
        requestId,
        error: error?.message,
        stack: error?.stack,
        metadata: meta,
      ),
    getRequestId: () => requestId,
  

Enter fullscreen modeExit fullscreen mode

Sending Logs to Analytics Engine

// src/lib/logger.ts — writing to Workers Analytics Engine
function log(entry: LogEntry) 
  // Write to Analytics Engine for queryable storage
  context.env.ANALYTICS.writeDataPoint(
    blobs: [
      entry.level,
      entry.message,
      entry.requestId,
      entry.userId ?? "",
      entry.error ?? "",
      JSON.stringify(entry.metadata ?? ),
    ],
    doubles: [entry.duration ?? 0],
    indexes: [entry.requestId],
  )
  // Also output to Workers console for wrangler tail
  console.log(JSON.stringify(entry))

Enter fullscreen modeExit fullscreen mode

Metrics: What to Measure

SaaS Metrics Taxonomy

Category Metrics How to Measure
Performance Request latency (p50/p95/p99), response size, cache hit rate Analytics Engine
Business Signups, MRR, churn rate, conversion rate Database queries
Infrastructure Worker CPU time, memory, subrequest count Workers Runtime API
User experience LCP, CLS, INP, TTFB Web Vitals API + Analytics

Collecting Performance Metrics

// server/middleware/metrics.ts
export const metricsMiddleware = createServerFn( method: "GET" ).handler(
  async ( request, context ) => 
    const startTime = Date.now()
    // After the request completes, record metrics
    context.waitUntil(
      context.env.ANALYTICS.writeDataPoint(
        blobs: [
          request.url,
          request.method,
          String(context.response?.status ?? 500),
        ],
        doubles: [
          Date.now() - startTime, // duration
          request.cf?.colo ?? 0, // edge location
        ],
        indexes: [request.url.split("?")[0]], // path
      )
    )
  
)
Enter fullscreen modeExit fullscreen mode

Querying Metrics

// Query Analytics Engine for dashboard data
export const getApiLatency = createServerFn( method: "GET" ).handler(
  async (,  context ) => 
    const result = await context.env.ANALYTICS.query(
      sql: `
        SELECT
          blob3 as path,
          quantile(0.50)(double1) as p50,
          quantile(0.95)(double1) as p95,
          quantile(0.99)(double1) as p99,
          count() as requests
        FROM analytics_engine_table
        WHERE timestamp > now() - INTERVAL '1' DAY
        GROUP BY blob3
        ORDER BY p99 DESC
        LIMIT 20
      `,
    )
    return result.rows
  
)
Enter fullscreen modeExit fullscreen mode

Tracing: Distributed Request Tracing

Setting Up a Tail Worker

// tail-worker.ts — deployed as a separate Worker
export default 
  async tail(events: TraceEvent[]) 
    for (const event of events) 
      if (event.duration && event.event.request) 
    
  ,

Enter fullscreen modeExit fullscreen mode

Injecting Trace Context

// Pass trace context from server function to downstream services
export const handleWebhook = createServerFn( method: "POST" ).handler(
  async ( request, context ) => 
    const traceId = crypto.randomUUID()
    // Pass trace ID to downstream calls
    await fetch("https://api.stripe.com/v1/events", 
      headers: 
        "X-Trace-Id": traceId,
      ,
    )
    // Log with trace context
    context.logger.info("Webhook received",  traceId, eventType: event.type )
  
)
Enter fullscreen modeExit fullscreen mode

Alerting: When to Wake Someone Up

Alert Severity Levels

Level Response Time Example
Critical 15 minutes p99 latency > 5s for 5 minutes
High 1 hour Error rate > 1% for 10 minutes
Medium 24 hours Disk usage > 80%
Low 7 days Deprecated API calls detected

Alert Configuration

// server/monitoring/alerts.ts
import  createServerFn  from "@tanstack/react-start"
export const checkAlerts = createServerFn( method: "GET" ).handler(
  async (,  context ) => 
    const alerts = []
    // 1. Check error rate
    const errorRate = await context.env.ANALYTICS.query(
      sql: `
        SELECT count() as total,
               countIf(blob0 = 'error') as errors
        FROM analytics_engine_table
        WHERE timestamp > now() - INTERVAL '5' MINUTE
      `,
    )
    const rate = Number(errorRate.rows[0]?.errors ?? 0) /
                 Math.max(Number(errorRate.rows[0]?.total ?? 1), 1)
    if (rate > 0.01) 
      alerts.push(
        severity: "high",
        message: `Error rate $(rate * 100).toFixed(1)% > 1%`,
      )
    
    // 2. Check latency
    const latency = await context.env.ANALYTICS.query(
      sql: `
        SELECT quantile(0.99)(double1) as p99
        FROM analytics_engine_table
        WHERE timestamp > now() - INTERVAL '5' MINUTE
      `,
    )
    const p99 = Number(latency.rows[0]?.p99 ?? 0)
    if (p99 > 5000) 
      alerts.push(
        severity: "critical",
        message: `p99 latency $p99ms > 5000ms`,
      )
    
    return alerts
  
)
Enter fullscreen modeExit fullscreen mode

Dashboard: Visualizing Your Observability Data

The ideal SaaS observability dashboard has three views:

Operational View (Real-Time)

┌──────────────┬──────────────┬──────────────┬──────────────┐
│ Request Rate │ Error Rate   │ p50 Latency  │ p99 Latency  │
│   1,234/min  │   0.02%      │    45ms      │    210ms     │
├──────────────┼──────────────┼──────────────┼──────────────┤
│ Active Users │ Signups (24h)│ Churn (30d)  │ MRR          │
│   847        │   23         │    3.2%      │  $12,450     │
└──────────────┴──────────────┴──────────────┴──────────────┘
Enter fullscreen modeExit fullscreen mode

Business View (Weekly)

Cohort Retention Table
Signup Funnel (Visit → Signup → Subscribe → Active)
Feature Adoption Heatmap
Top User Actions
Enter fullscreen modeExit fullscreen mode

Debug View (On-Demand)

Request Trace Explorer
Error Log Stream
Worker CPU Profile
Database Query Performance
Enter fullscreen modeExit fullscreen mode

Automating Incident Response

// server/monitoring/incident.ts
export const handleIncident = createServerFn( method: "POST" ).handler(
  async ( data :  data:  severity: string; message: string  ) => 
    // 1. Log the incident
    await logIncident(data)
    // 2. Notify (only for high+ criticality)
    if (["high", "critical"].includes(data.severity)) 
      await sendSlackNotification(
        `[$data.severity.toUpperCase()] $data.message`
      )
      await sendEmail(
        to: "ops@tanstackship.com",
        subject: `[$data.severity] $data.message`,
      )
    
    // 3. Auto-remediate where possible
    if (data.message.includes("rate limit")) 
      // Scale up rate limit window
      console.log("Auto-remediation: rate limit exceeded")
    
  
)
Enter fullscreen modeExit fullscreen mode

Observability Budget

Allocate your observability resources wisely:

Component % of Observability Budget Why
Request logging (sampled) 20% Core operational data
Error/exception logging 30% Highest debugging value
Business metrics 25% Revenue and growth
Performance metrics 15% User experience
Distributed traces 10% Debugging complex issues

Production Observability Checklist

  • [ ] All requests produce structured JSON logs
  • [ ] Logs include request ID, user ID, and duration
  • [ ] Error logs include stack traces
  • [ ] All critical business events are tracked as metrics
  • [ ] Worker performance metrics collected (CPU, memory, subrequests)
  • [ ] Alerts configured for p99 latency > 5s and error rate > 1%
  • [ ] Tail worker captures trace data for sampled requests
  • [ ] Dashboard covers operational, business, and debug views
  • [ ] Logs have automated retention and archival policy
  • [ ] Access to logs is audited and role-restricted

Conclusion

Observability is not optional for a production SaaS application. The three pillars — logs, metrics, and traces — each serve a different purpose, and a mature observability practice needs all three.

For TanStack Start on Cloudflare Workers, the native observability tooling (Workers Analytics Engine, Tail Workers, Trace Workers) provides a solid foundation without requiring third-party services. You can collect, query, and alert on operational data entirely within the Cloudflare ecosystem.

The key is to start simple: structured logs first, then metrics for the most critical business and performance indicators, then traces for debugging complex issues. Add alerting only when you have enough data to set meaningful thresholds.

For a production SaaS with built-in observability infrastructure, see tanstackship.com.

Related Resources

 

Leave a Reply

Your email address will not be published. Required fields are marked *