Observability is the foundation of operational confidence in a SaaS application. Without it, you are debugging in the dark. This guide covers the three pillars of observability — logs, metrics, and traces — implemented for SaaS applications on Cloudflare Workers and TanStack Start, with practical patterns used at tanstackship.com for production monitoring, alerting, and root cause analysis.
The Three Pillars of Observability
| Pillar | What It Answers | Storage | Retention | Cost |
|---|---|---|---|---|
| Logs | “What happened?” | Workers Analytics Engine | 7-30 days | Low |
| Metrics | “What is the trend?” | Analytics Engine + Grafana | 3-12 months | Medium |
| Traces | “Why did it happen?” | Workers Trace (Tail Worker) | 1-7 days | Low (sampled) |
Logging: Structured Logging on the Edge
Log Format
// src/lib/logger.ts
type LogLevel = "debug" | "info" | "warn" | "error"
interface LogEntry
timestamp: string
level: LogLevel
message: string
requestId: string
userId?: string
duration?: number
error?: string
metadata?: Record<string, unknown>
function createLogger(request: Request)
const requestId = request.headers.get("cf-ray") ?? crypto.randomUUID()
return
info: (msg: string, meta?: Record<string, unknown>) =>
log( level: "info", message: msg, requestId, metadata: meta ),
warn: (msg: string, meta?: Record<string, unknown>) =>
log( level: "warn", message: msg, requestId, metadata: meta ),
error: (msg: string, error?: Error, meta?: Record<string, unknown>) =>
log(
level: "error",
message: msg,
requestId,
error: error?.message,
stack: error?.stack,
metadata: meta,
),
getRequestId: () => requestId,
Sending Logs to Analytics Engine
// src/lib/logger.ts — writing to Workers Analytics Engine
function log(entry: LogEntry)
// Write to Analytics Engine for queryable storage
context.env.ANALYTICS.writeDataPoint(
blobs: [
entry.level,
entry.message,
entry.requestId,
entry.userId ?? "",
entry.error ?? "",
JSON.stringify(entry.metadata ?? ),
],
doubles: [entry.duration ?? 0],
indexes: [entry.requestId],
)
// Also output to Workers console for wrangler tail
console.log(JSON.stringify(entry))
Metrics: What to Measure
SaaS Metrics Taxonomy
| Category | Metrics | How to Measure |
|---|---|---|
| Performance | Request latency (p50/p95/p99), response size, cache hit rate | Analytics Engine |
| Business | Signups, MRR, churn rate, conversion rate | Database queries |
| Infrastructure | Worker CPU time, memory, subrequest count | Workers Runtime API |
| User experience | LCP, CLS, INP, TTFB | Web Vitals API + Analytics |
Collecting Performance Metrics
// server/middleware/metrics.ts
export const metricsMiddleware = createServerFn( method: "GET" ).handler(
async ( request, context ) =>
const startTime = Date.now()
// After the request completes, record metrics
context.waitUntil(
context.env.ANALYTICS.writeDataPoint(
blobs: [
request.url,
request.method,
String(context.response?.status ?? 500),
],
doubles: [
Date.now() - startTime, // duration
request.cf?.colo ?? 0, // edge location
],
indexes: [request.url.split("?")[0]], // path
)
)
)
Querying Metrics
// Query Analytics Engine for dashboard data
export const getApiLatency = createServerFn( method: "GET" ).handler(
async (, context ) =>
const result = await context.env.ANALYTICS.query(
sql: `
SELECT
blob3 as path,
quantile(0.50)(double1) as p50,
quantile(0.95)(double1) as p95,
quantile(0.99)(double1) as p99,
count() as requests
FROM analytics_engine_table
WHERE timestamp > now() - INTERVAL '1' DAY
GROUP BY blob3
ORDER BY p99 DESC
LIMIT 20
`,
)
return result.rows
)
Tracing: Distributed Request Tracing
Setting Up a Tail Worker
// tail-worker.ts — deployed as a separate Worker
export default
async tail(events: TraceEvent[])
for (const event of events)
if (event.duration && event.event.request)
,
Injecting Trace Context
// Pass trace context from server function to downstream services
export const handleWebhook = createServerFn( method: "POST" ).handler(
async ( request, context ) =>
const traceId = crypto.randomUUID()
// Pass trace ID to downstream calls
await fetch("https://api.stripe.com/v1/events",
headers:
"X-Trace-Id": traceId,
,
)
// Log with trace context
context.logger.info("Webhook received", traceId, eventType: event.type )
)
Alerting: When to Wake Someone Up
Alert Severity Levels
| Level | Response Time | Example |
|---|---|---|
| Critical | 15 minutes | p99 latency > 5s for 5 minutes |
| High | 1 hour | Error rate > 1% for 10 minutes |
| Medium | 24 hours | Disk usage > 80% |
| Low | 7 days | Deprecated API calls detected |
Alert Configuration
// server/monitoring/alerts.ts
import createServerFn from "@tanstack/react-start"
export const checkAlerts = createServerFn( method: "GET" ).handler(
async (, context ) =>
const alerts = []
// 1. Check error rate
const errorRate = await context.env.ANALYTICS.query(
sql: `
SELECT count() as total,
countIf(blob0 = 'error') as errors
FROM analytics_engine_table
WHERE timestamp > now() - INTERVAL '5' MINUTE
`,
)
const rate = Number(errorRate.rows[0]?.errors ?? 0) /
Math.max(Number(errorRate.rows[0]?.total ?? 1), 1)
if (rate > 0.01)
alerts.push(
severity: "high",
message: `Error rate $(rate * 100).toFixed(1)% > 1%`,
)
// 2. Check latency
const latency = await context.env.ANALYTICS.query(
sql: `
SELECT quantile(0.99)(double1) as p99
FROM analytics_engine_table
WHERE timestamp > now() - INTERVAL '5' MINUTE
`,
)
const p99 = Number(latency.rows[0]?.p99 ?? 0)
if (p99 > 5000)
alerts.push(
severity: "critical",
message: `p99 latency $p99ms > 5000ms`,
)
return alerts
)
Dashboard: Visualizing Your Observability Data
The ideal SaaS observability dashboard has three views:
Operational View (Real-Time)
┌──────────────┬──────────────┬──────────────┬──────────────┐
│ Request Rate │ Error Rate │ p50 Latency │ p99 Latency │
│ 1,234/min │ 0.02% │ 45ms │ 210ms │
├──────────────┼──────────────┼──────────────┼──────────────┤
│ Active Users │ Signups (24h)│ Churn (30d) │ MRR │
│ 847 │ 23 │ 3.2% │ $12,450 │
└──────────────┴──────────────┴──────────────┴──────────────┘
Business View (Weekly)
Cohort Retention Table
Signup Funnel (Visit → Signup → Subscribe → Active)
Feature Adoption Heatmap
Top User Actions
Debug View (On-Demand)
Request Trace Explorer
Error Log Stream
Worker CPU Profile
Database Query Performance
Automating Incident Response
// server/monitoring/incident.ts
export const handleIncident = createServerFn( method: "POST" ).handler(
async ( data : data: severity: string; message: string ) =>
// 1. Log the incident
await logIncident(data)
// 2. Notify (only for high+ criticality)
if (["high", "critical"].includes(data.severity))
await sendSlackNotification(
`[$data.severity.toUpperCase()] $data.message`
)
await sendEmail(
to: "ops@tanstackship.com",
subject: `[$data.severity] $data.message`,
)
// 3. Auto-remediate where possible
if (data.message.includes("rate limit"))
// Scale up rate limit window
console.log("Auto-remediation: rate limit exceeded")
)
Observability Budget
Allocate your observability resources wisely:
| Component | % of Observability Budget | Why |
|---|---|---|
| Request logging (sampled) | 20% | Core operational data |
| Error/exception logging | 30% | Highest debugging value |
| Business metrics | 25% | Revenue and growth |
| Performance metrics | 15% | User experience |
| Distributed traces | 10% | Debugging complex issues |
Production Observability Checklist
- [ ] All requests produce structured JSON logs
- [ ] Logs include request ID, user ID, and duration
- [ ] Error logs include stack traces
- [ ] All critical business events are tracked as metrics
- [ ] Worker performance metrics collected (CPU, memory, subrequests)
- [ ] Alerts configured for p99 latency > 5s and error rate > 1%
- [ ] Tail worker captures trace data for sampled requests
- [ ] Dashboard covers operational, business, and debug views
- [ ] Logs have automated retention and archival policy
- [ ] Access to logs is audited and role-restricted
Conclusion
Observability is not optional for a production SaaS application. The three pillars — logs, metrics, and traces — each serve a different purpose, and a mature observability practice needs all three.
For TanStack Start on Cloudflare Workers, the native observability tooling (Workers Analytics Engine, Tail Workers, Trace Workers) provides a solid foundation without requiring third-party services. You can collect, query, and alert on operational data entirely within the Cloudflare ecosystem.
The key is to start simple: structured logs first, then metrics for the most critical business and performance indicators, then traces for debugging complex issues. Add alerting only when you have enough data to set meaningful thresholds.
For a production SaaS with built-in observability infrastructure, see tanstackship.com.
Related Resources
- SaaS Security Best Practices
- Core Web Vitals Optimization Guide
- Real User Monitoring: Measuring Web Performance in Production
- Cloudflare D1: Production Database on the Edge

