Observability Without Cost Telemetry Is Broken Engineering

I’ve run production systems where we could tell you the p99 latency of any endpoint down to the microsecond, but couldn’t explain why our AWS bill jumped $40,000 in a single weekend. That disconnect — between operational visibility and financial reality — is where most observability strategies quietly fail.

The orthodox telemetry trinity (metrics, logs, traces) gives you performance. Error rates. Request volumes. Latency distributions that let you argue about whether 250 ms is acceptable for a search API. What it won’t tell you is that the microservice you just optimized for speed now costs $0.03 per invocation instead of $0.002, and at scale, that rounding error becomes someone’s quarterly budget.

Cost isn’t an operational afterthought. It’s a signal as essential as CPU saturation or memory pressure, yet we’ve architected it out of the feedback loop engineers actually use.

The Financial Lens Nobody Built Into Your Stack

New Relic’s engineering team — people who sell observability platforms — hit this wall internally. They had comprehensive telemetry across their infrastructure. They could profile individual functions. But when finance asked why cloud spend had grown 30% quarter over quarter, the observability data couldn’t answer. The metrics existed in parallel universes: CloudWatch showing instance hours, billing consoles showing dollar amounts, APM showing request counts. No correlation. No causality.

Their solution was to build what they called a “financial lens” — essentially cost telemetry as a first-class dimension in their existing monitoring stack. Not a separate FinOps dashboard that product engineers ignore. Not a monthly report from the infrastructure team. Real-time cost-per-operation data living alongside latency and error rates in the same Grafana panels developers already watch during incidents.

The mechanism: they correlated resource consumption metrics (CPU-hours, I/O operations, network egress) with their cloud provider’s pricing API rates. Every service emitted not just “processed 10,000 requests” but “processed 10,000 requests at $0.08 total cost.” When they shipped a feature, the deployment dashboard showed performance and financial impact simultaneously.

That visibility changed behavior. Engineers started evaluating architectural choices through a cost lens without needing MBA training. “Should we cache this aggressively?” became answerable with data: cache infrastructure costs $X/month, API calls saved cost $Y/month, net impact is measurable, not theoretical. They cut cost per GB by 60%, not through grand optimization initiatives but through hundreds of small decisions made visible.

Unit Economics for Infrastructure

The core pattern is treating cost as unit economics. Not aggregate spend — that’s finance’s job. Unit cost: dollars per request, dollars per active user, dollars per gigabyte processed. SLIs you can reason about operationally.

Consider a video transcoding service. Traditional observability tells you it processes 50 jobs per minute with a 2% error rate and an 8-second p95 completion time. Add cost telemetry and suddenly you know: each job costs $0.14 in compute, $0.03 in storage I/O, $0.02 in network egress. Now, when a PM wants to add 4K support, you can model the financial impact before writing code. When transcoding queues back up and you’re deciding whether to scale horizontally, you know exactly what each additional instance costs per job.

The tooling exists, but integration is fractured. Modern APMs — Datadog, Dynatrace, New Relic — can ingest cost metrics, but they’re not wired this way out of the box. You’re stitching together AWS Cost Explorer APIs, CloudWatch metrics, and your service telemetry through custom exporters. Kubernetes environments have Kubecost or OpenCost, which instrument pods and namespaces with spend data. Service meshes like Istio can propagate cost-center tags through request headers. But none of this happens automatically. It’s plumbing you build.

For AWS, you’re typically scraping Cost and Usage Reports — those massive gzipped CSVs that land in S3 hourly. You parse them, join on resource tags (which means you’ve tagged everything — a discipline most organizations lack), then push cost metrics into whatever time-series database feeds your dashboards. The latency is real: CUR data lags by hours, sometimes a day. Not ideal for incident response, but sufficient for architectural decisions and capacity planning.

Netflix’s Platform DSE team built exactly this with their Cloud Efficiency platform. They wanted service owners—not a centralized FinOps team—to understand their financial footprint. So they constructed a data pipeline that attributed every dollar of AWS spend to a specific service owner through propagated tags. Their engineering culture expects teams to operate within cost constraints the same way they operate within latency budgets. When a deployment increases spend by 15%, the responsible team knows immediately, sees it in their dashboards, and adjusts before it becomes a runaway problem.

Where the Pattern Fractures

The anti-pattern I see most often is siloed visibility. Finance gets billing dashboards. SREs get operational dashboards. Developers get APM traces. Nobody sees the intersection where cost and performance influence each other.

You debug a performance issue — say, slow database queries. The fix is to add an index. Query time drops from 800 ms to 40 ms. Victory. Except the database is now using 30% more storage for that index, and your storage tier bills by the gigabyte-month. If you’re on a flat-rate hosting plan, maybe that cost is absorbed. If you’re on Aurora or Cosmos DB with per-IOPS pricing, you’ve just traded latency for dollars. Without cost telemetry, you won’t notice until the bill arrives.

Or consider autoscaling. Your HPA (Horizontal Pod Autoscaler) kicks in at 70% CPU utilization. Operationally sound. But if your pods are running on Spot instances with wildly variable pricing, or if they’re using EBS volumes that bill separately, scaling up might cost 3× during peak pricing windows versus off-peak. Cost-aware autoscaling would factor in current instance pricing from the provider’s rate card, not just CPU thresholds. Some teams at AWS do this internally — their autoscaling logic queries Spot pricing APIs and adjusts thresholds dynamically. Most of us don’t have that sophistication.

Alerting without cost dimensions misses failure modes. Your error rate is fine. Latency is stable. But egress costs just doubled because a misconfigured service is downloading the same 200 GB dataset on every request instead of caching it. Traditional observability sees nothing wrong — the system is performing as designed. Cost telemetry would trigger immediately: “egress cost per request exceeded threshold.”

Building It on Monday Morning

Start with tagging discipline. If you can’t attribute infrastructure to teams, services, or cost centers through tags, none of this works. Every resource — EC2 instances, S3 buckets, Lambda functions, GCS storage — needs consistent, hierarchical tags: team name, service name, environment, cost center if your organization uses them. This is boring work. It’s also foundational. Netflix has automated tag enforcement through their Spinnaker pipelines; resources without proper tags can’t deploy.

Next, instrument your observability pipeline to accept cost metrics. If you’re using Prometheus, you need exporters that scrape billing APIs. For AWS, the YACE (Yet Another CloudWatch Exporter) can pull Cost Explorer data into Prometheus format. For GCP, similar exporters exist. For Kubernetes, deploy Kubecost or OpenCost — both project cost data at the pod, namespace, and label level. These tools use the cloud provider’s pricing API to estimate costs in near real time (within minutes, not days).

Then build SLIs: “cost per transaction” for user-facing services, “cost per terabyte processed” for data pipelines, “cost per build” for CI/CD systems. Treat these like any other service-level indicator: track them, visualize them, alert on anomalies. When cost per transaction spikes 40%, that’s as much an incident as when error rate spikes 40%.

The harder part is propagating context. For cost attribution to work across distributed systems, you need request-scoped metadata that flows through every hop. Istio and Linkerd can inject custom headers that carry cost-center or team identifiers. OpenTelemetry context propagation can include custom attributes — you’re essentially treating cost metadata as distributed tracing baggage. When a request hits your API gateway, tag it with the customer account ID. When it spawns background jobs, propagate that tag. When those jobs write to S3 or query DynamoDB, the cost of those operations is attributed back to the originating customer or feature.

Cost anomaly detection is underutilized. Most teams wait for the monthly bill. Build statistical baselines instead: “This service typically costs $200–$250 per day; today it’s at $780 by noon.” Tools like CloudHealth and Apptio offer this, but you can roll your own with Prometheus recording rules and alerting. Track rolling 7-day cost averages, set thresholds at two or three standard deviations, and alert when exceeded. Then route cost alerts to the same incident response channels as performance alerts. A runaway Kubernetes job churning through compute is an operational incident, not a finance problem to reconcile later.

Trade-Offs and Honest Limitations

This approach adds complexity. You’re maintaining additional exporters, managing more metrics, and joining datasets from different sources with different latencies. Billing APIs are slow and sometimes incomplete — AWS CUR is comprehensive but delayed; the Cost Explorer API is faster but less granular. You’re approximating, not calculating to the penny. For reserved instances and savings plans, attribution gets messy; the discount applies at the account level, but you’re trying to attribute at the service level.

Smaller teams might not have the leverage to justify this overhead. If your entire infrastructure costs $2,000/month, the engineering time to build cost telemetry probably isn’t worth it. Just watch the bill manually. This pattern matters when spend is five or six figures monthly, when architectural decisions meaningfully move the budget, and when multiple teams share infrastructure and need fair attribution.

There’s also a cultural shift. Engineers accustomed to treating performance as the only constraint resist cost-aware design. “Optimize for correctness and speed; let finance worry about the bill.” That worked when infrastructure was CapEx and relatively static. It doesn’t work in cloud environments where every API call has a price and scaling decisions are made in milliseconds by automated systems. Cost telemetry doesn’t mean nickel-and-diming every decision; it means making informed trade-offs.

What Gets Productized

SaaS platforms have emerged around this gap. CloudZero packages “cost intelligence” as a service — they ingest your cloud billing data, correlate it with your observability metrics, and provide dashboards showing unit economics. Datadog now offers cloud cost monitoring that integrates with their existing APM, so you get cost per service alongside latency per service in one interface. Apptio and CloudHealth have pivoted from finance-focused FinOps tools toward real-time engineering integrations.

For consultancies, this is ripe territory. Most organizations know their cloud bill is too high but can’t pinpoint why. An engagement that instruments their observability stack with cost telemetry, builds dashboards for engineering teams, and trains SREs on cost-aware operations is sellable. It’s not just spreadsheets and recommendations — it’s engineering work that changes system behavior.

Internal platforms become competitive advantages. Netflix doesn’t buy this capability; they built it because their scale and culture demand it. New Relic built it as a byproduct of their own infrastructure challenges. For companies at that scale, cost observability isn’t optional — it’s how you prevent runaway spending from outpacing revenue growth.

The Synthesis Nobody Wants to Hear

Observability without cost visibility is incomplete. You’re flying with one instrument panel dark. The irony is that cost is often the most actionable signal. Performance optimization requires deep technical changes, often with uncertain payoff. Cost optimization frequently has clear ROI and deterministic outcomes. “We’re spending $15,000/month on this RDS instance; migrating to Aurora Serverless would cost $4,000/month with acceptable performance trade-offs.” That’s a decision you can make Monday morning if you have the visibility to make it.

But most organizations won’t prioritize this until the pain is acute. A surprise six-figure bill. A CFO asking pointed questions about cloud efficiency. A board presentation where margins are squeezed by infrastructure spend. By then, you’re fixing the problem reactively, under pressure, with incomplete data.

Build cost telemetry before you need it. Instrument it like you instrument performance. Alert on it like you alert on errors. Make engineers responsible for the financial impact of their architectural choices — not through bureaucratic approval processes, but through transparent data in the tools they already use. That’s when observability actually observes the system as it exists: performing operations that cost money, where both dimensions matter equally.

Similar Posts