A Framework for Securing Open-Source Observability at the Edge

The Edge Observability Security Challenge

Deploying an open-source observability solution to distributed retail edge locations creates a fundamental security challenge. With thousands of locations processing sensitive data like payments and customers’ personally identifiable information (PII), every telemetry component running on the edge becomes a potential entry point for attackers. Edge environments operate in spaces where there is limited physical security, bandwidth constraints shared with business-critical application traffic, and no technical staff on-site for incident response.

Therefore, traditional centralized monitoring security models do not fit in these conditions because they require abundant resources, dedicated security teams, and controlled physical environments. None of them exists on the edge.

This article explores how to secure an OpenTelemetry (OTel) based observability framework from the Cloud Native Computing Foundation (CNCF). It covers metrics, distributed tracing, and logging through Fluent Bit and Fluentd.

Securing OTel Metrics

Mutual Transport Layer Security (TLS)

Security of metrics is enabled through mutual TLS (mTLS) authentication, where both client and server end need to prove their identity using certificates before communication can be established. This ensures trusted communication between the systems. Unlike traditional Prometheus deployments that expose unauthenticated HTTP stands for Hypertext Transfer Protocol (HTTP) endpoints for every service, OTel’s push model allows us to require mTLS for all connections to the collector (see Figure 1).

Figure 1: Multi-stage security through PII removal, mTLS communication, and 95% volume reduction

Security configuration, otel-config.yaml

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: mysite.local:55690
        tls:
          cert_file: server.crt
          key_file: server.key
  otlp/mtls:
    protocols:
      grpc:
        endpoint: mysite.local:55690
        tls:
          client_ca_file: client.pem
          cert_file: server.crt
          key_file: server.key 

exporters:
  otlp:
    endpoint: myserver.local:55690
    tls:
      ca_file: ca.crt
      cert_file: client.crt
      key_file: client-tss2.key

Multi-Stage PII Removal for Metrics

Metrics often end up capturing sensitive data by accident through labels and attributes. A customer identity (ID) in a label, or a credit card number in a database query attribute, can turn compliant metrics into a regulatory violation. The implementation of multi-stage PII removal fixes this problem in depth at the data level.

Stage 1: Application-level filtering.

The first stage happens at the application level, where developers use OTel Software Development Kit (SDK) instrumentation that hashes out user identifiers with the SHA-256 algorithm before creating metrics. Uniform Resource Locators (URLs) are scanned to remove query parameters like tokens and session IDs before they become span attributes.

Stage 2: Collector-level processing.

The second stage occurs in the OTel Collector’s attribute processor. It implements three patterns: complete deletion for high-risk PII, one-way hashing for identifiers using SHA-256 with a cryptographic salt and use regex to clean up complex data.

Stage 3: Backend-level scanning.

The third stage provides backend-level scanning where centralized systems perform data loss prevention (DLP) scanning to detect any PII that reached storage, triggering alerts for immediate remediation. When the backend scanner detects PII, it generates an alert indicating the edge filters need updating, creating a feedback loop that continuously improves protection.

Aggressive Metric Filtering

Security is not only about encryption and authentication, but also about removing unnecessary data. Transmitting less data reduces the attack surface, minimizes exposure windows, and makes anomaly detection easier. There may be hundreds of metrics available out of the box, but filtering and forwarding only the needed metrics reduces up to 95% of metric volume. It saves resources, network bandwidth utilization, and management bottlenecks.

Resource Limits as Security Controls

The OTel Collector sets strict resource limits that prevent denial-of-service attacks.

resource	Limit	Protection against
Memory	500MB hard cap	Out-of-memory attacks
Rate limiting	1,000 spans/sec/service	Telemetry flooding attacks
Connections	100 concurrent streams	Connection exhaustion

These limits ensure that even when an attack happens, the collector maintains stable operation and continues to collect required telemetry from applications.

Distributed Tracing Security

Trace Context Propagation Without PII

Security for distributed traces can be enabled through the W3C Trace Context standard, which provides secure propagation without exposing sensitive data. The traceparent header can contain only a trace ID and span ID. No business data, user identifiers, or secrets are allowed (see Figure 1).

Critical Rule Often Violated

Never put PII in baggage. Baggage is transmitted in HTTP headers across every service hop, creating multiple exposure opportunities through network monitoring, log files, and services that accidentally log baggage.

Span Attribute Cleaning at Source

Span attributes must be cleaned before span creation because they are immutable once created. Common mistakes that expose PII include capturing full URLs with authentication tokens in query parameters, adding database queries containing customer names or account numbers, capturing HTTP headers with cookies or authorization tokens, and logging error messages with sensitive data that users submitted. Implementing filter logic at the application level removes or hashes sensitive data before spans are created.

Security-Aware Sampling Strategy

Reduction of 90% normal operation traces is supported by the General Data Protection Regulation (GDPR) principle of data minimization while maintaining 100% visibility for security-relevant events.

The following sampling approach serves both performance and security by intelligently deciding which traces to keep based on their value.

trace type	sample rate	rationale
Error spans	100%	Potential security incidents require full investigation
High-value transactions	100%	Fraud detection and compliance requirements
Authentication/authorization	100%	Security-critical paths need complete visibility
Normal operations	10-20%	Maintains statistical validity while minimizing data collection

Logging Security With Fluent Bit and Fluentd

Real-Time PII Masking

Application logs are the highest risk involved data, which contain unstructured text that may include anything developers print. Real-time masking of PII data before logs leave the pod represents the most critical security control in the entire observability stack. The scanning and masking happen in microseconds, adding minimal overhead to log processing. If developers accidentally log sensitive data, it’s caught before network transmission (see Figure 2).

Figure 2: Logging security enabled through two-stage DLP, Real-Time Masking in microseconds, TLS 1.2+ End-to-End, Rate Limiting, and Zero Log-Based PII Leaks

Security configuration, fluent-bit.conf

pipeline: inputs: - name: http port: 9999 tls: on tls.verify: off tls.cert_file: self_signed.crt tls.key_file: self_signed.key outputs: - name: forward match: '*' host: x.x.x.x port: 24224 tls: on tls.verify: off tls.ca_file: '/etc/certs/fluent.crt' tls.vhost: 'fluent.example.com' Fluentd.conf <transport tls> cert_path /root/cert.crt private_key_path /root/cert.key client_cert_auth true ca_cert_path /root/ca.crt </transport>

Secondary DLP Layer

Fluentd provides secondary DLP scanning with different regex patterns designed to catch what Fluent Bit missed. This includes private keys, new PII patterns, sensitive data, and context-based detection.

Encryption and Authentication for Log Transit

Transmission of logs is secured through TLS 1.2 or higher encryption method using mutual authentication. In this communication method, Fluent Bit authenticates to Fluentd using certificates, and Fluentd authenticates to Splunk using tokens. This approach prevents network attacks that could capture logs in transit, man-in-the-middle attacks that could modify logs, and unauthorized log injection.

Rate Limiting as Attack Prevention

Preventing log flooding avoids both performance and security issues. An attacker generating massive volume of logs can hide malicious activity in noise, consume all disk space causing denial of service, overwhelm centralized log systems, or increase cloud costs until logging is disabled to save money. Rate limiting at 10,000 logs per minute per namespace prevents these attacks.

Security Comparison: Three Telemetry Types

Aspect	Metrics (Otel)	Traces (Otel)	Logs (Fluent bit/fluentd)
Primary Risk	PII in labels/attributes	PII in span attributes/baggage	Unstructured text with any PII
Authentication	mTLS with 30-day cert rotation	mTLS for trace export	TLS 1.2+ with mutual auth
PII Removal	3-stage: App –> Collector –> Backend	2-stage: App –> Backend DLP	3-stage: Fluent Bit –> Fluentd –> Backend
Data Minimization	95% volume reduction via filtering	80-90% via smart sampling	Rate limiting + filtering
Attack Prevention	Resource limits (memory, rate, connections)	Immutable spans + sampling	Rate limiting + buffer encryption
Compliance Feature	Allowlist-based metric forwarding	100% sampling for security events	Real-time regex-based masking
Key Control	Attribute processor in collector	Cleaning before span creation	Lua scripts in sidecar

Key Outcomes

Secured open-source observability across distributed retail edge locations
Achieved Full Payment Card Industry (PCI) Data Security Standard (DSS) and GDPR compliance
Reduced bandwidth consumption by 96%
Minimized attack surface while maintaining complete visibility

Conclusion

Securing a Cloud Native Computing Foundation-based observability framework at the retail edge is both achievable and essential. By implementing comprehensive security across OTel metrics, distributed tracing, and Fluent Bit/Fluentd logging, organizations can achieve zero security incidents while maintaining complete visibility across distributed locations.