A Framework for Securing Open-Source Observability at the Edge

The Edge Observability Security Challenge 

Deploying an open-source observability solution to distributed retail edge locations creates a fundamental security challenge. With thousands of locations processing sensitive data like payments and customers’ personally identifiable information (PII), every telemetry component running on the edge becomes a potential entry point for attackers. Edge environments operate in spaces where there is limited physical security, bandwidth constraints shared with business-critical application traffic, and no technical staff on-site for incident response. 

Therefore, traditional centralized monitoring security models do not fit in these conditions because they require abundant resources, dedicated security teams, and controlled physical environments. None of them exists on the edge. 

This article explores how to secure an OpenTelemetry (OTel) based observability framework from the Cloud Native Computing Foundation (CNCF). It covers metrics, distributed tracing, and logging through Fluent Bit and Fluentd.  

Securing OTel Metrics

Mutual Transport Layer Security (TLS) 

Security of metrics is enabled through mutual TLS (mTLS) authentication, where both client and server end need to prove their identity using certificates before communication can be established. This ensures trusted communication between the systems. Unlike traditional Prometheus deployments that expose unauthenticated HTTP stands for Hypertext Transfer Protocol (HTTP) endpoints for every service, OTel’s push model allows us to require mTLS for all connections to the collector (see Figure 1).

OpenTelemetry security architecture Figure 1Multi-stage security through PII removal, mTLS communication, and 95% volume reduction 

Security configuration, otel-config.yaml 

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: mysite.local:55690
        tls:
          cert_file: server.crt
          key_file: server.key
  otlp/mtls:
    protocols:
      grpc:
        endpoint: mysite.local:55690
        tls:
          client_ca_file: client.pem
          cert_file: server.crt
          key_file: server.key 

exporters:
  otlp:
    endpoint: myserver.local:55690
    tls:
      ca_file: ca.crt
      cert_file: client.crt
      key_file: client-tss2.key 

Multi-Stage PII Removal for Metrics 

Metrics often end up capturing sensitive data by accident through labels and attributes. A customer identity (ID) in a label, or a credit card number in a database query attribute, can turn compliant metrics into a regulatory violation. The implementation of multi-stage PII removal fixes this problem in depth at the data level. 

Stage 1: Application-level filtering.

The first stage happens at the application level, where developers use OTel Software Development Kit (SDK) instrumentation that hashes out user identifiers with the SHA-256 algorithm before creating metrics. Uniform Resource Locators (URLs) are scanned to remove query parameters like tokens and session IDs before they become span attributes.  

Stage 2: Collector-level processing.

The second stage occurs in the OTel Collector’s attribute processor. It implements three patterns: complete deletion for high-risk PII, one-way hashing for identifiers using SHA-256 with a cryptographic salt and use regex to clean up complex data.  

Stage 3: Backend-level scanning.

The third stage provides backend-level scanning where centralized systems perform data loss prevention (DLP) scanning to detect any PII that reached storage, triggering alerts for immediate remediation. When the backend scanner detects PII, it generates an alert indicating the edge filters need updating, creating a feedback loop that continuously improves protection.  

Aggressive Metric Filtering 

Security is not only about encryption and authentication, but also about removing unnecessary data. Transmitting less data reduces the attack surface, minimizes exposure windows, and makes anomaly detection easier. There may be hundreds of metrics available out of the box, but filtering and forwarding only the needed metrics reduces up to 95% of metric volume. It saves resources, network bandwidth utilization, and management bottlenecks.  

Resource Limits as Security Controls 

The OTel Collector sets strict resource limits that prevent denial-of-service attacks. 

resource Limit Protection against

Memory 

500MB hard cap 

Out-of-memory attacks 

Rate limiting 

1,000 spans/sec/service 

Telemetry flooding attacks 

Connections 

100 concurrent streams 

Connection exhaustion 

These limits ensure that even when an attack happens, the collector maintains stable operation and continues to collect required telemetry from applications. 

Distributed Tracing Security 

Trace Context Propagation Without PII 

Security for distributed traces can be enabled through the W3C Trace Context standard, which provides secure propagation without exposing sensitive data. The traceparent header can contain only a trace ID and span ID. No business data, user identifiers, or secrets are allowed (see Figure 1).  

Critical Rule Often Violated 

Never put PII in baggage. Baggage is transmitted in HTTP headers across every service hop, creating multiple exposure opportunities through network monitoring, log files, and services that accidentally log baggage. 

Span Attribute Cleaning at Source 

Span attributes must be cleaned before span creation because they are immutable once created. Common mistakes that expose PII include capturing full URLs with authentication tokens in query parameters, adding database queries containing customer names or account numbers, capturing HTTP headers with cookies or authorization tokens, and logging error messages with sensitive data that users submitted. Implementing filter logic at the application level removes or hashes sensitive data before spans are created.  

Security-Aware Sampling Strategy 

Reduction of 90% normal operation traces is supported by the General Data Protection Regulation (GDPR) principle of data minimization while maintaining 100% visibility for security-relevant events.  

The following sampling approach serves both performance and security by intelligently deciding which traces to keep based on their value. 

trace type sample rate rationale

Error spans 

100% 

Potential security incidents require full investigation 

High-value transactions 

100% 

Fraud detection and compliance requirements 

Authentication/authorization 

100% 

Security-critical paths need complete visibility 

Normal operations 

10-20% 

Maintains statistical validity while minimizing data collection 

Logging Security With Fluent Bit and Fluentd 

Real-Time PII Masking 

Application logs are the highest risk involved data, which contain unstructured text that may include anything developers print. Real-time masking of PII data before logs leave the pod represents the most critical security control in the entire observability stack. The scanning and masking happen in microseconds, adding minimal overhead to log processing. If developers accidentally log sensitive data, it’s caught before network transmission (see Figure 2).

Fluentbit and Fluentd security architectureFigure 2: Logging security enabled through two-stage DLP, Real-Time Masking in microseconds, TLS 1.2+ End-to-End, Rate Limiting, and Zero Log-Based PII Leaks 

Security configuration, fluent-bit.conf 

pipeline:   inputs:     - name: http       port: 9999       tls: on       tls.verify: off       tls.cert_file: self_signed.crt       tls.key_file: self_signed.key    outputs:     - name: forward       match: '*'       host: x.x.x.x       port: 24224       tls: on       tls.verify: off       tls.ca_file: '/etc/certs/fluent.crt'       tls.vhost: 'fluent.example.com'   Fluentd.conf   <transport tls>     cert_path /root/cert.crt     private_key_path /root/cert.key     client_cert_auth true     ca_cert_path /root/ca.crt   </transport>  

Secondary DLP Layer 

Fluentd provides secondary DLP scanning with different regex patterns designed to catch what Fluent Bit missed. This includes private keys, new PII patterns, sensitive data, and context-based detection.  

Encryption and Authentication for Log Transit 

Transmission of logs is secured through TLS 1.2 or higher encryption method using mutual authentication. In this communication method, Fluent Bit authenticates to Fluentd using certificates, and Fluentd authenticates to Splunk using tokens. This approach prevents network attacks that could capture logs in transit, man-in-the-middle attacks that could modify logs, and unauthorized log injection.  

Rate Limiting as Attack Prevention 

Preventing log flooding avoids both performance and security issues. An attacker generating massive volume of logs can hide malicious activity in noise, consume all disk space causing denial of service, overwhelm centralized log systems, or increase cloud costs until logging is disabled to save money. Rate limiting at 10,000 logs per minute per namespace prevents these attacks.  

Security Comparison: Three Telemetry Types 

Aspect Metrics (Otel) Traces (Otel) Logs (Fluent bit/fluentd)

Primary Risk 

PII in labels/attributes 

PII in span attributes/baggage 

Unstructured text with any PII 

Authentication 

mTLS with 30-day cert rotation 

mTLS for trace export 

TLS 1.2+ with mutual auth 

PII Removal 

3-stage: App –> Collector –> Backend 

2-stage: App –> Backend DLP 

3-stage: Fluent Bit –> Fluentd –> Backend 

Data Minimization 

95% volume reduction via filtering 

80-90% via smart sampling 

Rate limiting + filtering 

Attack Prevention 

Resource limits (memory, rate, connections) 

Immutable spans + sampling 

Rate limiting + buffer encryption 

Compliance Feature 

Allowlist-based metric forwarding 

100% sampling for security events 

Real-time regex-based masking 

Key Control 

Attribute processor in collector 

Cleaning before span creation 

Lua scripts in sidecar 

 Key Outcomes 

  • Secured open-source observability across distributed retail edge locations
  • Achieved Full Payment Card Industry (PCI) Data Security Standard (DSS) and GDPR compliance 
  • Reduced bandwidth consumption by 96% 
  • Minimized attack surface while maintaining complete visibility 

Conclusion 

Securing a Cloud Native Computing Foundation-based observability framework at the retail edge is both achievable and essential. By implementing comprehensive security across OTel metrics, distributed tracing, and Fluent Bit/Fluentd logging, organizations can achieve zero security incidents while maintaining complete visibility across distributed locations.

Similar Posts