Self Hosted Grafana OTel Stack

How to set up self-hosted OpenTelemetry using Grafana, Alloy, Loki, Tempo and Prometheus.

In past articles, I’ve shown how to set up OpenTelemetry in various programming languages and frameworks. But what is all of this worth if you have nowhere to send your telemetry data to? In this article, I will show you how to set up a very basic OpenTelemetry stack using Grafana Open Source products to test and experiment with OpenTelemetry at home or in other self-hosted environments. One important disclaimer though: This setup is by no means production-ready, since it is missing persistence, high availability, and security hardening. It is purely meant for learning and experimenting with OpenTelemetry.

Bill of Materials

OpenTelemetry consists of three types of data: Logs, Metrics and Traces. Each of these benefit from specialized storage solutions allowing efficient storage and querying of the data.

Metrics: For storing metrics we will use Prometheus, which is a very popular open-source monitoring and alerting toolkit. It is widely used in the industry and has great integration with OpenTelemetry. Particularly as a homelabber you might already be familiar with Prometheus as it is often used for monitoring Kubernetes clusters and other infrastructure.
Logs: For logs we will use Loki, which is a log aggregation system designed to be cost-effective and easy to operate. It is optimized for storing and querying logs from applications and infrastructure.
Traces: For traces we will use Tempo, which is a distributed tracing backend that is designed to be easy to operate and scale. It is optimized for storing and querying traces from applications and services.

Apart from these storage backends, we also need an ingestion point to receive the OpenTelemetry data and a tool to visualize and analyze the data. Fortunately, Grafana Labs also provides solutions for both of these needs:

Collector: For receiving the OpenTelemetry data we will use Grafana Alloy, which exposes an OTLP endpoint for ingesting OpenTelemetry data and can process and forward the data to the matching storage backends.
Visualization: For visualizing and analyzing the data we will use the highly popular Grafana, which is a powerful and flexible dashboarding tool that integrates with all of the above storage backends.

The components will then be arranged like this:

Grafana OTel Stack

Setting up Prometheus

Setting up Prometheus is fairly straightforward, as in this case we don’t need any of the scraping capabilities of Prometheus, since all data will be pushed to it via the OpenTelemetry Collector.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: otel
  labels:
    app: prometheus
  annotations:
    reloader.stakater.com/auto: "true"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:v3.9.1
          args:
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--storage.tsdb.path=/prometheus/"
            - "--storage.tsdb.retention.time=24h"
            - "--web.enable-remote-write-receiver"
            - "--enable-feature=native-histograms"
          ports:
            - containerPort: 9090
          volumeMounts:
            - name: prometheus-config-volume
              mountPath: /etc/prometheus/
            - name: prometheus-storage-volume
              mountPath: /prometheus/
      volumes:
        - name: prometheus-config-volume
          configMap:
            defaultMode: 420
            name: prometheus-server-conf

        - name: prometheus-storage-volume
          emptyDir: {}

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  labels:
    name: prometheus-server-conf
  namespace: otel
data:
  prometheus.yml: |-
    global:
      scrape_interval: 0s

    scrape_configs: []

apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: otel
spec:
  selector:
    app: prometheus
  ports:
    - port: 9090
      targetPort: 9090

The important part here is the --web.enable-remote-write-receiver flag, which enables Prometheus to receive data via the remote write protocol, which is what the OpenTelemetry Collector will use to send metrics data to Prometheus.

Other than that we set the path to our configuration file and the storage path. For simplicity we use an emptyDir volume for storage, but in a real setup you would want to use a persistent volume.

I’ve limited the retention time to 24 hours for this example, but you can adjust this as needed.

The feature flag --enable-feature=native-histograms is required to enable histogram support in Prometheus, which is needed for some OpenTelemetry metrics.

The configuration file for Prometheus is very basic since we don’t need to scrape any targets. In fact, we can disable scraping altogether by setting the scrape_interval to 0s.

Setting up Loki

Setting up a single-instance Loki requires some configuration that disables certain replication and high availability features, which are not needed in this simple setup.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: loki
  namespace: otel
  annotations:
    reloader.stakater.com/auto: "true"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: loki
  template:
    metadata:
      labels:
        app: loki
    spec:
      volumes:
        - name: loki-config
          configMap:
            name: loki
        - name: loki-storage
          emptyDir: {}
      containers:
        - name: loki
          image: grafana/loki:3.6.4
          args:
            - -config.file=/etc/loki/loki.yaml
          resources:
            limits:
              memory: "512Mi"
              cpu: "500m"
            requests:
              memory: "256Mi"
              cpu: "100m"
          volumeMounts:
            - name: loki-config
              mountPath: /etc/loki
            - name: loki-storage
              mountPath: /loki
          livenessProbe:
            httpGet:
              path: /ready
              port: 3100
            initialDelaySeconds: 45
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /ready
              port: 3100
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          ports:
            - containerPort: 3100
              name: http
              protocol: TCP
            - containerPort: 9096
              name: grpc
              protocol: TCP

apiVersion: v1
kind: ConfigMap
metadata:
  name: loki
  namespace: otel
data:
  loki.yaml: |
    auth_enabled: false

    server:
      http_listen_port: 3100
      grpc_listen_port: 9096
      log_level: info

    common:
      path_prefix: /loki
      storage:
        filesystem:
          chunks_directory: /loki/chunks
          rules_directory: /loki/rules
      replication_factor: 1
      ring:
        kvstore:
          store: inmemory

    schema_config:
      configs:
        - from: 2024-01-01
          store: tsdb
          object_store: filesystem
          schema: v13
          index:
            prefix: index_
            period: 24h

    limits_config:
      retention_period: 168h  # 7 days
      max_query_lookback: 168h

    compactor:
      working_directory: /loki/compactor
      compaction_interval: 10m
      retention_enabled: true
      retention_delete_delay: 2h
      retention_delete_worker_count: 150
      delete_request_store: filesystem

    analytics:
      reporting_enabled: false

apiVersion: v1
kind: Service
metadata:
  name: loki
  namespace: otel
spec:
  type: ClusterIP
  ports:
    - name: http
      port: 3100
      protocol: TCP
      targetPort: 3100
    - name: grpc
      port: 9096
      protocol: TCP
      targetPort: 9096
  selector:
    app: loki

Again, we are mounting an emptyDir volume for storage, but in a real setup you would want to use a persistent volume.

I won’t go into too much detail about the configuration here, but the important parts are that we disable authentication, set the storage backend to filesystem, and configure retention settings. The service exposes both the HTTP and gRPC ports.

Setting up Tempo

Setting up Tempo is similar to Loki; we need to provide a configuration that disables high availability features and sets the storage backend to filesystem.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tempo
  namespace: otel
  annotations:
    reloader.stakater.com/auto: "true"
spec:
  selector:
    matchLabels:
      app: tempo
  template:
    metadata:
      labels:
        app: tempo
    spec:
      volumes:
        - name: tempo-config
          configMap:
            name: tempo
        - name: tempo-storage
          emptyDir: {}
      containers:
        - name: tempo
          image: grafana/tempo:2.9.1
          args:
            - -config.file=/conf/tempo.yaml
          resources:
            limits:
              memory: "512Mi"
              cpu: "500m"
            requests:
              memory: "256Mi"
              cpu: "200m"
          volumeMounts:
            - name: tempo-config
              mountPath: /conf
            - name: tempo-storage
              mountPath: /var/tempo
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /ready
              port: 3100
            initialDelaySeconds: 30
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /ready
              port: 3100
            initialDelaySeconds: 20
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          ports:
            - containerPort: 3100
              name: prom-metrics
            - containerPort: 4317
              name: otlp-grpc
            - containerPort: 4318
              name: otlp-http

apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo
  namespace: otel
data:
  overrides.yaml: |
    overrides:
      {}
  tempo.yaml: |
    multitenancy_enabled: false
    stream_over_http_enabled: true
    usage_report:
      reporting_enabled: false
    compactor:
      compaction:
        block_retention: 24h
    distributor:
      receivers:
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
            http:
              endpoint: 0.0.0.0:4318
    ingester:
      {}
    server:
      http_listen_port: 3100
    storage:
      trace:
        backend: local
        local:
          path: /var/tempo/traces
        wal:
          path: /var/tempo/wal
    metrics_generator:
      registry:
        collection_interval: 5s
        external_labels:
          source: tempo
          cluster: k3s-home
      storage:
        path: /var/tempo/generator/wal
        remote_write:
          - url: http://prometheus:9090/api/v1/write
            send_exemplars: true
      processor:
        span_metrics:
          dimensions:
            - http.method
            - http.target
            - http.status_code
            - service.version
        service_graphs:
          dimensions:
            - http.method
            - http.target
            - http.status_code
            - service.version
          enable_virtual_node_label: true
          peer_attributes:
            - peer.service
            - db.name
            - db.system
            - db.system.name
            - server.address
        local_blocks:
          # Required for TraceQL queries with | rate()
          # Allows metrics-generator to read from completed blocks
          flush_to_storage: true
      traces_storage:
        path: /var/tempo/generator/traces
    querier:
          {}
    query_frontend:
          {}
    overrides:
      defaults:
        metrics_generator:
          processors: [service-graphs, span-metrics, local-blocks] # enables metrics generator
          generate_native_histograms: both

apiVersion: v1
kind: Service
metadata:
  name: tempo
  namespace: otel
spec:
  type: ClusterIP
  ports:
    - name: tempo-prom-metrics
      port: 3100
      protocol: TCP
      targetPort: 3100
    - name: grpc-tempo-otlp
      port: 4317
      protocol: TCP
      targetPort: 4317
    - name: tempo-otlp-http
      port: 4318
      protocol: TCP
      targetPort: 4318
  selector:
    app: tempo

Again, we are limiting the retention time to 24 hours. Additionally, we enable the metrics generator, which allows us to generate metrics from trace data, which is needed for the best experience in Grafana. These metrics are generated based on the spans ingested into Tempo and directly written to Prometheus via remote write.

Setting up Grafana Alloy

With all storage backends in place, we can now set up the OpenTelemetry Collector using Grafana Alloy. The collector will receive all OpenTelemetry data and forward it to the matching storage backends.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alloy
  namespace: otel
  annotations:
    reloader.stakater.com/auto: "true"
spec:
  selector:
    matchLabels:
      app: alloy
  template:
    metadata:
      labels:
        app: alloy
    spec:
      volumes:
        - name: config
          configMap:
            name: alloy
      containers:
        - name: alloy
          image: grafana/alloy:v1.12.2
          args:
            - run
            - /etc/alloy/config.alloy
            - --storage.path=/tmp/alloy
            - --server.http.listen-addr=0.0.0.0:12345
            - --server.http.ui-path-prefix=/
            - --stability.level=generally-available
          resources:
            limits:
              memory: "256Mi"
              cpu: "200m"
            requests:
              memory: "128Mi"
              cpu: "100m"
          ports:
            - containerPort: 12345
              name: http
            - containerPort: 4317
              name: otlp-grpc
            - containerPort: 4318
              name: otlp-http
          readinessProbe:
            httpGet:
              path: /-/ready
              port: 12345
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 1
          volumeMounts:
            - name: config
              mountPath: /etc/alloy

apiVersion: v1
kind: ConfigMap
metadata:
  name: alloy
  namespace: otel
data:
  config.alloy: |
    logging {
      level  = "info"
      format = "logfmt"
    }

    // ========================================
    // OTEL - Traces, Metrics & Logs
    // ========================================
    otelcol.receiver.otlp "otlp" {
      grpc {
        endpoint = "0.0.0.0:4317"
      }

      http {
        endpoint = "0.0.0.0:4318"
      }

      output {
        traces = [otelcol.exporter.otlp.tempo.input]
        metrics = [otelcol.exporter.prometheus.prom.input]
        logs = [otelcol.exporter.loki.loki.input]
      }
    }

    // ========================================
    // Metric and Trace Exporters
    // ========================================
    // Note: Span metrics and service graphs are generated in Tempo's
    // metrics-generator for better integration with TraceQL queries

    otelcol.exporter.prometheus "prom" {
      forward_to = [prometheus.remote_write.default.receiver]
    }

    prometheus.remote_write "default" {
      endpoint {
        url = "http://prometheus:9090/api/v1/write"
      }
    }

    otelcol.exporter.otlp "tempo" {
      client {
        endpoint = "tempo:4317"
        tls {
          insecure = true
        }
      }
    }

    // ========================================
    // Logs - OTLP to Loki
    // ========================================
    otelcol.exporter.loki "loki" {
      forward_to = [loki.write.loki.receiver]
    }

    loki.write "loki" {
      endpoint {
        url = "http://loki:3100/loki/api/v1/push"
      }
    }

apiVersion: v1
kind: Service
metadata:
  name: alloy
  namespace: otel
spec:
  ports:
    - name: http
      port: 12345
      targetPort: 12345
    - name: otlp-grpc
      port: 4317
      targetPort: 4317
    - name: otlp-http
      port: 4318
      targetPort: 4318
  selector:
    app: alloy

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: alloy
  namespace: otel
  labels:
    name: alloy
spec:
  rules:
    - host: alloy.example.com
      http:
        paths:
          - pathType: Prefix
            path: "/"
            backend:
              service:
                name: alloy
                port:
                  number: 12345
    - host: otlp.alloy.example.com
      http:
        paths:
          - pathType: Prefix
            path: "/"
            backend:
              service:
                name: alloy
                port:
                  number: 4318

Since Alloy is just the collector, it doesn’t need any persistent storage, so we can skip that here.

As you can see in the configuration, it pretty much describes a graph of how data flows through the collector. We define an OTLP receiver that listens on both gRPC and HTTP ports for incoming OpenTelemetry data. The received traces, metrics, and logs are then forwarded to the respective exporters, which send the data to Tempo, Prometheus, and Loki. The ingress configuration allows receiving OTLP traffic from outside the cluster.

Configuring Grafana to use the data sources

With all components set up, we can now configure Grafana to use the data sources. I’ll skip explaining how to deploy Grafana itself and assume we already have a running Grafana instance.

In order to explore our OpenTelemetry data, we need to add the data sources for Prometheus, Loki, and Tempo in Grafana.

Adding Prometheus Data Source

Adding the Prometheus data source is straightforward. In Grafana, navigate to Connections > Data Sources and click on Add data source. Select Prometheus from the list and configure it as follows:

Prometheus connection setting

All other settings can be left on default.

Adding Loki Data Source

When adding Loki as a data source we also want to configure some correlation settings.

Loki connection setting Loki correlation settings

Adding Tempo Data Source

For Tempo, it’s just the same: we configure the URL and some correlation settings as well, so we can jump from traces to logs and back. It’s advisable to set some span start and end time shift so that any clock skew doesn’t kill the correlation:

Tempo connection setting Tempo correlation settings

Harvesting the fruits

With all data sources configured, we can now start exploring our OpenTelemetry data in Grafana. We can create dashboards that visualize traces, metrics, and logs all in one place. The correlation settings we configured earlier will allow us to easily jump between related data points, making it much easier to troubleshoot issues and gain insights into our applications.

Grafana’s Drilldown feature gives us easy entry points into our OpenTelemetry data. With the metrics-generator we configured for Tempo, we can now also see metrics and statistics derived from traces directly in Grafana, as well as Service Maps.

Grafana Screenshot of Drilldown