Homelab ~ 11 min read

Self Hosted Grafana OTel Stack

How to set up self-hosted OpenTelemetry using Grafana, Alloy, Loki, Tempo and Prometheus.

In past articles, I’ve shown how to set up OpenTelemetry in various programming languages and frameworks. But what is all of this worth if you have nowhere to send your telemetry data to? In this article, I will show you how to set up a very basic OpenTelemetry stack using Grafana Open Source products to test and experiment with OpenTelemetry at home or in other self-hosted environments. One important disclaimer though: This setup is by no means production-ready, since it is missing persistence, high availability, and security hardening. It is purely meant for learning and experimenting with OpenTelemetry.

Bill of Materials

OpenTelemetry consists of three types of data: Logs, Metrics and Traces. Each of these benefit from specialized storage solutions allowing efficient storage and querying of the data.

  • Metrics: For storing metrics we will use Prometheus, which is a very popular open-source monitoring and alerting toolkit. It is widely used in the industry and has great integration with OpenTelemetry. Particularly as a homelabber you might already be familiar with Prometheus as it is often used for monitoring Kubernetes clusters and other infrastructure.
  • Logs: For logs we will use Loki, which is a log aggregation system designed to be cost-effective and easy to operate. It is optimized for storing and querying logs from applications and infrastructure.
  • Traces: For traces we will use Tempo, which is a distributed tracing backend that is designed to be easy to operate and scale. It is optimized for storing and querying traces from applications and services.

Apart from these storage backends, we also need an ingestion point to receive the OpenTelemetry data and a tool to visualize and analyze the data. Fortunately, Grafana Labs also provides solutions for both of these needs:

  • Collector: For receiving the OpenTelemetry data we will use Grafana Alloy, which exposes an OTLP endpoint for ingesting OpenTelemetry data and can process and forward the data to the matching storage backends.
  • Visualization: For visualizing and analyzing the data we will use the highly popular Grafana, which is a powerful and flexible dashboarding tool that integrates with all of the above storage backends.

The components will then be arranged like this:

Grafana OTel Stack

Setting up Prometheus

Setting up Prometheus is fairly straightforward, as in this case we don’t need any of the scraping capabilities of Prometheus, since all data will be pushed to it via the OpenTelemetry Collector.

apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: otel
labels:
app: prometheus
annotations:
reloader.stakater.com/auto: "true"
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:v3.9.1
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus/"
- "--storage.tsdb.retention.time=24h"
- "--web.enable-remote-write-receiver"
- "--enable-feature=native-histograms"
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-config-volume
mountPath: /etc/prometheus/
- name: prometheus-storage-volume
mountPath: /prometheus/
volumes:
- name: prometheus-config-volume
configMap:
defaultMode: 420
name: prometheus-server-conf
- name: prometheus-storage-volume
emptyDir: {}

The important part here is the --web.enable-remote-write-receiver flag, which enables Prometheus to receive data via the remote write protocol, which is what the OpenTelemetry Collector will use to send metrics data to Prometheus.

Other than that we set the path to our configuration file and the storage path. For simplicity we use an emptyDir volume for storage, but in a real setup you would want to use a persistent volume.

I’ve limited the retention time to 24 hours for this example, but you can adjust this as needed.

The feature flag --enable-feature=native-histograms is required to enable histogram support in Prometheus, which is needed for some OpenTelemetry metrics.

The configuration file for Prometheus is very basic since we don’t need to scrape any targets. In fact, we can disable scraping altogether by setting the scrape_interval to 0s.

Setting up Loki

Setting up a single-instance Loki requires some configuration that disables certain replication and high availability features, which are not needed in this simple setup.

apiVersion: apps/v1
kind: Deployment
metadata:
name: loki
namespace: otel
annotations:
reloader.stakater.com/auto: "true"
spec:
replicas: 1
selector:
matchLabels:
app: loki
template:
metadata:
labels:
app: loki
spec:
volumes:
- name: loki-config
configMap:
name: loki
- name: loki-storage
emptyDir: {}
containers:
- name: loki
image: grafana/loki:3.6.4
args:
- -config.file=/etc/loki/loki.yaml
resources:
limits:
memory: "512Mi"
cpu: "500m"
requests:
memory: "256Mi"
cpu: "100m"
volumeMounts:
- name: loki-config
mountPath: /etc/loki
- name: loki-storage
mountPath: /loki
livenessProbe:
httpGet:
path: /ready
port: 3100
initialDelaySeconds: 45
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 3100
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
ports:
- containerPort: 3100
name: http
protocol: TCP
- containerPort: 9096
name: grpc
protocol: TCP

Again, we are mounting an emptyDir volume for storage, but in a real setup you would want to use a persistent volume.

I won’t go into too much detail about the configuration here, but the important parts are that we disable authentication, set the storage backend to filesystem, and configure retention settings. The service exposes both the HTTP and gRPC ports.

Setting up Tempo

Setting up Tempo is similar to Loki; we need to provide a configuration that disables high availability features and sets the storage backend to filesystem.

apiVersion: apps/v1
kind: Deployment
metadata:
name: tempo
namespace: otel
annotations:
reloader.stakater.com/auto: "true"
spec:
selector:
matchLabels:
app: tempo
template:
metadata:
labels:
app: tempo
spec:
volumes:
- name: tempo-config
configMap:
name: tempo
- name: tempo-storage
emptyDir: {}
containers:
- name: tempo
image: grafana/tempo:2.9.1
args:
- -config.file=/conf/tempo.yaml
resources:
limits:
memory: "512Mi"
cpu: "500m"
requests:
memory: "256Mi"
cpu: "200m"
volumeMounts:
- name: tempo-config
mountPath: /conf
- name: tempo-storage
mountPath: /var/tempo
livenessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: 3100
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
readinessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: 3100
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
ports:
- containerPort: 3100
name: prom-metrics
- containerPort: 4317
name: otlp-grpc
- containerPort: 4318
name: otlp-http

Again, we are limiting the retention time to 24 hours. Additionally, we enable the metrics generator, which allows us to generate metrics from trace data, which is needed for the best experience in Grafana. These metrics are generated based on the spans ingested into Tempo and directly written to Prometheus via remote write.

Setting up Grafana Alloy

With all storage backends in place, we can now set up the OpenTelemetry Collector using Grafana Alloy. The collector will receive all OpenTelemetry data and forward it to the matching storage backends.

apiVersion: apps/v1
kind: Deployment
metadata:
name: alloy
namespace: otel
annotations:
reloader.stakater.com/auto: "true"
spec:
selector:
matchLabels:
app: alloy
template:
metadata:
labels:
app: alloy
spec:
volumes:
- name: config
configMap:
name: alloy
containers:
- name: alloy
image: grafana/alloy:v1.12.2
args:
- run
- /etc/alloy/config.alloy
- --storage.path=/tmp/alloy
- --server.http.listen-addr=0.0.0.0:12345
- --server.http.ui-path-prefix=/
- --stability.level=generally-available
resources:
limits:
memory: "256Mi"
cpu: "200m"
requests:
memory: "128Mi"
cpu: "100m"
ports:
- containerPort: 12345
name: http
- containerPort: 4317
name: otlp-grpc
- containerPort: 4318
name: otlp-http
readinessProbe:
httpGet:
path: /-/ready
port: 12345
scheme: HTTP
initialDelaySeconds: 10
timeoutSeconds: 1
volumeMounts:
- name: config
mountPath: /etc/alloy

Since Alloy is just the collector, it doesn’t need any persistent storage, so we can skip that here.

As you can see in the configuration, it pretty much describes a graph of how data flows through the collector. We define an OTLP receiver that listens on both gRPC and HTTP ports for incoming OpenTelemetry data. The received traces, metrics, and logs are then forwarded to the respective exporters, which send the data to Tempo, Prometheus, and Loki. The ingress configuration allows receiving OTLP traffic from outside the cluster.

Configuring Grafana to use the data sources

With all components set up, we can now configure Grafana to use the data sources. I’ll skip explaining how to deploy Grafana itself and assume we already have a running Grafana instance.

In order to explore our OpenTelemetry data, we need to add the data sources for Prometheus, Loki, and Tempo in Grafana.

Adding Prometheus Data Source

Adding the Prometheus data source is straightforward. In Grafana, navigate to Connections > Data Sources and click on Add data source. Select Prometheus from the list and configure it as follows:

Prometheus connection setting

All other settings can be left on default.

Adding Loki Data Source

When adding Loki as a data source we also want to configure some correlation settings.

Loki connection setting Loki correlation settings

Adding Tempo Data Source

For Tempo, it’s just the same: we configure the URL and some correlation settings as well, so we can jump from traces to logs and back. It’s advisable to set some span start and end time shift so that any clock skew doesn’t kill the correlation:

Tempo connection setting Tempo correlation settings

Harvesting the fruits

With all data sources configured, we can now start exploring our OpenTelemetry data in Grafana. We can create dashboards that visualize traces, metrics, and logs all in one place. The correlation settings we configured earlier will allow us to easily jump between related data points, making it much easier to troubleshoot issues and gain insights into our applications.

Grafana’s Drilldown feature gives us easy entry points into our OpenTelemetry data. With the metrics-generator we configured for Tempo, we can now also see metrics and statistics derived from traces directly in Grafana, as well as Service Maps.

Grafana Screenshot of Drilldown