Skip to content

OCM Multi-Cluster Monitoring

A monitoring stack deployed on top of OCM to collect metrics from all clusters (hub + spokes) into a single Prometheus/Grafana instance on the hub.

Architecture

graph TB
  subgraph Docker["Docker Host — k3d Docker Network (172.18.0.0/16)"]
    subgraph Hub["k3d-ocm-hub"]
      Prom["kube-prometheus-stack<br/>(Prometheus + Grafana)"]
      Ingress["Traefik Ingress<br/>*.100.106.163.111.nip.io"]
    end

    subgraph Spoke1["k3d-ocm-spoke-1 (172.18.0.4)"]
      NE1["node-exporter<br/>hostNetwork:9100"]
      KSM1["kube-state-metrics<br/>NodePort:30101"]
    end

    subgraph Spoke2["k3d-ocm-spoke-2 (172.18.0.5)"]
      NE2["node-exporter<br/>hostNetwork:9100"]
      KSM2["kube-state-metrics<br/>NodePort:30101"]
    end

    subgraph LB["MetalLB Pool<br/>172.18.0.200-210"]
      TL["Traefik LB<br/>172.18.0.200:80"]
    end
  end

  Ingress --> TL
  Prom -->|scrape 172.18.0.4:9100| NE1
  Prom -->|scrape 172.18.0.4:30101| KSM1
  Prom -->|scrape 172.18.0.5:9100| NE2
  Prom -->|scrape 172.18.0.5:30101| KSM2

  classDef hub fill:#1b5e20,color:#fff,stroke:#2e7d32
  classDef spoke fill:#0d47a1,color:#fff,stroke:#1565c0
  classDef lb fill:#e65100,color:#fff,stroke:#ef6c00
  class Hub hub
  class Spoke1,Spoke2 spoke
  class TL,Ingress lb
Component Hub Spoke-1 Spoke-2
Prometheus
Grafana ✔ (ClusterIP)
Alertmanager
Node-Exporter ✔ (hostNetwork:9100) ✔ (hostNetwork:9100)
Kube-State-Metrics ✔ (NodePort:30101) ✔ (NodePort:30101)

How it works

  • Hub runs the full kube-prometheus-stack via Helm — Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics
  • Spokes run only the exporter components of kube-prometheus-stack — node-exporter (hostNetwork:9100) and kube-state-metrics (NodePort:30101). These are deployed via direct helm upgrade --install using k3d kubeconfig get
  • Scraping: Hub Prometheus scrapes spoke exporters over the shared Docker network using container IPs (172.18.0.4, 172.18.0.5). Prometheus is configured with additionalScrapeConfigs for the spoke targets
  • Ingress: Traefik ingress controller at MetalLB IP 172.18.0.200 serves nip.io domains

Why not ArgoCD for spoke exporters?

The ArgoCD agent model requires Applications in spoke-namespaced scopes (e.g., ocm-spoke-1, ocm-spoke-2). The ApplicationSet clusterDecisionResource generator creates Applications in the appset's own namespace (argocd), not the spoke namespaces. Direct Helm is simpler and more reliable for this use case.

Quick Start

# After OCM is deployed:
make ocm-demo

# Deploy monitoring:
make ocm-deploy-monitoring

Accessing the Monitoring Stack

Via nip.io domains

Access from the Docker host machine — the k3d-proxy nginx forwards port 80 to Traefik:

Service URL
Grafana http://grafana.100.106.163.111.nip.io
Prometheus http://prometheus.100.106.163.111.nip.io
Alertmanager http://alertmanager.100.106.163.111.nip.io

Grafana credentials: admin / prom-operator

Via kubectl port-forward

# Grafana
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring port-forward svc/prometheus-stack-grafana 3000:80

# Prometheus
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring port-forward svc/prometheus-stack-kube-prom-prometheus 9090:9090

# Alertmanager
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring port-forward svc/prometheus-stack-kube-prom-alertmanager 9093:9093

Prometheus API

The Prometheus API is accessible at http://prometheus.100.106.163.111.nip.io/api/v1/:

# List all active targets
curl -s http://prometheus.100.106.163.111.nip.io/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, instance: .labels.instance, health: .health}'

# Query metrics
curl -s 'http://prometheus.100.106.163.111.nip.io/api/v1/query?query=up' | jq '.data.result[] | {job: .metric.job, cluster: .metric.cluster, instance: .metric.instance, value: .value[1]}'

Configuration Files

Hub Helm Values

ocm/configs/monitoring/hub-values.yaml:

grafana:
  enabled: true
  adminPassword: prom-operator
  defaultDashboardsEnabled: true
  defaultDashboardsTimezone: "asia/kolkata"
  service:
    type: ClusterIP
  grafana.ini:
    server:
      root_url: http://grafana.100.106.163.111.nip.io

prometheus:
  prometheusSpec:
    enableRemoteWriteReceiver: true
    enableFeatures:
      - remote-write-receiver
    retention: 7d
    resources:
      requests:
        cpu: 200m
        memory: 1Gi
    additionalScrapeConfigs:
      - job_name: spoke-1-node-exporter
        static_configs:
          - targets:
              - REPLACE_WITH_SPOKE1_IP:9100
            labels:
              cluster: ocm-spoke-1
      - job_name: spoke-2-node-exporter
        static_configs:
          - targets:
              - REPLACE_WITH_SPOKE2_IP:9100
            labels:
              cluster: ocm-spoke-2
      - job_name: spoke-1-kube-state-metrics
        static_configs:
          - targets:
              - REPLACE_WITH_SPOKE1_IP:30101
            labels:
              cluster: ocm-spoke-1
      - job_name: spoke-2-kube-state-metrics
        static_configs:
          - targets:
              - REPLACE_WITH_SPOKE2_IP:30101
            labels:
              cluster: ocm-spoke-2

nodeExporter:
  enabled: true

kubeStateMetrics:
  enabled: true

The REPLACE_WITH_SPOKE*_IP placeholders are resolved at deploy time by make ocm-deploy-monitoring using docker inspect to get the spoke containers' Docker network IPs.

Hub Ingress

ocm/configs/monitoring/hub-ingress.yaml:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hub-monitoring
  namespace: monitoring
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
  ingressClassName: traefik
  rules:
    - host: grafana.100.106.163.111.nip.io
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: prometheus-stack-grafana
                port:
                  number: 80
    - host: prometheus.100.106.163.111.nip.io
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: prometheus-stack-kube-prom-prometheus
                port:
                  number: 9090
    - host: alertmanager.100.106.163.111.nip.io
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: prometheus-stack-kube-prom-alertmanager
                port:
                  number: 9093

Spoke Helm Values

ocm/configs/monitoring/spoke-values.yaml:

prometheus:
  enabled: false
alertmanager:
  enabled: false
grafana:
  enabled: false
nodeExporter:
  enabled: true
  hostNetwork: true
kubeStateMetrics:
  enabled: true

Only nodeExporter and kubeStateMetrics are enabled. The other components are disabled to keep the spoke deployment minimal.

Spoke Kube-State-Metrics NodePort

ocm/configs/monitoring/spoke-kube-state-metrics-nodeport.yaml:

apiVersion: v1
kind: Service
metadata:
  name: kube-state-metrics-nodeport
  namespace: monitoring
spec:
  type: NodePort
  ports:
    - port: 8080
      targetPort: 8080
      nodePort: 30101
  selector:
    app.kubernetes.io/instance: spoke-monitoring
    app.kubernetes.io/name: kube-state-metrics

A separate NodePort service is required because the kube-prometheus-stack chart does not support setting nodePort on the kube-state-metrics service. The selector matches the pods created by the Helm release.

Pre-Installed Grafana Dashboards

The kube-prometheus-stack ships with a comprehensive set of Grafana dashboards, auto-loaded via the Grafana sidecar container:

Dashboard Description
Kubernetes / API Server API server metrics
Kubernetes / Cluster Total Overall cluster resource usage
Kubernetes / Controller Manager Controller manager metrics
Kubernetes / CoreDNS DNS performance and errors
Kubernetes / Kubelet Kubelet metrics
Kubernetes / Namespace by Pod Per-namespace pod counts
Kubernetes / Namespace by Workload Per-namespace workload counts
Kubernetes / Node Cluster Resource Usage Cluster-wide node resource usage
Kubernetes / Node Resource Usage Per-node resource usage
Kubernetes / Nodes Node-level metrics
Kubernetes / Persistent Volumes Usage PV usage metrics
Kubernetes / Pod Total Pod counts across the cluster
Kubernetes / Proxy kube-proxy metrics
Kubernetes / Scheduler Scheduler metrics
Kubernetes / Workload Total Workload counts
Kubernetes / Networking Network I/O and packets
Prometheus / Overview Prometheus self-metrics

Prometheus Scrape Targets

Hub Targets

Job Instance Port
apiserver 172.18.0.2:6443 k3s API
kubelet 172.18.0.2:10250 kubelet metrics/cadvisor/probes
coredns 10.42.0.X:9153 CoreDNS metrics
node-exporter 172.18.0.2:9100 Hub node metrics
kube-state-metrics 10.42.0.X:8080 Hub KSM
prometheus-stack-grafana 10.42.0.X:3000 Grafana self-metrics
prometheus-stack-kube-prom-operator 10.42.0.X:10250 Operator metrics
prometheus-stack-kube-prom-prometheus 10.42.0.X:9090 Prometheus self-metrics
prometheus-stack-kube-prom-alertmanager 10.42.0.X:9093 Alertmanager metrics

Spoke Targets (additional scrape config)

Job Instance Port Labels
spoke-1-node-exporter 172.18.0.4:9100 hostNetwork cluster: ocm-spoke-1
spoke-1-kube-state-metrics 172.18.0.4:30101 NodePort cluster: ocm-spoke-1
spoke-2-node-exporter 172.18.0.2:9100 hostNetwork cluster: ocm-spoke-2
spoke-2-kube-state-metrics 172.18.0.2:30101 NodePort cluster: ocm-spoke-2

The cluster label allows filtering and grouping metrics by cluster in Grafana.

Verification

# Hub monitoring pods
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring get pods

# Spoke monitoring pods
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-spoke-1 -n monitoring get pods
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-spoke-2 -n monitoring get pods

# Prometheus targets — should all be UP
curl -s http://prometheus.100.106.163.111.nip.io/api/v1/targets | \
  jq '.data.activeTargets[] | {job: .labels.job, cluster: (.labels.cluster // "hub"), instance: .labels.instance, health: .health}'

# Query up metric across clusters
curl -s 'http://prometheus.100.106.163.111.nip.io/api/v1/query?query=up{cluster=~"ocm-spoke-.+"}' | \
  jq '.data.result[] | {cluster: .metric.cluster, instance: .metric.instance, value: .value[1]}'

# Grafana health check
curl -sI http://grafana.100.106.163.111.nip.io/login | head -2
# Expected: HTTP/1.1 302 Found → redirects to /login

Troubleshooting

Port 80 nip.io returns 404

The k3d-proxy nginx container has a default HTTP server on port 80 that can interfere with the stream proxy to Traefik. Fix:

docker exec k3d-ocm-hub-serverlb sh -c 'mv /etc/nginx/conf.d/default.conf /etc/nginx/conf.d/default.conf.bak && nginx -s reload'

After this, requests with the correct Host header are forwarded to Traefik.

Direct access via Traefik LB

If nip.io domains don't work, access services directly via the Traefik LoadBalancer IP with the correct Host header:

# Grafana
curl -sI -H "Host: grafana.100.106.163.111.nip.io" http://172.18.0.200/

# Prometheus
curl -s -H "Host: prometheus.100.106.163.111.nip.io" http://172.18.0.200/api/v1/targets

Spoke targets show DOWN

  • node-exporter DOWN: Verify the pod is running with hostNetwork: true and port 9100 is listening. Check curl -s http://172.18.0.4:9100/metrics | head from a hub pod or the hub container
  • kube-state-metrics DOWN: Verify the NodePort service exists and the selector matches the pods:
    kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-spoke-1 -n monitoring describe svc kube-state-metrics-nodeport
    
    The selector must match app.kubernetes.io/instance: spoke-monitoring and app.kubernetes.io/name: kube-state-metrics

Grafana dashboards not loading

Check the Grafana sidecar logs:

kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring logs deploy/prometheus-stack-grafana -c grafana-sc-dashboard

SSO / OIDC — Dex

Dex provides OIDC-based single sign-on for Grafana and ArgoCD on the OCM hub. It acts as the identity gateway, backed by a static password database and an OpenLDAP directory.

Architecture

graph LR
  subgraph Hub["k3d-ocm-hub"]
    Dex["Dex<br/>dex:5556"]
    LDAP["OpenLDAP<br/>ldap.default.svc:389"]
    Grafana["Grafana<br/>grafana.100.106.163.111.nip.io"]
    ArgoCD["ArgoCD<br/>argocd.100.106.163.111.nip.io"]
  end

  User -->|"[email protected] / password"| Dex
  User -->|"[email protected] / babayaga"| LDAP
  Dex --> LDAP
  Grafana -->|generic_oauth| Dex
  ArgoCD -->|oidc| Dex

  classDef hub fill:#1b5e20,color:#fff,stroke:#2e7d32
  class Hub hub

Default Credentials

Dex / SSO Login

Username Email Password Provider Groups
admin [email protected] password Static (Dex)
john [email protected] babayaga LDAP admins
tony [email protected] ironman LDAP developers

Direct Service Credentials

Service Username Password Notes
Grafana (local) admin prom-operator Used when bypassing SSO
LDAP (admin bind) cn=admin,dc=example,dc=com admin Used by Dex LDAP connector
  • Static user is defined directly in Dex's staticPasswords with a bcrypt hash
  • LDAP users are stored in OpenLDAP under dc=example,dc=com. Dex connects to ldap.default.svc:389 as a proxy. Group membership (admins/developers) flows through to ArgoCD for RBAC

Grafana OIDC

Grafana is configured with auth.generic_oauth pointing to Dex:

# In hub-values.yaml
grafana:
  grafana.ini:
    auth.generic_oauth:
      enabled: true
      allow_sign_up: true
      name: Dex
      scopes: openid email profile groups offline_access
      auth_url: http://dex.100.106.163.111.nip.io/dex/auth
      token_url: http://dex.100.106.163.111.nip.io/dex/token
      api_url: http://dex.100.106.163.111.nip.io/dex/userinfo
      client_id: grafana
      client_secret: SrEzVU2WVqhIJiJsenDAONnDcira5F1DRfFW64UI

The Grafana login page shows a "Sign in with Dex" button alongside the standard admin login.

ArgoCD SSO

ArgoCD is configured as a Dex OIDC client (argocd) with group-based RBAC:

# Dex static client
staticClients:
  - id: argocd
    redirectURIs:
      - 'https://argocd.100.106.163.111.nip.io/auth/callback'
    name: 'ArgoCD'
    secret: YWdvY2Qtc2VjcmV0

ArgoCD's argocd-cm ConfigMap is patched with a Dex connector:

data:
  dex.config: |
    connectors:
      - type: oidc
        id: dex
        name: Dex
        config:
          issuer: http://dex.100.106.163.111.nip.io/dex
          clientID: argocd
          clientSecret: YWdvY2Qtc2VjcmV0
          insecureEnableGroups: true

RBAC is configured via argocd-rbac-cm:

data:
  policy.default: role:readonly
  policy.csv: |
    g, admins, role:admin
    g, developers, role:readonly
  scopes: "[groups, email]"
  • Authenticated users get read-only access by default
  • LDAP admins group members get admin role
  • LDAP developers group members get read-only (explicit, same as default)

Quick Start

# Deploy Dex + LDAP + SSO configs on the OCM hub:
make ocm-deploy-dex

# Deploy ArgoCD ingress (if not already done):
make ocm-deploy-argocd-ingress

Access

Service URL Auth Method
Dex http://dex.100.106.163.111.nip.io/dex
Grafana http://grafana.100.106.163.111.nip.io Dex OIDC + admin/local
ArgoCD https://argocd.100.106.163.111.nip.io Dex OIDC + admin/local
Dex Health http://dex.100.106.163.111.nip.io/dex/healthz

Configuration Files

File Purpose
ocm/configs/monitoring/dex-ocm.yaml Dex deployment, ConfigMap, Service, Ingress, RBAC
ocm/configs/monitoring/ldap-ocm.yaml OpenLDAP deployment with bootstrap users and groups
ocm/configs/monitoring/argocd-ingress.yaml Traefik IngressRouteTCP (TLS passthrough) for argocd.100.106.163.111.nip.io

Troubleshooting

Dex health check fails: Verify the Dex pod is running and the Ingress is in place:

kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n dex get pods
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n dex get ingress
curl -s http://dex.100.106.163.111.nip.io/dex/healthz

Grafana OIDC button missing: Check the grafana.ini has the OIDC section:

kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring get cm prometheus-stack-grafana -o jsonpath='{.data.grafana\.ini}' | grep -A2 'generic_oauth'

ArgoCD OIDC login fails: Check the Dex connector config in argocd-cm:

kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n argocd get cm argocd-cm -o jsonpath='{.data.dex\.config}'

The dex pod itself can be checked for OIDC errors:

kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n dex logs deployment/dex | tail -20

LDAP CrashLoopBackOff: Two fixes were needed:

  1. K8s service env override — A Kubernetes Service named ldap in the same namespace injects LDAP_PORT=tcp://CLUSTER_IP:389, overriding the osixia image's LDAP_PORT=389. This breaks the slapd listen URL (ldap://HOSTNAME:tcp://IP:389). Fix: explicitly set LDAP_PORT=389 and LDAPS_PORT=636 in the deployment env.

  2. ConfigMap read-only chown — The osixia image tries to chown ConfigMap-mounted bootstrap LDIF files (read-only filesystem). Fix: use an init container to copy LDIF files from the ConfigMap volume to an emptyDir volume, mount the emptyDir at the bootstrap path. Also set LDAP_REMOVE_CONFIG_AFTER_SETUP=false to prevent the post-setup cleanup from trying to delete the mounted emptyDir.

Verify LDAP is healthy:

kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n default get pods -l app=ldap
# Query all entries
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n default exec deployment/ldap -- \
  ldapsearch -x -H ldap://localhost:389 -b dc=example,dc=com -D "cn=admin,dc=example,dc=com" -w admin -LLL dn
# Test user authentication
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n default exec deployment/ldap -- \
  ldapwhoami -x -H ldap://localhost:389 -D "cn=john,ou=People,dc=example,dc=com" -w babayaga

Files Reference

File Purpose
ocm/configs/monitoring/hub-values.yaml Helm values for hub — includes placeholder spoke Docker IPs
ocm/configs/monitoring/hub-ingress.yaml Ingress for grafana/prometheus/alertmanager on *.100.106.163.111.nip.io
ocm/configs/monitoring/spoke-values.yaml Helm values for spoke exporters (node-exporter + kube-state-metrics only)
ocm/configs/monitoring/spoke-kube-state-metrics-nodeport.yaml NodePort service for spoke kube-state-metrics
ocm/configs/monitoring/dex-ocm.yaml Dex deployment, ConfigMap, Service, Ingress, RBAC
ocm/configs/monitoring/ldap-ocm.yaml OpenLDAP deployment with bootstrap users and groups
ocm/configs/monitoring/argocd-ingress.yaml Traefik Ingress for argocd.100.106.163.111.nip.io
ocm/configs/monitoring/appset-spoke-exporters.yaml (Reference) Abandoned ApplicationSet approach
ocm/configs/monitoring/generator-configmap.yaml (Reference) clusterDecisionResource generator ConfigMap
ocm/configs/monitoring/placement-spoke-monitoring.yaml (Reference) OCM Placement