OCM Multi-Cluster Monitoring

A monitoring stack deployed on top of OCM to collect metrics from all clusters (hub + spokes) into a single Prometheus/Grafana instance on the hub.

Architecture

graph TB
  subgraph Docker["Docker Host — k3d Docker Network (172.18.0.0/16)"]
    subgraph Hub["k3d-ocm-hub"]
      Prom["kube-prometheus-stack<br/>(Prometheus + Grafana)"]
      Ingress["Traefik Ingress<br/>*.100.106.163.111.nip.io"]
    end

    subgraph Spoke1["k3d-ocm-spoke-1 (172.18.0.4)"]
      NE1["node-exporter<br/>hostNetwork:9100"]
      KSM1["kube-state-metrics<br/>NodePort:30101"]
    end

    subgraph Spoke2["k3d-ocm-spoke-2 (172.18.0.5)"]
      NE2["node-exporter<br/>hostNetwork:9100"]
      KSM2["kube-state-metrics<br/>NodePort:30101"]
    end

    subgraph LB["MetalLB Pool<br/>172.18.0.200-210"]
      TL["Traefik LB<br/>172.18.0.200:80"]
    end
  end

  Ingress --> TL
  Prom -->|scrape 172.18.0.4:9100| NE1
  Prom -->|scrape 172.18.0.4:30101| KSM1
  Prom -->|scrape 172.18.0.5:9100| NE2
  Prom -->|scrape 172.18.0.5:30101| KSM2

  classDef hub fill:#1b5e20,color:#fff,stroke:#2e7d32
  classDef spoke fill:#0d47a1,color:#fff,stroke:#1565c0
  classDef lb fill:#e65100,color:#fff,stroke:#ef6c00
  class Hub hub
  class Spoke1,Spoke2 spoke
  class TL,Ingress lb

Component	Hub	Spoke-1	Spoke-2
Prometheus	✔	—	—
Grafana	✔ (ClusterIP)	—	—
Alertmanager	✔	—	—
Node-Exporter	✔	✔ (hostNetwork:9100)	✔ (hostNetwork:9100)
Kube-State-Metrics	✔	✔ (NodePort:30101)	✔ (NodePort:30101)

How it works

Hub runs the full kube-prometheus-stack via Helm — Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics
Spokes run only the exporter components of kube-prometheus-stack — node-exporter (hostNetwork:9100) and kube-state-metrics (NodePort:30101). These are deployed via direct helm upgrade --install using k3d kubeconfig get
Scraping: Hub Prometheus scrapes spoke exporters over the shared Docker network using container IPs (172.18.0.4, 172.18.0.5). Prometheus is configured with additionalScrapeConfigs for the spoke targets
Ingress: Traefik ingress controller at MetalLB IP 172.18.0.200 serves nip.io domains

Why not ArgoCD for spoke exporters?

The ArgoCD agent model requires Applications in spoke-namespaced scopes (e.g., ocm-spoke-1, ocm-spoke-2). The ApplicationSet clusterDecisionResource generator creates Applications in the appset's own namespace (argocd), not the spoke namespaces. Direct Helm is simpler and more reliable for this use case.

Quick Start

# After OCM is deployed:
make ocm-demo

# Deploy monitoring:
make ocm-deploy-monitoring

Accessing the Monitoring Stack

Via nip.io domains

Access from the Docker host machine — the k3d-proxy nginx forwards port 80 to Traefik:

Service	URL
Grafana	`http://grafana.100.106.163.111.nip.io`
Prometheus	`http://prometheus.100.106.163.111.nip.io`
Alertmanager	`http://alertmanager.100.106.163.111.nip.io`

Grafana credentials: admin / prom-operator

Via kubectl port-forward

# Grafana
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring port-forward svc/prometheus-stack-grafana 3000:80

# Prometheus
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring port-forward svc/prometheus-stack-kube-prom-prometheus 9090:9090

# Alertmanager
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring port-forward svc/prometheus-stack-kube-prom-alertmanager 9093:9093

Prometheus API

The Prometheus API is accessible at http://prometheus.100.106.163.111.nip.io/api/v1/:

# List all active targets
curl -s http://prometheus.100.106.163.111.nip.io/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, instance: .labels.instance, health: .health}'

# Query metrics
curl -s 'http://prometheus.100.106.163.111.nip.io/api/v1/query?query=up' | jq '.data.result[] | {job: .metric.job, cluster: .metric.cluster, instance: .metric.instance, value: .value[1]}'

Configuration Files

Hub Helm Values

ocm/configs/monitoring/hub-values.yaml:

grafana:
  enabled: true
  adminPassword: prom-operator
  defaultDashboardsEnabled: true
  defaultDashboardsTimezone: "asia/kolkata"
  service:
    type: ClusterIP
  grafana.ini:
    server:
      root_url: http://grafana.100.106.163.111.nip.io

prometheus:
  prometheusSpec:
    enableRemoteWriteReceiver: true
    enableFeatures:
      - remote-write-receiver
    retention: 7d
    resources:
      requests:
        cpu: 200m
        memory: 1Gi
    additionalScrapeConfigs:
      - job_name: spoke-1-node-exporter
        static_configs:
          - targets:
              - REPLACE_WITH_SPOKE1_IP:9100
            labels:
              cluster: ocm-spoke-1
      - job_name: spoke-2-node-exporter
        static_configs:
          - targets:
              - REPLACE_WITH_SPOKE2_IP:9100
            labels:
              cluster: ocm-spoke-2
      - job_name: spoke-1-kube-state-metrics
        static_configs:
          - targets:
              - REPLACE_WITH_SPOKE1_IP:30101
            labels:
              cluster: ocm-spoke-1
      - job_name: spoke-2-kube-state-metrics
        static_configs:
          - targets:
              - REPLACE_WITH_SPOKE2_IP:30101
            labels:
              cluster: ocm-spoke-2

nodeExporter:
  enabled: true

kubeStateMetrics:
  enabled: true

The REPLACE_WITH_SPOKE*_IP placeholders are resolved at deploy time by make ocm-deploy-monitoring using docker inspect to get the spoke containers' Docker network IPs.

Hub Ingress

ocm/configs/monitoring/hub-ingress.yaml:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hub-monitoring
  namespace: monitoring
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
  ingressClassName: traefik
  rules:
    - host: grafana.100.106.163.111.nip.io
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: prometheus-stack-grafana
                port:
                  number: 80
    - host: prometheus.100.106.163.111.nip.io
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: prometheus-stack-kube-prom-prometheus
                port:
                  number: 9090
    - host: alertmanager.100.106.163.111.nip.io
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: prometheus-stack-kube-prom-alertmanager
                port:
                  number: 9093

Spoke Helm Values

ocm/configs/monitoring/spoke-values.yaml:

prometheus:
  enabled: false
alertmanager:
  enabled: false
grafana:
  enabled: false
nodeExporter:
  enabled: true
  hostNetwork: true
kubeStateMetrics:
  enabled: true

Only nodeExporter and kubeStateMetrics are enabled. The other components are disabled to keep the spoke deployment minimal.

Spoke Kube-State-Metrics NodePort

ocm/configs/monitoring/spoke-kube-state-metrics-nodeport.yaml:

apiVersion: v1
kind: Service
metadata:
  name: kube-state-metrics-nodeport
  namespace: monitoring
spec:
  type: NodePort
  ports:
    - port: 8080
      targetPort: 8080
      nodePort: 30101
  selector:
    app.kubernetes.io/instance: spoke-monitoring
    app.kubernetes.io/name: kube-state-metrics

A separate NodePort service is required because the kube-prometheus-stack chart does not support setting nodePort on the kube-state-metrics service. The selector matches the pods created by the Helm release.

Pre-Installed Grafana Dashboards

The kube-prometheus-stack ships with a comprehensive set of Grafana dashboards, auto-loaded via the Grafana sidecar container:

Dashboard	Description
Kubernetes / API Server	API server metrics
Kubernetes / Cluster Total	Overall cluster resource usage
Kubernetes / Controller Manager	Controller manager metrics
Kubernetes / CoreDNS	DNS performance and errors
Kubernetes / Kubelet	Kubelet metrics
Kubernetes / Namespace by Pod	Per-namespace pod counts
Kubernetes / Namespace by Workload	Per-namespace workload counts
Kubernetes / Node Cluster Resource Usage	Cluster-wide node resource usage
Kubernetes / Node Resource Usage	Per-node resource usage
Kubernetes / Nodes	Node-level metrics
Kubernetes / Persistent Volumes Usage	PV usage metrics
Kubernetes / Pod Total	Pod counts across the cluster
Kubernetes / Proxy	kube-proxy metrics
Kubernetes / Scheduler	Scheduler metrics
Kubernetes / Workload Total	Workload counts
Kubernetes / Networking	Network I/O and packets
Prometheus / Overview	Prometheus self-metrics

Prometheus Scrape Targets

Hub Targets

Job	Instance	Port
`apiserver`	`172.18.0.2:6443`	k3s API
`kubelet`	`172.18.0.2:10250`	kubelet metrics/cadvisor/probes
`coredns`	`10.42.0.X:9153`	CoreDNS metrics
`node-exporter`	`172.18.0.2:9100`	Hub node metrics
`kube-state-metrics`	`10.42.0.X:8080`	Hub KSM
`prometheus-stack-grafana`	`10.42.0.X:3000`	Grafana self-metrics
`prometheus-stack-kube-prom-operator`	`10.42.0.X:10250`	Operator metrics
`prometheus-stack-kube-prom-prometheus`	`10.42.0.X:9090`	Prometheus self-metrics
`prometheus-stack-kube-prom-alertmanager`	`10.42.0.X:9093`	Alertmanager metrics

Spoke Targets (additional scrape config)

Job	Instance	Port	Labels
`spoke-1-node-exporter`	`172.18.0.4:9100`	hostNetwork	`cluster: ocm-spoke-1`
`spoke-1-kube-state-metrics`	`172.18.0.4:30101`	NodePort	`cluster: ocm-spoke-1`
`spoke-2-node-exporter`	`172.18.0.2:9100`	hostNetwork	`cluster: ocm-spoke-2`
`spoke-2-kube-state-metrics`	`172.18.0.2:30101`	NodePort	`cluster: ocm-spoke-2`

The cluster label allows filtering and grouping metrics by cluster in Grafana.

Verification

# Hub monitoring pods
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring get pods

# Spoke monitoring pods
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-spoke-1 -n monitoring get pods
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-spoke-2 -n monitoring get pods

# Prometheus targets — should all be UP
curl -s http://prometheus.100.106.163.111.nip.io/api/v1/targets | \
  jq '.data.activeTargets[] | {job: .labels.job, cluster: (.labels.cluster // "hub"), instance: .labels.instance, health: .health}'

# Query up metric across clusters
curl -s 'http://prometheus.100.106.163.111.nip.io/api/v1/query?query=up{cluster=~"ocm-spoke-.+"}' | \
  jq '.data.result[] | {cluster: .metric.cluster, instance: .metric.instance, value: .value[1]}'

# Grafana health check
curl -sI http://grafana.100.106.163.111.nip.io/login | head -2
# Expected: HTTP/1.1 302 Found → redirects to /login

Troubleshooting

Port 80 nip.io returns 404

The k3d-proxy nginx container has a default HTTP server on port 80 that can interfere with the stream proxy to Traefik. Fix:

docker exec k3d-ocm-hub-serverlb sh -c 'mv /etc/nginx/conf.d/default.conf /etc/nginx/conf.d/default.conf.bak && nginx -s reload'

After this, requests with the correct Host header are forwarded to Traefik.

Direct access via Traefik LB

If nip.io domains don't work, access services directly via the Traefik LoadBalancer IP with the correct Host header:

# Grafana
curl -sI -H "Host: grafana.100.106.163.111.nip.io" http://172.18.0.200/

# Prometheus
curl -s -H "Host: prometheus.100.106.163.111.nip.io" http://172.18.0.200/api/v1/targets

Spoke targets show DOWN

node-exporter DOWN: Verify the pod is running with hostNetwork: true and port 9100 is listening. Check curl -s http://172.18.0.4:9100/metrics | head from a hub pod or the hub container
kube-state-metrics DOWN: Verify the NodePort service exists and the selector matches the pods:
```
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-spoke-1 -n monitoring describe svc kube-state-metrics-nodeport
```
The selector must match app.kubernetes.io/instance: spoke-monitoring and app.kubernetes.io/name: kube-state-metrics

Grafana dashboards not loading

Check the Grafana sidecar logs:

kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring logs deploy/prometheus-stack-grafana -c grafana-sc-dashboard

SSO / OIDC — Dex

Dex provides OIDC-based single sign-on for Grafana and ArgoCD on the OCM hub. It acts as the identity gateway, backed by a static password database and an OpenLDAP directory.

Architecture

graph LR
  subgraph Hub["k3d-ocm-hub"]
    Dex["Dex<br/>dex:5556"]
    LDAP["OpenLDAP<br/>ldap.default.svc:389"]
    Grafana["Grafana<br/>grafana.100.106.163.111.nip.io"]
    ArgoCD["ArgoCD<br/>argocd.100.106.163.111.nip.io"]
  end

  User -->|"[email protected] / password"| Dex
  User -->|"[email protected] / babayaga"| LDAP
  Dex --> LDAP
  Grafana -->|generic_oauth| Dex
  ArgoCD -->|oidc| Dex

  classDef hub fill:#1b5e20,color:#fff,stroke:#2e7d32
  class Hub hub

Default Credentials

Username	Email	Password	Provider	Groups
admin	`[email protected]`	`password`	Static (Dex)	—
john	`[email protected]`	`babayaga`	LDAP	admins
tony	`[email protected]`	`ironman`	LDAP	developers

Direct Service Credentials

Service	Username	Password	Notes
Grafana (local)	`admin`	`prom-operator`	Used when bypassing SSO
LDAP (admin bind)	`cn=admin,dc=example,dc=com`	`admin`	Used by Dex LDAP connector

Static user is defined directly in Dex's staticPasswords with a bcrypt hash
LDAP users are stored in OpenLDAP under dc=example,dc=com. Dex connects to ldap.default.svc:389 as a proxy. Group membership (admins/developers) flows through to ArgoCD for RBAC

Grafana OIDC

Grafana is configured with auth.generic_oauth pointing to Dex:

# In hub-values.yaml
grafana:
  grafana.ini:
    auth.generic_oauth:
      enabled: true
      allow_sign_up: true
      name: Dex
      scopes: openid email profile groups offline_access
      auth_url: http://dex.100.106.163.111.nip.io/dex/auth
      token_url: http://dex.100.106.163.111.nip.io/dex/token
      api_url: http://dex.100.106.163.111.nip.io/dex/userinfo
      client_id: grafana
      client_secret: SrEzVU2WVqhIJiJsenDAONnDcira5F1DRfFW64UI

The Grafana login page shows a "Sign in with Dex" button alongside the standard admin login.

ArgoCD SSO

ArgoCD is configured as a Dex OIDC client (argocd) with group-based RBAC:

# Dex static client
staticClients:
  - id: argocd
    redirectURIs:
      - 'https://argocd.100.106.163.111.nip.io/auth/callback'
    name: 'ArgoCD'
    secret: YWdvY2Qtc2VjcmV0

ArgoCD's argocd-cm ConfigMap is patched with a Dex connector:

data:
  dex.config: |
    connectors:
      - type: oidc
        id: dex
        name: Dex
        config:
          issuer: http://dex.100.106.163.111.nip.io/dex
          clientID: argocd
          clientSecret: YWdvY2Qtc2VjcmV0
          insecureEnableGroups: true

RBAC is configured via argocd-rbac-cm:

data:
  policy.default: role:readonly
  policy.csv: |
    g, admins, role:admin
    g, developers, role:readonly
  scopes: "[groups, email]"

Authenticated users get read-only access by default
LDAP admins group members get admin role
LDAP developers group members get read-only (explicit, same as default)

Quick Start

# Deploy Dex + LDAP + SSO configs on the OCM hub:
make ocm-deploy-dex

# Deploy ArgoCD ingress (if not already done):
make ocm-deploy-argocd-ingress

Access

Service	URL	Auth Method
Dex	`http://dex.100.106.163.111.nip.io/dex`	—
Grafana	`http://grafana.100.106.163.111.nip.io`	Dex OIDC + admin/local
ArgoCD	`https://argocd.100.106.163.111.nip.io`	Dex OIDC + admin/local
Dex Health	`http://dex.100.106.163.111.nip.io/dex/healthz`	—

Configuration Files

File	Purpose
`dex/dex-k8s.yaml`	Dex OIDC provider (single source for all clients)
`dex/ldap/k8s/`	OpenLDAP with bootstrap users and groups
`ingress/argocd-ingress.yaml`	Standard Ingress for argocd.100.106.163.111.nip.io

Troubleshooting

Dex health check fails

Verify the Dex pod is running and the Ingress is in place:

kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n dex get pods
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n dex get ingress
curl -s http://dex.100.106.163.111.nip.io/dex/healthz

Grafana OIDC button missing

Check the grafana.ini has the OIDC section:

kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring get cm prometheus-stack-grafana -o jsonpath='{.data.grafana\.ini}' | grep -A2 'generic_oauth'

ArgoCD OIDC login fails

Check the Dex connector config in argocd-cm:

kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n argocd get cm argocd-cm -o jsonpath='{.data.dex\.config}'

The dex pod itself can be checked for OIDC errors:

kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n dex logs deployment/dex | tail -20

LDAP CrashLoopBackOff — Two fixes

K8s service env override — A Kubernetes Service named ldap in the same namespace injects LDAP_PORT=tcp://CLUSTER_IP:389, overriding the osixia image's LDAP_PORT=389. This breaks the slapd listen URL (ldap://HOSTNAME:tcp://IP:389). Fix: explicitly set LDAP_PORT=389 and LDAPS_PORT=636 in the deployment env.

ConfigMap read-only chown — The osixia image tries to chown ConfigMap-mounted bootstrap LDIF files (read-only filesystem). Fix: use an init container to copy LDIF files from the ConfigMap volume to an emptyDir volume, mount the emptyDir at the bootstrap path. Also set LDAP_REMOVE_CONFIG_AFTER_SETUP=false to prevent the post-setup cleanup from trying to delete the mounted emptyDir.

Verify LDAP is healthy:

kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n default get pods -l app=ldap
# Query all entries
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n default exec deployment/ldap -- \
  ldapsearch -x -H ldap://localhost:389 -b dc=example,dc=com -D "cn=admin,dc=example,dc=com" -w admin -LLL dn
# Test user authentication
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n default exec deployment/ldap -- \
  ldapwhoami -x -H ldap://localhost:389 -D "cn=john,ou=People,dc=example,dc=com" -w babayaga

Files Reference

File	Purpose
`ocm/configs/monitoring/hub-values.yaml`	Helm values for hub — includes placeholder spoke Docker IPs
`ocm/configs/monitoring/hub-ingress.yaml`	Ingress for grafana/prometheus/alertmanager on `*.100.106.163.111.nip.io`
`ocm/configs/monitoring/spoke-values.yaml`	Helm values for spoke exporters (node-exporter + kube-state-metrics only)
`ocm/configs/monitoring/spoke-kube-state-metrics-nodeport.yaml`	NodePort service for spoke kube-state-metrics
`dex/dex-k8s.yaml`	Dex OIDC provider (single source for all clients)
`dex/ldap/k8s/`	OpenLDAP with bootstrap users and groups
`ingress/argocd-ingress.yaml`	Standard Ingress for argocd.100.106.163.111.nip.io
`ocm/configs/monitoring/appset-spoke-exporters.yaml`	(Reference) Abandoned ApplicationSet approach
`ocm/configs/monitoring/generator-configmap.yaml`	(Reference) clusterDecisionResource generator ConfigMap
`ocm/configs/monitoring/placement-spoke-monitoring.yaml`	(Reference) OCM Placement

OCM Multi-Cluster Monitoring

Architecture

How it works

Why not ArgoCD for spoke exporters?

Quick Start

Accessing the Monitoring Stack

Via nip.io domains

Via kubectl port-forward

Prometheus API

Configuration Files

Hub Helm Values

Hub Ingress

Spoke Helm Values

Spoke Kube-State-Metrics NodePort

Pre-Installed Grafana Dashboards

Prometheus Scrape Targets

Hub Targets

Spoke Targets (additional scrape config)

Verification

Troubleshooting

Port 80 nip.io returns 404

Direct access via Traefik LB

Spoke targets show DOWN

Grafana dashboards not loading

SSO / OIDC — Dex

Architecture

Default Credentials

Dex / SSO Login

Direct Service Credentials

Grafana OIDC

ArgoCD SSO

Quick Start

Access

Configuration Files

Troubleshooting

Files Reference