OCM Multi-Cluster Monitoring
A monitoring stack deployed on top of OCM to collect metrics from all clusters (hub + spokes) into a single Prometheus/Grafana instance on the hub.
Architecture
graph TB
subgraph Docker["Docker Host — k3d Docker Network (172.18.0.0/16)"]
subgraph Hub["k3d-ocm-hub"]
Prom["kube-prometheus-stack<br/>(Prometheus + Grafana)"]
Ingress["Traefik Ingress<br/>*.100.106.163.111.nip.io"]
end
subgraph Spoke1["k3d-ocm-spoke-1 (172.18.0.4)"]
NE1["node-exporter<br/>hostNetwork:9100"]
KSM1["kube-state-metrics<br/>NodePort:30101"]
end
subgraph Spoke2["k3d-ocm-spoke-2 (172.18.0.5)"]
NE2["node-exporter<br/>hostNetwork:9100"]
KSM2["kube-state-metrics<br/>NodePort:30101"]
end
subgraph LB["MetalLB Pool<br/>172.18.0.200-210"]
TL["Traefik LB<br/>172.18.0.200:80"]
end
end
Ingress --> TL
Prom -->|scrape 172.18.0.4:9100| NE1
Prom -->|scrape 172.18.0.4:30101| KSM1
Prom -->|scrape 172.18.0.5:9100| NE2
Prom -->|scrape 172.18.0.5:30101| KSM2
classDef hub fill:#1b5e20,color:#fff,stroke:#2e7d32
classDef spoke fill:#0d47a1,color:#fff,stroke:#1565c0
classDef lb fill:#e65100,color:#fff,stroke:#ef6c00
class Hub hub
class Spoke1,Spoke2 spoke
class TL,Ingress lb
| Component | Hub | Spoke-1 | Spoke-2 |
|---|---|---|---|
| Prometheus | ✔ | — | — |
| Grafana | ✔ (ClusterIP) | — | — |
| Alertmanager | ✔ | — | — |
| Node-Exporter | ✔ | ✔ (hostNetwork:9100) | ✔ (hostNetwork:9100) |
| Kube-State-Metrics | ✔ | ✔ (NodePort:30101) | ✔ (NodePort:30101) |
How it works
- Hub runs the full
kube-prometheus-stackvia Helm — Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics - Spokes run only the exporter components of
kube-prometheus-stack— node-exporter (hostNetwork:9100) and kube-state-metrics (NodePort:30101). These are deployed via directhelm upgrade --installusingk3d kubeconfig get - Scraping: Hub Prometheus scrapes spoke exporters over the shared Docker network using container IPs (
172.18.0.4,172.18.0.5). Prometheus is configured withadditionalScrapeConfigsfor the spoke targets - Ingress: Traefik ingress controller at MetalLB IP
172.18.0.200serves nip.io domains
Why not ArgoCD for spoke exporters?
The ArgoCD agent model requires Applications in spoke-namespaced scopes (e.g., ocm-spoke-1, ocm-spoke-2). The ApplicationSet clusterDecisionResource generator creates Applications in the appset's own namespace (argocd), not the spoke namespaces. Direct Helm is simpler and more reliable for this use case.
Quick Start
Accessing the Monitoring Stack
Via nip.io domains
Access from the Docker host machine — the k3d-proxy nginx forwards port 80 to Traefik:
| Service | URL |
|---|---|
| Grafana | http://grafana.100.106.163.111.nip.io |
| Prometheus | http://prometheus.100.106.163.111.nip.io |
| Alertmanager | http://alertmanager.100.106.163.111.nip.io |
Grafana credentials: admin / prom-operator
Via kubectl port-forward
# Grafana
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring port-forward svc/prometheus-stack-grafana 3000:80
# Prometheus
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring port-forward svc/prometheus-stack-kube-prom-prometheus 9090:9090
# Alertmanager
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring port-forward svc/prometheus-stack-kube-prom-alertmanager 9093:9093
Prometheus API
The Prometheus API is accessible at http://prometheus.100.106.163.111.nip.io/api/v1/:
# List all active targets
curl -s http://prometheus.100.106.163.111.nip.io/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, instance: .labels.instance, health: .health}'
# Query metrics
curl -s 'http://prometheus.100.106.163.111.nip.io/api/v1/query?query=up' | jq '.data.result[] | {job: .metric.job, cluster: .metric.cluster, instance: .metric.instance, value: .value[1]}'
Configuration Files
Hub Helm Values
ocm/configs/monitoring/hub-values.yaml:
grafana:
enabled: true
adminPassword: prom-operator
defaultDashboardsEnabled: true
defaultDashboardsTimezone: "asia/kolkata"
service:
type: ClusterIP
grafana.ini:
server:
root_url: http://grafana.100.106.163.111.nip.io
prometheus:
prometheusSpec:
enableRemoteWriteReceiver: true
enableFeatures:
- remote-write-receiver
retention: 7d
resources:
requests:
cpu: 200m
memory: 1Gi
additionalScrapeConfigs:
- job_name: spoke-1-node-exporter
static_configs:
- targets:
- REPLACE_WITH_SPOKE1_IP:9100
labels:
cluster: ocm-spoke-1
- job_name: spoke-2-node-exporter
static_configs:
- targets:
- REPLACE_WITH_SPOKE2_IP:9100
labels:
cluster: ocm-spoke-2
- job_name: spoke-1-kube-state-metrics
static_configs:
- targets:
- REPLACE_WITH_SPOKE1_IP:30101
labels:
cluster: ocm-spoke-1
- job_name: spoke-2-kube-state-metrics
static_configs:
- targets:
- REPLACE_WITH_SPOKE2_IP:30101
labels:
cluster: ocm-spoke-2
nodeExporter:
enabled: true
kubeStateMetrics:
enabled: true
The REPLACE_WITH_SPOKE*_IP placeholders are resolved at deploy time by make ocm-deploy-monitoring using docker inspect to get the spoke containers' Docker network IPs.
Hub Ingress
ocm/configs/monitoring/hub-ingress.yaml:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: hub-monitoring
namespace: monitoring
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
ingressClassName: traefik
rules:
- host: grafana.100.106.163.111.nip.io
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-stack-grafana
port:
number: 80
- host: prometheus.100.106.163.111.nip.io
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-stack-kube-prom-prometheus
port:
number: 9090
- host: alertmanager.100.106.163.111.nip.io
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-stack-kube-prom-alertmanager
port:
number: 9093
Spoke Helm Values
ocm/configs/monitoring/spoke-values.yaml:
prometheus:
enabled: false
alertmanager:
enabled: false
grafana:
enabled: false
nodeExporter:
enabled: true
hostNetwork: true
kubeStateMetrics:
enabled: true
Only nodeExporter and kubeStateMetrics are enabled. The other components are disabled to keep the spoke deployment minimal.
Spoke Kube-State-Metrics NodePort
ocm/configs/monitoring/spoke-kube-state-metrics-nodeport.yaml:
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics-nodeport
namespace: monitoring
spec:
type: NodePort
ports:
- port: 8080
targetPort: 8080
nodePort: 30101
selector:
app.kubernetes.io/instance: spoke-monitoring
app.kubernetes.io/name: kube-state-metrics
A separate NodePort service is required because the kube-prometheus-stack chart does not support setting nodePort on the kube-state-metrics service. The selector matches the pods created by the Helm release.
Pre-Installed Grafana Dashboards
The kube-prometheus-stack ships with a comprehensive set of Grafana dashboards, auto-loaded via the Grafana sidecar container:
| Dashboard | Description |
|---|---|
| Kubernetes / API Server | API server metrics |
| Kubernetes / Cluster Total | Overall cluster resource usage |
| Kubernetes / Controller Manager | Controller manager metrics |
| Kubernetes / CoreDNS | DNS performance and errors |
| Kubernetes / Kubelet | Kubelet metrics |
| Kubernetes / Namespace by Pod | Per-namespace pod counts |
| Kubernetes / Namespace by Workload | Per-namespace workload counts |
| Kubernetes / Node Cluster Resource Usage | Cluster-wide node resource usage |
| Kubernetes / Node Resource Usage | Per-node resource usage |
| Kubernetes / Nodes | Node-level metrics |
| Kubernetes / Persistent Volumes Usage | PV usage metrics |
| Kubernetes / Pod Total | Pod counts across the cluster |
| Kubernetes / Proxy | kube-proxy metrics |
| Kubernetes / Scheduler | Scheduler metrics |
| Kubernetes / Workload Total | Workload counts |
| Kubernetes / Networking | Network I/O and packets |
| Prometheus / Overview | Prometheus self-metrics |
Prometheus Scrape Targets
Hub Targets
| Job | Instance | Port |
|---|---|---|
apiserver |
172.18.0.2:6443 |
k3s API |
kubelet |
172.18.0.2:10250 |
kubelet metrics/cadvisor/probes |
coredns |
10.42.0.X:9153 |
CoreDNS metrics |
node-exporter |
172.18.0.2:9100 |
Hub node metrics |
kube-state-metrics |
10.42.0.X:8080 |
Hub KSM |
prometheus-stack-grafana |
10.42.0.X:3000 |
Grafana self-metrics |
prometheus-stack-kube-prom-operator |
10.42.0.X:10250 |
Operator metrics |
prometheus-stack-kube-prom-prometheus |
10.42.0.X:9090 |
Prometheus self-metrics |
prometheus-stack-kube-prom-alertmanager |
10.42.0.X:9093 |
Alertmanager metrics |
Spoke Targets (additional scrape config)
| Job | Instance | Port | Labels |
|---|---|---|---|
spoke-1-node-exporter |
172.18.0.4:9100 |
hostNetwork | cluster: ocm-spoke-1 |
spoke-1-kube-state-metrics |
172.18.0.4:30101 |
NodePort | cluster: ocm-spoke-1 |
spoke-2-node-exporter |
172.18.0.2:9100 |
hostNetwork | cluster: ocm-spoke-2 |
spoke-2-kube-state-metrics |
172.18.0.2:30101 |
NodePort | cluster: ocm-spoke-2 |
The cluster label allows filtering and grouping metrics by cluster in Grafana.
Verification
# Hub monitoring pods
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring get pods
# Spoke monitoring pods
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-spoke-1 -n monitoring get pods
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-spoke-2 -n monitoring get pods
# Prometheus targets — should all be UP
curl -s http://prometheus.100.106.163.111.nip.io/api/v1/targets | \
jq '.data.activeTargets[] | {job: .labels.job, cluster: (.labels.cluster // "hub"), instance: .labels.instance, health: .health}'
# Query up metric across clusters
curl -s 'http://prometheus.100.106.163.111.nip.io/api/v1/query?query=up{cluster=~"ocm-spoke-.+"}' | \
jq '.data.result[] | {cluster: .metric.cluster, instance: .metric.instance, value: .value[1]}'
# Grafana health check
curl -sI http://grafana.100.106.163.111.nip.io/login | head -2
# Expected: HTTP/1.1 302 Found → redirects to /login
Troubleshooting
Port 80 nip.io returns 404
The k3d-proxy nginx container has a default HTTP server on port 80 that can interfere with the stream proxy to Traefik. Fix:
docker exec k3d-ocm-hub-serverlb sh -c 'mv /etc/nginx/conf.d/default.conf /etc/nginx/conf.d/default.conf.bak && nginx -s reload'
After this, requests with the correct Host header are forwarded to Traefik.
Direct access via Traefik LB
If nip.io domains don't work, access services directly via the Traefik LoadBalancer IP with the correct Host header:
# Grafana
curl -sI -H "Host: grafana.100.106.163.111.nip.io" http://172.18.0.200/
# Prometheus
curl -s -H "Host: prometheus.100.106.163.111.nip.io" http://172.18.0.200/api/v1/targets
Spoke targets show DOWN
- node-exporter DOWN: Verify the pod is running with
hostNetwork: trueand port 9100 is listening. Checkcurl -s http://172.18.0.4:9100/metrics | headfrom a hub pod or the hub container - kube-state-metrics DOWN: Verify the NodePort service exists and the selector matches the pods:
The selector must match
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-spoke-1 -n monitoring describe svc kube-state-metrics-nodeportapp.kubernetes.io/instance: spoke-monitoringandapp.kubernetes.io/name: kube-state-metrics
Grafana dashboards not loading
Check the Grafana sidecar logs:
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring logs deploy/prometheus-stack-grafana -c grafana-sc-dashboard
SSO / OIDC — Dex
Dex provides OIDC-based single sign-on for Grafana and ArgoCD on the OCM hub. It acts as the identity gateway, backed by a static password database and an OpenLDAP directory.
Architecture
graph LR
subgraph Hub["k3d-ocm-hub"]
Dex["Dex<br/>dex:5556"]
LDAP["OpenLDAP<br/>ldap.default.svc:389"]
Grafana["Grafana<br/>grafana.100.106.163.111.nip.io"]
ArgoCD["ArgoCD<br/>argocd.100.106.163.111.nip.io"]
end
User -->|"[email protected] / password"| Dex
User -->|"[email protected] / babayaga"| LDAP
Dex --> LDAP
Grafana -->|generic_oauth| Dex
ArgoCD -->|oidc| Dex
classDef hub fill:#1b5e20,color:#fff,stroke:#2e7d32
class Hub hub
Default Credentials
Dex / SSO Login
| Username | Password | Provider | Groups | |
|---|---|---|---|---|
| admin | [email protected] |
password |
Static (Dex) | — |
| john | [email protected] |
babayaga |
LDAP | admins |
| tony | [email protected] |
ironman |
LDAP | developers |
Direct Service Credentials
| Service | Username | Password | Notes |
|---|---|---|---|
| Grafana (local) | admin |
prom-operator |
Used when bypassing SSO |
| LDAP (admin bind) | cn=admin,dc=example,dc=com |
admin |
Used by Dex LDAP connector |
- Static user is defined directly in Dex's
staticPasswordswith a bcrypt hash - LDAP users are stored in OpenLDAP under
dc=example,dc=com. Dex connects toldap.default.svc:389as a proxy. Group membership (admins/developers) flows through to ArgoCD for RBAC
Grafana OIDC
Grafana is configured with auth.generic_oauth pointing to Dex:
# In hub-values.yaml
grafana:
grafana.ini:
auth.generic_oauth:
enabled: true
allow_sign_up: true
name: Dex
scopes: openid email profile groups offline_access
auth_url: http://dex.100.106.163.111.nip.io/dex/auth
token_url: http://dex.100.106.163.111.nip.io/dex/token
api_url: http://dex.100.106.163.111.nip.io/dex/userinfo
client_id: grafana
client_secret: SrEzVU2WVqhIJiJsenDAONnDcira5F1DRfFW64UI
The Grafana login page shows a "Sign in with Dex" button alongside the standard admin login.
ArgoCD SSO
ArgoCD is configured as a Dex OIDC client (argocd) with group-based RBAC:
# Dex static client
staticClients:
- id: argocd
redirectURIs:
- 'https://argocd.100.106.163.111.nip.io/auth/callback'
name: 'ArgoCD'
secret: YWdvY2Qtc2VjcmV0
ArgoCD's argocd-cm ConfigMap is patched with a Dex connector:
data:
dex.config: |
connectors:
- type: oidc
id: dex
name: Dex
config:
issuer: http://dex.100.106.163.111.nip.io/dex
clientID: argocd
clientSecret: YWdvY2Qtc2VjcmV0
insecureEnableGroups: true
RBAC is configured via argocd-rbac-cm:
data:
policy.default: role:readonly
policy.csv: |
g, admins, role:admin
g, developers, role:readonly
scopes: "[groups, email]"
- Authenticated users get read-only access by default
- LDAP
adminsgroup members get admin role - LDAP
developersgroup members get read-only (explicit, same as default)
Quick Start
# Deploy Dex + LDAP + SSO configs on the OCM hub:
make ocm-deploy-dex
# Deploy ArgoCD ingress (if not already done):
make ocm-deploy-argocd-ingress
Access
| Service | URL | Auth Method |
|---|---|---|
| Dex | http://dex.100.106.163.111.nip.io/dex |
— |
| Grafana | http://grafana.100.106.163.111.nip.io |
Dex OIDC + admin/local |
| ArgoCD | https://argocd.100.106.163.111.nip.io |
Dex OIDC + admin/local |
| Dex Health | http://dex.100.106.163.111.nip.io/dex/healthz |
— |
Configuration Files
| File | Purpose |
|---|---|
ocm/configs/monitoring/dex-ocm.yaml |
Dex deployment, ConfigMap, Service, Ingress, RBAC |
ocm/configs/monitoring/ldap-ocm.yaml |
OpenLDAP deployment with bootstrap users and groups |
ocm/configs/monitoring/argocd-ingress.yaml |
Traefik IngressRouteTCP (TLS passthrough) for argocd.100.106.163.111.nip.io |
Troubleshooting
Dex health check fails: Verify the Dex pod is running and the Ingress is in place:
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n dex get pods
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n dex get ingress
curl -s http://dex.100.106.163.111.nip.io/dex/healthz
Grafana OIDC button missing: Check the grafana.ini has the OIDC section:
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n monitoring get cm prometheus-stack-grafana -o jsonpath='{.data.grafana\.ini}' | grep -A2 'generic_oauth'
ArgoCD OIDC login fails: Check the Dex connector config in argocd-cm:
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n argocd get cm argocd-cm -o jsonpath='{.data.dex\.config}'
The dex pod itself can be checked for OIDC errors:
LDAP CrashLoopBackOff: Two fixes were needed:
-
K8s service env override — A Kubernetes Service named
ldapin the same namespace injectsLDAP_PORT=tcp://CLUSTER_IP:389, overriding the osixia image'sLDAP_PORT=389. This breaks the slapd listen URL (ldap://HOSTNAME:tcp://IP:389). Fix: explicitly setLDAP_PORT=389andLDAPS_PORT=636in the deployment env. -
ConfigMap read-only chown — The osixia image tries to
chownConfigMap-mounted bootstrap LDIF files (read-only filesystem). Fix: use an init container to copy LDIF files from the ConfigMap volume to an emptyDir volume, mount the emptyDir at the bootstrap path. Also setLDAP_REMOVE_CONFIG_AFTER_SETUP=falseto prevent the post-setup cleanup from trying to delete the mounted emptyDir.
Verify LDAP is healthy:
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n default get pods -l app=ldap
# Query all entries
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n default exec deployment/ldap -- \
ldapsearch -x -H ldap://localhost:389 -b dc=example,dc=com -D "cn=admin,dc=example,dc=com" -w admin -LLL dn
# Test user authentication
kubectl --kubeconfig ocm/kubeconfigs/k3d/ocm-hub -n default exec deployment/ldap -- \
ldapwhoami -x -H ldap://localhost:389 -D "cn=john,ou=People,dc=example,dc=com" -w babayaga
Files Reference
| File | Purpose |
|---|---|
ocm/configs/monitoring/hub-values.yaml |
Helm values for hub — includes placeholder spoke Docker IPs |
ocm/configs/monitoring/hub-ingress.yaml |
Ingress for grafana/prometheus/alertmanager on *.100.106.163.111.nip.io |
ocm/configs/monitoring/spoke-values.yaml |
Helm values for spoke exporters (node-exporter + kube-state-metrics only) |
ocm/configs/monitoring/spoke-kube-state-metrics-nodeport.yaml |
NodePort service for spoke kube-state-metrics |
ocm/configs/monitoring/dex-ocm.yaml |
Dex deployment, ConfigMap, Service, Ingress, RBAC |
ocm/configs/monitoring/ldap-ocm.yaml |
OpenLDAP deployment with bootstrap users and groups |
ocm/configs/monitoring/argocd-ingress.yaml |
Traefik Ingress for argocd.100.106.163.111.nip.io |
ocm/configs/monitoring/appset-spoke-exporters.yaml |
(Reference) Abandoned ApplicationSet approach |
ocm/configs/monitoring/generator-configmap.yaml |
(Reference) clusterDecisionResource generator ConfigMap |
ocm/configs/monitoring/placement-spoke-monitoring.yaml |
(Reference) OCM Placement |