Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
432 changes: 419 additions & 13 deletions products/kubernetes-operator/guides/configuration.mdx

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions products/kubernetes-operator/guides/introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -169,10 +169,10 @@ To completely remove storage:
kubectl delete clickhousecluster my-cluster

# Wait for pods to terminate
kubectl wait --for=delete pod -l app.kubernetes.io/instance=my-cluster
kubectl wait --for=delete pod -l app.kubernetes.io/instance=my-cluster-clickhouse

# Delete PVCs
kubectl delete pvc -l app.kubernetes.io/instance=sample-cluster
kubectl delete pvc -l app.kubernetes.io/instance=my-cluster-clickhouse
```

## Default configuration highlights {#default-configuration-highlights}
Expand Down
378 changes: 378 additions & 0 deletions products/kubernetes-operator/guides/monitoring.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,378 @@
---
position: 3
slug: /clickhouse-operator/guides/monitoring
title: Monitoring the ClickHouse Operator
keywords: ['kubernetes', 'prometheus', 'monitoring', 'metrics']
description: 'How to scrape, secure, and use the operator metrics and health endpoints.'
doc_type: 'guide'
---

The operator exposes Prometheus-compatible metrics and Kubernetes health probes so that you can observe its reconciliation activity, detect stalled controllers, and alert on failures.

This guide covers what the operator exposes, how to scrape it, and which queries are useful day to day.

<Note>
This guide is about the **operator process itself** (the controller manager). For ClickHouse server metrics (queries, parts, replication lag), use the [Prometheus endpoint in ClickHouse](/reference/settings/server-settings/settings#prometheus) to scrape it separately.
</Note>

## Endpoints {#endpoints}

The operator process exposes two HTTP endpoints inside the manager pod:

| Endpoint | Default port | Path | Purpose |
|---|---|---|---|
| Metrics | `8080` (Helm) / `0` disabled (binary default) | `/metrics` | Prometheus exposition format |
| Health probe | `8081` | `/healthz`, `/readyz` | Kubernetes liveness and readiness |

The metrics endpoint is **off by default** when running the operator binary directly (`--metrics-bind-address=0`). The Helm chart turns it on with `metrics.enable: true` and `metrics.port: 8080`.

The health probe endpoint is always on; the deployment template wires `/healthz` and `/readyz` to the pod's liveness and readiness probes on port `8081`.

## Operator binary flags {#operator-binary-flags}

The relevant `manager` flags (defined in [`cmd/main.go`](https://github.com/ClickHouse/clickhouse-operator/blob/main/cmd/main.go)):

| Flag | Default | Description |
|---|---|---|
| `--metrics-bind-address` | `0` (disabled) | Bind address for the metrics endpoint. Set to `:8443` for HTTPS or `:8080` for HTTP. Leave as `0` to disable the metrics server. |
| `--metrics-secure` | `true` | Serve metrics over HTTPS with authn/authz. Set to `false` for plain HTTP. |
| `--metrics-cert-path` | empty | Directory containing TLS cert files (`tls.crt`, `tls.key`) for the metrics server. |
| `--metrics-cert-name` | `tls.crt` | Cert file name inside `--metrics-cert-path`. |
| `--metrics-cert-key` | `tls.key` | Key file name inside `--metrics-cert-path`. |
| `--enable-http2` | `false` | Enable HTTP/2 for the metrics **and webhook** servers. Off by default to mitigate CVE-2023-44487 / CVE-2023-39325. |
| `--leader-elect` | `false` (binary) / `true` (Helm chart) | Enable leader election so only one replica reconciles at a time. The Helm chart sets this flag in `manager.args` by default. |
| `--health-probe-bind-address` | `:8081` | Bind address for `/healthz` and `/readyz`. |

<Note>
The `8443` (HTTPS) / `8080` (HTTP) convention in the flag's help text is only a hint. The Helm chart serves HTTPS on `8080` because it sets both `metrics.port: 8080` and `metrics.secure: true`. There is no port-based mode detection — `--metrics-secure` is what selects HTTPS or HTTP.
</Note>

## Enable metrics via Helm {#enable-metrics-via-helm}

The chart already creates a `Service` for the metrics port and, optionally, a `ServiceMonitor` for prometheus-operator.

The metrics endpoint itself is on by default (`metrics.enable: true`, port `8080`, served over HTTPS via `metrics.secure: true`). The only setting you typically need to flip is `prometheus.enable` to have the chart create a `ServiceMonitor` for you:

```yaml
# values.yaml — minimal override
prometheus:
enable: true
```

If you do not use cert-manager, additionally set `certManager.enable: false` and the ServiceMonitor will scrape with `insecureSkipVerify: true`, relying on bearer-token authentication only.

The full set of metrics-related defaults is:

```yaml
metrics:
enable: true
port: 8080
secure: true # HTTPS with authn/authz enforced on every scrape

certManager:
enable: true # Issues the metrics server certificate

prometheus:
enable: false # Set to true to render the ServiceMonitor
scraping_annotations: false # Alternative: prometheus.io/scrape pod annotations
```

Apply:

```bash
helm upgrade --install clickhouse-operator \
oci://ghcr.io/clickhouse/clickhouse-operator-helm \
-n clickhouse-operator-system --create-namespace \
-f values.yaml
```

After install the chart creates:

- `Service/<resource-prefix>-metrics-service` — exposes port `8080` (HTTPS when `metrics.secure: true`).
- `ServiceMonitor/<resource-prefix>-controller-manager-metrics-monitor` — when `prometheus.enable: true`.
- `ClusterRole/<resource-prefix>-metrics-reader` — non-resource URL `/metrics` with `get` verb.

## Securing the metrics endpoint {#securing-the-metrics-endpoint}

When `metrics.secure: true` the metrics server enforces TLS **and** Kubernetes authentication/authorization on every scrape. Scrapers must:

1. Present a valid Kubernetes bearer token.
2. Belong to a ServiceAccount bound to a ClusterRole granting `get` on the non-resource URL `/metrics`.

The chart ships such a ClusterRole:

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: clickhouse-operator-metrics-reader
rules:
- nonResourceURLs:
- /metrics
verbs:
- get
```

Bind it to the ServiceAccount used by your scraper (typically Prometheus):

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus-clickhouse-operator-metrics-reader
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: clickhouse-operator-metrics-reader
subjects:
- kind: ServiceAccount
name: <prometheus-sa>
namespace: <prometheus-namespace>
```

<Warning>
If you see `401 Unauthorized` or `403 Forbidden` from the metrics endpoint, the scraper is using HTTPS but is missing/unauthorized for a Kubernetes bearer token, or its ServiceAccount lacks the binding above. Disabling security by setting `metrics.secure: false` is **not recommended** in shared clusters because anyone with network reachability to the pod could scrape the endpoint.
</Warning>

## ServiceMonitor reference {#servicemonitor-reference}

The chart renders a ServiceMonitor of this shape when `prometheus.enable: true`:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: <release>-controller-manager-metrics-monitor
namespace: <operator-namespace>
labels:
control-plane: controller-manager
spec:
selector:
matchLabels:
control-plane: controller-manager
endpoints:
- path: /metrics
port: https # "http" when metrics.secure: false
scheme: https
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
tlsConfig:
serverName: <release>-metrics-service.<operator-namespace>.svc
ca:
secret:
name: metrics-server-cert
key: ca.crt
cert:
secret:
name: metrics-server-cert
key: tls.crt
keySecret:
name: metrics-server-cert
key: tls.key
```

If your Prometheus instance does not run cert-manager, set `tlsConfig.insecureSkipVerify: true` and rely on bearer-token authentication only — the chart already does this when `certManager.enable: false`.

## Standalone Prometheus example {#standalone-prometheus-example}

If you do not use kube-prometheus-stack, the repository ships a self-contained example at [`examples/prometheus_secure_metrics_scraper.yaml`](https://github.com/ClickHouse/clickhouse-operator/blob/main/examples/prometheus_secure_metrics_scraper.yaml). It creates a ServiceAccount, the necessary RBAC, and a `Prometheus` CR that selects the operator's ServiceMonitor.

## Health probe endpoints {#health-probe-endpoints}

| Path | Used by | Returns |
|---|---|---|
| `/healthz` | Kubernetes liveness probe | `200 OK` as long as the probe server is listening. |
| `/readyz` | Kubernetes readiness probe | `200 OK` as long as the probe server is listening. |

Both endpoints are registered with the same trivial ping check (`healthz.Ping` from `sigs.k8s.io/controller-runtime`). A failing probe therefore means "the manager process is not serving HTTP on `:8081`" — not "controllers are unhealthy". To detect controller-level problems, use the [reconciliation metrics](#reconciliation-activity) instead.

Both endpoints are served on port `8081` by default. They are wired to the deployment as:

```yaml
livenessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /readyz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10
```

A repeatedly failing probe usually means the probe server itself never came up — for example, the manager exited early during startup. Check the manager logs for `unable to start manager`, RBAC failures, or `cache did not sync` errors.

## Metrics catalog {#metrics-catalog}

The operator does not register custom Prometheus collectors. Everything below is exposed by the underlying `controller-runtime` and `client-go` libraries. The most useful series, grouped by purpose:

### Reconciliation activity {#reconciliation-activity}

| Metric | Type | Labels |
|---|---|---|
| `controller_runtime_reconcile_total` | counter | `controller`, `result` (`success` / `error` / `requeue` / `requeue_after`) |
| `controller_runtime_reconcile_errors_total` | counter | `controller` |
| `controller_runtime_reconcile_time_seconds_bucket` | histogram | `controller` |
| `controller_runtime_active_workers` | gauge | `controller` |
| `controller_runtime_max_concurrent_reconciles` | gauge | `controller` |

The `controller` label is derived by `controller-runtime` from the resource type registered with `For(...)`. With the current code in `internal/controller/clickhouse` and `internal/controller/keeper` this resolves to `clickhousecluster` and `keepercluster` respectively. If you have customized the operator, verify with a one-time scrape of `/metrics`.

### Work queue {#work-queue}

| Metric | Type | Labels |
|---|---|---|
| `workqueue_depth` | gauge | `name`, `controller`, `priority` |
| `workqueue_adds_total` | counter | `name`, `controller` |
| `workqueue_retries_total` | counter | `name`, `controller` |
| `workqueue_unfinished_work_seconds` | gauge | `name`, `controller` |
| `workqueue_longest_running_processor_seconds` | gauge | `name`, `controller` |
| `workqueue_queue_duration_seconds_bucket` | histogram | `name`, `controller` |
| `workqueue_work_duration_seconds_bucket` | histogram | `name`, `controller` |

The `name` and `controller` labels carry the same value (the controller name).

### API server traffic {#api-server-traffic}

| Metric | Type | Labels |
|---|---|---|
| `rest_client_requests_total` | counter | `code`, `method`, `host` |

### Leader election {#leader-election}

| Metric | Type | Labels |
|---|---|---|
| `leader_election_master_status` | gauge | `name` (= `d4ceba06.clickhouse.com`) |

The Helm chart enables `--leader-elect` by default, so this metric is present in standard Helm installs. When running the binary directly without the flag, the metric is absent.

### Runtime {#runtime}

Standard Go process and runtime collectors — `go_goroutines`, `go_memstats_*`, `process_cpu_seconds_total`, `process_resident_memory_bytes`, etc.

## Useful PromQL queries {#useful-promql-queries}

### Health overview

```promql
# Reconciliation rate per controller
sum by (controller) (rate(controller_runtime_reconcile_total[5m]))

# Error rate per controller (alert if > 0 sustained)
sum by (controller) (rate(controller_runtime_reconcile_errors_total[5m]))

# p99 reconcile latency
histogram_quantile(
0.99,
sum by (le, controller) (rate(controller_runtime_reconcile_time_seconds_bucket[5m]))
)
```

### Backlog detection

```promql
# Pending items in the work queue — a sustained value > 0 indicates a backlog,
# but short spikes during large reconciles are normal.
avg_over_time(workqueue_depth[10m])

# Reconciles that have been running for a long time
workqueue_longest_running_processor_seconds > 60
```

### Throttling and API pressure

```promql
# Throttled requests to the API server
sum by (code, host) (rate(rest_client_requests_total{code=~"4..|5.."}[5m]))
```

### Leader status (HA deployment)

```promql
# Should be exactly 1 across the replica set (Helm install enables --leader-elect by default)
sum(leader_election_master_status{name="d4ceba06.clickhouse.com"})
```

## Suggested alerts {#suggested-alerts}

Starting point for a PrometheusRule (tune thresholds for your environment):

```yaml
groups:
- name: clickhouse-operator
rules:
- alert: ClickHouseOperatorReconcileErrors
# > 0.1 errors/s sustained = > ~6 errors/min, filters transient conflicts.
expr: sum by (controller) (rate(controller_runtime_reconcile_errors_total[5m])) > 0.1
for: 15m
labels:
severity: warning
annotations:
summary: 'ClickHouse operator is failing to reconcile {{ $labels.controller }}'

- alert: ClickHouseOperatorWorkqueueBacklog
# avg_over_time avoids alerting on transient bursts during large reconciles.
expr: avg_over_time(workqueue_depth[10m]) > 5
for: 30m
labels:
severity: warning
annotations:
summary: 'Operator work queue backlog sustained for 30m'

- alert: ClickHouseOperatorReconcileSlow
expr: |
histogram_quantile(
0.99,
sum by (le, controller) (rate(controller_runtime_reconcile_time_seconds_bucket[10m]))
) > 30
for: 15m
labels:
severity: warning
annotations:
summary: 'p99 reconcile latency for {{ $labels.controller }} > 30s'

- alert: ClickHouseOperatorNoLeader
expr: absent(leader_election_master_status{name="d4ceba06.clickhouse.com"}) == 1
for: 5m
labels:
severity: critical
annotations:
summary: 'No leader for the ClickHouse operator (HA deployment)'
```

The last rule is only meaningful when leader election is enabled.

## Verifying the setup {#verifying-the-setup}

A quick end-to-end check, assuming the chart was installed in `clickhouse-operator-system`:

```bash
NS=clickhouse-operator-system

# The metrics Service exists and selects the manager pod
kubectl -n $NS get svc -l control-plane=controller-manager

# The ServiceMonitor exists (only with prometheus.enable=true)
kubectl -n $NS get servicemonitor -l control-plane=controller-manager

# Manager pod is Ready (readiness probe answers)
kubectl -n $NS get pod -l control-plane=controller-manager

# Direct scrape from inside the cluster (with the metrics-reader binding)
kubectl -n $NS run curl-metrics --rm -it --restart=Never \
--image=curlimages/curl:8.10.1 -- sh -c '
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -sk -H "Authorization: Bearer $TOKEN" \
https://<release>-metrics-service.'$NS'.svc:8080/metrics \
| head -20
'
```

If the scrape returns metrics in the Prometheus exposition format, the endpoint and RBAC are correctly wired.

## Related guides {#related-guides}

- [Installation](/products/kubernetes-operator/install/helm) — Helm values relevant to monitoring.
- [Configuration](/products/kubernetes-operator/guides/configuration) — TLS configuration shared with the metrics server.
Loading