Commit graph

27 commits

Author SHA1 Message Date
b1a00d0395
fix(observability): 🐛 add missing simple-user-secret to hub observability stack
The hub's VMUser (vmauth.yaml) references simple-user-secret via passwordRef,
but the Secret was never added to the hub's manifests. Without this Secret,
the VM operator cannot reconcile the VMUser into the vmauth config, causing
ALL requests to fall through to the unauthorizedUser catch-all (vmsingle).

Result: Vector log shipping to VictoriaLogs was broken — vmauth routed
/insert/elasticsearch/_bulk to vmsingle instead of vlogs-victorialogs.
2026-06-19 15:28:14 +02:00
4591ee7b14
feat(observability): 🗂️ organize dashboards into Grafana folders
Assigns folder field to all GrafanaDashboard CRs:
- EDP / Overview: platform-overview
- EDP / Applications: forgejo, argocd-operational, garm, argocd
- EDP / Operations: cronjob-monitoring, ingress-nginx, victoria-logs
2026-06-19 14:46:41 +02:00
b6fbd3f6eb
feat(observability): add VictoriaLogs log panels to platform, forgejo, argocd dashboards 2026-06-19 13:34:12 +02:00
076b2a16c9
fix(observability): 🐛 fix datasource UIDs, replace cronjob dashboard, add GARM
- Remove all ${DS_VICTORIAMETRICS} uid refs from platform-overview; use
  type-only datasource so grafana-operator resolves default prometheus DS
- Replace grafanaCom id:14279 cronjob dashboard with inline custom version
  supporting cluster_environment variable (dev/edp/observability)
- Add new GARM runners dashboard (edp-garm) ready for when GARM metrics
  are scraped; uses or vector(0) guards so panels show 0 not empty

Note: cluster_environment values confirmed as dev/edp/observability (no benchmark).
GARM metrics not yet present in VictoriaMetrics (0 series found).
2026-06-19 13:11:42 +02:00
6ea1e798d2
fix(observability): 🐛 add missing manifests to instance stacks
- backup-alerts.yaml → observability.buildth.ing victoria-k8s-stack
- forgejo-scrape.yaml → dev.t09.de vm-client-stack
2026-06-19 13:06:24 +02:00
91db8038e6
feat(observability): custom ArgoCD dashboard with cluster_environment filter 2026-06-19 13:02:48 +02:00
949529eb5c
feat(observability): add cluster_environment dropdown to Forgejo and platform-overview dashboards
- Replace grafanaCom import (17802) with custom inline Forgejo dashboard
  containing cluster_environment query variable (refresh=2, label=Environment)
- Add label, refresh=2, sort=1 to platform-overview cluster_environment variable
- ArgoCD (19993) and CronJob (14279) remain grafanaCom imports (acceptable)
2026-06-19 12:50:32 +02:00
c2528f6f69
feat(observability): add platform grafana dashboard CRs
- Add forgejo.yaml: Forgejo app dashboard (grafana.com ID 17802)
- Add argocd-operational.yaml: ArgoCD operational dashboard (grafana.com ID 19993)
- Add cronjob-monitoring.yaml: CronJob/backup monitoring dashboard (grafana.com ID 14279)
- Add platform-overview.yaml: custom EDP Platform Overview inline dashboard
  (platform health, forgejo stats, resource usage, backup status rows)
- Fix victoria-logs.yaml: replace broken URL with grafanaCom ID 22698
2026-06-19 12:47:44 +02:00
0316eefa43
fix(observability): 🐛 disable false-positive control-plane alerts and fix empty cluster_environment label
Hub defaultRules groups kubernetesSystemControllerManager, kubeScheduler, and
kubernetesSystemScheduler used wrong key 'enabled: false' — chart expects 'create: false'.
This caused KubeControllerManagerDown/KubeSchedulerDown to fire as false positives
because OTC CCE managed k8s does not expose control plane for scraping.

Dev local vmagent had empty externalLabels, so backup-alert rules evaluated by local
vmalert had no cluster_environment label on kube_job_status_failed metrics. Added
cluster_environment=dev to match what the vm-client-stack vmagent adds for hub shipping.
2026-06-19 12:42:21 +02:00
Martin McCaffery
63cdb926b9
fix(sustainability-rules): remove Kepler energy rules since Kepler is incompatible 2026-06-02 16:12:22 +01:00
Martin McCaffery
f98f53a5a0
revert(kepler): remove Kepler, incompatible with OTC CCE proc mount restrictions 2026-06-02 16:12:06 +01:00
Martin McCaffery
b5594a8017
feat(observability): add sustainability metrics, Kepler, 6-month retention, GARM scrape 2026-06-02 15:51:26 +01:00
Martin McCaffery
b98486f445
fix: argocd metrics port name, coredns metrics via headless service 2026-06-02 12:13:38 +01:00
Martin McCaffery
07261b081e
upgrade victoria-metrics-k8s-stack 0.48.1 -> 0.81.0 with values migration 2026-06-02 09:51:49 +01:00
Martin McCaffery
da0ccbd1b5
fix(observability): enable ArgoCD/CoreDNS scraping, add cluster label, fix node dashboard 2026-06-01 16:47:31 +01:00
Martin McCaffery
e89d48c2a5
Upgrade Grafana to 12.4.0 and add auth.jwt config for useKubeAuth 2026-06-01 13:16:37 +01:00
Martin McCaffery
3b31475552
Fix grafana-operator chart version tag (no v prefix) 2026-06-01 13:02:49 +01:00
Martin McCaffery
1686764b39
Upgrade grafana-operator to v5.23.0 and enable useKubeAuth 2026-06-01 12:58:14 +01:00
Automated pipeline
464a9eb22e Automated upload for observability.buildth.ing 2026-03-04 09:55:46 +00:00
Manuel Ganter
c824cd32ed
disabled scrape for kubescheduler 2025-10-22 16:20:27 +02:00
richardrobertreitz
218a1cbff8 chore(alerts): disabled bogus alerts related to kubecontrollermanager and kubescheduler 2025-10-21 08:48:57 +00:00
1696a6f42c Update otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/values.yaml 2025-10-14 11:41:30 +00:00
Automated pipeline
2820a37e00 Automated upload for observability.buildth.ing 2025-08-12 12:40:19 +00:00
Automated pipeline
bf54e7fe38 Automated upload for observability.buildth.ing 2025-08-12 08:31:20 +00:00
Automated pipeline
625f2e0005 Initial upload 2025-07-21 12:52:28 +00:00
Automated pipeline
fe696adecc Initial upload 2025-07-21 08:08:22 +00:00
Automated pipeline
fdeb8363b6 Initial upload 2025-06-30 08:02:54 +00:00