Commit graph

43 commits

Author SHA1 Message Date
b6fbd3f6eb
feat(observability): add VictoriaLogs log panels to platform, forgejo, argocd dashboards 2026-06-19 13:34:12 +02:00
076b2a16c9
fix(observability): 🐛 fix datasource UIDs, replace cronjob dashboard, add GARM
- Remove all ${DS_VICTORIAMETRICS} uid refs from platform-overview; use
  type-only datasource so grafana-operator resolves default prometheus DS
- Replace grafanaCom id:14279 cronjob dashboard with inline custom version
  supporting cluster_environment variable (dev/edp/observability)
- Add new GARM runners dashboard (edp-garm) ready for when GARM metrics
  are scraped; uses or vector(0) guards so panels show 0 not empty

Note: cluster_environment values confirmed as dev/edp/observability (no benchmark).
GARM metrics not yet present in VictoriaMetrics (0 series found).
2026-06-19 13:11:42 +02:00
6ea1e798d2
fix(observability): 🐛 add missing manifests to instance stacks
- backup-alerts.yaml → observability.buildth.ing victoria-k8s-stack
- forgejo-scrape.yaml → dev.t09.de vm-client-stack
2026-06-19 13:06:24 +02:00
91db8038e6
feat(observability): custom ArgoCD dashboard with cluster_environment filter 2026-06-19 13:02:48 +02:00
949529eb5c
feat(observability): add cluster_environment dropdown to Forgejo and platform-overview dashboards
- Replace grafanaCom import (17802) with custom inline Forgejo dashboard
  containing cluster_environment query variable (refresh=2, label=Environment)
- Add label, refresh=2, sort=1 to platform-overview cluster_environment variable
- ArgoCD (19993) and CronJob (14279) remain grafanaCom imports (acceptable)
2026-06-19 12:50:32 +02:00
c2528f6f69
feat(observability): add platform grafana dashboard CRs
- Add forgejo.yaml: Forgejo app dashboard (grafana.com ID 17802)
- Add argocd-operational.yaml: ArgoCD operational dashboard (grafana.com ID 19993)
- Add cronjob-monitoring.yaml: CronJob/backup monitoring dashboard (grafana.com ID 14279)
- Add platform-overview.yaml: custom EDP Platform Overview inline dashboard
  (platform health, forgejo stats, resource usage, backup status rows)
- Fix victoria-logs.yaml: replace broken URL with grafanaCom ID 22698
2026-06-19 12:47:44 +02:00
0316eefa43
fix(observability): 🐛 disable false-positive control-plane alerts and fix empty cluster_environment label
Hub defaultRules groups kubernetesSystemControllerManager, kubeScheduler, and
kubernetesSystemScheduler used wrong key 'enabled: false' — chart expects 'create: false'.
This caused KubeControllerManagerDown/KubeSchedulerDown to fire as false positives
because OTC CCE managed k8s does not expose control plane for scraping.

Dev local vmagent had empty externalLabels, so backup-alert rules evaluated by local
vmalert had no cluster_environment label on kube_job_status_failed metrics. Added
cluster_environment=dev to match what the vm-client-stack vmagent adds for hub shipping.
2026-06-19 12:42:21 +02:00
32e998df5b
fix(forgejo): ⏱️ increase s3-backup activeDeadlineSeconds 1350→7200
Previous 22.5m deadline caused DeadlineExceeded on 2026-06-19 when
rclone sync took >22m (vs 13-16s prior days). Likely triggered by
significant new data in OBS bucket. 2h window accommodates large
incremental syncs while BackupJobTooSlow alert still fires at 5m.
2026-06-19 12:35:41 +02:00
Martin McCaffery
63cdb926b9
fix(sustainability-rules): remove Kepler energy rules since Kepler is incompatible 2026-06-02 16:12:22 +01:00
Martin McCaffery
f98f53a5a0
revert(kepler): remove Kepler, incompatible with OTC CCE proc mount restrictions 2026-06-02 16:12:06 +01:00
Martin McCaffery
b5594a8017
feat(observability): add sustainability metrics, Kepler, 6-month retention, GARM scrape 2026-06-02 15:51:26 +01:00
Martin McCaffery
3be56f5a07
fix(vm-client): add nodename-to-IP metricRelabelConfig for node-exporter
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-02 14:58:36 +01:00
Martin McCaffery
b98486f445
fix: argocd metrics port name, coredns metrics via headless service 2026-06-02 12:13:38 +01:00
Martin McCaffery
eca54cb19c
fix(vm-client): use in-cluster VMSingle URL for remote write 2026-06-02 12:03:44 +01:00
Martin McCaffery
71a8fef501
fix(vm-client): create missing manifests directory 2026-06-02 11:59:42 +01:00
Martin McCaffery
d0b0c85cf8
fix: add ServerSideApply for argocd CRDs, remove deprecated vector playground field 2026-06-02 09:57:05 +01:00
Martin McCaffery
07261b081e
upgrade victoria-metrics-k8s-stack 0.48.1 -> 0.81.0 with values migration 2026-06-02 09:51:49 +01:00
Martin McCaffery
07d08e5839
upgrade chart versions: argocd, dex, cloudnative-pg, cert-manager, ingress-nginx, vector, metrics-server 2026-06-02 09:50:04 +01:00
Martin McCaffery
342870fa03
fix(vm-client): add cluster external label for dashboard variable resolution 2026-06-02 09:30:24 +01:00
Martin McCaffery
da0ccbd1b5
fix(observability): enable ArgoCD/CoreDNS scraping, add cluster label, fix node dashboard 2026-06-01 16:47:31 +01:00
Martin McCaffery
3212016398
fix(vector): use in-cluster endpoint for VictoriaLogs log shipping 2026-06-01 16:47:24 +01:00
Martin McCaffery
e89d48c2a5
Upgrade Grafana to 12.4.0 and add auth.jwt config for useKubeAuth 2026-06-01 13:16:37 +01:00
Martin McCaffery
3b31475552
Fix grafana-operator chart version tag (no v prefix) 2026-06-01 13:02:49 +01:00
Martin McCaffery
1686764b39
Upgrade grafana-operator to v5.23.0 and enable useKubeAuth 2026-06-01 12:58:14 +01:00
d4b54c854f
fix: increased pvc size due to out of disk space error 2026-05-11 10:56:01 +02:00
Automated pipeline
464a9eb22e Automated upload for observability.buildth.ing 2026-03-04 09:55:46 +00:00
Manuel Ganter
cc1de3d920
removed scrape from vm-client-stack 2025-10-24 14:07:35 +02:00
Manuel Ganter
c824cd32ed
disabled scrape for kubescheduler 2025-10-22 16:20:27 +02:00
richardrobertreitz
218a1cbff8 chore(alerts): disabled bogus alerts related to kubecontrollermanager and kubescheduler 2025-10-21 08:48:57 +00:00
1696a6f42c Update otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/values.yaml 2025-10-14 11:41:30 +00:00
9613a24413 Update otc/observability.buildth.ing/stacks/forgejo/forgejo-server/values.yaml 2025-10-02 09:56:43 +00:00
d759216360 revert e2386b2b78
revert Update otc/observability.buildth.ing/stacks/forgejo/forgejo-server/values.yaml
2025-10-02 08:59:46 +00:00
e2386b2b78 Update otc/observability.buildth.ing/stacks/forgejo/forgejo-server/values.yaml 2025-10-02 08:53:18 +00:00
ba218b905a revert 977b8b5223
revert Update otc/observability.buildth.ing/stacks/forgejo/forgejo-server/values.yaml
2025-10-02 08:41:34 +00:00
977b8b5223 Update otc/observability.buildth.ing/stacks/forgejo/forgejo-server/values.yaml 2025-10-02 08:20:01 +00:00
Automated pipeline
2820a37e00 Automated upload for observability.buildth.ing 2025-08-12 12:40:19 +00:00
Automated pipeline
bf54e7fe38 Automated upload for observability.buildth.ing 2025-08-12 08:31:20 +00:00
Automated pipeline
7d2c2a7efb Automated upload for observability.buildth.ing 2025-08-01 09:28:22 +00:00
4015e9eec0
fix(helm): 🔧 Update image fullOverride paths in values.yaml
Corrects the image `fullOverride` paths across multiple `values.yaml` files to ensure they point to the correct repositories.

This change addresses potential issues with image retrieval in the deployment process by updating outdated paths to the new ones.

No functional changes were introduced beyond the path updates.
2025-07-28 16:02:17 +02:00
Richard Robert Reitz
d2b997d31e chore(pipeline): Pulling forgejo-runner catthehacker image from ghcr.io 2025-07-28 15:10:42 +02:00
Automated pipeline
625f2e0005 Initial upload 2025-07-21 12:52:28 +00:00
Automated pipeline
fe696adecc Initial upload 2025-07-21 08:08:22 +00:00
Automated pipeline
fdeb8363b6 Initial upload 2025-06-30 08:02:54 +00:00