Commit graph

604 commits

Author SHA1 Message Date
b1a00d0395
fix(observability): 🐛 add missing simple-user-secret to hub observability stack
The hub's VMUser (vmauth.yaml) references simple-user-secret via passwordRef,
but the Secret was never added to the hub's manifests. Without this Secret,
the VM operator cannot reconcile the VMUser into the vmauth config, causing
ALL requests to fall through to the unauthorizedUser catch-all (vmsingle).

Result: Vector log shipping to VictoriaLogs was broken — vmauth routed
/insert/elasticsearch/_bulk to vmsingle instead of vlogs-victorialogs.
2026-06-19 15:28:14 +02:00
4591ee7b14
feat(observability): 🗂️ organize dashboards into Grafana folders
Assigns folder field to all GrafanaDashboard CRs:
- EDP / Overview: platform-overview
- EDP / Applications: forgejo, argocd-operational, garm, argocd
- EDP / Operations: cronjob-monitoring, ingress-nginx, victoria-logs
2026-06-19 14:46:41 +02:00
7f5c680e19
fix(observability): 🐛 enable GARM unauthenticated metrics + ArgoCD metrics on all instances
- GARM dev.t09.de: set garm.metrics.disableAuth=true to unblock Prometheus scraping (was 401)
- ArgoCD dev.t09.de: add controller/server/repoServer/applicationSet metrics blocks
- ArgoCD edp.buildth.ing: add controller/server/repoServer/applicationSet metrics blocks
- ArgoCD benchmark.t09.de: add controller/server/repoServer/applicationSet metrics blocks
- observability.buildth.ing already had metrics enabled (no change needed)
2026-06-19 13:36:26 +02:00
b6fbd3f6eb
feat(observability): add VictoriaLogs log panels to platform, forgejo, argocd dashboards 2026-06-19 13:34:12 +02:00
bcf583a055
fix(observability): 🐛 fix Vector log shipping URL on all clusters
Restores missing '.buildth.ing' domain segment in Vector elasticsearch
endpoint for benchmark, dev, and edp instances.

Template source uses {{{ .Env.DOMAIN_O12Y }}} (correct) — instances
were mis-hydrated, omitting the TLD suffix.
2026-06-19 13:32:23 +02:00
238ef71630
fix(observability): 🐛 fix remote write URL and add manifests for benchmark + edp clients
- Fix broken remote write URL (o12y.observability./ → o12y.observability.buildth.ing/)
- Create manifests/ dirs with .gitkeep for benchmark.t09.de and edp.buildth.ing
- Copy forgejo-scrape.yaml VMServiceScrape manifest to both instances
2026-06-19 13:23:50 +02:00
076b2a16c9
fix(observability): 🐛 fix datasource UIDs, replace cronjob dashboard, add GARM
- Remove all ${DS_VICTORIAMETRICS} uid refs from platform-overview; use
  type-only datasource so grafana-operator resolves default prometheus DS
- Replace grafanaCom id:14279 cronjob dashboard with inline custom version
  supporting cluster_environment variable (dev/edp/observability)
- Add new GARM runners dashboard (edp-garm) ready for when GARM metrics
  are scraped; uses or vector(0) guards so panels show 0 not empty

Note: cluster_environment values confirmed as dev/edp/observability (no benchmark).
GARM metrics not yet present in VictoriaMetrics (0 series found).
2026-06-19 13:11:42 +02:00
6ea1e798d2
fix(observability): 🐛 add missing manifests to instance stacks
- backup-alerts.yaml → observability.buildth.ing victoria-k8s-stack
- forgejo-scrape.yaml → dev.t09.de vm-client-stack
2026-06-19 13:06:24 +02:00
91db8038e6
feat(observability): custom ArgoCD dashboard with cluster_environment filter 2026-06-19 13:02:48 +02:00
949529eb5c
feat(observability): add cluster_environment dropdown to Forgejo and platform-overview dashboards
- Replace grafanaCom import (17802) with custom inline Forgejo dashboard
  containing cluster_environment query variable (refresh=2, label=Environment)
- Add label, refresh=2, sort=1 to platform-overview cluster_environment variable
- ArgoCD (19993) and CronJob (14279) remain grafanaCom imports (acceptable)
2026-06-19 12:50:32 +02:00
c2528f6f69
feat(observability): add platform grafana dashboard CRs
- Add forgejo.yaml: Forgejo app dashboard (grafana.com ID 17802)
- Add argocd-operational.yaml: ArgoCD operational dashboard (grafana.com ID 19993)
- Add cronjob-monitoring.yaml: CronJob/backup monitoring dashboard (grafana.com ID 14279)
- Add platform-overview.yaml: custom EDP Platform Overview inline dashboard
  (platform health, forgejo stats, resource usage, backup status rows)
- Fix victoria-logs.yaml: replace broken URL with grafanaCom ID 22698
2026-06-19 12:47:44 +02:00
0316eefa43
fix(observability): 🐛 disable false-positive control-plane alerts and fix empty cluster_environment label
Hub defaultRules groups kubernetesSystemControllerManager, kubeScheduler, and
kubernetesSystemScheduler used wrong key 'enabled: false' — chart expects 'create: false'.
This caused KubeControllerManagerDown/KubeSchedulerDown to fire as false positives
because OTC CCE managed k8s does not expose control plane for scraping.

Dev local vmagent had empty externalLabels, so backup-alert rules evaluated by local
vmalert had no cluster_environment label on kube_job_status_failed metrics. Added
cluster_environment=dev to match what the vm-client-stack vmagent adds for hub shipping.
2026-06-19 12:42:21 +02:00
32e998df5b
fix(forgejo): ⏱️ increase s3-backup activeDeadlineSeconds 1350→7200
Previous 22.5m deadline caused DeadlineExceeded on 2026-06-19 when
rclone sync took >22m (vs 13-16s prior days). Likely triggered by
significant new data in OBS bucket. 2h window accommodates large
incremental syncs while BackupJobTooSlow alert still fires at 5m.
2026-06-19 12:35:41 +02:00
59eed97263
fix(observability-client): 🐛 fix remote write URL and add missing manifests dir
- Fix broken remote write URL: o12y.observability. → o12y.observability.buildth.ing
- Create manifests/ directory with .gitkeep for ArgoCD source path
2026-06-19 11:41:26 +02:00
369961a940
fix(observability): 🐛 enable vmagent, fix grafana auth, disable vmauth on dev
- Enable VMAgent (was disabled → no metrics scraped)
- Remove disable_login from Grafana config; add security block so operator can auth via API
- Disable VMAuth (invalid trailing-dot hostname o12y.observability.; not needed on dev)
2026-06-19 10:44:34 +02:00
d83945413d
fix(observability): 🐛 change VLSingle → VLogs in victorialogs manifest
Chart 0.48.1 / operator v0.58.0 uses VLogs CRD for VictoriaLogs, not
VLSingle. The VLSingle kind was introduced in a newer operator version
and is not registered in this chart release. Changing to VLogs which
has identical spec fields (retentionPeriod, removePvcAfterDelete,
storage, storageMetadata, resources all supported).
2026-06-19 10:20:19 +02:00
ef4a1d7ce2
fix(observability): 🐛 disable crds.cleanup hook in victoria-metrics-operator
Pre-upgrade cleanup hook uses bitnami/kubectl and spawns on every ArgoCD
sync. Dev cluster nodes are at 99% CPU / pod limit — hook pod cannot be
scheduled, blocking the entire sync indefinitely.

Disabling cleanup.enabled prevents the hook Job from being created.
CRD cleanup is safe to skip on a fresh bootstrap where no old CRDs exist.
2026-06-19 09:58:55 +02:00
29c0a59734
fix(observability): 🐛 add SkipDryRunOnMissingResource to o12y syncOptions
VLSingle CRD missing at sync time — ArgoCD pre-validates all resources
before applying any, causing 'synchronization tasks not valid' on CRs
whose CRDs are created by the operator in the same sync wave.
SkipDryRunOnMissingResource=true bypasses dry-run for missing CRDs,
unblocking the CRD bootstrap deadlock.
2026-06-19 09:56:24 +02:00
a52a6691a8
fix(observability): 🐛 add prune + RespectIgnoreDifferences to o12y syncPolicy
Fix CRD bootstrap deadlock on victoria-metrics-k8s-stack ArgoCD app.
Adds prune: true and RespectIgnoreDifferences=true to prevent sync
failures when CRs are applied before CRDs are established.
2026-06-19 09:52:01 +02:00
Martin McCaffery
9ed3ff50d2
bump(benchmark): ci-sizer-collector sidecar 0.9.0 → 0.9.7 to pick up host-resolved kernel_peak + cgroup_path_count diagnostic 2026-06-17 11:38:55 +02:00
57ee5afa62
feat(observability): add VMServiceScrapes + migrate VLogs → VLSingle
- Migrate VLogs CRD to VLSingle (operator.victoriametrics.com/v1beta1)
- Add VMServiceScrape for Forgejo (gitea ns, port http, /metrics)
- Add VMServiceScrape for ArgoCD (argocd ns, port http-metrics)
- Add VMServiceScrape for GARM (garm ns, port metrics)
- Add VMServiceScrape for CoreDNS (kube-system ns, k8s-app: kube-dns)

Ref: IPCEICIS-4618, IPCEICIS-5066
2026-06-15 21:05:22 +02:00
7949cabb29
fix(garm): ⬆️ update to v0.1.7-forgejo-24 (fresh multi-arch build)
Build completed successfully. Fixes exec format error from -23.
Dropped stale NOTE warning — image is clean amd64.
2026-06-12 13:42:23 +02:00
8939b4f32b
fix(secrets-backup): 🔄 sync simplified manifest from template
Remove client-side openssl encryption. OBS SSE-KMS handles encryption at rest.
Updated: no apk add openssl, no openssl enc step, no secrets-backup-config Secret,
upload .tar.gz directly. Image tag bumped to 1.0.1 (built without openssl).

Ref: IPCEICIS-9317
2026-06-12 13:12:20 +02:00
900c1f6c80
fix(dev): 🐛 revert automated-upload damage — restore working image pins + OIDC secrets
Automated upload (95deeef) overwrote 5 manually-pinned values:

- forgejo-server: restore workflow-webhook-20260305 (DB has v15a/v15b
  migrations; rolling back to 14.0.2-edp1-rootless WILL break the DB)
- garm: restore v0.1.7-forgejo-22 (v0.1.7-forgejo-23 has exec format
  error — wrong arch build, crashes on OTC CCE amd64 nodes)
- sizer-receiver/secret.yaml: re-add sizer-oidc-client secret (deleted
  by upload; causes OIDC auth failure on every sizer-receiver login)
- dex/manifests/dex-sizer-client.yaml: re-add (deleted by upload;
  dex cannot resolve sizer OIDC client without this secret)
- dex.yaml: restore manifests source block (removed by upload;
  without it ArgoCD never deploys the dex/manifests/ directory)

backup-alerts.yaml (new VMRule from automated upload) is kept as-is.
2026-06-12 10:11:00 +02:00
Automated pipeline
95deeef6a0 Automated upload for dev.t09.de 2026-06-12 07:46:00 +00:00
9bbcf4efca
fix(secrets-backup): 🐛 add openssl install + upgrade image to 1.32.0
alpine/k8s:1.28.0 does not ship openssl. Script calls openssl enc
on line 116 causing exit 127 on every run since initial deploy.

Fix:
- apk add --no-cache openssl at script start (defensive, idempotent)
- upgrade image 1.28.0 -> 1.32.0 (kubectl client was 5 minor versions
  behind cluster v1.33, outside supported skew of +/-1)
2026-06-12 09:32:48 +02:00
cf8271fd86
revert(ci-sizer): 🔥 revert image pin — no versioned images in registry
GoReleaser config uses 'dockers_v2' (invalid key, should be 'dockers')
so versioned container images were never pushed. Only :latest exists.
Reverting to :latest until CI pipeline is fixed to publish version tags.

Refs: IPCEICIS-9326
2026-06-08 18:12:56 +02:00
f4aa470894
fix(ci-sizer): 📌 pin sizer-receiver to v0.8.1 for dev
v0.8.2 does not exist — tags go v0.8.1 → v0.8.3.
v0.8.3 introduced RequireOrgMatch middleware that breaks dev env where
repos are under giteaAdmin but OIDC org resolves differently.
Pin to v0.8.1 until IPCEICIS-9326 fixes multi-env org support.
2026-06-08 18:08:04 +02:00
3fdfda9da7
fix(ci-sizer): 📌 pin sizer-receiver to v0.8.2 for dev
v0.8.3 introduced RequireOrgMatch middleware that breaks dev env where
repos are under giteaAdmin but OIDC org resolves differently.
Pin to v0.8.2 until IPCEICIS-9326 fixes multi-env org support.
2026-06-08 18:06:00 +02:00
69839f767b
fix(ci-sizer): 🐛 set RECEIVER_ALLOWED_ORG=giteaAdmin for dev
Dev Forgejo repos live under giteaAdmin user, not DevFW org.
Prod will use DevFW-CICD (template default). Dev needs explicit override.
2026-06-08 18:00:47 +02:00
925c7416b3
fix(ci-sizer): 🐛 revert RECEIVER_ALLOWED_ORG to DevFW for dev env
Template default is DevFW-CICD (prod), but dev Forgejo uses DevFW org.
Hydration overwrote the correct value today.
2026-06-08 17:51:14 +02:00
bd82384eb1
fix(dex): 🔐 correct sizer client secret to match sizer-oidc-client
The deploy hydration created dex-sizer-client with wrong value.
Reverting to the original shared secret that sizer expects
(73eda906... - active for 81 days before hydration overwrote it).

Changes:
- sizer-oidc-client: restore correct shared secret
- dex-sizer-client: add managed manifest to prevent future drift
- dex.yaml: add manifests source for ArgoCD to sync the secret

Broken by stacks rehydration pipeline run.
2026-06-08 17:11:10 +02:00
967edf0382
fix(ci-sizer): 🔐 align OIDC client secret with dex config
Secret mismatch caused infinite login loop on sizer.dev.t09.de.
Added sizer-oidc-client secret manifest to GitOps so ArgoCD manages it.
Value now matches dex-runner-sizer-client (dex side).
2026-06-08 17:00:38 +02:00
Daniel.Sy
9a7544418c fix(forgejo): 🐛 use workflow-webhook image matching DB migration level (v15a/v15b)
DB was migrated to v15 schema by this image in March.
The 14.0.2-edp1-rootless image cannot start against it.
Today's automated pipeline sync triggered pod restart, exposing the mismatch.
2026-06-08 14:11:31 +00:00
Daniel.Sy
a047be3aae fix(garm): ⬇️ rollback to v0.1.7-forgejo-22 — -23 has exec format error (wrong arch) 2026-06-08 14:11:05 +00:00
Automated pipeline
422f568c8e Automated upload for dev.t09.de 2026-06-08 12:15:27 +00:00
Martin McCaffery
011f436fb7
feat(benchmark.t09.de/garm): bump ci-sizer-collector 0.8.3 → 0.9.0 (kernel-peak + cgroup-v1 limit fallback) 2026-06-03 15:01:09 +01:00
Martin McCaffery
14873b7941
fix(garm): bump dev+benchmark to garm-helm v0.0.17 (template-robust readToken); drop now-redundant explicit fields on benchmark 2026-06-02 16:21:51 +01:00
Martin McCaffery
63cdb926b9
fix(sustainability-rules): remove Kepler energy rules since Kepler is incompatible 2026-06-02 16:12:22 +01:00
Martin McCaffery
f98f53a5a0
revert(kepler): remove Kepler, incompatible with OTC CCE proc mount restrictions 2026-06-02 16:12:06 +01:00
Martin McCaffery
608439697b
fix(benchmark.t09.de/garm): pin ci-sizer-collector to 0.8.3 (latest tagged release, avoid :latest drift during long runs) 2026-06-02 16:08:35 +01:00
Martin McCaffery
b5594a8017
feat(observability): add sustainability metrics, Kepler, 6-month retention, GARM scrape 2026-06-02 15:51:26 +01:00
Martin McCaffery
bbdca11f00
fix(benchmark.t09.de/garm): bump ci-sizer-collector to :latest (0.0.4 tag doesn't exist in registry, was unreachable until sizer integration was restored) 2026-06-02 15:42:10 +01:00
Martin McCaffery
3be56f5a07
fix(vm-client): add nodename-to-IP metricRelabelConfig for node-exporter
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-02 14:58:36 +01:00
Martin McCaffery
e2469e7843
fix(benchmark.t09.de/garm): explicit sizer readToken mountPath/key/fileName (chart defaults not deep-merging, was rendering broken %!s(<nil>) path that crashed sizer consultation) 2026-06-02 14:38:41 +01:00
Martin McCaffery
b98486f445
fix: argocd metrics port name, coredns metrics via headless service 2026-06-02 12:13:38 +01:00
Martin McCaffery
eca54cb19c
fix(vm-client): use in-cluster VMSingle URL for remote write 2026-06-02 12:03:44 +01:00
Martin McCaffery
71a8fef501
fix(vm-client): create missing manifests directory 2026-06-02 11:59:42 +01:00
Martin McCaffery
e95fa403e9
fix(benchmark.t09.de/garm): wire sizer baseUrl + readToken so edge-connect-k8s provider actually applies sizer recommendations (was silently no-op) 2026-06-02 11:56:11 +01:00
Martin McCaffery
d0b0c85cf8
fix: add ServerSideApply for argocd CRDs, remove deprecated vector playground field 2026-06-02 09:57:05 +01:00