Commit graph

639 commits

Author SHA1 Message Date
238ef71630
fix(observability): 🐛 fix remote write URL and add manifests for benchmark + edp clients
- Fix broken remote write URL (o12y.observability./ → o12y.observability.buildth.ing/)
- Create manifests/ dirs with .gitkeep for benchmark.t09.de and edp.buildth.ing
- Copy forgejo-scrape.yaml VMServiceScrape manifest to both instances
2026-06-19 13:23:50 +02:00
076b2a16c9
fix(observability): 🐛 fix datasource UIDs, replace cronjob dashboard, add GARM
- Remove all ${DS_VICTORIAMETRICS} uid refs from platform-overview; use
  type-only datasource so grafana-operator resolves default prometheus DS
- Replace grafanaCom id:14279 cronjob dashboard with inline custom version
  supporting cluster_environment variable (dev/edp/observability)
- Add new GARM runners dashboard (edp-garm) ready for when GARM metrics
  are scraped; uses or vector(0) guards so panels show 0 not empty

Note: cluster_environment values confirmed as dev/edp/observability (no benchmark).
GARM metrics not yet present in VictoriaMetrics (0 series found).
2026-06-19 13:11:42 +02:00
6ea1e798d2
fix(observability): 🐛 add missing manifests to instance stacks
- backup-alerts.yaml → observability.buildth.ing victoria-k8s-stack
- forgejo-scrape.yaml → dev.t09.de vm-client-stack
2026-06-19 13:06:24 +02:00
91db8038e6
feat(observability): custom ArgoCD dashboard with cluster_environment filter 2026-06-19 13:02:48 +02:00
949529eb5c
feat(observability): add cluster_environment dropdown to Forgejo and platform-overview dashboards
- Replace grafanaCom import (17802) with custom inline Forgejo dashboard
  containing cluster_environment query variable (refresh=2, label=Environment)
- Add label, refresh=2, sort=1 to platform-overview cluster_environment variable
- ArgoCD (19993) and CronJob (14279) remain grafanaCom imports (acceptable)
2026-06-19 12:50:32 +02:00
c2528f6f69
feat(observability): add platform grafana dashboard CRs
- Add forgejo.yaml: Forgejo app dashboard (grafana.com ID 17802)
- Add argocd-operational.yaml: ArgoCD operational dashboard (grafana.com ID 19993)
- Add cronjob-monitoring.yaml: CronJob/backup monitoring dashboard (grafana.com ID 14279)
- Add platform-overview.yaml: custom EDP Platform Overview inline dashboard
  (platform health, forgejo stats, resource usage, backup status rows)
- Fix victoria-logs.yaml: replace broken URL with grafanaCom ID 22698
2026-06-19 12:47:44 +02:00
0316eefa43
fix(observability): 🐛 disable false-positive control-plane alerts and fix empty cluster_environment label
Hub defaultRules groups kubernetesSystemControllerManager, kubeScheduler, and
kubernetesSystemScheduler used wrong key 'enabled: false' — chart expects 'create: false'.
This caused KubeControllerManagerDown/KubeSchedulerDown to fire as false positives
because OTC CCE managed k8s does not expose control plane for scraping.

Dev local vmagent had empty externalLabels, so backup-alert rules evaluated by local
vmalert had no cluster_environment label on kube_job_status_failed metrics. Added
cluster_environment=dev to match what the vm-client-stack vmagent adds for hub shipping.
2026-06-19 12:42:21 +02:00
32e998df5b
fix(forgejo): ⏱️ increase s3-backup activeDeadlineSeconds 1350→7200
Previous 22.5m deadline caused DeadlineExceeded on 2026-06-19 when
rclone sync took >22m (vs 13-16s prior days). Likely triggered by
significant new data in OBS bucket. 2h window accommodates large
incremental syncs while BackupJobTooSlow alert still fires at 5m.
2026-06-19 12:35:41 +02:00
59eed97263
fix(observability-client): 🐛 fix remote write URL and add missing manifests dir
- Fix broken remote write URL: o12y.observability. → o12y.observability.buildth.ing
- Create manifests/ directory with .gitkeep for ArgoCD source path
2026-06-19 11:41:26 +02:00
369961a940
fix(observability): 🐛 enable vmagent, fix grafana auth, disable vmauth on dev
- Enable VMAgent (was disabled → no metrics scraped)
- Remove disable_login from Grafana config; add security block so operator can auth via API
- Disable VMAuth (invalid trailing-dot hostname o12y.observability.; not needed on dev)
2026-06-19 10:44:34 +02:00
d83945413d
fix(observability): 🐛 change VLSingle → VLogs in victorialogs manifest
Chart 0.48.1 / operator v0.58.0 uses VLogs CRD for VictoriaLogs, not
VLSingle. The VLSingle kind was introduced in a newer operator version
and is not registered in this chart release. Changing to VLogs which
has identical spec fields (retentionPeriod, removePvcAfterDelete,
storage, storageMetadata, resources all supported).
2026-06-19 10:20:19 +02:00
ef4a1d7ce2
fix(observability): 🐛 disable crds.cleanup hook in victoria-metrics-operator
Pre-upgrade cleanup hook uses bitnami/kubectl and spawns on every ArgoCD
sync. Dev cluster nodes are at 99% CPU / pod limit — hook pod cannot be
scheduled, blocking the entire sync indefinitely.

Disabling cleanup.enabled prevents the hook Job from being created.
CRD cleanup is safe to skip on a fresh bootstrap where no old CRDs exist.
2026-06-19 09:58:55 +02:00
29c0a59734
fix(observability): 🐛 add SkipDryRunOnMissingResource to o12y syncOptions
VLSingle CRD missing at sync time — ArgoCD pre-validates all resources
before applying any, causing 'synchronization tasks not valid' on CRs
whose CRDs are created by the operator in the same sync wave.
SkipDryRunOnMissingResource=true bypasses dry-run for missing CRDs,
unblocking the CRD bootstrap deadlock.
2026-06-19 09:56:24 +02:00
a52a6691a8
fix(observability): 🐛 add prune + RespectIgnoreDifferences to o12y syncPolicy
Fix CRD bootstrap deadlock on victoria-metrics-k8s-stack ArgoCD app.
Adds prune: true and RespectIgnoreDifferences=true to prevent sync
failures when CRs are applied before CRDs are established.
2026-06-19 09:52:01 +02:00
Martin McCaffery
9ed3ff50d2
bump(benchmark): ci-sizer-collector sidecar 0.9.0 → 0.9.7 to pick up host-resolved kernel_peak + cgroup_path_count diagnostic 2026-06-17 11:38:55 +02:00
57ee5afa62
feat(observability): add VMServiceScrapes + migrate VLogs → VLSingle
- Migrate VLogs CRD to VLSingle (operator.victoriametrics.com/v1beta1)
- Add VMServiceScrape for Forgejo (gitea ns, port http, /metrics)
- Add VMServiceScrape for ArgoCD (argocd ns, port http-metrics)
- Add VMServiceScrape for GARM (garm ns, port metrics)
- Add VMServiceScrape for CoreDNS (kube-system ns, k8s-app: kube-dns)

Ref: IPCEICIS-4618, IPCEICIS-5066
2026-06-15 21:05:22 +02:00
7949cabb29
fix(garm): ⬆️ update to v0.1.7-forgejo-24 (fresh multi-arch build)
Build completed successfully. Fixes exec format error from -23.
Dropped stale NOTE warning — image is clean amd64.
2026-06-12 13:42:23 +02:00
8939b4f32b
fix(secrets-backup): 🔄 sync simplified manifest from template
Remove client-side openssl encryption. OBS SSE-KMS handles encryption at rest.
Updated: no apk add openssl, no openssl enc step, no secrets-backup-config Secret,
upload .tar.gz directly. Image tag bumped to 1.0.1 (built without openssl).

Ref: IPCEICIS-9317
2026-06-12 13:12:20 +02:00
900c1f6c80
fix(dev): 🐛 revert automated-upload damage — restore working image pins + OIDC secrets
Automated upload (95deeef) overwrote 5 manually-pinned values:

- forgejo-server: restore workflow-webhook-20260305 (DB has v15a/v15b
  migrations; rolling back to 14.0.2-edp1-rootless WILL break the DB)
- garm: restore v0.1.7-forgejo-22 (v0.1.7-forgejo-23 has exec format
  error — wrong arch build, crashes on OTC CCE amd64 nodes)
- sizer-receiver/secret.yaml: re-add sizer-oidc-client secret (deleted
  by upload; causes OIDC auth failure on every sizer-receiver login)
- dex/manifests/dex-sizer-client.yaml: re-add (deleted by upload;
  dex cannot resolve sizer OIDC client without this secret)
- dex.yaml: restore manifests source block (removed by upload;
  without it ArgoCD never deploys the dex/manifests/ directory)

backup-alerts.yaml (new VMRule from automated upload) is kept as-is.
2026-06-12 10:11:00 +02:00
Automated pipeline
95deeef6a0 Automated upload for dev.t09.de 2026-06-12 07:46:00 +00:00
9bbcf4efca
fix(secrets-backup): 🐛 add openssl install + upgrade image to 1.32.0
alpine/k8s:1.28.0 does not ship openssl. Script calls openssl enc
on line 116 causing exit 127 on every run since initial deploy.

Fix:
- apk add --no-cache openssl at script start (defensive, idempotent)
- upgrade image 1.28.0 -> 1.32.0 (kubectl client was 5 minor versions
  behind cluster v1.33, outside supported skew of +/-1)
2026-06-12 09:32:48 +02:00
cf8271fd86
revert(ci-sizer): 🔥 revert image pin — no versioned images in registry
GoReleaser config uses 'dockers_v2' (invalid key, should be 'dockers')
so versioned container images were never pushed. Only :latest exists.
Reverting to :latest until CI pipeline is fixed to publish version tags.

Refs: IPCEICIS-9326
2026-06-08 18:12:56 +02:00
f4aa470894
fix(ci-sizer): 📌 pin sizer-receiver to v0.8.1 for dev
v0.8.2 does not exist — tags go v0.8.1 → v0.8.3.
v0.8.3 introduced RequireOrgMatch middleware that breaks dev env where
repos are under giteaAdmin but OIDC org resolves differently.
Pin to v0.8.1 until IPCEICIS-9326 fixes multi-env org support.
2026-06-08 18:08:04 +02:00
3fdfda9da7
fix(ci-sizer): 📌 pin sizer-receiver to v0.8.2 for dev
v0.8.3 introduced RequireOrgMatch middleware that breaks dev env where
repos are under giteaAdmin but OIDC org resolves differently.
Pin to v0.8.2 until IPCEICIS-9326 fixes multi-env org support.
2026-06-08 18:06:00 +02:00
69839f767b
fix(ci-sizer): 🐛 set RECEIVER_ALLOWED_ORG=giteaAdmin for dev
Dev Forgejo repos live under giteaAdmin user, not DevFW org.
Prod will use DevFW-CICD (template default). Dev needs explicit override.
2026-06-08 18:00:47 +02:00
925c7416b3
fix(ci-sizer): 🐛 revert RECEIVER_ALLOWED_ORG to DevFW for dev env
Template default is DevFW-CICD (prod), but dev Forgejo uses DevFW org.
Hydration overwrote the correct value today.
2026-06-08 17:51:14 +02:00
bd82384eb1
fix(dex): 🔐 correct sizer client secret to match sizer-oidc-client
The deploy hydration created dex-sizer-client with wrong value.
Reverting to the original shared secret that sizer expects
(73eda906... - active for 81 days before hydration overwrote it).

Changes:
- sizer-oidc-client: restore correct shared secret
- dex-sizer-client: add managed manifest to prevent future drift
- dex.yaml: add manifests source for ArgoCD to sync the secret

Broken by stacks rehydration pipeline run.
2026-06-08 17:11:10 +02:00
967edf0382
fix(ci-sizer): 🔐 align OIDC client secret with dex config
Secret mismatch caused infinite login loop on sizer.dev.t09.de.
Added sizer-oidc-client secret manifest to GitOps so ArgoCD manages it.
Value now matches dex-runner-sizer-client (dex side).
2026-06-08 17:00:38 +02:00
Daniel.Sy
9a7544418c fix(forgejo): 🐛 use workflow-webhook image matching DB migration level (v15a/v15b)
DB was migrated to v15 schema by this image in March.
The 14.0.2-edp1-rootless image cannot start against it.
Today's automated pipeline sync triggered pod restart, exposing the mismatch.
2026-06-08 14:11:31 +00:00
Daniel.Sy
a047be3aae fix(garm): ⬇️ rollback to v0.1.7-forgejo-22 — -23 has exec format error (wrong arch) 2026-06-08 14:11:05 +00:00
Automated pipeline
422f568c8e Automated upload for dev.t09.de 2026-06-08 12:15:27 +00:00
Martin McCaffery
011f436fb7
feat(benchmark.t09.de/garm): bump ci-sizer-collector 0.8.3 → 0.9.0 (kernel-peak + cgroup-v1 limit fallback) 2026-06-03 15:01:09 +01:00
Martin McCaffery
14873b7941
fix(garm): bump dev+benchmark to garm-helm v0.0.17 (template-robust readToken); drop now-redundant explicit fields on benchmark 2026-06-02 16:21:51 +01:00
Martin McCaffery
63cdb926b9
fix(sustainability-rules): remove Kepler energy rules since Kepler is incompatible 2026-06-02 16:12:22 +01:00
Martin McCaffery
f98f53a5a0
revert(kepler): remove Kepler, incompatible with OTC CCE proc mount restrictions 2026-06-02 16:12:06 +01:00
Martin McCaffery
608439697b
fix(benchmark.t09.de/garm): pin ci-sizer-collector to 0.8.3 (latest tagged release, avoid :latest drift during long runs) 2026-06-02 16:08:35 +01:00
Martin McCaffery
b5594a8017
feat(observability): add sustainability metrics, Kepler, 6-month retention, GARM scrape 2026-06-02 15:51:26 +01:00
Martin McCaffery
bbdca11f00
fix(benchmark.t09.de/garm): bump ci-sizer-collector to :latest (0.0.4 tag doesn't exist in registry, was unreachable until sizer integration was restored) 2026-06-02 15:42:10 +01:00
Martin McCaffery
3be56f5a07
fix(vm-client): add nodename-to-IP metricRelabelConfig for node-exporter
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-02 14:58:36 +01:00
Martin McCaffery
e2469e7843
fix(benchmark.t09.de/garm): explicit sizer readToken mountPath/key/fileName (chart defaults not deep-merging, was rendering broken %!s(<nil>) path that crashed sizer consultation) 2026-06-02 14:38:41 +01:00
Martin McCaffery
b98486f445
fix: argocd metrics port name, coredns metrics via headless service 2026-06-02 12:13:38 +01:00
Martin McCaffery
eca54cb19c
fix(vm-client): use in-cluster VMSingle URL for remote write 2026-06-02 12:03:44 +01:00
Martin McCaffery
71a8fef501
fix(vm-client): create missing manifests directory 2026-06-02 11:59:42 +01:00
Martin McCaffery
e95fa403e9
fix(benchmark.t09.de/garm): wire sizer baseUrl + readToken so edge-connect-k8s provider actually applies sizer recommendations (was silently no-op) 2026-06-02 11:56:11 +01:00
Martin McCaffery
d0b0c85cf8
fix: add ServerSideApply for argocd CRDs, remove deprecated vector playground field 2026-06-02 09:57:05 +01:00
Martin McCaffery
07261b081e
upgrade victoria-metrics-k8s-stack 0.48.1 -> 0.81.0 with values migration 2026-06-02 09:51:49 +01:00
Martin McCaffery
07d08e5839
upgrade chart versions: argocd, dex, cloudnative-pg, cert-manager, ingress-nginx, vector, metrics-server 2026-06-02 09:50:04 +01:00
Martin McCaffery
342870fa03
fix(vm-client): add cluster external label for dashboard variable resolution 2026-06-02 09:30:24 +01:00
Martin McCaffery
da0ccbd1b5
fix(observability): enable ArgoCD/CoreDNS scraping, add cluster label, fix node dashboard 2026-06-01 16:47:31 +01:00
Martin McCaffery
3212016398
fix(vector): use in-cluster endpoint for VictoriaLogs log shipping 2026-06-01 16:47:24 +01:00