stacks-instances

Author	SHA1	Message	Date
Daniel Sy	0a249820de	fix(observability): 🐛 fix ArgoCD scrape port name http-metrics not metrics	2026-06-19 16:11:15 +02:00
Daniel Sy	f3931dc550	fix(observability): 🐛 add ArgoCD + GARM VMServiceScrapes to dev client stack	2026-06-19 16:07:27 +02:00
Daniel Sy	8488de0c6f	fix(observability): 🐛 use plaintext password in hub VMUser to unblock operator reconciliation The hub VMUser was using passwordRef pointing to simple-user-secret, but that Secret was not present in the cluster (only exists in git now via the previous commit). VM operator skips VMUser reconciliation when passwordRef cannot resolve, leaving vmauth with only the unauthorizedUser catch-all (vmsingle). Switching to inline password ensures immediate operator reconciliation without waiting for Secret deployment. The simple-user-secret.yaml manifest is kept for Vector's credential reference.	2026-06-19 15:45:55 +02:00
Daniel Sy	b1a00d0395	fix(observability): 🐛 add missing simple-user-secret to hub observability stack The hub's VMUser (vmauth.yaml) references simple-user-secret via passwordRef, but the Secret was never added to the hub's manifests. Without this Secret, the VM operator cannot reconcile the VMUser into the vmauth config, causing ALL requests to fall through to the unauthorizedUser catch-all (vmsingle). Result: Vector log shipping to VictoriaLogs was broken — vmauth routed /insert/elasticsearch/_bulk to vmsingle instead of vlogs-victorialogs.	2026-06-19 15:28:14 +02:00
Daniel Sy	4591ee7b14	feat(observability): 🗂️ organize dashboards into Grafana folders Assigns folder field to all GrafanaDashboard CRs: - EDP / Overview: platform-overview - EDP / Applications: forgejo, argocd-operational, garm, argocd - EDP / Operations: cronjob-monitoring, ingress-nginx, victoria-logs	2026-06-19 14:46:41 +02:00
Daniel Sy	7f5c680e19	fix(observability): 🐛 enable GARM unauthenticated metrics + ArgoCD metrics on all instances - GARM dev.t09.de: set garm.metrics.disableAuth=true to unblock Prometheus scraping (was 401) - ArgoCD dev.t09.de: add controller/server/repoServer/applicationSet metrics blocks - ArgoCD edp.buildth.ing: add controller/server/repoServer/applicationSet metrics blocks - ArgoCD benchmark.t09.de: add controller/server/repoServer/applicationSet metrics blocks - observability.buildth.ing already had metrics enabled (no change needed)	2026-06-19 13:36:26 +02:00
Daniel Sy	b6fbd3f6eb	feat(observability): ✨ add VictoriaLogs log panels to platform, forgejo, argocd dashboards	2026-06-19 13:34:12 +02:00
Daniel Sy	bcf583a055	fix(observability): 🐛 fix Vector log shipping URL on all clusters Restores missing '.buildth.ing' domain segment in Vector elasticsearch endpoint for benchmark, dev, and edp instances. Template source uses {{{ .Env.DOMAIN_O12Y }}} (correct) — instances were mis-hydrated, omitting the TLD suffix.	2026-06-19 13:32:23 +02:00
Daniel Sy	238ef71630	fix(observability): 🐛 fix remote write URL and add manifests for benchmark + edp clients - Fix broken remote write URL (o12y.observability./ → o12y.observability.buildth.ing/) - Create manifests/ dirs with .gitkeep for benchmark.t09.de and edp.buildth.ing - Copy forgejo-scrape.yaml VMServiceScrape manifest to both instances	2026-06-19 13:23:50 +02:00
Daniel Sy	076b2a16c9	fix(observability): 🐛 fix datasource UIDs, replace cronjob dashboard, add GARM - Remove all ${DS_VICTORIAMETRICS} uid refs from platform-overview; use type-only datasource so grafana-operator resolves default prometheus DS - Replace grafanaCom id:14279 cronjob dashboard with inline custom version supporting cluster_environment variable (dev/edp/observability) - Add new GARM runners dashboard (edp-garm) ready for when GARM metrics are scraped; uses or vector(0) guards so panels show 0 not empty Note: cluster_environment values confirmed as dev/edp/observability (no benchmark). GARM metrics not yet present in VictoriaMetrics (0 series found).	2026-06-19 13:11:42 +02:00
Daniel Sy	6ea1e798d2	fix(observability): 🐛 add missing manifests to instance stacks - backup-alerts.yaml → observability.buildth.ing victoria-k8s-stack - forgejo-scrape.yaml → dev.t09.de vm-client-stack	2026-06-19 13:06:24 +02:00
Daniel Sy	91db8038e6	feat(observability): ✨ custom ArgoCD dashboard with cluster_environment filter	2026-06-19 13:02:48 +02:00
Daniel Sy	949529eb5c	feat(observability): ✨ add cluster_environment dropdown to Forgejo and platform-overview dashboards - Replace grafanaCom import (17802) with custom inline Forgejo dashboard containing cluster_environment query variable (refresh=2, label=Environment) - Add label, refresh=2, sort=1 to platform-overview cluster_environment variable - ArgoCD (19993) and CronJob (14279) remain grafanaCom imports (acceptable)	2026-06-19 12:50:32 +02:00
Daniel Sy	c2528f6f69	feat(observability): ✨ add platform grafana dashboard CRs - Add forgejo.yaml: Forgejo app dashboard (grafana.com ID 17802) - Add argocd-operational.yaml: ArgoCD operational dashboard (grafana.com ID 19993) - Add cronjob-monitoring.yaml: CronJob/backup monitoring dashboard (grafana.com ID 14279) - Add platform-overview.yaml: custom EDP Platform Overview inline dashboard (platform health, forgejo stats, resource usage, backup status rows) - Fix victoria-logs.yaml: replace broken URL with grafanaCom ID 22698	2026-06-19 12:47:44 +02:00
Daniel Sy	0316eefa43	fix(observability): 🐛 disable false-positive control-plane alerts and fix empty cluster_environment label Hub defaultRules groups kubernetesSystemControllerManager, kubeScheduler, and kubernetesSystemScheduler used wrong key 'enabled: false' — chart expects 'create: false'. This caused KubeControllerManagerDown/KubeSchedulerDown to fire as false positives because OTC CCE managed k8s does not expose control plane for scraping. Dev local vmagent had empty externalLabels, so backup-alert rules evaluated by local vmalert had no cluster_environment label on kube_job_status_failed metrics. Added cluster_environment=dev to match what the vm-client-stack vmagent adds for hub shipping.	2026-06-19 12:42:21 +02:00
Daniel Sy	32e998df5b	fix(forgejo): ⏱️ increase s3-backup activeDeadlineSeconds 1350→7200 Previous 22.5m deadline caused DeadlineExceeded on 2026-06-19 when rclone sync took >22m (vs 13-16s prior days). Likely triggered by significant new data in OBS bucket. 2h window accommodates large incremental syncs while BackupJobTooSlow alert still fires at 5m.	2026-06-19 12:35:41 +02:00
Daniel Sy	59eed97263	fix(observability-client): 🐛 fix remote write URL and add missing manifests dir - Fix broken remote write URL: o12y.observability. → o12y.observability.buildth.ing - Create manifests/ directory with .gitkeep for ArgoCD source path	2026-06-19 11:41:26 +02:00
Daniel Sy	369961a940	fix(observability): 🐛 enable vmagent, fix grafana auth, disable vmauth on dev - Enable VMAgent (was disabled → no metrics scraped) - Remove disable_login from Grafana config; add security block so operator can auth via API - Disable VMAuth (invalid trailing-dot hostname o12y.observability.; not needed on dev)	2026-06-19 10:44:34 +02:00
Daniel Sy	d83945413d	fix(observability): 🐛 change VLSingle → VLogs in victorialogs manifest Chart 0.48.1 / operator v0.58.0 uses VLogs CRD for VictoriaLogs, not VLSingle. The VLSingle kind was introduced in a newer operator version and is not registered in this chart release. Changing to VLogs which has identical spec fields (retentionPeriod, removePvcAfterDelete, storage, storageMetadata, resources all supported).	2026-06-19 10:20:19 +02:00
Daniel Sy	ef4a1d7ce2	fix(observability): 🐛 disable crds.cleanup hook in victoria-metrics-operator Pre-upgrade cleanup hook uses bitnami/kubectl and spawns on every ArgoCD sync. Dev cluster nodes are at 99% CPU / pod limit — hook pod cannot be scheduled, blocking the entire sync indefinitely. Disabling cleanup.enabled prevents the hook Job from being created. CRD cleanup is safe to skip on a fresh bootstrap where no old CRDs exist.	2026-06-19 09:58:55 +02:00
Daniel Sy	29c0a59734	fix(observability): 🐛 add SkipDryRunOnMissingResource to o12y syncOptions VLSingle CRD missing at sync time — ArgoCD pre-validates all resources before applying any, causing 'synchronization tasks not valid' on CRs whose CRDs are created by the operator in the same sync wave. SkipDryRunOnMissingResource=true bypasses dry-run for missing CRDs, unblocking the CRD bootstrap deadlock.	2026-06-19 09:56:24 +02:00
Daniel Sy	a52a6691a8	fix(observability): 🐛 add prune + RespectIgnoreDifferences to o12y syncPolicy Fix CRD bootstrap deadlock on victoria-metrics-k8s-stack ArgoCD app. Adds prune: true and RespectIgnoreDifferences=true to prevent sync failures when CRs are applied before CRDs are established.	2026-06-19 09:52:01 +02:00
Martin McCaffery	9ed3ff50d2	bump(benchmark): ci-sizer-collector sidecar 0.9.0 → 0.9.7 to pick up host-resolved kernel_peak + cgroup_path_count diagnostic	2026-06-17 11:38:55 +02:00
Daniel Sy	57ee5afa62	feat(observability): ✨ add VMServiceScrapes + migrate VLogs → VLSingle - Migrate VLogs CRD to VLSingle (operator.victoriametrics.com/v1beta1) - Add VMServiceScrape for Forgejo (gitea ns, port http, /metrics) - Add VMServiceScrape for ArgoCD (argocd ns, port http-metrics) - Add VMServiceScrape for GARM (garm ns, port metrics) - Add VMServiceScrape for CoreDNS (kube-system ns, k8s-app: kube-dns) Ref: IPCEICIS-4618, IPCEICIS-5066	2026-06-15 21:05:22 +02:00
Daniel Sy	7949cabb29	fix(garm): ⬆️ update to v0.1.7-forgejo-24 (fresh multi-arch build) Build completed successfully. Fixes exec format error from -23. Dropped stale NOTE warning — image is clean amd64.	2026-06-12 13:42:23 +02:00
Daniel Sy	8939b4f32b	fix(secrets-backup): 🔄 sync simplified manifest from template Remove client-side openssl encryption. OBS SSE-KMS handles encryption at rest. Updated: no apk add openssl, no openssl enc step, no secrets-backup-config Secret, upload .tar.gz directly. Image tag bumped to 1.0.1 (built without openssl). Ref: IPCEICIS-9317	2026-06-12 13:12:20 +02:00
Daniel Sy	900c1f6c80	fix(dev): 🐛 revert automated-upload damage — restore working image pins + OIDC secrets Automated upload (`95deeef`) overwrote 5 manually-pinned values: - forgejo-server: restore workflow-webhook-20260305 (DB has v15a/v15b migrations; rolling back to 14.0.2-edp1-rootless WILL break the DB) - garm: restore v0.1.7-forgejo-22 (v0.1.7-forgejo-23 has exec format error — wrong arch build, crashes on OTC CCE amd64 nodes) - sizer-receiver/secret.yaml: re-add sizer-oidc-client secret (deleted by upload; causes OIDC auth failure on every sizer-receiver login) - dex/manifests/dex-sizer-client.yaml: re-add (deleted by upload; dex cannot resolve sizer OIDC client without this secret) - dex.yaml: restore manifests source block (removed by upload; without it ArgoCD never deploys the dex/manifests/ directory) backup-alerts.yaml (new VMRule from automated upload) is kept as-is.	2026-06-12 10:11:00 +02:00
Automated pipeline	95deeef6a0	Automated upload for dev.t09.de	2026-06-12 07:46:00 +00:00
Daniel Sy	9bbcf4efca	fix(secrets-backup): 🐛 add openssl install + upgrade image to 1.32.0 alpine/k8s:1.28.0 does not ship openssl. Script calls openssl enc on line 116 causing exit 127 on every run since initial deploy. Fix: - apk add --no-cache openssl at script start (defensive, idempotent) - upgrade image 1.28.0 -> 1.32.0 (kubectl client was 5 minor versions behind cluster v1.33, outside supported skew of +/-1)	2026-06-12 09:32:48 +02:00
Daniel Sy	cf8271fd86	revert(ci-sizer): 🔥 revert image pin — no versioned images in registry GoReleaser config uses 'dockers_v2' (invalid key, should be 'dockers') so versioned container images were never pushed. Only :latest exists. Reverting to :latest until CI pipeline is fixed to publish version tags. Refs: IPCEICIS-9326	2026-06-08 18:12:56 +02:00
Daniel Sy	f4aa470894	fix(ci-sizer): 📌 pin sizer-receiver to v0.8.1 for dev v0.8.2 does not exist — tags go v0.8.1 → v0.8.3. v0.8.3 introduced RequireOrgMatch middleware that breaks dev env where repos are under giteaAdmin but OIDC org resolves differently. Pin to v0.8.1 until IPCEICIS-9326 fixes multi-env org support.	2026-06-08 18:08:04 +02:00
Daniel Sy	3fdfda9da7	fix(ci-sizer): 📌 pin sizer-receiver to v0.8.2 for dev v0.8.3 introduced RequireOrgMatch middleware that breaks dev env where repos are under giteaAdmin but OIDC org resolves differently. Pin to v0.8.2 until IPCEICIS-9326 fixes multi-env org support.	2026-06-08 18:06:00 +02:00
Daniel Sy	69839f767b	fix(ci-sizer): 🐛 set RECEIVER_ALLOWED_ORG=giteaAdmin for dev Dev Forgejo repos live under giteaAdmin user, not DevFW org. Prod will use DevFW-CICD (template default). Dev needs explicit override.	2026-06-08 18:00:47 +02:00
Daniel Sy	925c7416b3	fix(ci-sizer): 🐛 revert RECEIVER_ALLOWED_ORG to DevFW for dev env Template default is DevFW-CICD (prod), but dev Forgejo uses DevFW org. Hydration overwrote the correct value today.	2026-06-08 17:51:14 +02:00
Daniel Sy	bd82384eb1	fix(dex): 🔐 correct sizer client secret to match sizer-oidc-client The deploy hydration created dex-sizer-client with wrong value. Reverting to the original shared secret that sizer expects (73eda906... - active for 81 days before hydration overwrote it). Changes: - sizer-oidc-client: restore correct shared secret - dex-sizer-client: add managed manifest to prevent future drift - dex.yaml: add manifests source for ArgoCD to sync the secret Broken by stacks rehydration pipeline run.	2026-06-08 17:11:10 +02:00
Daniel Sy	967edf0382	fix(ci-sizer): 🔐 align OIDC client secret with dex config Secret mismatch caused infinite login loop on sizer.dev.t09.de. Added sizer-oidc-client secret manifest to GitOps so ArgoCD manages it. Value now matches dex-runner-sizer-client (dex side).	2026-06-08 17:00:38 +02:00
Daniel.Sy	9a7544418c	fix(forgejo): 🐛 use workflow-webhook image matching DB migration level (v15a/v15b) DB was migrated to v15 schema by this image in March. The 14.0.2-edp1-rootless image cannot start against it. Today's automated pipeline sync triggered pod restart, exposing the mismatch.	2026-06-08 14:11:31 +00:00
Daniel.Sy	a047be3aae	fix(garm): ⬇️ rollback to v0.1.7-forgejo-22 — -23 has exec format error (wrong arch)	2026-06-08 14:11:05 +00:00
Automated pipeline	422f568c8e	Automated upload for dev.t09.de	2026-06-08 12:15:27 +00:00
Martin McCaffery	011f436fb7	feat(benchmark.t09.de/garm): bump ci-sizer-collector 0.8.3 → 0.9.0 (kernel-peak + cgroup-v1 limit fallback)	2026-06-03 15:01:09 +01:00
Martin McCaffery	14873b7941	fix(garm): bump dev+benchmark to garm-helm v0.0.17 (template-robust readToken); drop now-redundant explicit fields on benchmark	2026-06-02 16:21:51 +01:00
Martin McCaffery	63cdb926b9	fix(sustainability-rules): remove Kepler energy rules since Kepler is incompatible	2026-06-02 16:12:22 +01:00
Martin McCaffery	f98f53a5a0	revert(kepler): remove Kepler, incompatible with OTC CCE proc mount restrictions	2026-06-02 16:12:06 +01:00
Martin McCaffery	608439697b	fix(benchmark.t09.de/garm): pin ci-sizer-collector to 0.8.3 (latest tagged release, avoid :latest drift during long runs)	2026-06-02 16:08:35 +01:00
Martin McCaffery	b5594a8017	feat(observability): add sustainability metrics, Kepler, 6-month retention, GARM scrape	2026-06-02 15:51:26 +01:00
Martin McCaffery	bbdca11f00	fix(benchmark.t09.de/garm): bump ci-sizer-collector to :latest (0.0.4 tag doesn't exist in registry, was unreachable until sizer integration was restored)	2026-06-02 15:42:10 +01:00
Martin McCaffery	3be56f5a07	fix(vm-client): add nodename-to-IP metricRelabelConfig for node-exporter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-02 14:58:36 +01:00
Martin McCaffery	e2469e7843	fix(benchmark.t09.de/garm): explicit sizer readToken mountPath/key/fileName (chart defaults not deep-merging, was rendering broken %!s(<nil>) path that crashed sizer consultation)	2026-06-02 14:38:41 +01:00
Martin McCaffery	b98486f445	fix: argocd metrics port name, coredns metrics via headless service	2026-06-02 12:13:38 +01:00
Martin McCaffery	eca54cb19c	fix(vm-client): use in-cluster VMSingle URL for remote write	2026-06-02 12:03:44 +01:00

1 2 3 4 5 ...

647 commits