Commit graph

105 commits

Author SHA1 Message Date
70939149ea
feat(observability): add read routes to vmauth for dev.t09.de instance 2026-06-19 16:37:37 +02:00
0a249820de
fix(observability): 🐛 fix ArgoCD scrape port name http-metrics not metrics 2026-06-19 16:11:15 +02:00
f3931dc550
fix(observability): 🐛 add ArgoCD + GARM VMServiceScrapes to dev client stack 2026-06-19 16:07:27 +02:00
7f5c680e19
fix(observability): 🐛 enable GARM unauthenticated metrics + ArgoCD metrics on all instances
- GARM dev.t09.de: set garm.metrics.disableAuth=true to unblock Prometheus scraping (was 401)
- ArgoCD dev.t09.de: add controller/server/repoServer/applicationSet metrics blocks
- ArgoCD edp.buildth.ing: add controller/server/repoServer/applicationSet metrics blocks
- ArgoCD benchmark.t09.de: add controller/server/repoServer/applicationSet metrics blocks
- observability.buildth.ing already had metrics enabled (no change needed)
2026-06-19 13:36:26 +02:00
bcf583a055
fix(observability): 🐛 fix Vector log shipping URL on all clusters
Restores missing '.buildth.ing' domain segment in Vector elasticsearch
endpoint for benchmark, dev, and edp instances.

Template source uses {{{ .Env.DOMAIN_O12Y }}} (correct) — instances
were mis-hydrated, omitting the TLD suffix.
2026-06-19 13:32:23 +02:00
6ea1e798d2
fix(observability): 🐛 add missing manifests to instance stacks
- backup-alerts.yaml → observability.buildth.ing victoria-k8s-stack
- forgejo-scrape.yaml → dev.t09.de vm-client-stack
2026-06-19 13:06:24 +02:00
91db8038e6
feat(observability): custom ArgoCD dashboard with cluster_environment filter 2026-06-19 13:02:48 +02:00
0316eefa43
fix(observability): 🐛 disable false-positive control-plane alerts and fix empty cluster_environment label
Hub defaultRules groups kubernetesSystemControllerManager, kubeScheduler, and
kubernetesSystemScheduler used wrong key 'enabled: false' — chart expects 'create: false'.
This caused KubeControllerManagerDown/KubeSchedulerDown to fire as false positives
because OTC CCE managed k8s does not expose control plane for scraping.

Dev local vmagent had empty externalLabels, so backup-alert rules evaluated by local
vmalert had no cluster_environment label on kube_job_status_failed metrics. Added
cluster_environment=dev to match what the vm-client-stack vmagent adds for hub shipping.
2026-06-19 12:42:21 +02:00
32e998df5b
fix(forgejo): ⏱️ increase s3-backup activeDeadlineSeconds 1350→7200
Previous 22.5m deadline caused DeadlineExceeded on 2026-06-19 when
rclone sync took >22m (vs 13-16s prior days). Likely triggered by
significant new data in OBS bucket. 2h window accommodates large
incremental syncs while BackupJobTooSlow alert still fires at 5m.
2026-06-19 12:35:41 +02:00
59eed97263
fix(observability-client): 🐛 fix remote write URL and add missing manifests dir
- Fix broken remote write URL: o12y.observability. → o12y.observability.buildth.ing
- Create manifests/ directory with .gitkeep for ArgoCD source path
2026-06-19 11:41:26 +02:00
369961a940
fix(observability): 🐛 enable vmagent, fix grafana auth, disable vmauth on dev
- Enable VMAgent (was disabled → no metrics scraped)
- Remove disable_login from Grafana config; add security block so operator can auth via API
- Disable VMAuth (invalid trailing-dot hostname o12y.observability.; not needed on dev)
2026-06-19 10:44:34 +02:00
d83945413d
fix(observability): 🐛 change VLSingle → VLogs in victorialogs manifest
Chart 0.48.1 / operator v0.58.0 uses VLogs CRD for VictoriaLogs, not
VLSingle. The VLSingle kind was introduced in a newer operator version
and is not registered in this chart release. Changing to VLogs which
has identical spec fields (retentionPeriod, removePvcAfterDelete,
storage, storageMetadata, resources all supported).
2026-06-19 10:20:19 +02:00
ef4a1d7ce2
fix(observability): 🐛 disable crds.cleanup hook in victoria-metrics-operator
Pre-upgrade cleanup hook uses bitnami/kubectl and spawns on every ArgoCD
sync. Dev cluster nodes are at 99% CPU / pod limit — hook pod cannot be
scheduled, blocking the entire sync indefinitely.

Disabling cleanup.enabled prevents the hook Job from being created.
CRD cleanup is safe to skip on a fresh bootstrap where no old CRDs exist.
2026-06-19 09:58:55 +02:00
29c0a59734
fix(observability): 🐛 add SkipDryRunOnMissingResource to o12y syncOptions
VLSingle CRD missing at sync time — ArgoCD pre-validates all resources
before applying any, causing 'synchronization tasks not valid' on CRs
whose CRDs are created by the operator in the same sync wave.
SkipDryRunOnMissingResource=true bypasses dry-run for missing CRDs,
unblocking the CRD bootstrap deadlock.
2026-06-19 09:56:24 +02:00
a52a6691a8
fix(observability): 🐛 add prune + RespectIgnoreDifferences to o12y syncPolicy
Fix CRD bootstrap deadlock on victoria-metrics-k8s-stack ArgoCD app.
Adds prune: true and RespectIgnoreDifferences=true to prevent sync
failures when CRs are applied before CRDs are established.
2026-06-19 09:52:01 +02:00
57ee5afa62
feat(observability): add VMServiceScrapes + migrate VLogs → VLSingle
- Migrate VLogs CRD to VLSingle (operator.victoriametrics.com/v1beta1)
- Add VMServiceScrape for Forgejo (gitea ns, port http, /metrics)
- Add VMServiceScrape for ArgoCD (argocd ns, port http-metrics)
- Add VMServiceScrape for GARM (garm ns, port metrics)
- Add VMServiceScrape for CoreDNS (kube-system ns, k8s-app: kube-dns)

Ref: IPCEICIS-4618, IPCEICIS-5066
2026-06-15 21:05:22 +02:00
7949cabb29
fix(garm): ⬆️ update to v0.1.7-forgejo-24 (fresh multi-arch build)
Build completed successfully. Fixes exec format error from -23.
Dropped stale NOTE warning — image is clean amd64.
2026-06-12 13:42:23 +02:00
8939b4f32b
fix(secrets-backup): 🔄 sync simplified manifest from template
Remove client-side openssl encryption. OBS SSE-KMS handles encryption at rest.
Updated: no apk add openssl, no openssl enc step, no secrets-backup-config Secret,
upload .tar.gz directly. Image tag bumped to 1.0.1 (built without openssl).

Ref: IPCEICIS-9317
2026-06-12 13:12:20 +02:00
900c1f6c80
fix(dev): 🐛 revert automated-upload damage — restore working image pins + OIDC secrets
Automated upload (95deeef) overwrote 5 manually-pinned values:

- forgejo-server: restore workflow-webhook-20260305 (DB has v15a/v15b
  migrations; rolling back to 14.0.2-edp1-rootless WILL break the DB)
- garm: restore v0.1.7-forgejo-22 (v0.1.7-forgejo-23 has exec format
  error — wrong arch build, crashes on OTC CCE amd64 nodes)
- sizer-receiver/secret.yaml: re-add sizer-oidc-client secret (deleted
  by upload; causes OIDC auth failure on every sizer-receiver login)
- dex/manifests/dex-sizer-client.yaml: re-add (deleted by upload;
  dex cannot resolve sizer OIDC client without this secret)
- dex.yaml: restore manifests source block (removed by upload;
  without it ArgoCD never deploys the dex/manifests/ directory)

backup-alerts.yaml (new VMRule from automated upload) is kept as-is.
2026-06-12 10:11:00 +02:00
Automated pipeline
95deeef6a0 Automated upload for dev.t09.de 2026-06-12 07:46:00 +00:00
9bbcf4efca
fix(secrets-backup): 🐛 add openssl install + upgrade image to 1.32.0
alpine/k8s:1.28.0 does not ship openssl. Script calls openssl enc
on line 116 causing exit 127 on every run since initial deploy.

Fix:
- apk add --no-cache openssl at script start (defensive, idempotent)
- upgrade image 1.28.0 -> 1.32.0 (kubectl client was 5 minor versions
  behind cluster v1.33, outside supported skew of +/-1)
2026-06-12 09:32:48 +02:00
cf8271fd86
revert(ci-sizer): 🔥 revert image pin — no versioned images in registry
GoReleaser config uses 'dockers_v2' (invalid key, should be 'dockers')
so versioned container images were never pushed. Only :latest exists.
Reverting to :latest until CI pipeline is fixed to publish version tags.

Refs: IPCEICIS-9326
2026-06-08 18:12:56 +02:00
f4aa470894
fix(ci-sizer): 📌 pin sizer-receiver to v0.8.1 for dev
v0.8.2 does not exist — tags go v0.8.1 → v0.8.3.
v0.8.3 introduced RequireOrgMatch middleware that breaks dev env where
repos are under giteaAdmin but OIDC org resolves differently.
Pin to v0.8.1 until IPCEICIS-9326 fixes multi-env org support.
2026-06-08 18:08:04 +02:00
3fdfda9da7
fix(ci-sizer): 📌 pin sizer-receiver to v0.8.2 for dev
v0.8.3 introduced RequireOrgMatch middleware that breaks dev env where
repos are under giteaAdmin but OIDC org resolves differently.
Pin to v0.8.2 until IPCEICIS-9326 fixes multi-env org support.
2026-06-08 18:06:00 +02:00
69839f767b
fix(ci-sizer): 🐛 set RECEIVER_ALLOWED_ORG=giteaAdmin for dev
Dev Forgejo repos live under giteaAdmin user, not DevFW org.
Prod will use DevFW-CICD (template default). Dev needs explicit override.
2026-06-08 18:00:47 +02:00
925c7416b3
fix(ci-sizer): 🐛 revert RECEIVER_ALLOWED_ORG to DevFW for dev env
Template default is DevFW-CICD (prod), but dev Forgejo uses DevFW org.
Hydration overwrote the correct value today.
2026-06-08 17:51:14 +02:00
bd82384eb1
fix(dex): 🔐 correct sizer client secret to match sizer-oidc-client
The deploy hydration created dex-sizer-client with wrong value.
Reverting to the original shared secret that sizer expects
(73eda906... - active for 81 days before hydration overwrote it).

Changes:
- sizer-oidc-client: restore correct shared secret
- dex-sizer-client: add managed manifest to prevent future drift
- dex.yaml: add manifests source for ArgoCD to sync the secret

Broken by stacks rehydration pipeline run.
2026-06-08 17:11:10 +02:00
967edf0382
fix(ci-sizer): 🔐 align OIDC client secret with dex config
Secret mismatch caused infinite login loop on sizer.dev.t09.de.
Added sizer-oidc-client secret manifest to GitOps so ArgoCD manages it.
Value now matches dex-runner-sizer-client (dex side).
2026-06-08 17:00:38 +02:00
Daniel.Sy
9a7544418c fix(forgejo): 🐛 use workflow-webhook image matching DB migration level (v15a/v15b)
DB was migrated to v15 schema by this image in March.
The 14.0.2-edp1-rootless image cannot start against it.
Today's automated pipeline sync triggered pod restart, exposing the mismatch.
2026-06-08 14:11:31 +00:00
Daniel.Sy
a047be3aae fix(garm): ⬇️ rollback to v0.1.7-forgejo-22 — -23 has exec format error (wrong arch) 2026-06-08 14:11:05 +00:00
Automated pipeline
422f568c8e Automated upload for dev.t09.de 2026-06-08 12:15:27 +00:00
Martin McCaffery
14873b7941
fix(garm): bump dev+benchmark to garm-helm v0.0.17 (template-robust readToken); drop now-redundant explicit fields on benchmark 2026-06-02 16:21:51 +01:00
a7bc25603c
Added DevFW-CICD users as admins 2026-05-19 14:01:18 +02:00
75e4a2384b
fix(ci-sizer): 🐛 align GARM_URL with template output
Use short service DNS (garm.garm.svc:80) instead of FQDN
(garm.garm.svc.cluster.local:80) to match what the stack template
now generates.

Ref: IPCEICIS-6886
2026-05-18 10:26:23 +02:00
manuel.ganter
c5191ea18a Update otc/dev.t09.de/stacks/forgejo/forgejo-server/manifests/forgejo-ingress.yaml 2026-05-05 08:29:25 +00:00
556a784beb
fix(stacks-instances): 🚑 add ci-sizer registry entry for dev.t09.de
Create ci-sizer-reg ArgoCD app-of-apps to manage the sizer-receiver
after migration from garm namespace. Restores sizer.dev.t09.de ingress.
2026-04-29 10:41:29 +02:00
2e90240c81
refactor(stacks-instances): 🚚 migrate sizer-receiver to ci-sizer namespace (dev.t09.de)
Move sizer-receiver from stacks/garm/ to stacks/ci-sizer/ for
dev.t09.de only. edp.buildth.ing stays in garm (not deployed yet).
2026-04-29 10:36:14 +02:00
9d042eee1c
chore: ⬆️ bump garm image to v0.1.7-forgejo-22 on dev.t09.de 2026-04-28 10:11:09 +02:00
bc96d8d7aa
chore(garm): ⬆️ bump garm-forgejo to v0.1.7-forgejo-21 2026-04-24 15:47:23 +02:00
e65abf162e
chore(garm): ⬆️ bump garm-forgejo to v0.1.7-forgejo-20 2026-04-24 14:51:58 +02:00
a9dcf29f7a
chore(garm): ⬆️ bump garm-forgejo to v0.1.7-forgejo-19 2026-04-24 13:41:52 +02:00
b72e2049e3
chore: bump garm image to v0.1.7-forgejo-18 for dev.t09.de 2026-04-22 13:19:32 +02:00
4cea4ffde7
chore: bump garm to v0.1.7-forgejo-17 (activeDeadlineSeconds) 2026-04-21 17:15:41 +02:00
0b13b89640
chore(garm): ⬆️ bump garm-helm to v0.0.15 (startup probe fix) 2026-04-21 16:27:25 +02:00
4aa8973c91
chore(garm): ⬆️ bump garm-helm chart to v0.0.14 2026-04-21 16:03:06 +02:00
c682c48be0
chore: bump garm image to v0.1.7-forgejo-16 2026-04-21 15:53:50 +02:00
61721097d6
chore(sizer): 🔧 rename forgejo-runner-sizer to ci-sizer in deployment configs
- Update container image names to ci-sizer-{receiver,collector}
- Update Dex OIDC client ID and name to ci-sizer
- Template allowed-org as SIZER_ALLOWED_ORG variable
2026-04-21 14:16:39 +02:00
487e1ac15c
chore(garm): ⬆️ bump garm to v0.1.7-forgejo-15 2026-04-20 17:32:22 +02:00
2af607e949
chore(garm): ⬆️ bump garm to v0.1.7-forgejo-14, add CPU sizing mode env vars 2026-04-20 16:08:12 +02:00
f2c885cd84
fix(sizer): 🔧 sync gitops with live deployment — add OIDC config, remove legacy Forgejo tokens 2026-04-16 15:05:53 +02:00