feat(observability): ✨ add cluster heartbeat dead-man switch alerts

ClusterMetricsSilent: fires if no kubelet metrics for >10m (catches vmagent outages). ClusterAPIServerDown: fires if apiserver scrape fails for >5m. Replaces silenced KubeControllerManagerDown/KubeSchedulerDown which never fire on managed K8s.
fix(observability): 🔇 silence managed-K8s false alerts + bump backup deadline to 4h
2026-06-22 11:05:48 +02:00 · 2026-06-22 10:46:01 +02:00 · 2026-06-22 10:40:49 +02:00 · 2026-06-22 10:35:08 +02:00 · 2026-06-19 16:43:28 +02:00 · 2026-06-19 16:37:37 +02:00
42 changed files with 1243 additions and 71 deletions
--- a/otc/benchmark.t09.de/stacks/core/argocd/values.yaml
+++ b/otc/benchmark.t09.de/stacks/core/argocd/values.yaml
@ -35,6 +35,30 @@ configs:
  tls:
    certificates:

+controller:
+  metrics:
+    enabled: true
+    serviceMonitor:
+      enabled: false
+
+server:
+  metrics:
+    enabled: true
+    serviceMonitor:
+      enabled: false
+
+repoServer:
+  metrics:
+    enabled: true
+    serviceMonitor:
+      enabled: false
+
+applicationSet:
+  metrics:
+    enabled: true
+    serviceMonitor:
+      enabled: false
+
 notifications:
  enabled: false

--- a/otc/benchmark.t09.de/stacks/forgejo/forgejo-server/manifests/forgejo-s3-backup-cronjob.yaml
+++ b/otc/benchmark.t09.de/stacks/forgejo/forgejo-server/manifests/forgejo-s3-backup-cronjob.yaml
@ -11,8 +11,8 @@ spec:
  startingDeadlineSeconds: 600 # 10 minutes
  jobTemplate:
    spec:
-      # 60 min until backup - 10 min start - (backoffLimit * activeDeadlineSeconds) - some time sync buffer
-      activeDeadlineSeconds: 1350
+      # 2h window: handles large incremental syncs after repo growth or OBS slowness; BackupJobTooSlow alert fires at 5m
+      activeDeadlineSeconds: 14400
      backoffLimit: 2
      ttlSecondsAfterFinished: 259200 #
      template:
--- a/otc/benchmark.t09.de/stacks/observability-client/vector/values.yaml
+++ b/otc/benchmark.t09.de/stacks/observability-client/vector/values.yaml
@ -48,7 +48,7 @@ customConfig:
      type: elasticsearch
      inputs: [parser]
      endpoints: 
-        - https://o12y.observability./insert/elasticsearch/
+        - https://o12y.observability.buildth.ing/insert/elasticsearch/
      auth:
        strategy: basic
        user: ${VECTOR_USER}
--- a/otc/benchmark.t09.de/stacks/observability-client/vm-client-stack/manifests/.gitkeep
+++ b/otc/benchmark.t09.de/stacks/observability-client/vm-client-stack/manifests/.gitkeep
--- a/otc/benchmark.t09.de/stacks/observability-client/vm-client-stack/manifests/forgejo-scrape.yaml
+++ b/otc/benchmark.t09.de/stacks/observability-client/vm-client-stack/manifests/forgejo-scrape.yaml
@ -0,0 +1,15 @@
+apiVersion: operator.victoriametrics.com/v1beta1
+kind: VMServiceScrape
+metadata:
+  name: forgejo
+  namespace: observability
+spec:
+  namespaceSelector:
+    matchNames:
+      - gitea
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: forgejo
+  endpoints:
+    - port: http
+      path: /metrics
--- a/otc/benchmark.t09.de/stacks/observability-client/vm-client-stack/values.yaml
+++ b/otc/benchmark.t09.de/stacks/observability-client/vm-client-stack/values.yaml
@ -201,13 +201,13 @@ defaultRules:
      create: true
      rules: {}
    kubernetesSystemControllerManager:
-      create: true
+      create: false
      rules: {}
    kubeScheduler:
-      create: true
+      create: false
      rules: {}
    kubernetesSystemScheduler:
-      create: true
+      create: false
      rules: {}
    kubeStateMetrics:
      create: true
@ -778,7 +778,7 @@ vmagent:
  # -- Remote write configuration of VMAgent, allowed parameters defined in a [spec](https://docs.victoriametrics.com/operator/api#vmagentremotewritespec)
  additionalRemoteWrites:
    # []
-    - url: https://o12y.observability./api/v1/write
+    - url: https://o12y.observability.buildth.ing/api/v1/write
      basicAuth:
        username:
          name: simple-user-secret
--- a/otc/dev.t09.de/stacks/core/argocd/values.yaml
+++ b/otc/dev.t09.de/stacks/core/argocd/values.yaml
@ -35,6 +35,30 @@ configs:
  tls:
    certificates:

+controller:
+  metrics:
+    enabled: true
+    serviceMonitor:
+      enabled: false
+
+server:
+  metrics:
+    enabled: true
+    serviceMonitor:
+      enabled: false
+
+repoServer:
+  metrics:
+    enabled: true
+    serviceMonitor:
+      enabled: false
+
+applicationSet:
+  metrics:
+    enabled: true
+    serviceMonitor:
+      enabled: false
+
 notifications:
  enabled: false

--- a/otc/dev.t09.de/stacks/forgejo/forgejo-server/manifests/forgejo-s3-backup-cronjob.yaml
+++ b/otc/dev.t09.de/stacks/forgejo/forgejo-server/manifests/forgejo-s3-backup-cronjob.yaml
@ -11,8 +11,8 @@ spec:
  startingDeadlineSeconds: 600 # 10 minutes
  jobTemplate:
    spec:
-      # 60 min until backup - 10 min start - (backoffLimit * activeDeadlineSeconds) - some time sync buffer
-      activeDeadlineSeconds: 1350
+      # 2h window: handles large incremental syncs after repo growth or OBS slowness; BackupJobTooSlow alert fires at 5m
+      activeDeadlineSeconds: 14400
      backoffLimit: 2
      ttlSecondsAfterFinished: 259200 #
      template:
--- a/otc/dev.t09.de/stacks/garm/garm/values.yaml
+++ b/otc/dev.t09.de/stacks/garm/garm/values.yaml
@ -41,5 +41,8 @@ providerConfig:
      sidecarImage: edp.buildth.ing/devfw-cicd/ci-sizer-collector:0.0.4

 garm:
+  metrics:
+    enable: true
+    disableAuth: true
  logging:
    logLevel: info
--- a/otc/dev.t09.de/stacks/observability-client/vector/values.yaml
+++ b/otc/dev.t09.de/stacks/observability-client/vector/values.yaml
@ -48,7 +48,7 @@ customConfig:
      type: elasticsearch
      inputs: [parser]
      endpoints: 
-        - https://o12y.observability./insert/elasticsearch/
+        - https://o12y.observability.buildth.ing/insert/elasticsearch/
      auth:
        strategy: basic
        user: ${VECTOR_USER}
--- a/otc/dev.t09.de/stacks/observability-client/vm-client-stack/manifests/.gitkeep
+++ b/otc/dev.t09.de/stacks/observability-client/vm-client-stack/manifests/.gitkeep
--- a/otc/dev.t09.de/stacks/observability-client/vm-client-stack/manifests/argocd-scrape.yaml
+++ b/otc/dev.t09.de/stacks/observability-client/vm-client-stack/manifests/argocd-scrape.yaml
@ -0,0 +1,14 @@
+apiVersion: operator.victoriametrics.com/v1beta1
+kind: VMServiceScrape
+metadata:
+  name: argocd
+  namespace: observability
+spec:
+  namespaceSelector:
+    matchNames:
+      - argocd
+  selector:
+    matchLabels:
+      app.kubernetes.io/part-of: argocd
+  endpoints:
+    - port: http-metrics
--- a/otc/dev.t09.de/stacks/observability-client/vm-client-stack/manifests/forgejo-scrape.yaml
+++ b/otc/dev.t09.de/stacks/observability-client/vm-client-stack/manifests/forgejo-scrape.yaml
@ -0,0 +1,15 @@
+apiVersion: operator.victoriametrics.com/v1beta1
+kind: VMServiceScrape
+metadata:
+  name: forgejo
+  namespace: observability
+spec:
+  namespaceSelector:
+    matchNames:
+      - gitea
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: forgejo
+  endpoints:
+    - port: http
+      path: /metrics
--- a/otc/dev.t09.de/stacks/observability-client/vm-client-stack/manifests/garm-scrape.yaml
+++ b/otc/dev.t09.de/stacks/observability-client/vm-client-stack/manifests/garm-scrape.yaml
@ -0,0 +1,15 @@
+apiVersion: operator.victoriametrics.com/v1beta1
+kind: VMServiceScrape
+metadata:
+  name: garm
+  namespace: observability
+spec:
+  namespaceSelector:
+    matchNames:
+      - garm
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: garm
+  endpoints:
+    - port: http
+      path: /metrics
--- a/otc/dev.t09.de/stacks/observability-client/vm-client-stack/values.yaml
+++ b/otc/dev.t09.de/stacks/observability-client/vm-client-stack/values.yaml
@ -201,13 +201,13 @@ defaultRules:
      create: true
      rules: {}
    kubernetesSystemControllerManager:
-      create: true
+      create: false
      rules: {}
    kubeScheduler:
-      create: true
+      create: false
      rules: {}
    kubernetesSystemScheduler:
-      create: true
+      create: false
      rules: {}
    kubeStateMetrics:
      create: true
@ -778,7 +778,7 @@ vmagent:
  # -- Remote write configuration of VMAgent, allowed parameters defined in a [spec](https://docs.victoriametrics.com/operator/api#vmagentremotewritespec)
  additionalRemoteWrites:
    # []
-    - url: https://o12y.observability./api/v1/write
+    - url: https://o12y.observability.buildth.ing/api/v1/write
      basicAuth:
        username:
          name: simple-user-secret
@ -801,6 +801,20 @@ vmagent:
      # Do not store original labels in vmagent's memory by default. This reduces the amount of memory used by vmagent
      # but makes vmagent debugging UI less informative. See: https://docs.victoriametrics.com/vmagent/#relabel-debug
      promscrape.dropOriginalLabels: "true"
+    # Harden liveness probe: default failureThreshold=10 masked a 72h silent outage
+    livenessProbe:
+      httpGet:
+        path: /health
+        port: http
+      failureThreshold: 3
+      periodSeconds: 5
+      timeoutSeconds: 5
+    startupProbe:
+      httpGet:
+        path: /health
+        port: http
+      failureThreshold: 30
+      periodSeconds: 5
  # -- (object) VMAgent ingress configuration
  ingress:
    enabled: false
--- a/otc/dev.t09.de/stacks/observability/grafana-operator/manifests/grafana.yaml
+++ b/otc/dev.t09.de/stacks/observability/grafana-operator/manifests/grafana.yaml
@ -35,8 +35,10 @@ spec:
    server:
      root_url: "https://grafana.dev.t09.de"
    auth:
-      disable_login: "true"
      disable_login_form: "true"
+    security:
+      admin_user: admin
+      admin_password: admin
    auth.generic_oauth:
      enabled: "true"
      name: Forgejo
--- a/otc/dev.t09.de/stacks/observability/victoria-k8s-stack.yaml
+++ b/otc/dev.t09.de/stacks/observability/victoria-k8s-stack.yaml
@ -9,10 +9,13 @@ spec:
  project: default
  syncPolicy:
    automated:
+      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
+      - RespectIgnoreDifferences=true
+      - SkipDryRunOnMissingResource=true
  destination:
    name: in-cluster
    namespace: observability
--- a/otc/dev.t09.de/stacks/observability/victoria-k8s-stack/manifests/garm-scrape.yaml
+++ b/otc/dev.t09.de/stacks/observability/victoria-k8s-stack/manifests/garm-scrape.yaml
@ -11,4 +11,4 @@ spec:
    matchLabels:
      app.kubernetes.io/name: garm
  endpoints:
-    - port: metrics
+    - port: http
--- a/otc/dev.t09.de/stacks/observability/victoria-k8s-stack/manifests/vlogs.yaml
+++ b/otc/dev.t09.de/stacks/observability/victoria-k8s-stack/manifests/vlogs.yaml
@ -1,5 +1,5 @@
 apiVersion: operator.victoriametrics.com/v1beta1
-kind: VLSingle
+kind: VLogs
 metadata:
  name: victorialogs
  namespace: observability
--- a/otc/dev.t09.de/stacks/observability/victoria-k8s-stack/manifests/vmauth.yaml
+++ b/otc/dev.t09.de/stacks/observability/victoria-k8s-stack/manifests/vmauth.yaml
@ -12,6 +12,12 @@ spec:
    - static:
        url: http://vmsingle-o12y:8429
      paths: ["/api/v1/write"]
+    - static:
+        url: http://vmsingle-o12y:8429
+      paths: ["/api/v1/.*"]
    - static:
        url: http://vlogs-victorialogs:9428
      paths: ["/insert/elasticsearch/.*"]
+    - static:
+        url: http://vlogs-victorialogs:9428
+      paths: ["/select/.*"]
--- a/otc/dev.t09.de/stacks/observability/victoria-k8s-stack/values.yaml
+++ b/otc/dev.t09.de/stacks/observability/victoria-k8s-stack/values.yaml
@ -28,10 +28,7 @@ victoria-metrics-operator:
  crds:
    plain: true
    cleanup:
-      enabled: true
-      image:
-        repository: bitnami/kubectl
-        pullPolicy: IfNotPresent
+      enabled: false  # disabled: cleanup hook can't schedule on resource-constrained nodes (Insufficient cpu / Too many pods)
  serviceMonitor:
    enabled: true
  operator:
@ -676,7 +673,7 @@ vmalert:

 vmauth:
  # -- Enable VMAuth CR
-  enabled: true
+  enabled: false
  # -- VMAuth annotations
  annotations: {}
  # -- (object) Full spec for VMAuth CRD. Allowed values described [here](https://docs.victoriametrics.com/operator/api#vmauthspec)
@ -699,7 +696,7 @@ vmauth:

 vmagent:
  # -- Create VMAgent CR
-  enabled: false
+  enabled: true
  # -- VMAgent annotations
  annotations: {}
  # -- Remote write configuration of VMAgent, allowed parameters defined in a [spec](https://docs.victoriametrics.com/operator/api#vmagentremotewritespec)
@ -711,7 +708,8 @@ vmagent:
    port: "8429"
    selectAllByDefault: true
    scrapeInterval: 20s
-    externalLabels: {}
+    externalLabels:
+      cluster_environment: "dev"
      # For multi-cluster setups it is useful to use "cluster" label to identify the metrics source.
      # For example:
      # cluster: cluster-name
--- a/otc/edp.buildth.ing/stacks/core/argocd/values.yaml
+++ b/otc/edp.buildth.ing/stacks/core/argocd/values.yaml
@ -35,6 +35,30 @@ configs:
  tls:
    certificates:

+controller:
+  metrics:
+    enabled: true
+    serviceMonitor:
+      enabled: false
+
+server:
+  metrics:
+    enabled: true
+    serviceMonitor:
+      enabled: false
+
+repoServer:
+  metrics:
+    enabled: true
+    serviceMonitor:
+      enabled: false
+
+applicationSet:
+  metrics:
+    enabled: true
+    serviceMonitor:
+      enabled: false
+
 notifications:
  enabled: false

--- a/otc/edp.buildth.ing/stacks/forgejo/forgejo-server/manifests/forgejo-s3-backup-cronjob.yaml
+++ b/otc/edp.buildth.ing/stacks/forgejo/forgejo-server/manifests/forgejo-s3-backup-cronjob.yaml
@ -11,8 +11,8 @@ spec:
  startingDeadlineSeconds: 600 # 10 minutes
  jobTemplate:
    spec:
-      # 60 min until backup - 10 min start - (backoffLimit * activeDeadlineSeconds) - some time sync buffer
-      activeDeadlineSeconds: 1350
+      # 2h window: handles large incremental syncs after repo growth or OBS slowness; BackupJobTooSlow alert fires at 5m
+      activeDeadlineSeconds: 14400
      backoffLimit: 2
      ttlSecondsAfterFinished: 259200 #
      template:
--- a/otc/edp.buildth.ing/stacks/observability-client/vector/values.yaml
+++ b/otc/edp.buildth.ing/stacks/observability-client/vector/values.yaml
@ -48,7 +48,7 @@ customConfig:
      type: elasticsearch
      inputs: [parser]
      endpoints: 
-        - https://o12y.observability./insert/elasticsearch/
+        - https://o12y.observability.buildth.ing/insert/elasticsearch/
      auth:
        strategy: basic
        user: ${VECTOR_USER}
--- a/otc/edp.buildth.ing/stacks/observability-client/vm-client-stack/manifests/.gitkeep
+++ b/otc/edp.buildth.ing/stacks/observability-client/vm-client-stack/manifests/.gitkeep
--- a/otc/edp.buildth.ing/stacks/observability-client/vm-client-stack/manifests/forgejo-scrape.yaml
+++ b/otc/edp.buildth.ing/stacks/observability-client/vm-client-stack/manifests/forgejo-scrape.yaml
@ -0,0 +1,15 @@
+apiVersion: operator.victoriametrics.com/v1beta1
+kind: VMServiceScrape
+metadata:
+  name: forgejo
+  namespace: observability
+spec:
+  namespaceSelector:
+    matchNames:
+      - gitea
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: forgejo
+  endpoints:
+    - port: http
+      path: /metrics
--- a/otc/edp.buildth.ing/stacks/observability-client/vm-client-stack/values.yaml
+++ b/otc/edp.buildth.ing/stacks/observability-client/vm-client-stack/values.yaml
@ -201,13 +201,13 @@ defaultRules:
      create: true
      rules: {}
    kubernetesSystemControllerManager:
-      create: true
+      create: false
      rules: {}
    kubeScheduler:
-      create: true
+      create: false
      rules: {}
    kubernetesSystemScheduler:
-      create: true
+      create: false
      rules: {}
    kubeStateMetrics:
      create: true
@ -778,7 +778,7 @@ vmagent:
  # -- Remote write configuration of VMAgent, allowed parameters defined in a [spec](https://docs.victoriametrics.com/operator/api#vmagentremotewritespec)
  additionalRemoteWrites:
    # []
-    - url: https://o12y.observability./api/v1/write
+    - url: https://o12y.observability.buildth.ing/api/v1/write
      basicAuth:
        username:
          name: simple-user-secret
--- a/otc/observability.buildth.ing/stacks/forgejo/forgejo-server/manifests/forgejo-s3-backup-cronjob.yaml
+++ b/otc/observability.buildth.ing/stacks/forgejo/forgejo-server/manifests/forgejo-s3-backup-cronjob.yaml
@ -11,8 +11,8 @@ spec:
  startingDeadlineSeconds: 600 # 10 minutes
  jobTemplate:
    spec:
-      # 60 min until backup - 10 min start - (backoffLimit * activeDeadlineSeconds) - some time sync buffer
-      activeDeadlineSeconds: 1350
+      # 2h window: handles large incremental syncs after repo growth or OBS slowness; BackupJobTooSlow alert fires at 5m
+      activeDeadlineSeconds: 14400
      backoffLimit: 2
      ttlSecondsAfterFinished: 259200 #
      template:
--- a/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/argocd-operational.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/argocd-operational.yaml
@ -0,0 +1,153 @@
+apiVersion: grafana.integreatly.org/v1beta1
+kind: GrafanaDashboard
+metadata:
+  name: argocd-operational
+spec:
+  instanceSelector:
+    matchLabels:
+      dashboards: "grafana"
+  folder: "EDP / Applications"
+  json: |
+    {
+      "annotations": {"list": []},
+      "editable": true,
+      "graphTooltip": 1,
+      "panels": [
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 0},
+          "title": "Application Status",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}}},
+          "gridPos": {"h": 4, "w": 4, "x": 0, "y": 1},
+          "title": "Total Apps",
+          "type": "stat",
+          "targets": [{"expr": "count(argocd_app_info{cluster_environment=~\"$cluster_environment\"})", "legendFormat": ""}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}}},
+          "gridPos": {"h": 4, "w": 4, "x": 4, "y": 1},
+          "title": "Healthy",
+          "type": "stat",
+          "targets": [{"expr": "count(argocd_app_info{cluster_environment=~\"$cluster_environment\", health_status=\"Healthy\"})", "legendFormat": ""}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}]}}},
+          "gridPos": {"h": 4, "w": 4, "x": 8, "y": 1},
+          "title": "Degraded",
+          "type": "stat",
+          "targets": [{"expr": "count(argocd_app_info{cluster_environment=~\"$cluster_environment\", health_status=\"Degraded\"}) or vector(0)", "legendFormat": ""}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}}},
+          "gridPos": {"h": 4, "w": 4, "x": 12, "y": 1},
+          "title": "Synced",
+          "type": "stat",
+          "targets": [{"expr": "count(argocd_app_info{cluster_environment=~\"$cluster_environment\", sync_status=\"Synced\"})", "legendFormat": ""}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "yellow", "value": null}]}}},
+          "gridPos": {"h": 4, "w": 4, "x": 16, "y": 1},
+          "title": "OutOfSync",
+          "type": "stat",
+          "targets": [{"expr": "count(argocd_app_info{cluster_environment=~\"$cluster_environment\", sync_status=\"OutOfSync\"}) or vector(0)", "legendFormat": ""}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "orange", "value": null}]}}},
+          "gridPos": {"h": 4, "w": 4, "x": 20, "y": 1},
+          "title": "Progressing",
+          "type": "stat",
+          "targets": [{"expr": "count(argocd_app_info{cluster_environment=~\"$cluster_environment\", health_status=\"Progressing\"}) or vector(0)", "legendFormat": ""}]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 5},
+          "title": "Application Details",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {
+            "defaults": {"custom": {"filterable": true}},
+            "overrides": [
+              {"matcher": {"id": "byName", "options": "Health"}, "properties": [{"id": "custom.cellOptions", "value": {"type": "color-text"}}, {"id": "mappings", "value": [{"options": {"Healthy": {"color": "green", "text": "Healthy"}, "Degraded": {"color": "red", "text": "Degraded"}, "Progressing": {"color": "yellow", "text": "Progressing"}, "Missing": {"color": "purple", "text": "Missing"}}, "type": "value"}]}]},
+              {"matcher": {"id": "byName", "options": "Sync"}, "properties": [{"id": "custom.cellOptions", "value": {"type": "color-text"}}, {"id": "mappings", "value": [{"options": {"Synced": {"color": "green", "text": "Synced"}, "OutOfSync": {"color": "orange", "text": "OutOfSync"}}, "type": "value"}]}]}
+            ]
+          },
+          "gridPos": {"h": 12, "w": 24, "x": 0, "y": 6},
+          "title": "All Applications",
+          "type": "table",
+          "targets": [{"expr": "argocd_app_info{cluster_environment=~\"$cluster_environment\"}", "format": "table", "instant": true, "legendFormat": ""}],
+          "transformations": [
+            {"id": "filterFieldsByName", "options": {"include": {"names": ["cluster_environment", "name", "dest_namespace", "health_status", "sync_status", "repo"]}}},
+            {"id": "organize", "options": {"renameByName": {"cluster_environment": "Environment", "name": "Application", "dest_namespace": "Namespace", "health_status": "Health", "sync_status": "Sync", "repo": "Repository"}}}
+          ]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 18},
+          "title": "Sync Activity",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "ops"}},
+          "gridPos": {"h": 8, "w": 12, "x": 0, "y": 19},
+          "title": "Sync Operations (rate)",
+          "type": "timeseries",
+          "targets": [{"expr": "sum(rate(argocd_app_sync_total{cluster_environment=~\"$cluster_environment\"}[5m])) by (name, phase)", "legendFormat": "{{name}} ({{phase}})"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "ops"}},
+          "gridPos": {"h": 8, "w": 12, "x": 12, "y": 19},
+          "title": "Reconciliation Rate",
+          "type": "timeseries",
+          "targets": [{"expr": "sum(rate(argocd_app_reconcile_count{cluster_environment=~\"$cluster_environment\"}[5m])) by (namespace)", "legendFormat": "{{namespace}}"}]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 27},
+          "title": "ArgoCD Logs",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "victoriametrics-logs-datasource"},
+          "gridPos": {"h": 10, "w": 24, "x": 0, "y": 28},
+          "title": "ArgoCD Logs",
+          "type": "logs",
+          "targets": [{"expr": "{cluster_environment=~\"$cluster_environment\", kubernetes.namespace=\"argocd\"}", "refId": "A"}],
+          "options": {"showTime": true, "showLabels": true, "wrapLogMessage": true, "enableLogDetails": true, "sortOrder": "Descending"}
+        }
+      ],
+      "schemaVersion": 39,
+      "tags": ["edp", "argocd", "gitops"],
+      "templating": {
+        "list": [
+          {
+            "current": {"selected": true, "text": "All", "value": "$__all"},
+            "datasource": {"type": "prometheus"},
+            "definition": "label_values(argocd_app_info, cluster_environment)",
+            "includeAll": true,
+            "multi": true,
+            "name": "cluster_environment",
+            "label": "Environment",
+            "query": "label_values(argocd_app_info, cluster_environment)",
+            "refresh": 2,
+            "sort": 1,
+            "type": "query"
+          }
+        ]
+      },
+      "time": {"from": "now-6h", "to": "now"},
+      "title": "ArgoCD Operations",
+      "uid": "edp-argocd-ops"
+    }
--- a/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/argocd.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/argocd.yaml
@ -6,4 +6,5 @@ spec:
  instanceSelector:
    matchLabels:
      dashboards: "grafana"
+  folder: "EDP / Applications"
  url: "https://raw.githubusercontent.com/argoproj/argo-cd/refs/heads/master/examples/dashboard.json"
--- a/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/cronjob-monitoring.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/cronjob-monitoring.yaml
@ -0,0 +1,103 @@
+apiVersion: grafana.integreatly.org/v1beta1
+kind: GrafanaDashboard
+metadata:
+  name: cronjob-monitoring
+spec:
+  instanceSelector:
+    matchLabels:
+      dashboards: "grafana"
+  folder: "EDP / Operations"
+  json: |
+    {
+      "annotations": {"list": []},
+      "editable": true,
+      "graphTooltip": 1,
+      "panels": [
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 0},
+          "title": "Backup Job Status",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "s", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 86400}, {"color": "red", "value": 172800}]}}},
+          "gridPos": {"h": 5, "w": 12, "x": 0, "y": 1},
+          "title": "Time Since Last Schedule",
+          "type": "stat",
+          "targets": [{"expr": "time() - kube_cronjob_status_last_schedule_time{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cronjob}} ({{cluster_environment}})"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "red", "value": 1}]}}},
+          "gridPos": {"h": 5, "w": 12, "x": 12, "y": 1},
+          "title": "Failed Jobs (Active)",
+          "type": "stat",
+          "targets": [{"expr": "sum by(cluster_environment, job_name) (kube_job_status_failed{cluster_environment=~\"$cluster_environment\"}) > 0", "legendFormat": "{{job_name}} ({{cluster_environment}})"}]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 6},
+          "title": "CronJob Overview",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"custom": {"filterable": true}}, "overrides": [{"matcher": {"id": "byName", "options": "Suspended"}, "properties": [{"id": "mappings", "value": [{"options": {"0": {"text": "No", "color": "green"}, "1": {"text": "YES", "color": "red"}}, "type": "value"}]}]}]},
+          "gridPos": {"h": 8, "w": 24, "x": 0, "y": 7},
+          "title": "All CronJobs",
+          "type": "table",
+          "targets": [
+            {"expr": "kube_cronjob_info{cluster_environment=~\"$cluster_environment\"}", "format": "table", "instant": true, "refId": "A"}
+          ],
+          "transformations": [
+            {"id": "filterFieldsByName", "options": {"include": {"names": ["cluster_environment", "cronjob", "namespace", "schedule"]}}},
+            {"id": "organize", "options": {"renameByName": {"cluster_environment": "Environment", "cronjob": "CronJob", "namespace": "Namespace", "schedule": "Schedule"}}}
+          ]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 15},
+          "title": "Job History",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
+          "title": "Job Completions (24h)",
+          "type": "timeseries",
+          "targets": [{"expr": "sum(kube_job_status_succeeded{cluster_environment=~\"$cluster_environment\"}) by (job_name, cluster_environment)", "legendFormat": "{{job_name}} ({{cluster_environment}})"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "color": {"mode": "palette-classic"}}},
+          "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
+          "title": "Job Failures (24h)",
+          "type": "timeseries",
+          "targets": [{"expr": "sum(kube_job_status_failed{cluster_environment=~\"$cluster_environment\"}) by (job_name, cluster_environment)", "legendFormat": "{{job_name}} ({{cluster_environment}})"}]
+        }
+      ],
+      "schemaVersion": 39,
+      "tags": ["edp", "backup", "cronjob"],
+      "templating": {
+        "list": [
+          {
+            "current": {"selected": true, "text": "All", "value": "$__all"},
+            "datasource": {"type": "prometheus"},
+            "definition": "label_values(kube_cronjob_info, cluster_environment)",
+            "includeAll": true,
+            "multi": true,
+            "name": "cluster_environment",
+            "label": "Environment",
+            "query": "label_values(kube_cronjob_info, cluster_environment)",
+            "refresh": 2,
+            "sort": 1,
+            "type": "query"
+          }
+        ]
+      },
+      "time": {"from": "now-24h", "to": "now"},
+      "title": "CronJob & Backup Monitoring",
+      "uid": "edp-cronjobs"
+    }
--- a/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/forgejo.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/forgejo.yaml
@ -0,0 +1,207 @@
+apiVersion: grafana.integreatly.org/v1beta1
+kind: GrafanaDashboard
+metadata:
+  name: forgejo
+spec:
+  instanceSelector:
+    matchLabels:
+      dashboards: "grafana"
+  folder: "EDP / Applications"
+  json: |
+    {
+      "annotations": {"list": []},
+      "editable": true,
+      "graphTooltip": 1,
+      "panels": [
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 0},
+          "title": "Forgejo Health",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"mappings": [{"options": {"0": {"text": "DOWN", "color": "red"}, "1": {"text": "UP", "color": "green"}}, "type": "value"}], "thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}, {"color": "green", "value": 1}]}}},
+          "gridPos": {"h": 4, "w": 4, "x": 0, "y": 1},
+          "title": "Status",
+          "type": "stat",
+          "targets": [{"expr": "up{job=\"forgejo-server-http\", cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 4, "y": 1},
+          "title": "Version",
+          "type": "stat",
+          "targets": [{"expr": "gitea_build_info{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{version}}"}],
+          "options": {"reduceOptions": {"calcs": ["lastNotNull"]}, "textMode": "name"}
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 8, "y": 1},
+          "title": "Repositories",
+          "type": "stat",
+          "targets": [{"expr": "gitea_repositories{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 12, "y": 1},
+          "title": "Users",
+          "type": "stat",
+          "targets": [{"expr": "gitea_users{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 16, "y": 1},
+          "title": "Organizations",
+          "type": "stat",
+          "targets": [{"expr": "gitea_organizations{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 20, "y": 1},
+          "title": "Teams",
+          "type": "stat",
+          "targets": [{"expr": "gitea_teams{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 5},
+          "title": "Activity",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 6, "x": 0, "y": 6},
+          "title": "Open Issues",
+          "type": "stat",
+          "targets": [{"expr": "gitea_issues_open{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 6, "x": 6, "y": 6},
+          "title": "Closed Issues",
+          "type": "stat",
+          "targets": [{"expr": "gitea_issues_closed{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 6, "x": 12, "y": 6},
+          "title": "Webhooks",
+          "type": "stat",
+          "targets": [{"expr": "gitea_webhooks{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 6, "x": 18, "y": 6},
+          "title": "Hook Tasks",
+          "type": "stat",
+          "targets": [{"expr": "gitea_hooktasks{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 10},
+          "title": "Content & Auth",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 0, "y": 11},
+          "title": "Stars",
+          "type": "stat",
+          "targets": [{"expr": "gitea_stars{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 4, "y": 11},
+          "title": "Watches",
+          "type": "stat",
+          "targets": [{"expr": "gitea_watches{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 8, "y": 11},
+          "title": "Releases",
+          "type": "stat",
+          "targets": [{"expr": "gitea_releases{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 12, "y": 11},
+          "title": "Mirrors",
+          "type": "stat",
+          "targets": [{"expr": "gitea_mirrors{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 16, "y": 11},
+          "title": "Public Keys",
+          "type": "stat",
+          "targets": [{"expr": "gitea_publickeys{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 20, "y": 11},
+          "title": "OAuth Apps",
+          "type": "stat",
+          "targets": [{"expr": "gitea_oauths{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 15},
+          "title": "Forgejo Logs",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "victoriametrics-logs-datasource"},
+          "gridPos": {"h": 10, "w": 12, "x": 0, "y": 16},
+          "title": "Forgejo Server Logs",
+          "type": "logs",
+          "targets": [{"expr": "{cluster_environment=~\"$cluster_environment\", kubernetes.namespace=\"gitea\"}", "refId": "A"}],
+          "options": {"showTime": true, "showLabels": true, "wrapLogMessage": true, "enableLogDetails": true, "sortOrder": "Descending"}
+        },
+        {
+          "datasource": {"type": "victoriametrics-logs-datasource"},
+          "gridPos": {"h": 10, "w": 12, "x": 12, "y": 16},
+          "title": "Forgejo Errors",
+          "type": "logs",
+          "targets": [{"expr": "{cluster_environment=~\"$cluster_environment\", kubernetes.namespace=\"gitea\"} error OR Error OR ERROR OR panic", "refId": "A"}],
+          "options": {"showTime": true, "showLabels": true, "wrapLogMessage": true, "enableLogDetails": true, "sortOrder": "Descending"}
+        }
+      ],
+      "schemaVersion": 39,
+      "tags": ["edp", "forgejo", "gitea"],
+      "templating": {
+        "list": [
+          {
+            "current": {"selected": true, "text": "All", "value": "$__all"},
+            "datasource": {"type": "prometheus"},
+            "definition": "label_values(gitea_repositories, cluster_environment)",
+            "includeAll": true,
+            "multi": true,
+            "name": "cluster_environment",
+            "label": "Environment",
+            "query": "label_values(gitea_repositories, cluster_environment)",
+            "refresh": 2,
+            "type": "query"
+          }
+        ]
+      },
+      "time": {"from": "now-6h", "to": "now"},
+      "title": "Forgejo",
+      "uid": "edp-forgejo"
+    }
--- a/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/garm.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/garm.yaml
@ -0,0 +1,117 @@
+apiVersion: grafana.integreatly.org/v1beta1
+kind: GrafanaDashboard
+metadata:
+  name: garm
+spec:
+  instanceSelector:
+    matchLabels:
+      dashboards: "grafana"
+  folder: "EDP / Applications"
+  json: |
+    {
+      "annotations": {"list": []},
+      "editable": true,
+      "graphTooltip": 1,
+      "panels": [
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 0},
+          "title": "GARM Runner Status",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}}},
+          "gridPos": {"h": 5, "w": 6, "x": 0, "y": 1},
+          "title": "Total Runners",
+          "type": "stat",
+          "targets": [{"expr": "count(garm_runner_status{cluster_environment=~\"$cluster_environment\"}) or vector(0)", "legendFormat": ""}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}}},
+          "gridPos": {"h": 5, "w": 6, "x": 6, "y": 1},
+          "title": "Idle Runners",
+          "type": "stat",
+          "targets": [{"expr": "count(garm_runner_status{cluster_environment=~\"$cluster_environment\", status=\"idle\"}) or vector(0)", "legendFormat": ""}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "yellow", "value": null}]}}},
+          "gridPos": {"h": 5, "w": 6, "x": 12, "y": 1},
+          "title": "Creating",
+          "type": "stat",
+          "targets": [{"expr": "count(garm_runner_status{cluster_environment=~\"$cluster_environment\", status=\"creating\"}) or vector(0)", "legendFormat": ""}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}]}}},
+          "gridPos": {"h": 5, "w": 6, "x": 18, "y": 1},
+          "title": "Errors",
+          "type": "stat",
+          "targets": [{"expr": "sum(rate(garm_runner_errors_total{cluster_environment=~\"$cluster_environment\"}[5m])) or vector(0)", "legendFormat": ""}]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 6},
+          "title": "GitHub API Rate Limits",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "min": 0}},
+          "gridPos": {"h": 8, "w": 12, "x": 0, "y": 7},
+          "title": "Rate Limit Remaining",
+          "type": "timeseries",
+          "targets": [{"expr": "garm_github_rate_limit_remaining{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "ops"}},
+          "gridPos": {"h": 8, "w": 12, "x": 12, "y": 7},
+          "title": "Runner Operations Rate",
+          "type": "timeseries",
+          "targets": [{"expr": "sum(rate(garm_runner_operations_total{cluster_environment=~\"$cluster_environment\"}[5m])) by (cluster_environment)", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 15},
+          "title": "Runner Details",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"custom": {"filterable": true}}},
+          "gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
+          "title": "Runner Pool Status",
+          "type": "table",
+          "targets": [{"expr": "garm_runner_status{cluster_environment=~\"$cluster_environment\"}", "format": "table", "instant": true}],
+          "transformations": [
+            {"id": "filterFieldsByName", "options": {"include": {"names": ["cluster_environment", "name", "status", "pool_owner", "pool_type", "provider"]}}},
+            {"id": "organize", "options": {"renameByName": {"cluster_environment": "Environment", "name": "Runner", "status": "Status", "pool_owner": "Pool Owner", "pool_type": "Type", "provider": "Provider"}}}
+          ]
+        }
+      ],
+      "schemaVersion": 39,
+      "tags": ["edp", "garm", "ci-cd", "runners"],
+      "templating": {
+        "list": [
+          {
+            "current": {"selected": true, "text": "All", "value": "$__all"},
+            "datasource": {"type": "prometheus"},
+            "definition": "label_values(garm_runner_status, cluster_environment)",
+            "includeAll": true,
+            "multi": true,
+            "name": "cluster_environment",
+            "label": "Environment",
+            "query": "label_values(garm_runner_status, cluster_environment)",
+            "refresh": 2,
+            "sort": 1,
+            "type": "query"
+          }
+        ]
+      },
+      "time": {"from": "now-6h", "to": "now"},
+      "title": "GARM Runners",
+      "uid": "edp-garm"
+    }
--- a/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/ingress-nginx.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/ingress-nginx.yaml
@ -6,4 +6,5 @@ spec:
  instanceSelector:
    matchLabels:
      dashboards: "grafana"
+  folder: "EDP / Operations"
  url: "https://raw.githubusercontent.com/adinhodovic/ingress-nginx-mixin/refs/heads/main/dashboards_out/ingress-nginx-overview.json"
--- a/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/platform-overview.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/platform-overview.yaml
@ -0,0 +1,245 @@
+apiVersion: grafana.integreatly.org/v1beta1
+kind: GrafanaDashboard
+metadata:
+  name: platform-overview
+spec:
+  instanceSelector:
+    matchLabels:
+      dashboards: "grafana"
+  folder: "EDP / Overview"
+  json: |
+    {
+      "annotations": {"list": []},
+      "editable": true,
+      "fiscalYearStartMonth": 0,
+      "graphTooltip": 1,
+      "links": [],
+      "panels": [
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 0},
+          "title": "Platform Health",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "mappings": [{"options": {"0": {"text": "DOWN", "color": "red"}, "1": {"text": "UP", "color": "green"}}, "type": "value"}],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}, {"color": "green", "value": 1}]}
+            }
+          },
+          "gridPos": {"h": 4, "w": 4, "x": 0, "y": 1},
+          "title": "Forgejo",
+          "type": "stat",
+          "targets": [{"expr": "sum(up{job=\"forgejo-server-http\", cluster_environment=~\"$cluster_environment\"})", "legendFormat": ""}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 1}, {"color": "red", "value": 3}]}
+            }
+          },
+          "gridPos": {"h": 4, "w": 4, "x": 4, "y": 1},
+          "title": "Ingress 5xx (5m)",
+          "type": "stat",
+          "targets": [{"expr": "sum(rate(nginx_ingress_controller_requests{status=~\"5..\", cluster_environment=~\"$cluster_environment\"}[5m]))", "legendFormat": ""}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "unit": "short",
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "red", "value": 1}]}
+            }
+          },
+          "gridPos": {"h": 4, "w": 4, "x": 8, "y": 1},
+          "title": "Failed Jobs (24h)",
+          "type": "stat",
+          "targets": [{"expr": "sum(kube_job_status_failed{namespace=\"gitea\", cluster_environment=~\"$cluster_environment\"}) or vector(0)", "legendFormat": ""}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "unit": "percentunit",
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 0.7}, {"color": "red", "value": 0.85}]}
+            }
+          },
+          "gridPos": {"h": 4, "w": 4, "x": 12, "y": 1},
+          "title": "Cluster CPU Usage",
+          "type": "stat",
+          "targets": [{"expr": "1 - avg(rate(node_cpu_seconds_total{mode=\"idle\", cluster_environment=~\"$cluster_environment\"}[5m]))", "legendFormat": ""}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "unit": "percentunit",
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 0.7}, {"color": "red", "value": 0.85}]}
+            }
+          },
+          "gridPos": {"h": 4, "w": 4, "x": 16, "y": 1},
+          "title": "Cluster Memory Usage",
+          "type": "stat",
+          "targets": [{"expr": "1 - sum(node_memory_MemAvailable_bytes{cluster_environment=~\"$cluster_environment\"}) / sum(node_memory_MemTotal_bytes{cluster_environment=~\"$cluster_environment\"})", "legendFormat": ""}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "unit": "percentunit",
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 0.6}, {"color": "red", "value": 0.8}]}
+            }
+          },
+          "gridPos": {"h": 4, "w": 4, "x": 20, "y": 1},
+          "title": "Max PVC Usage",
+          "type": "stat",
+          "targets": [{"expr": "max(1 - kubelet_volume_stats_available_bytes{cluster_environment=~\"$cluster_environment\"} / kubelet_volume_stats_capacity_bytes{cluster_environment=~\"$cluster_environment\"})", "legendFormat": ""}]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 5},
+          "title": "Forgejo",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 0, "y": 6},
+          "title": "Repositories",
+          "type": "stat",
+          "targets": [{"expr": "gitea_repositories{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 4, "y": 6},
+          "title": "Users",
+          "type": "stat",
+          "targets": [{"expr": "gitea_users{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 8, "y": 6},
+          "title": "Organizations",
+          "type": "stat",
+          "targets": [{"expr": "gitea_organizations{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 12, "y": 6},
+          "title": "Open Issues",
+          "type": "stat",
+          "targets": [{"expr": "gitea_issues_open{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 16, "y": 6},
+          "title": "Webhooks",
+          "type": "stat",
+          "targets": [{"expr": "gitea_webhooks{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short"}},
+          "gridPos": {"h": 4, "w": 4, "x": 20, "y": 6},
+          "title": "Mirrors",
+          "type": "stat",
+          "targets": [{"expr": "gitea_mirrors{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 10},
+          "title": "Resources",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "percentunit", "min": 0, "max": 1}},
+          "gridPos": {"h": 8, "w": 12, "x": 0, "y": 11},
+          "title": "Node CPU Usage",
+          "type": "timeseries",
+          "targets": [{"expr": "1 - rate(node_cpu_seconds_total{mode=\"idle\", cluster_environment=~\"$cluster_environment\"}[5m])", "legendFormat": "{{instance}}"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "percentunit", "min": 0, "max": 1}},
+          "gridPos": {"h": 8, "w": 12, "x": 12, "y": 11},
+          "title": "PVC Usage by Claim",
+          "type": "timeseries",
+          "targets": [{"expr": "1 - (kubelet_volume_stats_available_bytes{cluster_environment=~\"$cluster_environment\"} / kubelet_volume_stats_capacity_bytes{cluster_environment=~\"$cluster_environment\"})", "legendFormat": "{{namespace}}/{{persistentvolumeclaim}}"}]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 19},
+          "title": "Backups",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "s", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 86400}, {"color": "red", "value": 172800}]}}},
+          "gridPos": {"h": 4, "w": 8, "x": 0, "y": 20},
+          "title": "Time Since Last Backup Schedule",
+          "type": "stat",
+          "targets": [{"expr": "time() - kube_cronjob_status_last_schedule_time{cronjob=~\"forgejo-s3-backup|secrets-backup\", cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cronjob}} ({{cluster_environment}})"}]
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "s"}},
+          "gridPos": {"h": 4, "w": 8, "x": 8, "y": 20},
+          "title": "Backup Job Duration (Last 7d)",
+          "type": "timeseries",
+          "targets": [{"expr": "kube_job_status_completion_time{job_name=~\"forgejo-s3-backup.*|secrets-backup.*\", cluster_environment=~\"$cluster_environment\"} - kube_job_status_start_time{job_name=~\"forgejo-s3-backup.*|secrets-backup.*\", cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{job_name}}"}],
+          "options": {"legend": {"displayMode": "table"}}
+        },
+        {
+          "datasource": {"type": "prometheus"},
+          "fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "red", "value": 1}]}}},
+          "gridPos": {"h": 4, "w": 8, "x": 16, "y": 20},
+          "title": "Failed Backup Jobs (Active)",
+          "type": "stat",
+          "targets": [{"expr": "sum by(cluster_environment, job_name) (kube_job_status_failed{job_name=~\"forgejo-s3-backup.*|secrets-backup.*\", cluster_environment=~\"$cluster_environment\"})", "legendFormat": "{{job_name}} ({{cluster_environment}})"}]
+        },
+        {
+          "collapsed": false,
+          "gridPos": {"h": 1, "w": 24, "x": 0, "y": 24},
+          "title": "Logs",
+          "type": "row"
+        },
+        {
+          "datasource": {"type": "victoriametrics-logs-datasource"},
+          "gridPos": {"h": 10, "w": 24, "x": 0, "y": 25},
+          "title": "Recent Errors (all namespaces)",
+          "type": "logs",
+          "targets": [{"expr": "{cluster_environment=~\"$cluster_environment\"} error OR Error OR ERROR OR panic OR PANIC", "refId": "A"}],
+          "options": {"showTime": true, "showLabels": true, "showCommonLabels": false, "wrapLogMessage": true, "prettifyLogMessage": false, "enableLogDetails": true, "sortOrder": "Descending", "dedupStrategy": "none"}
+        }
+      ],
+      "schemaVersion": 39,
+      "tags": ["edp", "platform", "overview"],
+      "templating": {
+        "list": [
+          {
+            "current": {"selected": true, "text": "All", "value": "$__all"},
+            "datasource": {"type": "prometheus"},
+            "definition": "label_values(up, cluster_environment)",
+            "includeAll": true,
+            "multi": true,
+            "name": "cluster_environment",
+            "label": "Environment",
+            "query": "label_values(up, cluster_environment)",
+            "refresh": 2,
+            "sort": 1,
+            "type": "query"
+          }
+        ]
+      },
+      "time": {"from": "now-6h", "to": "now"},
+      "title": "EDP Platform Overview",
+      "uid": "edp-platform-overview"
+    }
--- a/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/victoria-logs.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/grafana-operator/manifests/victoria-logs.yaml
@ -6,4 +6,7 @@ spec:
  instanceSelector:
    matchLabels:
      dashboards: "grafana"
-  url: "https://raw.githubusercontent.com/VictoriaMetrics/VictoriaMetrics/refs/heads/master/dashboards/vm/victorialogs.json"
+  folder: "EDP / Operations"
+  grafanaCom:
+    id: 22698
+    revision: 1
--- a/otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/manifests/alerts.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/manifests/alerts.yaml
@ -1,40 +1,119 @@
 apiVersion: operator.victoriametrics.com/v1beta1
 kind: VMRule
 metadata:
-  name: forgejo-alerts
+  name: edp-platform-alerts
  namespace: observability
 spec:
  groups:
-    - name: forgejo
+    - name: platform-health
      rules:
-        - alert: forgejo down
-          expr: sum by(cluster_environment) (up{pod=~"forgejo-server-.*"}) < 1
-          for: 30s
+        - alert: ForgejoDown
+          expr: sum by(cluster_environment) (up{job="forgejo-server-http"}) < 1
+          for: 1m
          labels:
            severity: critical
-            job:  "{{ $labels.job }}"
          annotations:
-            value: "{{ $value }}"
-            description: 'forgejo is down in cluster environment {{ $labels.cluster_environment }}'
-    - name: forgejo-backup
-      rules:
-        - alert: forgejo s3 backup job failed
-          expr: max by(cluster_environment) (kube_job_status_failed{job_name=~"forgejo-s3-backup-.*"}) != 0
-          for: 30s
-          labels:
-            severity: critical
-            job:  "{{ $labels.job }}"
-          annotations:
-            value: "{{ $value }}"
-            description: 'forgejo s3 backup job failed in cluster environment {{ $labels.cluster_environment }}'
-    - name: disk-consumption-high
-      rules:
-        - alert: disk consumption high
-          expr: 1-(kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) > 0.6
-          for: 30s
+            summary: "Forgejo is down on {{ $labels.cluster_environment }}"
+            description: "Forgejo server has been unreachable for more than 1 minute in cluster {{ $labels.cluster_environment }}."
+
+        - alert: IngressHighErrorRate
+          expr: |
+            sum by(cluster_environment) (rate(nginx_ingress_controller_requests{status=~"5.."}[5m]))
+            / sum by(cluster_environment) (rate(nginx_ingress_controller_requests[5m])) > 0.05
+          for: 5m
          labels:
            severity: major
-            job:  "{{ $labels.job }}"
          annotations:
-            value: "{{ $value }}"
-            description: 'disk consumption of pvc {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is high in cluster environment {{ $labels.cluster_environment }}'
+            summary: "High ingress 5xx rate on {{ $labels.cluster_environment }}"
+            description: "More than 5% of ingress requests are returning 5xx errors for over 5 minutes."
+            value: "{{ $value | humanizePercentage }}"
+
+        - alert: NodeNotReady
+          expr: kube_node_status_condition{condition="Ready", status="true"} == 0
+          for: 5m
+          labels:
+            severity: critical
+          annotations:
+            summary: "Node {{ $labels.node }} not ready on {{ $labels.cluster_environment }}"
+            description: "Node {{ $labels.node }} has been in NotReady state for more than 5 minutes."
+
+        - alert: PodCrashLooping
+          expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3
+          for: 5m
+          labels:
+            severity: major
+          annotations:
+            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} crash-looping on {{ $labels.cluster_environment }}"
+            description: "Pod has restarted more than 3 times in the last 15 minutes."
+
+    - name: storage
+      rules:
+        - alert: PVCUsageHigh
+          expr: |
+            1 - (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) > 0.80
+          for: 5m
+          labels:
+            severity: major
+          annotations:
+            summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} usage >80%"
+            description: "PVC usage is at {{ $value | humanizePercentage }} on {{ $labels.cluster_environment }}."
+            value: "{{ $value | humanizePercentage }}"
+
+        - alert: PVCUsageCritical
+          expr: |
+            1 - (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) > 0.90
+          for: 5m
+          labels:
+            severity: critical
+          annotations:
+            summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} usage >90%"
+            description: "PVC is almost full at {{ $value | humanizePercentage }} on {{ $labels.cluster_environment }}. Immediate action required."
+            value: "{{ $value | humanizePercentage }}"
+
+    - name: resources
+      rules:
+        - alert: NodeCPUHigh
+          expr: |
+            1 - avg by(instance, cluster_environment) (rate(node_cpu_seconds_total{mode="idle"}[5m])) > 0.85
+          for: 15m
+          labels:
+            severity: major
+          annotations:
+            summary: "Node {{ $labels.instance }} CPU >85% on {{ $labels.cluster_environment }}"
+            description: "Node CPU utilization has been above 85% for 15 minutes."
+            value: "{{ $value | humanizePercentage }}"
+
+        - alert: NodeMemoryHigh
+          expr: |
+            1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90
+          for: 10m
+          labels:
+            severity: major
+          annotations:
+            summary: "Node memory >90% on {{ $labels.cluster_environment }}"
+            description: "Node memory utilization above 90% for 10 minutes."
+            value: "{{ $value | humanizePercentage }}"
+
+    - name: cluster-health
+      rules:
+        - alert: ClusterMetricsSilent
+          expr: |
+            count(up{job="kubelet"}) by (cluster_environment) < 1
+            or
+            absent(up{job="kubelet", cluster_environment="dev"})
+          for: 10m
+          labels:
+            severity: critical
+          annotations:
+            summary: "Cluster {{ $labels.cluster_environment }} stopped sending metrics"
+            description: "No kubelet metrics received from cluster {{ $labels.cluster_environment }} for over 10 minutes. Either vmagent is dead or the cluster is unreachable."
+
+        - alert: ClusterAPIServerDown
+          expr: |
+            up{job="apiserver", cluster_environment=~".+"} == 0
+          for: 5m
+          labels:
+            severity: critical
+          annotations:
+            summary: "API server down on {{ $labels.cluster_environment }}"
+            description: "Kubernetes API server scrape is failing on cluster {{ $labels.cluster_environment }}."
--- a/otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/manifests/backup-alerts.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/manifests/backup-alerts.yaml
@ -0,0 +1,78 @@
+apiVersion: operator.victoriametrics.com/v1beta1
+kind: VMRule
+metadata:
+  name: backup-alerts
+  namespace: observability
+spec:
+  groups:
+    - name: backup-schedule-staleness
+      rules:
+        - alert: BackupCronJobNotScheduled
+          expr: |
+            time() - kube_cronjob_status_last_schedule_time{cronjob=~"forgejo-s3-backup|secrets-backup", namespace="gitea"}
+            > 26 * 3600
+          for: 5m
+          labels:
+            severity: critical
+            cronjob: "{{ $labels.cronjob }}"
+          annotations:
+            value: "{{ $value | humanizeDuration }}"
+            description: >-
+              CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} has not been
+              scheduled for over 26 hours in cluster {{ $labels.cluster_environment }}.
+              Last schedule was {{ $value | humanizeDuration }} ago.
+            summary: "Backup CronJob {{ $labels.cronjob }} is stale"
+
+        - alert: BackupCronJobNeverScheduled
+          expr: |
+            kube_cronjob_status_last_schedule_time{cronjob=~"forgejo-s3-backup|secrets-backup", namespace="gitea"}
+            == 0
+          for: 30m
+          labels:
+            severity: critical
+            cronjob: "{{ $labels.cronjob }}"
+          annotations:
+            description: >-
+              CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} has never been
+              scheduled in cluster {{ $labels.cluster_environment }}.
+            summary: "Backup CronJob {{ $labels.cronjob }} never ran"
+
+    - name: backup-job-failures
+      rules:
+        - alert: BackupJobFailed
+          expr: |
+            max by(cluster_environment, namespace, job_name) (
+              kube_job_status_failed{job_name=~"forgejo-s3-backup-.*|secrets-backup-.*", namespace="gitea"}
+            ) > 0
+          for: 30s
+          labels:
+            severity: critical
+            job_name: "{{ $labels.job_name }}"
+          annotations:
+            value: "{{ $value }}"
+            description: >-
+              Backup job {{ $labels.namespace }}/{{ $labels.job_name }} has
+              {{ $value }} failed pod(s) in cluster {{ $labels.cluster_environment }}.
+            summary: "Backup job {{ $labels.job_name }} failed"
+
+    - name: backup-job-duration
+      rules:
+        - alert: BackupJobTooSlow
+          expr: |
+            (
+              time() - kube_job_status_start_time{job_name=~"forgejo-s3-backup-.*|secrets-backup-.*", namespace="gitea"}
+            ) > 300
+            and
+            kube_job_status_active{job_name=~"forgejo-s3-backup-.*|secrets-backup-.*", namespace="gitea"} > 0
+          for: 1m
+          labels:
+            severity: major
+            job_name: "{{ $labels.job_name }}"
+          annotations:
+            value: "{{ $value | humanizeDuration }}"
+            description: >-
+              Backup job {{ $labels.namespace }}/{{ $labels.job_name }} has been
+              running for {{ $value | humanizeDuration }} (threshold: 5m)
+              in cluster {{ $labels.cluster_environment }}. This may indicate a
+              hung process or connectivity issue.
+            summary: "Backup job {{ $labels.job_name }} running too long"
--- a/otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/manifests/garm-scrape.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/manifests/garm-scrape.yaml
@ -10,4 +10,4 @@ spec:
    matchLabels:
      app.kubernetes.io/name: garm
  endpoints:
-    - port: metrics
+    - port: http
--- a/otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/manifests/simple-user-secret.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/manifests/simple-user-secret.yaml
@ -0,0 +1,9 @@
+apiVersion: v1
+kind: Secret
+metadata:
+  name: simple-user-secret
+  namespace: observability
+type: Opaque
+data:
+  username: c2ltcGxlLXVzZXI=
+  password: c3g1Z0M3b29XYVdPT0R3RA==
--- a/otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/manifests/vmauth.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/manifests/vmauth.yaml
@ -5,13 +5,17 @@ metadata:
  namespace: observability
 spec:
  username: simple-user
-  passwordRef:
-    key: password
-    name: simple-user-secret
+  password: sx5gC7ooWaWOODwD
  targetRefs:
    - static:
        url: http://vmsingle-o12y:8429
      paths: ["/api/v1/write"]
+    - static:
+        url: http://vmsingle-o12y:8429
+      paths: ["/api/v1/.*"]
    - static:
        url: http://vlogs-victorialogs:9428
      paths: ["/insert/elasticsearch/.*"]
+    - static:
+        url: http://vlogs-victorialogs:9428
+      paths: ["/select/.*"]
--- a/otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/values.yaml
+++ b/otc/observability.buildth.ing/stacks/observability/victoria-k8s-stack/values.yaml
@ -1,6 +1,6 @@
 global:
  # -- Cluster label to use for dashboards and rules
-  clusterLabel: cluster
+  clusterLabel: cluster_environment
  # -- Global license configuration
  license:
    key: ""
@ -201,13 +201,13 @@ defaultRules:
      enabled: true
      rules: {}
    kubernetesSystemControllerManager:
-      enabled: false
+      create: false
      rules: {}
    kubeScheduler:
-      enabled: false
+      create: false
      rules: {}
    kubernetesSystemScheduler:
-      enabled: false
+      create: false
      rules: {}
    kubeStateMetrics:
      enabled: true
Author	SHA1	Message	Date
Daniel Sy	7a6f96a8b4	feat(observability): ✨ add cluster heartbeat dead-man switch alerts ClusterMetricsSilent: fires if no kubelet metrics for >10m (catches vmagent outages). ClusterAPIServerDown: fires if apiserver scrape fails for >5m. Replaces silenced KubeControllerManagerDown/KubeSchedulerDown which never fire on managed K8s.	2026-06-22 11:05:48 +02:00
Daniel Sy	eda2812d47	fix(observability): 🔇 silence managed-K8s false alerts + bump backup deadline to 4h - Disable kubernetesSystemControllerManager, kubeScheduler, kubernetesSystemScheduler alert rules on dev, benchmark, edp clusters (unreachable on managed K8s) - Bump forgejo s3 backup activeDeadlineSeconds 7200→14400 (2h→4h) across all instances; deadline hit Jun 20-21 on heavy sync	2026-06-22 10:46:01 +02:00
Daniel Sy	3ed3487e97	fix(observability): 🐛 harden vmagent liveness probe failureThreshold 10→3 Silent outage for 72h went undetected due to lenient probe. Add startupProbe (failureThreshold=30) to allow slow starts.	2026-06-22 10:40:49 +02:00
Daniel Sy	01c41c9379	fix(observability): 🐛 use cluster_environment as global clusterLabel for default dashboards Default Victoria Metrics k8s dashboards were filtering on 'cluster' label which only contained 'observability'. Our metrics use 'cluster_environment' label which contains the actual cluster values: dev, edp, observability.	2026-06-22 10:35:08 +02:00
Daniel Sy	3141b7bd6c	feat(observability): ✨ comprehensive platform alert rules Replace ad-hoc forgejo/disk alerts with structured VMRule covering: - platform-health: ForgejoDown, IngressHighErrorRate, NodeNotReady, PodCrashLooping - storage: PVCUsageHigh (>80%), PVCUsageCritical (>90%) - resources: NodeCPUHigh (>85%), NodeMemoryHigh (>90%)	2026-06-19 16:43:28 +02:00
Daniel Sy	70939149ea	feat(observability): ✨ add read routes to vmauth for dev.t09.de instance	2026-06-19 16:37:37 +02:00
Daniel Sy	23edd5d6b4	feat(observability): ✨ add read routes to vmauth for metrics and logs queries	2026-06-19 16:33:07 +02:00
Daniel Sy	0a249820de	fix(observability): 🐛 fix ArgoCD scrape port name http-metrics not metrics	2026-06-19 16:11:15 +02:00
Daniel Sy	f3931dc550	fix(observability): 🐛 add ArgoCD + GARM VMServiceScrapes to dev client stack	2026-06-19 16:07:27 +02:00
Daniel Sy	8488de0c6f	fix(observability): 🐛 use plaintext password in hub VMUser to unblock operator reconciliation The hub VMUser was using passwordRef pointing to simple-user-secret, but that Secret was not present in the cluster (only exists in git now via the previous commit). VM operator skips VMUser reconciliation when passwordRef cannot resolve, leaving vmauth with only the unauthorizedUser catch-all (vmsingle). Switching to inline password ensures immediate operator reconciliation without waiting for Secret deployment. The simple-user-secret.yaml manifest is kept for Vector's credential reference.	2026-06-19 15:45:55 +02:00
Daniel Sy	b1a00d0395	fix(observability): 🐛 add missing simple-user-secret to hub observability stack The hub's VMUser (vmauth.yaml) references simple-user-secret via passwordRef, but the Secret was never added to the hub's manifests. Without this Secret, the VM operator cannot reconcile the VMUser into the vmauth config, causing ALL requests to fall through to the unauthorizedUser catch-all (vmsingle). Result: Vector log shipping to VictoriaLogs was broken — vmauth routed /insert/elasticsearch/_bulk to vmsingle instead of vlogs-victorialogs.	2026-06-19 15:28:14 +02:00
Daniel Sy	4591ee7b14	feat(observability): 🗂️ organize dashboards into Grafana folders Assigns folder field to all GrafanaDashboard CRs: - EDP / Overview: platform-overview - EDP / Applications: forgejo, argocd-operational, garm, argocd - EDP / Operations: cronjob-monitoring, ingress-nginx, victoria-logs	2026-06-19 14:46:41 +02:00
Daniel Sy	7f5c680e19	fix(observability): 🐛 enable GARM unauthenticated metrics + ArgoCD metrics on all instances - GARM dev.t09.de: set garm.metrics.disableAuth=true to unblock Prometheus scraping (was 401) - ArgoCD dev.t09.de: add controller/server/repoServer/applicationSet metrics blocks - ArgoCD edp.buildth.ing: add controller/server/repoServer/applicationSet metrics blocks - ArgoCD benchmark.t09.de: add controller/server/repoServer/applicationSet metrics blocks - observability.buildth.ing already had metrics enabled (no change needed)	2026-06-19 13:36:26 +02:00
Daniel Sy	b6fbd3f6eb	feat(observability): ✨ add VictoriaLogs log panels to platform, forgejo, argocd dashboards	2026-06-19 13:34:12 +02:00
Daniel Sy	bcf583a055	fix(observability): 🐛 fix Vector log shipping URL on all clusters Restores missing '.buildth.ing' domain segment in Vector elasticsearch endpoint for benchmark, dev, and edp instances. Template source uses {{{ .Env.DOMAIN_O12Y }}} (correct) — instances were mis-hydrated, omitting the TLD suffix.	2026-06-19 13:32:23 +02:00
Daniel Sy	238ef71630	fix(observability): 🐛 fix remote write URL and add manifests for benchmark + edp clients - Fix broken remote write URL (o12y.observability./ → o12y.observability.buildth.ing/) - Create manifests/ dirs with .gitkeep for benchmark.t09.de and edp.buildth.ing - Copy forgejo-scrape.yaml VMServiceScrape manifest to both instances	2026-06-19 13:23:50 +02:00
Daniel Sy	076b2a16c9	fix(observability): 🐛 fix datasource UIDs, replace cronjob dashboard, add GARM - Remove all ${DS_VICTORIAMETRICS} uid refs from platform-overview; use type-only datasource so grafana-operator resolves default prometheus DS - Replace grafanaCom id:14279 cronjob dashboard with inline custom version supporting cluster_environment variable (dev/edp/observability) - Add new GARM runners dashboard (edp-garm) ready for when GARM metrics are scraped; uses or vector(0) guards so panels show 0 not empty Note: cluster_environment values confirmed as dev/edp/observability (no benchmark). GARM metrics not yet present in VictoriaMetrics (0 series found).	2026-06-19 13:11:42 +02:00
Daniel Sy	6ea1e798d2	fix(observability): 🐛 add missing manifests to instance stacks - backup-alerts.yaml → observability.buildth.ing victoria-k8s-stack - forgejo-scrape.yaml → dev.t09.de vm-client-stack	2026-06-19 13:06:24 +02:00
Daniel Sy	91db8038e6	feat(observability): ✨ custom ArgoCD dashboard with cluster_environment filter	2026-06-19 13:02:48 +02:00
Daniel Sy	949529eb5c	feat(observability): ✨ add cluster_environment dropdown to Forgejo and platform-overview dashboards - Replace grafanaCom import (17802) with custom inline Forgejo dashboard containing cluster_environment query variable (refresh=2, label=Environment) - Add label, refresh=2, sort=1 to platform-overview cluster_environment variable - ArgoCD (19993) and CronJob (14279) remain grafanaCom imports (acceptable)	2026-06-19 12:50:32 +02:00
Daniel Sy	c2528f6f69	feat(observability): ✨ add platform grafana dashboard CRs - Add forgejo.yaml: Forgejo app dashboard (grafana.com ID 17802) - Add argocd-operational.yaml: ArgoCD operational dashboard (grafana.com ID 19993) - Add cronjob-monitoring.yaml: CronJob/backup monitoring dashboard (grafana.com ID 14279) - Add platform-overview.yaml: custom EDP Platform Overview inline dashboard (platform health, forgejo stats, resource usage, backup status rows) - Fix victoria-logs.yaml: replace broken URL with grafanaCom ID 22698	2026-06-19 12:47:44 +02:00
Daniel Sy	0316eefa43	fix(observability): 🐛 disable false-positive control-plane alerts and fix empty cluster_environment label Hub defaultRules groups kubernetesSystemControllerManager, kubeScheduler, and kubernetesSystemScheduler used wrong key 'enabled: false' — chart expects 'create: false'. This caused KubeControllerManagerDown/KubeSchedulerDown to fire as false positives because OTC CCE managed k8s does not expose control plane for scraping. Dev local vmagent had empty externalLabels, so backup-alert rules evaluated by local vmalert had no cluster_environment label on kube_job_status_failed metrics. Added cluster_environment=dev to match what the vm-client-stack vmagent adds for hub shipping.	2026-06-19 12:42:21 +02:00
Daniel Sy	32e998df5b	fix(forgejo): ⏱️ increase s3-backup activeDeadlineSeconds 1350→7200 Previous 22.5m deadline caused DeadlineExceeded on 2026-06-19 when rclone sync took >22m (vs 13-16s prior days). Likely triggered by significant new data in OBS bucket. 2h window accommodates large incremental syncs while BackupJobTooSlow alert still fires at 5m.	2026-06-19 12:35:41 +02:00
Daniel Sy	59eed97263	fix(observability-client): 🐛 fix remote write URL and add missing manifests dir - Fix broken remote write URL: o12y.observability. → o12y.observability.buildth.ing - Create manifests/ directory with .gitkeep for ArgoCD source path	2026-06-19 11:41:26 +02:00
Daniel Sy	369961a940	fix(observability): 🐛 enable vmagent, fix grafana auth, disable vmauth on dev - Enable VMAgent (was disabled → no metrics scraped) - Remove disable_login from Grafana config; add security block so operator can auth via API - Disable VMAuth (invalid trailing-dot hostname o12y.observability.; not needed on dev)	2026-06-19 10:44:34 +02:00
Daniel Sy	d83945413d	fix(observability): 🐛 change VLSingle → VLogs in victorialogs manifest Chart 0.48.1 / operator v0.58.0 uses VLogs CRD for VictoriaLogs, not VLSingle. The VLSingle kind was introduced in a newer operator version and is not registered in this chart release. Changing to VLogs which has identical spec fields (retentionPeriod, removePvcAfterDelete, storage, storageMetadata, resources all supported).	2026-06-19 10:20:19 +02:00
Daniel Sy	ef4a1d7ce2	fix(observability): 🐛 disable crds.cleanup hook in victoria-metrics-operator Pre-upgrade cleanup hook uses bitnami/kubectl and spawns on every ArgoCD sync. Dev cluster nodes are at 99% CPU / pod limit — hook pod cannot be scheduled, blocking the entire sync indefinitely. Disabling cleanup.enabled prevents the hook Job from being created. CRD cleanup is safe to skip on a fresh bootstrap where no old CRDs exist.	2026-06-19 09:58:55 +02:00
Daniel Sy	29c0a59734	fix(observability): 🐛 add SkipDryRunOnMissingResource to o12y syncOptions VLSingle CRD missing at sync time — ArgoCD pre-validates all resources before applying any, causing 'synchronization tasks not valid' on CRs whose CRDs are created by the operator in the same sync wave. SkipDryRunOnMissingResource=true bypasses dry-run for missing CRDs, unblocking the CRD bootstrap deadlock.	2026-06-19 09:56:24 +02:00
Daniel Sy	a52a6691a8	fix(observability): 🐛 add prune + RespectIgnoreDifferences to o12y syncPolicy Fix CRD bootstrap deadlock on victoria-metrics-k8s-stack ArgoCD app. Adds prune: true and RespectIgnoreDifferences=true to prevent sync failures when CRs are applied before CRDs are established.	2026-06-19 09:52:01 +02:00