Compare commits

...

29 commits

Author SHA1 Message Date
7a6f96a8b4
feat(observability): add cluster heartbeat dead-man switch alerts
ClusterMetricsSilent: fires if no kubelet metrics for >10m (catches vmagent outages).
ClusterAPIServerDown: fires if apiserver scrape fails for >5m.
Replaces silenced KubeControllerManagerDown/KubeSchedulerDown which never fire on managed K8s.
2026-06-22 11:05:48 +02:00
eda2812d47
fix(observability): 🔇 silence managed-K8s false alerts + bump backup deadline to 4h
- Disable kubernetesSystemControllerManager, kubeScheduler, kubernetesSystemScheduler
  alert rules on dev, benchmark, edp clusters (unreachable on managed K8s)
- Bump forgejo s3 backup activeDeadlineSeconds 7200→14400 (2h→4h) across
  all instances; deadline hit Jun 20-21 on heavy sync
2026-06-22 10:46:01 +02:00
3ed3487e97
fix(observability): 🐛 harden vmagent liveness probe failureThreshold 10→3
Silent outage for 72h went undetected due to lenient probe.
Add startupProbe (failureThreshold=30) to allow slow starts.
2026-06-22 10:40:49 +02:00
01c41c9379
fix(observability): 🐛 use cluster_environment as global clusterLabel for default dashboards
Default Victoria Metrics k8s dashboards were filtering on 'cluster' label
which only contained 'observability'. Our metrics use 'cluster_environment'
label which contains the actual cluster values: dev, edp, observability.
2026-06-22 10:35:08 +02:00
3141b7bd6c
feat(observability): comprehensive platform alert rules
Replace ad-hoc forgejo/disk alerts with structured VMRule covering:
- platform-health: ForgejoDown, IngressHighErrorRate, NodeNotReady, PodCrashLooping
- storage: PVCUsageHigh (>80%), PVCUsageCritical (>90%)
- resources: NodeCPUHigh (>85%), NodeMemoryHigh (>90%)
2026-06-19 16:43:28 +02:00
70939149ea
feat(observability): add read routes to vmauth for dev.t09.de instance 2026-06-19 16:37:37 +02:00
23edd5d6b4
feat(observability): add read routes to vmauth for metrics and logs queries 2026-06-19 16:33:07 +02:00
0a249820de
fix(observability): 🐛 fix ArgoCD scrape port name http-metrics not metrics 2026-06-19 16:11:15 +02:00
f3931dc550
fix(observability): 🐛 add ArgoCD + GARM VMServiceScrapes to dev client stack 2026-06-19 16:07:27 +02:00
8488de0c6f
fix(observability): 🐛 use plaintext password in hub VMUser to unblock operator reconciliation
The hub VMUser was using passwordRef pointing to simple-user-secret, but that
Secret was not present in the cluster (only exists in git now via the previous
commit). VM operator skips VMUser reconciliation when passwordRef cannot resolve,
leaving vmauth with only the unauthorizedUser catch-all (vmsingle).

Switching to inline password ensures immediate operator reconciliation without
waiting for Secret deployment. The simple-user-secret.yaml manifest is kept for
Vector's credential reference.
2026-06-19 15:45:55 +02:00
b1a00d0395
fix(observability): 🐛 add missing simple-user-secret to hub observability stack
The hub's VMUser (vmauth.yaml) references simple-user-secret via passwordRef,
but the Secret was never added to the hub's manifests. Without this Secret,
the VM operator cannot reconcile the VMUser into the vmauth config, causing
ALL requests to fall through to the unauthorizedUser catch-all (vmsingle).

Result: Vector log shipping to VictoriaLogs was broken — vmauth routed
/insert/elasticsearch/_bulk to vmsingle instead of vlogs-victorialogs.
2026-06-19 15:28:14 +02:00
4591ee7b14
feat(observability): 🗂️ organize dashboards into Grafana folders
Assigns folder field to all GrafanaDashboard CRs:
- EDP / Overview: platform-overview
- EDP / Applications: forgejo, argocd-operational, garm, argocd
- EDP / Operations: cronjob-monitoring, ingress-nginx, victoria-logs
2026-06-19 14:46:41 +02:00
7f5c680e19
fix(observability): 🐛 enable GARM unauthenticated metrics + ArgoCD metrics on all instances
- GARM dev.t09.de: set garm.metrics.disableAuth=true to unblock Prometheus scraping (was 401)
- ArgoCD dev.t09.de: add controller/server/repoServer/applicationSet metrics blocks
- ArgoCD edp.buildth.ing: add controller/server/repoServer/applicationSet metrics blocks
- ArgoCD benchmark.t09.de: add controller/server/repoServer/applicationSet metrics blocks
- observability.buildth.ing already had metrics enabled (no change needed)
2026-06-19 13:36:26 +02:00
b6fbd3f6eb
feat(observability): add VictoriaLogs log panels to platform, forgejo, argocd dashboards 2026-06-19 13:34:12 +02:00
bcf583a055
fix(observability): 🐛 fix Vector log shipping URL on all clusters
Restores missing '.buildth.ing' domain segment in Vector elasticsearch
endpoint for benchmark, dev, and edp instances.

Template source uses {{{ .Env.DOMAIN_O12Y }}} (correct) — instances
were mis-hydrated, omitting the TLD suffix.
2026-06-19 13:32:23 +02:00
238ef71630
fix(observability): 🐛 fix remote write URL and add manifests for benchmark + edp clients
- Fix broken remote write URL (o12y.observability./ → o12y.observability.buildth.ing/)
- Create manifests/ dirs with .gitkeep for benchmark.t09.de and edp.buildth.ing
- Copy forgejo-scrape.yaml VMServiceScrape manifest to both instances
2026-06-19 13:23:50 +02:00
076b2a16c9
fix(observability): 🐛 fix datasource UIDs, replace cronjob dashboard, add GARM
- Remove all ${DS_VICTORIAMETRICS} uid refs from platform-overview; use
  type-only datasource so grafana-operator resolves default prometheus DS
- Replace grafanaCom id:14279 cronjob dashboard with inline custom version
  supporting cluster_environment variable (dev/edp/observability)
- Add new GARM runners dashboard (edp-garm) ready for when GARM metrics
  are scraped; uses or vector(0) guards so panels show 0 not empty

Note: cluster_environment values confirmed as dev/edp/observability (no benchmark).
GARM metrics not yet present in VictoriaMetrics (0 series found).
2026-06-19 13:11:42 +02:00
6ea1e798d2
fix(observability): 🐛 add missing manifests to instance stacks
- backup-alerts.yaml → observability.buildth.ing victoria-k8s-stack
- forgejo-scrape.yaml → dev.t09.de vm-client-stack
2026-06-19 13:06:24 +02:00
91db8038e6
feat(observability): custom ArgoCD dashboard with cluster_environment filter 2026-06-19 13:02:48 +02:00
949529eb5c
feat(observability): add cluster_environment dropdown to Forgejo and platform-overview dashboards
- Replace grafanaCom import (17802) with custom inline Forgejo dashboard
  containing cluster_environment query variable (refresh=2, label=Environment)
- Add label, refresh=2, sort=1 to platform-overview cluster_environment variable
- ArgoCD (19993) and CronJob (14279) remain grafanaCom imports (acceptable)
2026-06-19 12:50:32 +02:00
c2528f6f69
feat(observability): add platform grafana dashboard CRs
- Add forgejo.yaml: Forgejo app dashboard (grafana.com ID 17802)
- Add argocd-operational.yaml: ArgoCD operational dashboard (grafana.com ID 19993)
- Add cronjob-monitoring.yaml: CronJob/backup monitoring dashboard (grafana.com ID 14279)
- Add platform-overview.yaml: custom EDP Platform Overview inline dashboard
  (platform health, forgejo stats, resource usage, backup status rows)
- Fix victoria-logs.yaml: replace broken URL with grafanaCom ID 22698
2026-06-19 12:47:44 +02:00
0316eefa43
fix(observability): 🐛 disable false-positive control-plane alerts and fix empty cluster_environment label
Hub defaultRules groups kubernetesSystemControllerManager, kubeScheduler, and
kubernetesSystemScheduler used wrong key 'enabled: false' — chart expects 'create: false'.
This caused KubeControllerManagerDown/KubeSchedulerDown to fire as false positives
because OTC CCE managed k8s does not expose control plane for scraping.

Dev local vmagent had empty externalLabels, so backup-alert rules evaluated by local
vmalert had no cluster_environment label on kube_job_status_failed metrics. Added
cluster_environment=dev to match what the vm-client-stack vmagent adds for hub shipping.
2026-06-19 12:42:21 +02:00
32e998df5b
fix(forgejo): ⏱️ increase s3-backup activeDeadlineSeconds 1350→7200
Previous 22.5m deadline caused DeadlineExceeded on 2026-06-19 when
rclone sync took >22m (vs 13-16s prior days). Likely triggered by
significant new data in OBS bucket. 2h window accommodates large
incremental syncs while BackupJobTooSlow alert still fires at 5m.
2026-06-19 12:35:41 +02:00
59eed97263
fix(observability-client): 🐛 fix remote write URL and add missing manifests dir
- Fix broken remote write URL: o12y.observability. → o12y.observability.buildth.ing
- Create manifests/ directory with .gitkeep for ArgoCD source path
2026-06-19 11:41:26 +02:00
369961a940
fix(observability): 🐛 enable vmagent, fix grafana auth, disable vmauth on dev
- Enable VMAgent (was disabled → no metrics scraped)
- Remove disable_login from Grafana config; add security block so operator can auth via API
- Disable VMAuth (invalid trailing-dot hostname o12y.observability.; not needed on dev)
2026-06-19 10:44:34 +02:00
d83945413d
fix(observability): 🐛 change VLSingle → VLogs in victorialogs manifest
Chart 0.48.1 / operator v0.58.0 uses VLogs CRD for VictoriaLogs, not
VLSingle. The VLSingle kind was introduced in a newer operator version
and is not registered in this chart release. Changing to VLogs which
has identical spec fields (retentionPeriod, removePvcAfterDelete,
storage, storageMetadata, resources all supported).
2026-06-19 10:20:19 +02:00
ef4a1d7ce2
fix(observability): 🐛 disable crds.cleanup hook in victoria-metrics-operator
Pre-upgrade cleanup hook uses bitnami/kubectl and spawns on every ArgoCD
sync. Dev cluster nodes are at 99% CPU / pod limit — hook pod cannot be
scheduled, blocking the entire sync indefinitely.

Disabling cleanup.enabled prevents the hook Job from being created.
CRD cleanup is safe to skip on a fresh bootstrap where no old CRDs exist.
2026-06-19 09:58:55 +02:00
29c0a59734
fix(observability): 🐛 add SkipDryRunOnMissingResource to o12y syncOptions
VLSingle CRD missing at sync time — ArgoCD pre-validates all resources
before applying any, causing 'synchronization tasks not valid' on CRs
whose CRDs are created by the operator in the same sync wave.
SkipDryRunOnMissingResource=true bypasses dry-run for missing CRDs,
unblocking the CRD bootstrap deadlock.
2026-06-19 09:56:24 +02:00
a52a6691a8
fix(observability): 🐛 add prune + RespectIgnoreDifferences to o12y syncPolicy
Fix CRD bootstrap deadlock on victoria-metrics-k8s-stack ArgoCD app.
Adds prune: true and RespectIgnoreDifferences=true to prevent sync
failures when CRs are applied before CRDs are established.
2026-06-19 09:52:01 +02:00
42 changed files with 1243 additions and 71 deletions

View file

@ -35,6 +35,30 @@ configs:
tls:
certificates:
controller:
metrics:
enabled: true
serviceMonitor:
enabled: false
server:
metrics:
enabled: true
serviceMonitor:
enabled: false
repoServer:
metrics:
enabled: true
serviceMonitor:
enabled: false
applicationSet:
metrics:
enabled: true
serviceMonitor:
enabled: false
notifications:
enabled: false

View file

@ -11,8 +11,8 @@ spec:
startingDeadlineSeconds: 600 # 10 minutes
jobTemplate:
spec:
# 60 min until backup - 10 min start - (backoffLimit * activeDeadlineSeconds) - some time sync buffer
activeDeadlineSeconds: 1350
# 2h window: handles large incremental syncs after repo growth or OBS slowness; BackupJobTooSlow alert fires at 5m
activeDeadlineSeconds: 14400
backoffLimit: 2
ttlSecondsAfterFinished: 259200 #
template:

View file

@ -48,7 +48,7 @@ customConfig:
type: elasticsearch
inputs: [parser]
endpoints:
- https://o12y.observability./insert/elasticsearch/
- https://o12y.observability.buildth.ing/insert/elasticsearch/
auth:
strategy: basic
user: ${VECTOR_USER}

View file

@ -0,0 +1,15 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
metadata:
name: forgejo
namespace: observability
spec:
namespaceSelector:
matchNames:
- gitea
selector:
matchLabels:
app.kubernetes.io/name: forgejo
endpoints:
- port: http
path: /metrics

View file

@ -201,13 +201,13 @@ defaultRules:
create: true
rules: {}
kubernetesSystemControllerManager:
create: true
create: false
rules: {}
kubeScheduler:
create: true
create: false
rules: {}
kubernetesSystemScheduler:
create: true
create: false
rules: {}
kubeStateMetrics:
create: true
@ -778,7 +778,7 @@ vmagent:
# -- Remote write configuration of VMAgent, allowed parameters defined in a [spec](https://docs.victoriametrics.com/operator/api#vmagentremotewritespec)
additionalRemoteWrites:
# []
- url: https://o12y.observability./api/v1/write
- url: https://o12y.observability.buildth.ing/api/v1/write
basicAuth:
username:
name: simple-user-secret

View file

@ -35,6 +35,30 @@ configs:
tls:
certificates:
controller:
metrics:
enabled: true
serviceMonitor:
enabled: false
server:
metrics:
enabled: true
serviceMonitor:
enabled: false
repoServer:
metrics:
enabled: true
serviceMonitor:
enabled: false
applicationSet:
metrics:
enabled: true
serviceMonitor:
enabled: false
notifications:
enabled: false

View file

@ -11,8 +11,8 @@ spec:
startingDeadlineSeconds: 600 # 10 minutes
jobTemplate:
spec:
# 60 min until backup - 10 min start - (backoffLimit * activeDeadlineSeconds) - some time sync buffer
activeDeadlineSeconds: 1350
# 2h window: handles large incremental syncs after repo growth or OBS slowness; BackupJobTooSlow alert fires at 5m
activeDeadlineSeconds: 14400
backoffLimit: 2
ttlSecondsAfterFinished: 259200 #
template:

View file

@ -41,5 +41,8 @@ providerConfig:
sidecarImage: edp.buildth.ing/devfw-cicd/ci-sizer-collector:0.0.4
garm:
metrics:
enable: true
disableAuth: true
logging:
logLevel: info

View file

@ -48,7 +48,7 @@ customConfig:
type: elasticsearch
inputs: [parser]
endpoints:
- https://o12y.observability./insert/elasticsearch/
- https://o12y.observability.buildth.ing/insert/elasticsearch/
auth:
strategy: basic
user: ${VECTOR_USER}

View file

@ -0,0 +1,14 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
metadata:
name: argocd
namespace: observability
spec:
namespaceSelector:
matchNames:
- argocd
selector:
matchLabels:
app.kubernetes.io/part-of: argocd
endpoints:
- port: http-metrics

View file

@ -0,0 +1,15 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
metadata:
name: forgejo
namespace: observability
spec:
namespaceSelector:
matchNames:
- gitea
selector:
matchLabels:
app.kubernetes.io/name: forgejo
endpoints:
- port: http
path: /metrics

View file

@ -0,0 +1,15 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
metadata:
name: garm
namespace: observability
spec:
namespaceSelector:
matchNames:
- garm
selector:
matchLabels:
app.kubernetes.io/name: garm
endpoints:
- port: http
path: /metrics

View file

@ -201,13 +201,13 @@ defaultRules:
create: true
rules: {}
kubernetesSystemControllerManager:
create: true
create: false
rules: {}
kubeScheduler:
create: true
create: false
rules: {}
kubernetesSystemScheduler:
create: true
create: false
rules: {}
kubeStateMetrics:
create: true
@ -778,7 +778,7 @@ vmagent:
# -- Remote write configuration of VMAgent, allowed parameters defined in a [spec](https://docs.victoriametrics.com/operator/api#vmagentremotewritespec)
additionalRemoteWrites:
# []
- url: https://o12y.observability./api/v1/write
- url: https://o12y.observability.buildth.ing/api/v1/write
basicAuth:
username:
name: simple-user-secret
@ -801,6 +801,20 @@ vmagent:
# Do not store original labels in vmagent's memory by default. This reduces the amount of memory used by vmagent
# but makes vmagent debugging UI less informative. See: https://docs.victoriametrics.com/vmagent/#relabel-debug
promscrape.dropOriginalLabels: "true"
# Harden liveness probe: default failureThreshold=10 masked a 72h silent outage
livenessProbe:
httpGet:
path: /health
port: http
failureThreshold: 3
periodSeconds: 5
timeoutSeconds: 5
startupProbe:
httpGet:
path: /health
port: http
failureThreshold: 30
periodSeconds: 5
# -- (object) VMAgent ingress configuration
ingress:
enabled: false

View file

@ -35,8 +35,10 @@ spec:
server:
root_url: "https://grafana.dev.t09.de"
auth:
disable_login: "true"
disable_login_form: "true"
security:
admin_user: admin
admin_password: admin
auth.generic_oauth:
enabled: "true"
name: Forgejo

View file

@ -9,10 +9,13 @@ spec:
project: default
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
- RespectIgnoreDifferences=true
- SkipDryRunOnMissingResource=true
destination:
name: in-cluster
namespace: observability

View file

@ -11,4 +11,4 @@ spec:
matchLabels:
app.kubernetes.io/name: garm
endpoints:
- port: metrics
- port: http

View file

@ -1,5 +1,5 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VLSingle
kind: VLogs
metadata:
name: victorialogs
namespace: observability

View file

@ -12,6 +12,12 @@ spec:
- static:
url: http://vmsingle-o12y:8429
paths: ["/api/v1/write"]
- static:
url: http://vmsingle-o12y:8429
paths: ["/api/v1/.*"]
- static:
url: http://vlogs-victorialogs:9428
paths: ["/insert/elasticsearch/.*"]
- static:
url: http://vlogs-victorialogs:9428
paths: ["/select/.*"]

View file

@ -28,10 +28,7 @@ victoria-metrics-operator:
crds:
plain: true
cleanup:
enabled: true
image:
repository: bitnami/kubectl
pullPolicy: IfNotPresent
enabled: false # disabled: cleanup hook can't schedule on resource-constrained nodes (Insufficient cpu / Too many pods)
serviceMonitor:
enabled: true
operator:
@ -676,7 +673,7 @@ vmalert:
vmauth:
# -- Enable VMAuth CR
enabled: true
enabled: false
# -- VMAuth annotations
annotations: {}
# -- (object) Full spec for VMAuth CRD. Allowed values described [here](https://docs.victoriametrics.com/operator/api#vmauthspec)
@ -699,7 +696,7 @@ vmauth:
vmagent:
# -- Create VMAgent CR
enabled: false
enabled: true
# -- VMAgent annotations
annotations: {}
# -- Remote write configuration of VMAgent, allowed parameters defined in a [spec](https://docs.victoriametrics.com/operator/api#vmagentremotewritespec)
@ -711,7 +708,8 @@ vmagent:
port: "8429"
selectAllByDefault: true
scrapeInterval: 20s
externalLabels: {}
externalLabels:
cluster_environment: "dev"
# For multi-cluster setups it is useful to use "cluster" label to identify the metrics source.
# For example:
# cluster: cluster-name

View file

@ -35,6 +35,30 @@ configs:
tls:
certificates:
controller:
metrics:
enabled: true
serviceMonitor:
enabled: false
server:
metrics:
enabled: true
serviceMonitor:
enabled: false
repoServer:
metrics:
enabled: true
serviceMonitor:
enabled: false
applicationSet:
metrics:
enabled: true
serviceMonitor:
enabled: false
notifications:
enabled: false

View file

@ -11,8 +11,8 @@ spec:
startingDeadlineSeconds: 600 # 10 minutes
jobTemplate:
spec:
# 60 min until backup - 10 min start - (backoffLimit * activeDeadlineSeconds) - some time sync buffer
activeDeadlineSeconds: 1350
# 2h window: handles large incremental syncs after repo growth or OBS slowness; BackupJobTooSlow alert fires at 5m
activeDeadlineSeconds: 14400
backoffLimit: 2
ttlSecondsAfterFinished: 259200 #
template:

View file

@ -48,7 +48,7 @@ customConfig:
type: elasticsearch
inputs: [parser]
endpoints:
- https://o12y.observability./insert/elasticsearch/
- https://o12y.observability.buildth.ing/insert/elasticsearch/
auth:
strategy: basic
user: ${VECTOR_USER}

View file

@ -0,0 +1,15 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
metadata:
name: forgejo
namespace: observability
spec:
namespaceSelector:
matchNames:
- gitea
selector:
matchLabels:
app.kubernetes.io/name: forgejo
endpoints:
- port: http
path: /metrics

View file

@ -201,13 +201,13 @@ defaultRules:
create: true
rules: {}
kubernetesSystemControllerManager:
create: true
create: false
rules: {}
kubeScheduler:
create: true
create: false
rules: {}
kubernetesSystemScheduler:
create: true
create: false
rules: {}
kubeStateMetrics:
create: true
@ -778,7 +778,7 @@ vmagent:
# -- Remote write configuration of VMAgent, allowed parameters defined in a [spec](https://docs.victoriametrics.com/operator/api#vmagentremotewritespec)
additionalRemoteWrites:
# []
- url: https://o12y.observability./api/v1/write
- url: https://o12y.observability.buildth.ing/api/v1/write
basicAuth:
username:
name: simple-user-secret

View file

@ -11,8 +11,8 @@ spec:
startingDeadlineSeconds: 600 # 10 minutes
jobTemplate:
spec:
# 60 min until backup - 10 min start - (backoffLimit * activeDeadlineSeconds) - some time sync buffer
activeDeadlineSeconds: 1350
# 2h window: handles large incremental syncs after repo growth or OBS slowness; BackupJobTooSlow alert fires at 5m
activeDeadlineSeconds: 14400
backoffLimit: 2
ttlSecondsAfterFinished: 259200 #
template:

View file

@ -0,0 +1,153 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: argocd-operational
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
folder: "EDP / Applications"
json: |
{
"annotations": {"list": []},
"editable": true,
"graphTooltip": 1,
"panels": [
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 0},
"title": "Application Status",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}}},
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 1},
"title": "Total Apps",
"type": "stat",
"targets": [{"expr": "count(argocd_app_info{cluster_environment=~\"$cluster_environment\"})", "legendFormat": ""}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}}},
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 1},
"title": "Healthy",
"type": "stat",
"targets": [{"expr": "count(argocd_app_info{cluster_environment=~\"$cluster_environment\", health_status=\"Healthy\"})", "legendFormat": ""}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}]}}},
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 1},
"title": "Degraded",
"type": "stat",
"targets": [{"expr": "count(argocd_app_info{cluster_environment=~\"$cluster_environment\", health_status=\"Degraded\"}) or vector(0)", "legendFormat": ""}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}}},
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 1},
"title": "Synced",
"type": "stat",
"targets": [{"expr": "count(argocd_app_info{cluster_environment=~\"$cluster_environment\", sync_status=\"Synced\"})", "legendFormat": ""}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "yellow", "value": null}]}}},
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 1},
"title": "OutOfSync",
"type": "stat",
"targets": [{"expr": "count(argocd_app_info{cluster_environment=~\"$cluster_environment\", sync_status=\"OutOfSync\"}) or vector(0)", "legendFormat": ""}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "orange", "value": null}]}}},
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 1},
"title": "Progressing",
"type": "stat",
"targets": [{"expr": "count(argocd_app_info{cluster_environment=~\"$cluster_environment\", health_status=\"Progressing\"}) or vector(0)", "legendFormat": ""}]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 5},
"title": "Application Details",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {
"defaults": {"custom": {"filterable": true}},
"overrides": [
{"matcher": {"id": "byName", "options": "Health"}, "properties": [{"id": "custom.cellOptions", "value": {"type": "color-text"}}, {"id": "mappings", "value": [{"options": {"Healthy": {"color": "green", "text": "Healthy"}, "Degraded": {"color": "red", "text": "Degraded"}, "Progressing": {"color": "yellow", "text": "Progressing"}, "Missing": {"color": "purple", "text": "Missing"}}, "type": "value"}]}]},
{"matcher": {"id": "byName", "options": "Sync"}, "properties": [{"id": "custom.cellOptions", "value": {"type": "color-text"}}, {"id": "mappings", "value": [{"options": {"Synced": {"color": "green", "text": "Synced"}, "OutOfSync": {"color": "orange", "text": "OutOfSync"}}, "type": "value"}]}]}
]
},
"gridPos": {"h": 12, "w": 24, "x": 0, "y": 6},
"title": "All Applications",
"type": "table",
"targets": [{"expr": "argocd_app_info{cluster_environment=~\"$cluster_environment\"}", "format": "table", "instant": true, "legendFormat": ""}],
"transformations": [
{"id": "filterFieldsByName", "options": {"include": {"names": ["cluster_environment", "name", "dest_namespace", "health_status", "sync_status", "repo"]}}},
{"id": "organize", "options": {"renameByName": {"cluster_environment": "Environment", "name": "Application", "dest_namespace": "Namespace", "health_status": "Health", "sync_status": "Sync", "repo": "Repository"}}}
]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 18},
"title": "Sync Activity",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "ops"}},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 19},
"title": "Sync Operations (rate)",
"type": "timeseries",
"targets": [{"expr": "sum(rate(argocd_app_sync_total{cluster_environment=~\"$cluster_environment\"}[5m])) by (name, phase)", "legendFormat": "{{name}} ({{phase}})"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "ops"}},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 19},
"title": "Reconciliation Rate",
"type": "timeseries",
"targets": [{"expr": "sum(rate(argocd_app_reconcile_count{cluster_environment=~\"$cluster_environment\"}[5m])) by (namespace)", "legendFormat": "{{namespace}}"}]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 27},
"title": "ArgoCD Logs",
"type": "row"
},
{
"datasource": {"type": "victoriametrics-logs-datasource"},
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 28},
"title": "ArgoCD Logs",
"type": "logs",
"targets": [{"expr": "{cluster_environment=~\"$cluster_environment\", kubernetes.namespace=\"argocd\"}", "refId": "A"}],
"options": {"showTime": true, "showLabels": true, "wrapLogMessage": true, "enableLogDetails": true, "sortOrder": "Descending"}
}
],
"schemaVersion": 39,
"tags": ["edp", "argocd", "gitops"],
"templating": {
"list": [
{
"current": {"selected": true, "text": "All", "value": "$__all"},
"datasource": {"type": "prometheus"},
"definition": "label_values(argocd_app_info, cluster_environment)",
"includeAll": true,
"multi": true,
"name": "cluster_environment",
"label": "Environment",
"query": "label_values(argocd_app_info, cluster_environment)",
"refresh": 2,
"sort": 1,
"type": "query"
}
]
},
"time": {"from": "now-6h", "to": "now"},
"title": "ArgoCD Operations",
"uid": "edp-argocd-ops"
}

View file

@ -6,4 +6,5 @@ spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
folder: "EDP / Applications"
url: "https://raw.githubusercontent.com/argoproj/argo-cd/refs/heads/master/examples/dashboard.json"

View file

@ -0,0 +1,103 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: cronjob-monitoring
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
folder: "EDP / Operations"
json: |
{
"annotations": {"list": []},
"editable": true,
"graphTooltip": 1,
"panels": [
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 0},
"title": "Backup Job Status",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "s", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 86400}, {"color": "red", "value": 172800}]}}},
"gridPos": {"h": 5, "w": 12, "x": 0, "y": 1},
"title": "Time Since Last Schedule",
"type": "stat",
"targets": [{"expr": "time() - kube_cronjob_status_last_schedule_time{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cronjob}} ({{cluster_environment}})"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "red", "value": 1}]}}},
"gridPos": {"h": 5, "w": 12, "x": 12, "y": 1},
"title": "Failed Jobs (Active)",
"type": "stat",
"targets": [{"expr": "sum by(cluster_environment, job_name) (kube_job_status_failed{cluster_environment=~\"$cluster_environment\"}) > 0", "legendFormat": "{{job_name}} ({{cluster_environment}})"}]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 6},
"title": "CronJob Overview",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"custom": {"filterable": true}}, "overrides": [{"matcher": {"id": "byName", "options": "Suspended"}, "properties": [{"id": "mappings", "value": [{"options": {"0": {"text": "No", "color": "green"}, "1": {"text": "YES", "color": "red"}}, "type": "value"}]}]}]},
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 7},
"title": "All CronJobs",
"type": "table",
"targets": [
{"expr": "kube_cronjob_info{cluster_environment=~\"$cluster_environment\"}", "format": "table", "instant": true, "refId": "A"}
],
"transformations": [
{"id": "filterFieldsByName", "options": {"include": {"names": ["cluster_environment", "cronjob", "namespace", "schedule"]}}},
{"id": "organize", "options": {"renameByName": {"cluster_environment": "Environment", "cronjob": "CronJob", "namespace": "Namespace", "schedule": "Schedule"}}}
]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 15},
"title": "Job History",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
"title": "Job Completions (24h)",
"type": "timeseries",
"targets": [{"expr": "sum(kube_job_status_succeeded{cluster_environment=~\"$cluster_environment\"}) by (job_name, cluster_environment)", "legendFormat": "{{job_name}} ({{cluster_environment}})"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "color": {"mode": "palette-classic"}}},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
"title": "Job Failures (24h)",
"type": "timeseries",
"targets": [{"expr": "sum(kube_job_status_failed{cluster_environment=~\"$cluster_environment\"}) by (job_name, cluster_environment)", "legendFormat": "{{job_name}} ({{cluster_environment}})"}]
}
],
"schemaVersion": 39,
"tags": ["edp", "backup", "cronjob"],
"templating": {
"list": [
{
"current": {"selected": true, "text": "All", "value": "$__all"},
"datasource": {"type": "prometheus"},
"definition": "label_values(kube_cronjob_info, cluster_environment)",
"includeAll": true,
"multi": true,
"name": "cluster_environment",
"label": "Environment",
"query": "label_values(kube_cronjob_info, cluster_environment)",
"refresh": 2,
"sort": 1,
"type": "query"
}
]
},
"time": {"from": "now-24h", "to": "now"},
"title": "CronJob & Backup Monitoring",
"uid": "edp-cronjobs"
}

View file

@ -0,0 +1,207 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: forgejo
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
folder: "EDP / Applications"
json: |
{
"annotations": {"list": []},
"editable": true,
"graphTooltip": 1,
"panels": [
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 0},
"title": "Forgejo Health",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"mappings": [{"options": {"0": {"text": "DOWN", "color": "red"}, "1": {"text": "UP", "color": "green"}}, "type": "value"}], "thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}, {"color": "green", "value": 1}]}}},
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 1},
"title": "Status",
"type": "stat",
"targets": [{"expr": "up{job=\"forgejo-server-http\", cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 1},
"title": "Version",
"type": "stat",
"targets": [{"expr": "gitea_build_info{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{version}}"}],
"options": {"reduceOptions": {"calcs": ["lastNotNull"]}, "textMode": "name"}
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 1},
"title": "Repositories",
"type": "stat",
"targets": [{"expr": "gitea_repositories{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 1},
"title": "Users",
"type": "stat",
"targets": [{"expr": "gitea_users{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 1},
"title": "Organizations",
"type": "stat",
"targets": [{"expr": "gitea_organizations{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 1},
"title": "Teams",
"type": "stat",
"targets": [{"expr": "gitea_teams{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 5},
"title": "Activity",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 6},
"title": "Open Issues",
"type": "stat",
"targets": [{"expr": "gitea_issues_open{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 6},
"title": "Closed Issues",
"type": "stat",
"targets": [{"expr": "gitea_issues_closed{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 6},
"title": "Webhooks",
"type": "stat",
"targets": [{"expr": "gitea_webhooks{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 6},
"title": "Hook Tasks",
"type": "stat",
"targets": [{"expr": "gitea_hooktasks{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 10},
"title": "Content & Auth",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 11},
"title": "Stars",
"type": "stat",
"targets": [{"expr": "gitea_stars{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 11},
"title": "Watches",
"type": "stat",
"targets": [{"expr": "gitea_watches{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 11},
"title": "Releases",
"type": "stat",
"targets": [{"expr": "gitea_releases{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 11},
"title": "Mirrors",
"type": "stat",
"targets": [{"expr": "gitea_mirrors{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 11},
"title": "Public Keys",
"type": "stat",
"targets": [{"expr": "gitea_publickeys{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 11},
"title": "OAuth Apps",
"type": "stat",
"targets": [{"expr": "gitea_oauths{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 15},
"title": "Forgejo Logs",
"type": "row"
},
{
"datasource": {"type": "victoriametrics-logs-datasource"},
"gridPos": {"h": 10, "w": 12, "x": 0, "y": 16},
"title": "Forgejo Server Logs",
"type": "logs",
"targets": [{"expr": "{cluster_environment=~\"$cluster_environment\", kubernetes.namespace=\"gitea\"}", "refId": "A"}],
"options": {"showTime": true, "showLabels": true, "wrapLogMessage": true, "enableLogDetails": true, "sortOrder": "Descending"}
},
{
"datasource": {"type": "victoriametrics-logs-datasource"},
"gridPos": {"h": 10, "w": 12, "x": 12, "y": 16},
"title": "Forgejo Errors",
"type": "logs",
"targets": [{"expr": "{cluster_environment=~\"$cluster_environment\", kubernetes.namespace=\"gitea\"} error OR Error OR ERROR OR panic", "refId": "A"}],
"options": {"showTime": true, "showLabels": true, "wrapLogMessage": true, "enableLogDetails": true, "sortOrder": "Descending"}
}
],
"schemaVersion": 39,
"tags": ["edp", "forgejo", "gitea"],
"templating": {
"list": [
{
"current": {"selected": true, "text": "All", "value": "$__all"},
"datasource": {"type": "prometheus"},
"definition": "label_values(gitea_repositories, cluster_environment)",
"includeAll": true,
"multi": true,
"name": "cluster_environment",
"label": "Environment",
"query": "label_values(gitea_repositories, cluster_environment)",
"refresh": 2,
"type": "query"
}
]
},
"time": {"from": "now-6h", "to": "now"},
"title": "Forgejo",
"uid": "edp-forgejo"
}

View file

@ -0,0 +1,117 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: garm
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
folder: "EDP / Applications"
json: |
{
"annotations": {"list": []},
"editable": true,
"graphTooltip": 1,
"panels": [
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 0},
"title": "GARM Runner Status",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}}},
"gridPos": {"h": 5, "w": 6, "x": 0, "y": 1},
"title": "Total Runners",
"type": "stat",
"targets": [{"expr": "count(garm_runner_status{cluster_environment=~\"$cluster_environment\"}) or vector(0)", "legendFormat": ""}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}}},
"gridPos": {"h": 5, "w": 6, "x": 6, "y": 1},
"title": "Idle Runners",
"type": "stat",
"targets": [{"expr": "count(garm_runner_status{cluster_environment=~\"$cluster_environment\", status=\"idle\"}) or vector(0)", "legendFormat": ""}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "yellow", "value": null}]}}},
"gridPos": {"h": 5, "w": 6, "x": 12, "y": 1},
"title": "Creating",
"type": "stat",
"targets": [{"expr": "count(garm_runner_status{cluster_environment=~\"$cluster_environment\", status=\"creating\"}) or vector(0)", "legendFormat": ""}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}]}}},
"gridPos": {"h": 5, "w": 6, "x": 18, "y": 1},
"title": "Errors",
"type": "stat",
"targets": [{"expr": "sum(rate(garm_runner_errors_total{cluster_environment=~\"$cluster_environment\"}[5m])) or vector(0)", "legendFormat": ""}]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 6},
"title": "GitHub API Rate Limits",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "min": 0}},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 7},
"title": "Rate Limit Remaining",
"type": "timeseries",
"targets": [{"expr": "garm_github_rate_limit_remaining{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "ops"}},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 7},
"title": "Runner Operations Rate",
"type": "timeseries",
"targets": [{"expr": "sum(rate(garm_runner_operations_total{cluster_environment=~\"$cluster_environment\"}[5m])) by (cluster_environment)", "legendFormat": "{{cluster_environment}}"}]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 15},
"title": "Runner Details",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"custom": {"filterable": true}}},
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
"title": "Runner Pool Status",
"type": "table",
"targets": [{"expr": "garm_runner_status{cluster_environment=~\"$cluster_environment\"}", "format": "table", "instant": true}],
"transformations": [
{"id": "filterFieldsByName", "options": {"include": {"names": ["cluster_environment", "name", "status", "pool_owner", "pool_type", "provider"]}}},
{"id": "organize", "options": {"renameByName": {"cluster_environment": "Environment", "name": "Runner", "status": "Status", "pool_owner": "Pool Owner", "pool_type": "Type", "provider": "Provider"}}}
]
}
],
"schemaVersion": 39,
"tags": ["edp", "garm", "ci-cd", "runners"],
"templating": {
"list": [
{
"current": {"selected": true, "text": "All", "value": "$__all"},
"datasource": {"type": "prometheus"},
"definition": "label_values(garm_runner_status, cluster_environment)",
"includeAll": true,
"multi": true,
"name": "cluster_environment",
"label": "Environment",
"query": "label_values(garm_runner_status, cluster_environment)",
"refresh": 2,
"sort": 1,
"type": "query"
}
]
},
"time": {"from": "now-6h", "to": "now"},
"title": "GARM Runners",
"uid": "edp-garm"
}

View file

@ -6,4 +6,5 @@ spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
folder: "EDP / Operations"
url: "https://raw.githubusercontent.com/adinhodovic/ingress-nginx-mixin/refs/heads/main/dashboards_out/ingress-nginx-overview.json"

View file

@ -0,0 +1,245 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: platform-overview
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
folder: "EDP / Overview"
json: |
{
"annotations": {"list": []},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"links": [],
"panels": [
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 0},
"title": "Platform Health",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {
"defaults": {
"mappings": [{"options": {"0": {"text": "DOWN", "color": "red"}, "1": {"text": "UP", "color": "green"}}, "type": "value"}],
"thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}, {"color": "green", "value": 1}]}
}
},
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 1},
"title": "Forgejo",
"type": "stat",
"targets": [{"expr": "sum(up{job=\"forgejo-server-http\", cluster_environment=~\"$cluster_environment\"})", "legendFormat": ""}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {
"defaults": {
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 1}, {"color": "red", "value": 3}]}
}
},
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 1},
"title": "Ingress 5xx (5m)",
"type": "stat",
"targets": [{"expr": "sum(rate(nginx_ingress_controller_requests{status=~\"5..\", cluster_environment=~\"$cluster_environment\"}[5m]))", "legendFormat": ""}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {
"defaults": {
"unit": "short",
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "red", "value": 1}]}
}
},
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 1},
"title": "Failed Jobs (24h)",
"type": "stat",
"targets": [{"expr": "sum(kube_job_status_failed{namespace=\"gitea\", cluster_environment=~\"$cluster_environment\"}) or vector(0)", "legendFormat": ""}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 0.7}, {"color": "red", "value": 0.85}]}
}
},
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 1},
"title": "Cluster CPU Usage",
"type": "stat",
"targets": [{"expr": "1 - avg(rate(node_cpu_seconds_total{mode=\"idle\", cluster_environment=~\"$cluster_environment\"}[5m]))", "legendFormat": ""}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 0.7}, {"color": "red", "value": 0.85}]}
}
},
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 1},
"title": "Cluster Memory Usage",
"type": "stat",
"targets": [{"expr": "1 - sum(node_memory_MemAvailable_bytes{cluster_environment=~\"$cluster_environment\"}) / sum(node_memory_MemTotal_bytes{cluster_environment=~\"$cluster_environment\"})", "legendFormat": ""}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 0.6}, {"color": "red", "value": 0.8}]}
}
},
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 1},
"title": "Max PVC Usage",
"type": "stat",
"targets": [{"expr": "max(1 - kubelet_volume_stats_available_bytes{cluster_environment=~\"$cluster_environment\"} / kubelet_volume_stats_capacity_bytes{cluster_environment=~\"$cluster_environment\"})", "legendFormat": ""}]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 5},
"title": "Forgejo",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 6},
"title": "Repositories",
"type": "stat",
"targets": [{"expr": "gitea_repositories{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 6},
"title": "Users",
"type": "stat",
"targets": [{"expr": "gitea_users{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 6},
"title": "Organizations",
"type": "stat",
"targets": [{"expr": "gitea_organizations{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 6},
"title": "Open Issues",
"type": "stat",
"targets": [{"expr": "gitea_issues_open{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 6},
"title": "Webhooks",
"type": "stat",
"targets": [{"expr": "gitea_webhooks{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short"}},
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 6},
"title": "Mirrors",
"type": "stat",
"targets": [{"expr": "gitea_mirrors{cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cluster_environment}}"}]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 10},
"title": "Resources",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "percentunit", "min": 0, "max": 1}},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 11},
"title": "Node CPU Usage",
"type": "timeseries",
"targets": [{"expr": "1 - rate(node_cpu_seconds_total{mode=\"idle\", cluster_environment=~\"$cluster_environment\"}[5m])", "legendFormat": "{{instance}}"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "percentunit", "min": 0, "max": 1}},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 11},
"title": "PVC Usage by Claim",
"type": "timeseries",
"targets": [{"expr": "1 - (kubelet_volume_stats_available_bytes{cluster_environment=~\"$cluster_environment\"} / kubelet_volume_stats_capacity_bytes{cluster_environment=~\"$cluster_environment\"})", "legendFormat": "{{namespace}}/{{persistentvolumeclaim}}"}]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 19},
"title": "Backups",
"type": "row"
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "s", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 86400}, {"color": "red", "value": 172800}]}}},
"gridPos": {"h": 4, "w": 8, "x": 0, "y": 20},
"title": "Time Since Last Backup Schedule",
"type": "stat",
"targets": [{"expr": "time() - kube_cronjob_status_last_schedule_time{cronjob=~\"forgejo-s3-backup|secrets-backup\", cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{cronjob}} ({{cluster_environment}})"}]
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "s"}},
"gridPos": {"h": 4, "w": 8, "x": 8, "y": 20},
"title": "Backup Job Duration (Last 7d)",
"type": "timeseries",
"targets": [{"expr": "kube_job_status_completion_time{job_name=~\"forgejo-s3-backup.*|secrets-backup.*\", cluster_environment=~\"$cluster_environment\"} - kube_job_status_start_time{job_name=~\"forgejo-s3-backup.*|secrets-backup.*\", cluster_environment=~\"$cluster_environment\"}", "legendFormat": "{{job_name}}"}],
"options": {"legend": {"displayMode": "table"}}
},
{
"datasource": {"type": "prometheus"},
"fieldConfig": {"defaults": {"unit": "short", "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "red", "value": 1}]}}},
"gridPos": {"h": 4, "w": 8, "x": 16, "y": 20},
"title": "Failed Backup Jobs (Active)",
"type": "stat",
"targets": [{"expr": "sum by(cluster_environment, job_name) (kube_job_status_failed{job_name=~\"forgejo-s3-backup.*|secrets-backup.*\", cluster_environment=~\"$cluster_environment\"})", "legendFormat": "{{job_name}} ({{cluster_environment}})"}]
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 24},
"title": "Logs",
"type": "row"
},
{
"datasource": {"type": "victoriametrics-logs-datasource"},
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 25},
"title": "Recent Errors (all namespaces)",
"type": "logs",
"targets": [{"expr": "{cluster_environment=~\"$cluster_environment\"} error OR Error OR ERROR OR panic OR PANIC", "refId": "A"}],
"options": {"showTime": true, "showLabels": true, "showCommonLabels": false, "wrapLogMessage": true, "prettifyLogMessage": false, "enableLogDetails": true, "sortOrder": "Descending", "dedupStrategy": "none"}
}
],
"schemaVersion": 39,
"tags": ["edp", "platform", "overview"],
"templating": {
"list": [
{
"current": {"selected": true, "text": "All", "value": "$__all"},
"datasource": {"type": "prometheus"},
"definition": "label_values(up, cluster_environment)",
"includeAll": true,
"multi": true,
"name": "cluster_environment",
"label": "Environment",
"query": "label_values(up, cluster_environment)",
"refresh": 2,
"sort": 1,
"type": "query"
}
]
},
"time": {"from": "now-6h", "to": "now"},
"title": "EDP Platform Overview",
"uid": "edp-platform-overview"
}

View file

@ -6,4 +6,7 @@ spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
url: "https://raw.githubusercontent.com/VictoriaMetrics/VictoriaMetrics/refs/heads/master/dashboards/vm/victorialogs.json"
folder: "EDP / Operations"
grafanaCom:
id: 22698
revision: 1

View file

@ -1,40 +1,119 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: forgejo-alerts
name: edp-platform-alerts
namespace: observability
spec:
groups:
- name: forgejo
- name: platform-health
rules:
- alert: forgejo down
expr: sum by(cluster_environment) (up{pod=~"forgejo-server-.*"}) < 1
for: 30s
- alert: ForgejoDown
expr: sum by(cluster_environment) (up{job="forgejo-server-http"}) < 1
for: 1m
labels:
severity: critical
job: "{{ $labels.job }}"
annotations:
value: "{{ $value }}"
description: 'forgejo is down in cluster environment {{ $labels.cluster_environment }}'
- name: forgejo-backup
rules:
- alert: forgejo s3 backup job failed
expr: max by(cluster_environment) (kube_job_status_failed{job_name=~"forgejo-s3-backup-.*"}) != 0
for: 30s
labels:
severity: critical
job: "{{ $labels.job }}"
annotations:
value: "{{ $value }}"
description: 'forgejo s3 backup job failed in cluster environment {{ $labels.cluster_environment }}'
- name: disk-consumption-high
rules:
- alert: disk consumption high
expr: 1-(kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) > 0.6
for: 30s
summary: "Forgejo is down on {{ $labels.cluster_environment }}"
description: "Forgejo server has been unreachable for more than 1 minute in cluster {{ $labels.cluster_environment }}."
- alert: IngressHighErrorRate
expr: |
sum by(cluster_environment) (rate(nginx_ingress_controller_requests{status=~"5.."}[5m]))
/ sum by(cluster_environment) (rate(nginx_ingress_controller_requests[5m])) > 0.05
for: 5m
labels:
severity: major
job: "{{ $labels.job }}"
annotations:
value: "{{ $value }}"
description: 'disk consumption of pvc {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is high in cluster environment {{ $labels.cluster_environment }}'
summary: "High ingress 5xx rate on {{ $labels.cluster_environment }}"
description: "More than 5% of ingress requests are returning 5xx errors for over 5 minutes."
value: "{{ $value | humanizePercentage }}"
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready", status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} not ready on {{ $labels.cluster_environment }}"
description: "Node {{ $labels.node }} has been in NotReady state for more than 5 minutes."
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3
for: 5m
labels:
severity: major
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} crash-looping on {{ $labels.cluster_environment }}"
description: "Pod has restarted more than 3 times in the last 15 minutes."
- name: storage
rules:
- alert: PVCUsageHigh
expr: |
1 - (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) > 0.80
for: 5m
labels:
severity: major
annotations:
summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} usage >80%"
description: "PVC usage is at {{ $value | humanizePercentage }} on {{ $labels.cluster_environment }}."
value: "{{ $value | humanizePercentage }}"
- alert: PVCUsageCritical
expr: |
1 - (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) > 0.90
for: 5m
labels:
severity: critical
annotations:
summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} usage >90%"
description: "PVC is almost full at {{ $value | humanizePercentage }} on {{ $labels.cluster_environment }}. Immediate action required."
value: "{{ $value | humanizePercentage }}"
- name: resources
rules:
- alert: NodeCPUHigh
expr: |
1 - avg by(instance, cluster_environment) (rate(node_cpu_seconds_total{mode="idle"}[5m])) > 0.85
for: 15m
labels:
severity: major
annotations:
summary: "Node {{ $labels.instance }} CPU >85% on {{ $labels.cluster_environment }}"
description: "Node CPU utilization has been above 85% for 15 minutes."
value: "{{ $value | humanizePercentage }}"
- alert: NodeMemoryHigh
expr: |
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90
for: 10m
labels:
severity: major
annotations:
summary: "Node memory >90% on {{ $labels.cluster_environment }}"
description: "Node memory utilization above 90% for 10 minutes."
value: "{{ $value | humanizePercentage }}"
- name: cluster-health
rules:
- alert: ClusterMetricsSilent
expr: |
count(up{job="kubelet"}) by (cluster_environment) < 1
or
absent(up{job="kubelet", cluster_environment="dev"})
for: 10m
labels:
severity: critical
annotations:
summary: "Cluster {{ $labels.cluster_environment }} stopped sending metrics"
description: "No kubelet metrics received from cluster {{ $labels.cluster_environment }} for over 10 minutes. Either vmagent is dead or the cluster is unreachable."
- alert: ClusterAPIServerDown
expr: |
up{job="apiserver", cluster_environment=~".+"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "API server down on {{ $labels.cluster_environment }}"
description: "Kubernetes API server scrape is failing on cluster {{ $labels.cluster_environment }}."

View file

@ -0,0 +1,78 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: backup-alerts
namespace: observability
spec:
groups:
- name: backup-schedule-staleness
rules:
- alert: BackupCronJobNotScheduled
expr: |
time() - kube_cronjob_status_last_schedule_time{cronjob=~"forgejo-s3-backup|secrets-backup", namespace="gitea"}
> 26 * 3600
for: 5m
labels:
severity: critical
cronjob: "{{ $labels.cronjob }}"
annotations:
value: "{{ $value | humanizeDuration }}"
description: >-
CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} has not been
scheduled for over 26 hours in cluster {{ $labels.cluster_environment }}.
Last schedule was {{ $value | humanizeDuration }} ago.
summary: "Backup CronJob {{ $labels.cronjob }} is stale"
- alert: BackupCronJobNeverScheduled
expr: |
kube_cronjob_status_last_schedule_time{cronjob=~"forgejo-s3-backup|secrets-backup", namespace="gitea"}
== 0
for: 30m
labels:
severity: critical
cronjob: "{{ $labels.cronjob }}"
annotations:
description: >-
CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} has never been
scheduled in cluster {{ $labels.cluster_environment }}.
summary: "Backup CronJob {{ $labels.cronjob }} never ran"
- name: backup-job-failures
rules:
- alert: BackupJobFailed
expr: |
max by(cluster_environment, namespace, job_name) (
kube_job_status_failed{job_name=~"forgejo-s3-backup-.*|secrets-backup-.*", namespace="gitea"}
) > 0
for: 30s
labels:
severity: critical
job_name: "{{ $labels.job_name }}"
annotations:
value: "{{ $value }}"
description: >-
Backup job {{ $labels.namespace }}/{{ $labels.job_name }} has
{{ $value }} failed pod(s) in cluster {{ $labels.cluster_environment }}.
summary: "Backup job {{ $labels.job_name }} failed"
- name: backup-job-duration
rules:
- alert: BackupJobTooSlow
expr: |
(
time() - kube_job_status_start_time{job_name=~"forgejo-s3-backup-.*|secrets-backup-.*", namespace="gitea"}
) > 300
and
kube_job_status_active{job_name=~"forgejo-s3-backup-.*|secrets-backup-.*", namespace="gitea"} > 0
for: 1m
labels:
severity: major
job_name: "{{ $labels.job_name }}"
annotations:
value: "{{ $value | humanizeDuration }}"
description: >-
Backup job {{ $labels.namespace }}/{{ $labels.job_name }} has been
running for {{ $value | humanizeDuration }} (threshold: 5m)
in cluster {{ $labels.cluster_environment }}. This may indicate a
hung process or connectivity issue.
summary: "Backup job {{ $labels.job_name }} running too long"

View file

@ -10,4 +10,4 @@ spec:
matchLabels:
app.kubernetes.io/name: garm
endpoints:
- port: metrics
- port: http

View file

@ -0,0 +1,9 @@
apiVersion: v1
kind: Secret
metadata:
name: simple-user-secret
namespace: observability
type: Opaque
data:
username: c2ltcGxlLXVzZXI=
password: c3g1Z0M3b29XYVdPT0R3RA==

View file

@ -5,13 +5,17 @@ metadata:
namespace: observability
spec:
username: simple-user
passwordRef:
key: password
name: simple-user-secret
password: sx5gC7ooWaWOODwD
targetRefs:
- static:
url: http://vmsingle-o12y:8429
paths: ["/api/v1/write"]
- static:
url: http://vmsingle-o12y:8429
paths: ["/api/v1/.*"]
- static:
url: http://vlogs-victorialogs:9428
paths: ["/insert/elasticsearch/.*"]
- static:
url: http://vlogs-victorialogs:9428
paths: ["/select/.*"]

View file

@ -1,6 +1,6 @@
global:
# -- Cluster label to use for dashboards and rules
clusterLabel: cluster
clusterLabel: cluster_environment
# -- Global license configuration
license:
key: ""
@ -201,13 +201,13 @@ defaultRules:
enabled: true
rules: {}
kubernetesSystemControllerManager:
enabled: false
create: false
rules: {}
kubeScheduler:
enabled: false
create: false
rules: {}
kubernetesSystemScheduler:
enabled: false
create: false
rules: {}
kubeStateMetrics:
enabled: true