1
0
Fork 0

v-m/alerts: Rework free disk space alert

Fedora CoreOS fills `/boot` beyond the 75% alert threshold under normal
circumstances on aarch64 machines.  This is not a problem, because it
cleans up old files on its own, so we do not need to alert on it.
Unfortunately, the _DiskUsage_ alert is already quite complex, and
adding in exclusions for these devices would make it even worse.

To simplify the logic, we can use a recording rule to precomupte the
used/free space ratio.  By using `sum(...) without (type)` instead of
`sum(...) on (df, instance)`, we keep the other labels, which we can
then use to identify the metrics coming from machines we don't care to
monitor.

Instead of having different thresholds for different volumes
encoded in the same expression, we can use multiple alerts to alert on
"low" vs "very low" thresholds.  Since this will of course cause
duplicate alerts for most volumes, we can use AlertManager inhibition
rules to disable the "low" alert once the metric crosses the "very low"
threshold.
pull/32/head
Dustin 2024-11-02 09:38:02 -05:00
parent 4cef41688f
commit 8ecee4133f
4 changed files with 45 additions and 4 deletions

View File

@ -31,3 +31,12 @@ route:
- alertgroup=Frigate
group_by:
- alertname
inhibit_rules:
- source_matchers:
- alertname=Free disk space is very low
target_matchers:
- alertname=Free disk space is low
equal:
- instance
- df

View File

@ -1,12 +1,35 @@
groups:
- name: default alert
rules:
- alert: DiskUsage
- alert: Free disk space is low
expr: >-
sum(collectd_df_df_complex{type!="free"}) by (instance, df) / sum(collectd_df_df_complex{df!="var-log", df!="var-lib-frigate"}) by (instance, df) > .75
or sum(collectd_df_df_complex{type!="free"}) by (instance, df) / sum(collectd_df_df_complex{df="var-log"}) by (instance, df) > .95
or sum(collectd_df_df_complex{type!="free"}) by (instance, df) / sum(collectd_df_df_complex{df="var-lib-frigate"}) by (instance, df) > .95
(
filesystem:usage:percent{
kubernetes_io_arch!="arm64",
df!="mmcblk0p3",
df!="var-lib-frigate",
df!="var-log",
}
or
filesystem:usage:percent{
kubernetes_io_arch="arm64",
df!="boot",
}
or
filesystem:usage:percent{
df="mmcblk0p3",
instance!="nut0.pyrocufflink.blue",
}
) > .75
for: 2h
annotations:
severity: minor
- alert: Free disk space is very low
expr: >-
filesystem:usage:percent > 0.9
for: 2h
annotations:
severity: minor
- alert: TheWebsiteIsDown
expr: >-
probe_success{job="websites"} == 0

View File

@ -38,6 +38,7 @@ configMapGenerator:
- name: vmalert-rules
files:
- alerts.yml
- recording.yml
options:
disableNameSuffixHash: true
labels:

View File

@ -0,0 +1,8 @@
groups:
- name: collectd
rules:
- record: filesystem:usage:percent
expr: >-
sum without (type) (collectd_df_df_complex{type!="free"})
/ sum without (type) (collectd_df_df_complex)