v-m/alerts: Rework free disk space alert
Fedora CoreOS fills `/boot` beyond the 75% alert threshold under normal circumstances on aarch64 machines. This is not a problem, because it cleans up old files on its own, so we do not need to alert on it. Unfortunately, the _DiskUsage_ alert is already quite complex, and adding in exclusions for these devices would make it even worse. To simplify the logic, we can use a recording rule to precomupte the used/free space ratio. By using `sum(...) without (type)` instead of `sum(...) on (df, instance)`, we keep the other labels, which we can then use to identify the metrics coming from machines we don't care to monitor. Instead of having different thresholds for different volumes encoded in the same expression, we can use multiple alerts to alert on "low" vs "very low" thresholds. Since this will of course cause duplicate alerts for most volumes, we can use AlertManager inhibition rules to disable the "low" alert once the metric crosses the "very low" threshold.pull/32/head
parent
4cef41688f
commit
8ecee4133f
|
@ -31,3 +31,12 @@ route:
|
|||
- alertgroup=Frigate
|
||||
group_by:
|
||||
- alertname
|
||||
|
||||
inhibit_rules:
|
||||
- source_matchers:
|
||||
- alertname=Free disk space is very low
|
||||
target_matchers:
|
||||
- alertname=Free disk space is low
|
||||
equal:
|
||||
- instance
|
||||
- df
|
||||
|
|
|
@ -1,12 +1,35 @@
|
|||
groups:
|
||||
- name: default alert
|
||||
rules:
|
||||
- alert: DiskUsage
|
||||
- alert: Free disk space is low
|
||||
expr: >-
|
||||
sum(collectd_df_df_complex{type!="free"}) by (instance, df) / sum(collectd_df_df_complex{df!="var-log", df!="var-lib-frigate"}) by (instance, df) > .75
|
||||
or sum(collectd_df_df_complex{type!="free"}) by (instance, df) / sum(collectd_df_df_complex{df="var-log"}) by (instance, df) > .95
|
||||
or sum(collectd_df_df_complex{type!="free"}) by (instance, df) / sum(collectd_df_df_complex{df="var-lib-frigate"}) by (instance, df) > .95
|
||||
(
|
||||
filesystem:usage:percent{
|
||||
kubernetes_io_arch!="arm64",
|
||||
df!="mmcblk0p3",
|
||||
df!="var-lib-frigate",
|
||||
df!="var-log",
|
||||
}
|
||||
or
|
||||
filesystem:usage:percent{
|
||||
kubernetes_io_arch="arm64",
|
||||
df!="boot",
|
||||
}
|
||||
or
|
||||
filesystem:usage:percent{
|
||||
df="mmcblk0p3",
|
||||
instance!="nut0.pyrocufflink.blue",
|
||||
}
|
||||
) > .75
|
||||
for: 2h
|
||||
annotations:
|
||||
severity: minor
|
||||
- alert: Free disk space is very low
|
||||
expr: >-
|
||||
filesystem:usage:percent > 0.9
|
||||
for: 2h
|
||||
annotations:
|
||||
severity: minor
|
||||
- alert: TheWebsiteIsDown
|
||||
expr: >-
|
||||
probe_success{job="websites"} == 0
|
||||
|
|
|
@ -38,6 +38,7 @@ configMapGenerator:
|
|||
- name: vmalert-rules
|
||||
files:
|
||||
- alerts.yml
|
||||
- recording.yml
|
||||
options:
|
||||
disableNameSuffixHash: true
|
||||
labels:
|
||||
|
|
|
@ -0,0 +1,8 @@
|
|||
groups:
|
||||
- name: collectd
|
||||
rules:
|
||||
- record: filesystem:usage:percent
|
||||
expr: >-
|
||||
sum without (type) (collectd_df_df_complex{type!="free"})
|
||||
/ sum without (type) (collectd_df_df_complex)
|
||||
|
Loading…
Reference in New Issue