Fedora CoreOS fills `/boot` beyond the 75% alert threshold under normal circumstances on aarch64 machines. This is not a problem, because it cleans up old files on its own, so we do not need to alert on it. Unfortunately, the _DiskUsage_ alert is already quite complex, and adding in exclusions for these devices would make it even worse. To simplify the logic, we can use a recording rule to precomupte the used/free space ratio. By using `sum(...) without (type)` instead of `sum(...) on (df, instance)`, we keep the other labels, which we can then use to identify the metrics coming from machines we don't care to monitor. Instead of having different thresholds for different volumes encoded in the same expression, we can use multiple alerts to alert on "low" vs "very low" thresholds. Since this will of course cause duplicate alerts for most volumes, we can use AlertManager inhibition rules to disable the "low" alert once the metric crosses the "very low" threshold.
9 lines
201 B
YAML
9 lines
201 B
YAML
groups:
|
|
- name: collectd
|
|
rules:
|
|
- record: filesystem:usage:percent
|
|
expr: >-
|
|
sum without (type) (collectd_df_df_complex{type!="free"})
|
|
/ sum without (type) (collectd_df_df_complex)
|
|
|