v-m/alerts: Fix PostgreSQL WAL archive failed alert
The `pg_stat_archiver_failed_count` metric is a counter, so once a WAL archival has failed, it will increase and never return to `0`. To ensure the alert is resolved once the WAL archival process recovers, we need to use the `increase` function to turn it into a gauge. Finally, we aggregate that gauge with `max_over_time` to keep the alert from flapping if the WAL archive occurs less frequently than the scrape interval.pull/50/head
parent
f637feba16
commit
dc835ddc9d
|
@ -185,7 +185,9 @@ groups:
|
||||||
for: 10m
|
for: 10m
|
||||||
- alert: WAL archive process failed
|
- alert: WAL archive process failed
|
||||||
expr: >-
|
expr: >-
|
||||||
pg_stat_archiver_failed_count > 0
|
max_over_time(
|
||||||
|
increase(pg_stat_archiver_failed_count)[20m]
|
||||||
|
)> 0
|
||||||
annotations:
|
annotations:
|
||||||
summary: The archiver process failed for one or more WAL segments
|
summary: The archiver process failed for one or more WAL segments
|
||||||
description: >-
|
description: >-
|
||||||
|
|
Loading…
Reference in New Issue