Files
configpolicy/group_vars/metricspi/alerts.yml
Dustin C. Hatch 4e608e379f metricspi/alerts: Correct BURP archive alert query
When the RAID array is being resynchronized after the archived disk has
been reconnected, md changes the disk status from "missing" to "spare."
Once the synchronization is complete, it changes from "spare" to
"active."  We only want to trigger the "disk needs archived" alert once
the synchronization process is complete; otherwise, both the "disks need
swapped" and "disk needs archived" alerts would be active at the same
time, which makes no sense.  By adjusting the query for the "disk needs
archived" alert to consider disks in both "missing" and "spare" status,
we can delay firing that alert until the proper time.
2023-06-20 11:58:35 -05:00

124 lines
4.7 KiB
YAML

vmalert_rules:
groups:
- name: default alert
rules:
- alert: DiskUsage
expr: >-
sum(collectd_df_df_complex{type!="free"}) by (instance, df) / sum(collectd_df_df_complex{df!="var-log"}) by (instance, df) > .75
or sum(collectd_df_df_complex{type!="free"}) by (instance, df) / sum(collectd_df_df_complex{df="var-log"}) by (instance, df) > .95
for: 2h
- alert: TheWebsiteIsDown
expr: >-
probe_success{job="websites"} == 0
for: 10m
- alert: Missing Metrics
expr: >-
up{instance!~"vmhost.*"} == 0
for: 10m
- alert: NUT is offline
expr: >-
absent(collectd_nut_percent)
- name: Bitwarden
rules:
- alert: vaultwarden is not running
expr: >-
collectd_processes_ps_count_processes{processes="vaultwarden"} < 1
for: 5m
- name: Active Directory
rules:
- alert: samba is not running
expr: >-
collectd_processes_ps_count_processes{processes=~"samba|smbd|winbindd|krb5kdc"} < 1
for: 5m
- name: Graylog
rules:
- alert: unprocessed messages
expr: >-
org_graylog2_journal_entries_uncommitted > 100
for: 1h
- name: mdraid
rules:
- alert: mdraid missing disk
expr: collectd_md_md_disks{type="missing", instance!~"burp.*"} != 0
- alert: mdraid failed disk
expr: collectd_md_md_disks{type="failed"} != 0
- name: BURP
rules:
- alert: no recent backups
expr: absent(burp_client_last_backup_timestamp)
for: 8h
annotations:
summary: No clients have been backed up recently
description: >-
This alert indicates that NO clients have been backed up within the
last day. There is likely a problem with the BURP server.
- alert: missed client backup
expr:
time() - (burp_client_last_backup_timestamp > now() - 86400 * 90) > 86400 * 2
for: 3h
annotations:
summary: A client has not backed up today
description: >-
A client has not been backed up for more than a day. This may be
because the client is offline, or because the backup process has
failed. Clients that have not been backed up for more than 90 days
will not trigger this alert.
- alert: disks need swapped
expr:
time() - tlast_change_over_time(
(
collectd_md_md_disks{instance="burp1.pyrocufflink.blue", type="active"}
or last_over_time(collectd_md_md_disks{instance="burp1.pyrocufflink.blue", type="active"})[1d]
)[90d]
) > 86400 * 30
annotations:
summary: The disks in the BURP array need swapped
description: >-
The disks in the BURP RAID-1 (mirror) array should be swapped
periodically. One disk should be online and mounted while the other
is stored in the fireproof safe. Switching them ensures that even if
something happens to the active disk, such as hardware failure, power
surge, fire, or accidental `rm -rf`, the offline disk is only out of
date by a few weeks.
- alert: disk needs archived
expr:
sum(
collectd_md_md_disks{instance="burp1.pyrocufflink.blue", type=~"missing|spare"}
) < 1
annotations:
summary: One of the disks in the BURP array should be archived
description: >-
The disks in the BURP RAID-1 (mirror) array should be swapped
periodically. One disk should be online and mounted while the other
is stored in the fireproof safe. All of the disks are currently
online; one needs to be disconnected and moved to the safe as soon as
possible.
- name: certificates
rules:
- alert: certificate will expire soon
expr:
probe_ssl_last_chain_expiry_timestamp_seconds - time() < 29 * 86400
annotations:
summary: A certificate will expire in less than 29 days
description: >-
Generally, certificates are renewed automatically, approximately 30
days before their expiration (NotAfter) date. There may be a problem
with the certificate renewal process that prevented this certificate
from being renewed.
- alert: certificate will expire very soon
expr:
probe_ssl_last_chain_expiry_timestamp_seconds - time() < 14 * 86400
annotations:
summary: A certificate will expire in less than 14 days
description: >-
Generally, certificates are renewed automatically, approximately 30
days before their expiration (NotAfter) date. There is most likely a
problem with the certificate renewal process that prevented this
certificate from being renewed.