The Purple Pi is no more. We want to keep it's backups around, though, but we don't need alerts about them.
313 lines
10 KiB
YAML
313 lines
10 KiB
YAML
groups:
|
|
- name: default alert
|
|
rules:
|
|
- alert: Free disk space is low
|
|
expr: >-
|
|
(
|
|
filesystem:usage:percent{
|
|
kubernetes_io_arch!="arm64",
|
|
df!="mmcblk0p3",
|
|
df!="var-lib-frigate",
|
|
df!="var-log",
|
|
}
|
|
or
|
|
filesystem:usage:percent{
|
|
kubernetes_io_arch="arm64",
|
|
df!="boot",
|
|
}
|
|
or
|
|
filesystem:usage:percent{
|
|
df="mmcblk0p3",
|
|
instance!="nut0.pyrocufflink.blue",
|
|
}
|
|
) > .75
|
|
for: 2h
|
|
annotations:
|
|
severity: minor
|
|
- alert: Free disk space is very low
|
|
expr: >-
|
|
filesystem:usage:percent > 0.9
|
|
for: 2h
|
|
annotations:
|
|
severity: minor
|
|
- alert: TheWebsiteIsDown
|
|
expr: >-
|
|
probe_success{job="websites"} == 0
|
|
for: 10m
|
|
- alert: Missing Metrics
|
|
expr: >-
|
|
up{instance!~"vmhost.*"} == 0
|
|
for: 10m
|
|
- alert: NUT is offline
|
|
expr: >-
|
|
absent(collectd_nut_percent)
|
|
for: 10m
|
|
- alert: Internet is down
|
|
expr: >-
|
|
probe_success{job="blackbox"} == 0
|
|
for: 5m
|
|
annotations:
|
|
severity: critical
|
|
summary: The connection to the Internet is down.
|
|
description: >-
|
|
The Internet connection is down. Try rebooting the ONT, or call
|
|
Everfast Fiber.
|
|
|
|
- name: Bitwarden
|
|
rules:
|
|
- alert: vaultwarden is not running
|
|
expr: >-
|
|
collectd_processes_ps_count_processes{processes="vaultwarden"} < 1
|
|
for: 5m
|
|
|
|
- name: Active Directory
|
|
rules:
|
|
- alert: samba is not running
|
|
expr: >-
|
|
collectd_processes_ps_count_processes{processes=~"samba|smbd|winbindd|krb5kdc"} < 1
|
|
for: 5m
|
|
|
|
- name: mdraid
|
|
rules:
|
|
- alert: mdraid missing disk
|
|
expr: collectd_md_md_disks{type="missing", instance!="chromie.pyrocufflink.blue"} != 0
|
|
- alert: mdraid failed disk
|
|
expr: collectd_md_md_disks{type="failed"} != 0
|
|
|
|
- name: Backups
|
|
rules:
|
|
- alert: disks need swapped
|
|
expr:
|
|
time() - tlast_change_over_time(
|
|
(
|
|
collectd_md_md_disks{instance="chromie.pyrocufflink.blue", type="active"}
|
|
or last_over_time(collectd_md_md_disks{instance="chromie.pyrocufflink.blue", type="active"})[1d]
|
|
)[90d]
|
|
) > 86400 * 30
|
|
annotations:
|
|
summary: The disks in the backup array need swapped
|
|
description: >-
|
|
The disks in the backup RAID-1 (mirror) array should be swapped
|
|
periodically. One disk should be online and mounted while the other
|
|
is stored in the fireproof safe. Switching them ensures that even if
|
|
something happens to the active disk, such as hardware failure, power
|
|
surge, fire, or accidental `rm -rf`, the offline disk is only out of
|
|
date by a few weeks.
|
|
- alert: disk needs archived
|
|
expr:
|
|
sum(
|
|
collectd_md_md_disks{instance="chromie.pyrocufflink.blue", type=~"missing|spare"}
|
|
) < 1
|
|
annotations:
|
|
summary: One of the disks in the backup array should be archived
|
|
description: >-
|
|
The disks in the backup RAID-1 (mirror) array should be swapped
|
|
periodically. One disk should be online and mounted while the other
|
|
is stored in the fireproof safe. All of the disks are currently
|
|
online; one needs to be disconnected and moved to the safe as soon as
|
|
possible.
|
|
|
|
- name: certificates
|
|
rules:
|
|
- alert: certificate will expire soon
|
|
expr:
|
|
probe_ssl_last_chain_expiry_timestamp_seconds - time() < 29 * 86400
|
|
annotations:
|
|
summary: A certificate will expire in less than 29 days
|
|
description: >-
|
|
Generally, certificates are renewed automatically, approximately 30
|
|
days before their expiration (NotAfter) date. There may be a problem
|
|
with the certificate renewal process that prevented this certificate
|
|
from being renewed.
|
|
- alert: certificate will expire very soon
|
|
expr:
|
|
probe_ssl_last_chain_expiry_timestamp_seconds - time() < 14 * 86400
|
|
annotations:
|
|
summary: A certificate will expire in less than 14 days
|
|
description: >-
|
|
Generally, certificates are renewed automatically, approximately 30
|
|
days before their expiration (NotAfter) date. There is most likely a
|
|
problem with the certificate renewal process that prevented this
|
|
certificate from being renewed.
|
|
|
|
- name: Frigate
|
|
rules:
|
|
- alert: Frigate is Unavailable
|
|
expr:
|
|
absent(frigate_service_info)
|
|
or irate(frigate_service_last_updated_timestamp) < 1
|
|
or irate(frigate_service_uptime_seconds) < 1
|
|
for: 10m
|
|
- alert: Camera unavailable
|
|
expr:
|
|
homeassistant_entity_available{domain="camera"} != 1
|
|
for: 10m
|
|
- alert: No camera video frames
|
|
expr:
|
|
homeassistant_sensor_unit_fps{entity=~"sensor.*_camera_fps"} <= 0
|
|
annotations:
|
|
summary: No video received from camera
|
|
description: >-
|
|
Frigate is not receiving video from the camera. The camera may be
|
|
offline.
|
|
|
|
- name: Home Assistant
|
|
rules:
|
|
- alert: Battery Low
|
|
expr:
|
|
homeassistant_sensor_battery_percent{entity!~"sensor\\.(pixel_|sm_p610).*"} < 10
|
|
annotations:
|
|
summary: >-
|
|
Low battery: {{ $labels.friendly_name }}
|
|
severity: minor
|
|
- alert: Z-Wave Network is Offline
|
|
expr:
|
|
sum(
|
|
homeassistant_entity_available{entity="sensor.usb_controller_status"}
|
|
) without (
|
|
friendly_name
|
|
) < 1
|
|
annotations:
|
|
summary: The Z-Wave network controller is offline
|
|
description: >-
|
|
Home Assistant is not able to communicate with ZWaveJS, or ZWaveJS is
|
|
not able to connect to the Z-Wave USB controller. Z-Wave devices like
|
|
light switches, door sensors, and smart plugs will not work until the
|
|
Z-Wave network is operational again.
|
|
- alert: Zigbee Network is Offline
|
|
expr:
|
|
homeassistant_binary_sensor_state{entity="binary_sensor.zigbee2mqtt_bridge_connection_state"} == 0
|
|
annotations:
|
|
summary: The Zigbee network bridge is offline
|
|
description: >-
|
|
Home Assistant is not able to communicate with Zigbee2MQTT, or
|
|
Zigbee2MQTT is not able to connect to the Z-Wave USB controller.
|
|
Zigbee devices like smart bulbs and buttons will not work until the
|
|
Zigbee network is operational again.
|
|
|
|
- name: PostgreSQL
|
|
rules:
|
|
- alert: Replica lag too high
|
|
expr:
|
|
(patroni_xlog_location != 0)
|
|
- ignoring (instance) group_right (scope) (patroni_xlog_replayed_location != 0)
|
|
> 10240
|
|
for: 10m
|
|
- alert: WAL archive process failed
|
|
expr: >-
|
|
max_over_time(
|
|
increase(pg_stat_archiver_failed_count)[20m]
|
|
)> 0
|
|
annotations:
|
|
summary: The archiver process failed for one or more WAL segments
|
|
description: >-
|
|
Check the WAL segment archiver configuration and confirm that WAL
|
|
segments are being backed up correctly.
|
|
- alert: No recent WAL archives
|
|
expr: >-
|
|
pg_stat_archiver_last_archive_age > 3600
|
|
annotations:
|
|
summary: The last successful WAL segment backup was over 1h ago
|
|
description: >-
|
|
The WAL archiver process has not run successfully for over an hour.
|
|
Ensure the WAL backup process is configured correctly and the backup
|
|
target is online and healthy.
|
|
|
|
|
|
- name: Temperature
|
|
rules:
|
|
- alert: High Temperature
|
|
expr: >-
|
|
{__name__=~"collectd_.*_temperature", sensors!~"i350bb.*"} > 80
|
|
for: 10m
|
|
|
|
- name: Longhorn
|
|
rules:
|
|
- alert: Degraded Volumes
|
|
expr: >-
|
|
count(longhorn_volume_robustness==2) > 0
|
|
for: 1h
|
|
- alert: Faulted Volumes
|
|
expr: >-
|
|
count(longhorn_volume_robustness==3) > 0
|
|
for: 5m
|
|
|
|
- name: Restic
|
|
rules:
|
|
- alert: Repository Check Failed
|
|
expr: >-
|
|
min(restic_check_success) by (job) < 1
|
|
annotations:
|
|
summary: Errors found in restic repository data
|
|
description: >-
|
|
The Restic repository has one or more problems that may result in data
|
|
loss. Check the restic-exporter log for more information and correct
|
|
the issue as soon as possible.
|
|
- alert: Last Backup Age
|
|
expr: >-
|
|
time() - restic_backup_timestamp{
|
|
client_hostname!="bw0.pyrocufflink.blue",
|
|
client_hostname!="luma.pyrocufflink.blue",
|
|
client_hostname!="purplepi.hatch",
|
|
client_hostname!="toad.pyrocufflink.blue",
|
|
}> 604800
|
|
annotations:
|
|
summary: A Restic client has not backed up recently
|
|
description: >-
|
|
Clients are scheduled to back up every day, but at least one has not
|
|
been backed up in at least 7 days. Check the Restic configuration on
|
|
that system to ensure backups are running properly.
|
|
|
|
- name: Paperless-ngx
|
|
rules:
|
|
- alert: Paperless-ngx is down
|
|
expr: >-
|
|
up{job="paperless-ngx"} == 0 or absent(up{job="paperless-ngx"})
|
|
annotations:
|
|
summary: Paperless-ngx is down
|
|
description: >-
|
|
Paperless-ngx is offline.
|
|
- alert: Celery tasks failed
|
|
expr: >-
|
|
max_over_time(
|
|
increase(
|
|
flower_events_total{
|
|
job="paperless-ngx",
|
|
type="task-failed",
|
|
task!="documents.tasks.consume_file",
|
|
}
|
|
)[24h]
|
|
) > 0
|
|
annotations:
|
|
summary: Paperless-ngx Celery task failed
|
|
description: >-
|
|
Failing Celery tasks may indicate a problem with the Paperless-ngx
|
|
deployment and can result in data loss. Check the Paperless-ngx logs
|
|
for details about the task failures.
|
|
- alert: Paperless email task not running
|
|
expr: >-
|
|
absent_over_time(
|
|
flower_events_total{
|
|
type="task-started",
|
|
task="paperless_mail.tasks.process_mail_accounts"
|
|
}[12h]
|
|
)
|
|
annotations:
|
|
summary: Paperless task to process mail accounts has not run recently
|
|
description: >-
|
|
Paperless-ngx uses a scheduled Celery task to periodically poll email
|
|
mailboxes for new messages. If this task does not start, new email
|
|
messages will not be downloaded and imported into the document library.
|
|
|
|
- name: Firefly III
|
|
rules:
|
|
- alert: Firefly III is down
|
|
expr: >-
|
|
probe_success{job="firefly-iii"} != 1
|
|
|
|
- name: phpipam
|
|
rules:
|
|
- alert: phpipam is down
|
|
expr: >-
|
|
probe_success{job="phpipam"} != 1
|