vmalert_rules: groups: - name: default alert rules: - alert: DiskUsage expr: >- sum(collectd_df_df_complex{type!="free"}) by (instance, df) / sum(collectd_df_df_complex{df!="var-log"}) by (instance, df) > .75 or sum(collectd_df_df_complex{type!="free"}) by (instance, df) / sum(collectd_df_df_complex{df="var-log"}) by (instance, df) > .95 for: 2h - alert: TheWebsiteIsDown expr: >- probe_success{job="websites"} == 0 for: 10m - alert: Missing Metrics expr: >- up{instance!~"vmhost.*"} == 0 for: 10m - alert: NUT is offline expr: >- absent(collectd_nut_percent) - name: Bitwarden rules: - alert: vaultwarden is not running expr: >- collectd_processes_ps_count_processes{processes="vaultwarden"} < 1 for: 5m - name: Active Directory rules: - alert: samba is not running expr: >- collectd_processes_ps_count_processes{processes=~"samba|smbd|winbindd|krb5kdc"} < 1 for: 5m - name: Graylog rules: - alert: unprocessed messages expr: >- org_graylog2_journal_entries_uncommitted > 100 for: 1h - name: mdraid rules: - alert: mdraid missing disk expr: collectd_md_md_disks{type="missing", instance!~"burp.*"} != 0 - alert: mdraid failed disk expr: collectd_md_md_disks{type="failed"} != 0 - name: BURP RAID rules: - alert: disks need swapped expr: time() - tlast_change_over_time( ( collectd_md_md_disks{instance="burp1.pyrocufflink.blue", type="active"} or last_over_time(collectd_md_md_disks{instance="burp1.pyrocufflink.blue", type="active"})[1d] )[1d] ) > 86400 * 30 annotations: summary: The disks in the BURP array need swapped description: >- The disks in the BURP RAID-1 (mirror) array should be swapped periodically. One disk should be online and mounted while the other is stored in the fireproof safe. Switching them ensures that even if something happens to the active disk, such as hardware failure, power surge, fire, or accidental `rm -rf`, the offline disk is only out of date by a few weeks.