v-m: Deploy (clustered) Victoria Metrics

Since *mtrcs0.pyrocufflink.blue* (the Metrics Pi) seems to be dying,
I decided to move monitoring and alerting into Kubernetes.

I was originally planning to have a single, dedicated virtual machine
for Victoria Metrics and Grafana, similar to how the Metrics Pi was set
up, but running Fedora CoreOS instead of a custom Buildroot-based OS.
While I was working on the Ignition configuration for the VM, it
occurred to me that monitoring would be interrupted frequently, since
FCOS updates weekly and all updates require a reboot.  I would rather
not have that many gaps in the data.  Ultimately I decided that
deploying a cluster with Kubernetes would probably be more robust and
reliable, as updates can be performed without any downtime at all.

I chose not to use the Victoria Metrics Operator, but rather handle
the resource definitions myself.  Victoria Metrics components are not
particularly difficult to deploy, so the overhead of running the
operator and using its custom resources would not be worth the minor
convenience it provides.
This commit is contained in:
2024-01-01 15:23:14 -06:00
parent 8c605d0f9f
commit 8f088fb6ae
17 changed files with 1474 additions and 0 deletions

View File

@@ -0,0 +1,68 @@
# Victoria Metrics
[Victoria Metrics] is a powerful, scalable time-series database compatible
with Prometheus and its ecosystem of metrics exporters.
## Clustered Deployment
*Victoria Metrics* can run in a high-availability cluster, with the various
functions of the TSDB split into independently-scalable processes:
* `vmstorage`: Stores time series data.
* `vminsert`: Ingests metrics in various formats (e.g. Prometheus) and sends
them to one or more `vmstorage` nodes.
* `vmselect`: Performs metrics queries, retrieving results from one or more
`vmstorage` nodes.
The `vmstorage` processes are managed by a StatefulSet with a volume claim
template for persistent storage. The number of replicas in the StatefulSet
must be $2n-1$ where $n$ is the value of the `replicationFactor` setting for
`vminsert`.
`vminsert` and `vmselect` processes are stateless and thus managed by a
Deployment. There should be at least 2 replicas of each of these, so that
restarts, etc. can be performed without any downtime.
## vmagent
In a typical Victoria Metrics ecosystem, collecting metrics is handled
separately from the TSDB. The [vmagent] process handles scraping and receiving
metrics and passing them to `vminsert`. `vmagent` can cache received metrics
locally, in case no `vminsert` process is available, so it requires persistent
storage and is therefore managed by a StatefulSet. Because there are multiple
`vmagent` processes scraping the same targets, the `vminsert` and `vmstorage`
processes MUST have the `dedup.minScrapeInterval` setting set to match the
`vmagent` scrape interval. Jobs with scrape intervals longer than the
default will unfortunately have duplicate data points.
## Blackbox Exporter
Many applications and web sites are monitored via the [Blackbox Exporter],
which makes arbitrary HTTP, TCP, ICMP, etc. requests and reports Prometheus
metrics about them. This is a stateless process, managed by a Deployment.
## vmalert
Victoria Metrics has a separate process for alerting, [vmalert]. This process
periodically executes the queries defined in its alerting rules and creates
alerts for matching results. Alerts are stored in the Victoria Metrics TSDB.
Rules are defined in a YAML document, managed by a ConfigMap. Notifications
are sent to Alertmanager.
## Alertmanager
[Alertmanager] receives notifications from `vmalert` and sends e.g. email
messages. Multiple instances can be run in a cluster; each node needs to know
the host and port of every node in the cluster.
[Victoria Metrics]: https://new.docs.victoriametrics.com/
[vmagent]: https://new.docs.victoriametrics.com/vmagent/
[Blackbox Exporter]: https://github.com/prometheus/blackbox_exporter
[vmalert]: https://new.docs.victoriametrics.com/vmalert/
[Alertmanager]: https://prometheus.io/docs/alerting/latest/alertmanager/