From [1]: increase should only be used with counters. but etcdExcessiveDatabaseGrowth has used increase since it landed [2], despite (etcd_mvcc_)db_total_size_in_bytes being a gauge [3]. This leads to false positives when things like compaction reduce the consumed DB size, and increase() inteprets that as "counter reset, so transparently unroll it". We probably want a query based on predict_linear [4], like we do for NodeFilesystemSpaceFillingUp [5]. [1]: https://prometheus.io/docs/prometheus/latest/querying/functions/#increase [2]: https://github.com/openshift/cluster-etcd-operator/blame/28a4ae406ff736b00af68c4f4d249319d62e48dd/manifests/0000_90_etcd-operator_03_prometheusrule.yaml#L117 [3]: https://github.com/etcd-io/etcd/blob/71bba3c761b0078c81c2b39781ec74853c458303/server/storage/mvcc/metrics.go#L156 [4]: https://prometheus.io/docs/prometheus/latest/querying/functions/#predict_linear [5]: https://github.com/openshift/cluster-monitoring-operator/blob/d4acf6add896cd0d3eafc396e52cb10deb7762eb/assets/node-exporter/prometheus-rule.yaml#L17-L34
@wking This alert is coming from upstream https://github.com/etcd-io/etcd/blob/1ec3722ce5e911ac53495efd8d28099ec3478cad/contrib/mixin/mixin.libsonnet#L228-L238 So the change should be made there. There was a previous attempt but it did not land there https://github.com/etcd-io/etcd/issues/12550 https://github.com/etcd-io/etcd/pull/13223 May I ask how you want the new alert, and what is the reason I should give upstream cc @spasq
(In reply to melbeher from comment #1) > So the change should be made there. There was a previous attempt but it did > not land there https://github.com/etcd-io/etcd/issues/12550 > https://github.com/etcd-io/etcd/pull/13223 > > May I ask how you want the new alert, and what is the reason I should give > upstream Looks like that previous attempt was driven by the same "we should not use increase() for this non-counter argument" reasoning, and the previous attempt just rotted out due to lack of interest. I'd just point out (again) to the upstream maintainers that what they are recommending now is using increase() in a situation that it is clearly documented as a bad fit for. If they want to use deriv() like [1], that's fine with me. If they want to use predict_linear like [2], that's fine with me too. Plenty of examples of how they could do this, and plenty of flexibility to pick whatever they like best. But leaving the invalid increase() use unfixed does not seem like a great long-term approach. [1]: https://github.com/etcd-io/etcd/pull/13223 [2]: https://github.com/openshift/cluster-monitoring-operator/blob/d4acf6add896cd0d3eafc396e52cb10deb7762eb/assets/node-exporter/prometheus-rule.yaml#L17-L34
Hello @wking I raised a fix here https://github.com/etcd-io/etcd/pull/14196 Let's hope it gets merged soon Thanks a lot
Upstream fix has been merged https://github.com/etcd-io/etcd/pull/14196. Shall I cherry-pick it, or wait for the next rebase ? cc @wking
I dunno when the next etcd rebase is likely to come around, but as long as it's just me on this bug, probably fine to wait on that rebase. If we get some external customers who are bothered by the increase() false-positives, we can revisit and consider a pick and backports.
@wking I have cherrypicked it, lets see what the reviewers say :)
I have vendored the upstream alerts and opened a PR to bring the fix to Cluster-etcd-operator
ocp version 4.12.0-0.nightly-2022-09-22-014209 steps followed : Initial etcd db size sh-4.4# etcdctl endpoint status -w table +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.0.3:2379 | df3331c2c5b4de7e | 3.5.3 | 102 MB | false | false | 7 | 329705 | 329705 | | | https://10.0.0.4:2379 | 4e51fdb1568937b | 3.5.3 | 101 MB | true | false | 7 | 329705 | 329705 | | | https://10.0.0.5:2379 | a891485ea9688125 | 3.5.3 | 102 MB | false | false | 7 | 329705 | 329705 | | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer sha256~iSBvyqtIJHsdj0sMeL5XCm7fq1iuMmxZjnW2jV6hg60" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=etcd_mvcc_db_total_size_in_bytes' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 886 0 886 0 0 15543 0 --:--:-- --:--:-- --:--:-- 15543 { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "etcd_mvcc_db_total_size_in_bytes", "endpoint": "etcd-metrics", "instance": "10.0.0.3:9979", "job": "etcd", "namespace": "openshift-etcd", "pod": "etcd-skundu-gcp-ver-74z85-master-1.c.openshift-qe.internal", "service": "etcd" }, "value": [ 1663874038.431, "100708352" ] }, { "metric": { "__name__": "etcd_mvcc_db_total_size_in_bytes", "endpoint": "etcd-metrics", "instance": "10.0.0.4:9979", "job": "etcd", "namespace": "openshift-etcd", "pod": "etcd-skundu-gcp-ver-74z85-master-0.c.openshift-qe.internal", "service": "etcd" }, "value": [ 1663874038.431, "99790848" ] }, { "metric": { "__name__": "etcd_mvcc_db_total_size_in_bytes", "endpoint": "etcd-metrics", "instance": "10.0.0.5:9979", "job": "etcd", "namespace": "openshift-etcd", "pod": "etcd-skundu-gcp-ver-74z85-master-2.c.openshift-qe.internal", "service": "etcd" }, "value": [ 1663874038.431, "101232640" ] } ] } } Increased etcd db size sh-4.4# etcdctl endpoint status -w table +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.0.3:2379 | df3331c2c5b4de7e | 3.5.3 | 123 MB | false | false | 7 | 351550 | 351550 | | | https://10.0.0.4:2379 | 4e51fdb1568937b | 3.5.3 | 123 MB | true | false | 7 | 351550 | 351550 | | | https://10.0.0.5:2379 | a891485ea9688125 | 3.5.3 | 122 MB | false | false | 7 | 351550 | 351550 | | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer sha256~iSBvyqtIJHsdj0sMeL5XCm7fq1iuMmxZjnW2jV6hg60" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=etcd_mvcc_db_total_size_in_bytes' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 887 0 887 0 0 15561 0 --:--:-- --:--:-- --:--:-- 15561 { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "etcd_mvcc_db_total_size_in_bytes", "endpoint": "etcd-metrics", "instance": "10.0.0.3:9979", "job": "etcd", "namespace": "openshift-etcd", "pod": "etcd-skundu-gcp-ver-74z85-master-1.c.openshift-qe.internal", "service": "etcd" }, "value": [ 1663876172.054, "122802176" ] }, { "metric": { "__name__": "etcd_mvcc_db_total_size_in_bytes", "endpoint": "etcd-metrics", "instance": "10.0.0.4:9979", "job": "etcd", "namespace": "openshift-etcd", "pod": "etcd-skundu-gcp-ver-74z85-master-0.c.openshift-qe.internal", "service": "etcd" }, "value": [ 1663876172.054, "122961920" ] }, { "metric": { "__name__": "etcd_mvcc_db_total_size_in_bytes", "endpoint": "etcd-metrics", "instance": "10.0.0.5:9979", "job": "etcd", "namespace": "openshift-etcd", "pod": "etcd-skundu-gcp-ver-74z85-master-2.c.openshift-qe.internal", "service": "etcd" }, "value": [ 1663876172.054, "122126336" ] } ] } }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399