Bug 2097073 - etcdExcessiveDatabaseGrowth should not use increase() around gauge metrics
Summary: etcdExcessiveDatabaseGrowth should not use increase() around gauge metrics
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.12.0
Assignee: melbeher
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-14 20:46 UTC by W. Trevor King
Modified: 2023-01-17 19:50 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 19:49:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 890 0 None open Bug 2097073: vendor upstream alerts 2022-07-15 10:53:29 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:50:17 UTC

Description W. Trevor King 2022-06-14 20:46:54 UTC
From [1]:

  increase should only be used with counters.

but etcdExcessiveDatabaseGrowth has used increase since it landed [2], despite (etcd_mvcc_)db_total_size_in_bytes being a gauge [3].  This leads to false positives when things like compaction reduce the consumed DB size, and increase() inteprets that as "counter reset, so transparently unroll it".  We probably want a query based on predict_linear [4], like we do for NodeFilesystemSpaceFillingUp [5].

[1]: https://prometheus.io/docs/prometheus/latest/querying/functions/#increase
[2]: https://github.com/openshift/cluster-etcd-operator/blame/28a4ae406ff736b00af68c4f4d249319d62e48dd/manifests/0000_90_etcd-operator_03_prometheusrule.yaml#L117
[3]: https://github.com/etcd-io/etcd/blob/71bba3c761b0078c81c2b39781ec74853c458303/server/storage/mvcc/metrics.go#L156
[4]: https://prometheus.io/docs/prometheus/latest/querying/functions/#predict_linear
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/d4acf6add896cd0d3eafc396e52cb10deb7762eb/assets/node-exporter/prometheus-rule.yaml#L17-L34

Comment 1 melbeher 2022-07-05 14:19:02 UTC
@wking This alert is coming from upstream https://github.com/etcd-io/etcd/blob/1ec3722ce5e911ac53495efd8d28099ec3478cad/contrib/mixin/mixin.libsonnet#L228-L238 

So the change should be made there. There was a previous attempt but it did not land there https://github.com/etcd-io/etcd/issues/12550 https://github.com/etcd-io/etcd/pull/13223

May I ask how you want the new alert, and what is the reason I should give upstream 

cc @spasq

Comment 2 W. Trevor King 2022-07-05 22:24:15 UTC
(In reply to melbeher from comment #1)
> So the change should be made there. There was a previous attempt but it did
> not land there https://github.com/etcd-io/etcd/issues/12550
> https://github.com/etcd-io/etcd/pull/13223
> 
> May I ask how you want the new alert, and what is the reason I should give
> upstream

Looks like that previous attempt was driven by the same "we should not use increase() for this non-counter argument" reasoning, and the previous attempt just rotted out due to lack of interest.  I'd just point out (again) to the upstream maintainers that what they are recommending now is using increase() in a situation that it is clearly documented as a bad fit for.  If they want to use deriv() like [1], that's fine with me.  If they want to use predict_linear like [2], that's fine with me too.  Plenty of examples of how they could do this, and plenty of flexibility to pick whatever they like best.  But leaving the invalid increase() use unfixed does not seem like a great long-term approach.

[1]: https://github.com/etcd-io/etcd/pull/13223
[2]: https://github.com/openshift/cluster-monitoring-operator/blob/d4acf6add896cd0d3eafc396e52cb10deb7762eb/assets/node-exporter/prometheus-rule.yaml#L17-L34

Comment 3 melbeher 2022-07-06 13:05:37 UTC
Hello @wking 

I raised a fix here https://github.com/etcd-io/etcd/pull/14196 

Let's hope it gets merged soon 

Thanks a lot

Comment 4 melbeher 2022-07-14 16:32:31 UTC
Upstream fix has been merged https://github.com/etcd-io/etcd/pull/14196.

Shall I cherry-pick it, or wait for the next rebase ? 

cc @wking

Comment 5 W. Trevor King 2022-07-14 21:57:04 UTC
I dunno when the next etcd rebase is likely to come around, but as long as it's just me on this bug, probably fine to wait on that rebase.  If we get some external customers who are bothered by the increase() false-positives, we can revisit and consider a pick and backports.

Comment 6 melbeher 2022-07-15 09:32:38 UTC
@wking I have cherrypicked it, lets see what the reviewers say :)

Comment 9 melbeher 2022-07-15 10:54:36 UTC
I have vendored the upstream alerts and opened a PR to bring the fix to Cluster-etcd-operator

Comment 11 Sandeep 2022-09-22 19:51:30 UTC
ocp version 4.12.0-0.nightly-2022-09-22-014209


steps followed : 


Initial etcd db size

sh-4.4# etcdctl endpoint status -w table
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|       ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.0.3:2379 | df3331c2c5b4de7e |   3.5.3 |  102 MB |     false |      false |         7 |     329705 |             329705 |        |
| https://10.0.0.4:2379 |  4e51fdb1568937b |   3.5.3 |  101 MB |      true |      false |         7 |     329705 |             329705 |        |
| https://10.0.0.5:2379 | a891485ea9688125 |   3.5.3 |  102 MB |     false |      false |         7 |     329705 |             329705 |        |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+




oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer sha256~iSBvyqtIJHsdj0sMeL5XCm7fq1iuMmxZjnW2jV6hg60" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=etcd_mvcc_db_total_size_in_bytes' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   886    0   886    0     0  15543      0 --:--:-- --:--:-- --:--:-- 15543
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "etcd_mvcc_db_total_size_in_bytes",
          "endpoint": "etcd-metrics",
          "instance": "10.0.0.3:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-skundu-gcp-ver-74z85-master-1.c.openshift-qe.internal",
          "service": "etcd"
        },
        "value": [
          1663874038.431,
          "100708352"
        ]
      },
      {
        "metric": {
          "__name__": "etcd_mvcc_db_total_size_in_bytes",
          "endpoint": "etcd-metrics",
          "instance": "10.0.0.4:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-skundu-gcp-ver-74z85-master-0.c.openshift-qe.internal",
          "service": "etcd"
        },
        "value": [
          1663874038.431,
          "99790848"
        ]
      },
      {
        "metric": {
          "__name__": "etcd_mvcc_db_total_size_in_bytes",
          "endpoint": "etcd-metrics",
          "instance": "10.0.0.5:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-skundu-gcp-ver-74z85-master-2.c.openshift-qe.internal",
          "service": "etcd"
        },
        "value": [
          1663874038.431,
          "101232640"
        ]
      }
    ]
  }
}


Increased etcd db size

sh-4.4# etcdctl endpoint status -w table
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|       ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.0.3:2379 | df3331c2c5b4de7e |   3.5.3 |  123 MB |     false |      false |         7 |     351550 |             351550 |        |
| https://10.0.0.4:2379 |  4e51fdb1568937b |   3.5.3 |  123 MB |      true |      false |         7 |     351550 |             351550 |        |
| https://10.0.0.5:2379 | a891485ea9688125 |   3.5.3 |  122 MB |     false |      false |         7 |     351550 |             351550 |        |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+




oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer sha256~iSBvyqtIJHsdj0sMeL5XCm7fq1iuMmxZjnW2jV6hg60" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=etcd_mvcc_db_total_size_in_bytes' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   887    0   887    0     0  15561      0 --:--:-- --:--:-- --:--:-- 15561
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "etcd_mvcc_db_total_size_in_bytes",
          "endpoint": "etcd-metrics",
          "instance": "10.0.0.3:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-skundu-gcp-ver-74z85-master-1.c.openshift-qe.internal",
          "service": "etcd"
        },
        "value": [
          1663876172.054,
          "122802176"
        ]
      },
      {
        "metric": {
          "__name__": "etcd_mvcc_db_total_size_in_bytes",
          "endpoint": "etcd-metrics",
          "instance": "10.0.0.4:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-skundu-gcp-ver-74z85-master-0.c.openshift-qe.internal",
          "service": "etcd"
        },
        "value": [
          1663876172.054,
          "122961920"
        ]
      },
      {
        "metric": {
          "__name__": "etcd_mvcc_db_total_size_in_bytes",
          "endpoint": "etcd-metrics",
          "instance": "10.0.0.5:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-skundu-gcp-ver-74z85-master-2.c.openshift-qe.internal",
          "service": "etcd"
        },
        "value": [
          1663876172.054,
          "122126336"
        ]
      }
    ]
  }
}

Comment 16 errata-xmlrpc 2023-01-17 19:49:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.