Bug 2097073

Summary: etcdExcessiveDatabaseGrowth should not use increase() around gauge metrics
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: EtcdAssignee: melbeher
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.9CC: melbeher, skundu, spasquie, sreber
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-17 19:49:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2022-06-14 20:46:54 UTC
From [1]:

  increase should only be used with counters.

but etcdExcessiveDatabaseGrowth has used increase since it landed [2], despite (etcd_mvcc_)db_total_size_in_bytes being a gauge [3].  This leads to false positives when things like compaction reduce the consumed DB size, and increase() inteprets that as "counter reset, so transparently unroll it".  We probably want a query based on predict_linear [4], like we do for NodeFilesystemSpaceFillingUp [5].

[1]: https://prometheus.io/docs/prometheus/latest/querying/functions/#increase
[2]: https://github.com/openshift/cluster-etcd-operator/blame/28a4ae406ff736b00af68c4f4d249319d62e48dd/manifests/0000_90_etcd-operator_03_prometheusrule.yaml#L117
[3]: https://github.com/etcd-io/etcd/blob/71bba3c761b0078c81c2b39781ec74853c458303/server/storage/mvcc/metrics.go#L156
[4]: https://prometheus.io/docs/prometheus/latest/querying/functions/#predict_linear
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/d4acf6add896cd0d3eafc396e52cb10deb7762eb/assets/node-exporter/prometheus-rule.yaml#L17-L34

Comment 1 melbeher 2022-07-05 14:19:02 UTC
@wking This alert is coming from upstream https://github.com/etcd-io/etcd/blob/1ec3722ce5e911ac53495efd8d28099ec3478cad/contrib/mixin/mixin.libsonnet#L228-L238 

So the change should be made there. There was a previous attempt but it did not land there https://github.com/etcd-io/etcd/issues/12550 https://github.com/etcd-io/etcd/pull/13223

May I ask how you want the new alert, and what is the reason I should give upstream 

cc @spasq

Comment 2 W. Trevor King 2022-07-05 22:24:15 UTC
(In reply to melbeher from comment #1)
> So the change should be made there. There was a previous attempt but it did
> not land there https://github.com/etcd-io/etcd/issues/12550
> https://github.com/etcd-io/etcd/pull/13223
> 
> May I ask how you want the new alert, and what is the reason I should give
> upstream

Looks like that previous attempt was driven by the same "we should not use increase() for this non-counter argument" reasoning, and the previous attempt just rotted out due to lack of interest.  I'd just point out (again) to the upstream maintainers that what they are recommending now is using increase() in a situation that it is clearly documented as a bad fit for.  If they want to use deriv() like [1], that's fine with me.  If they want to use predict_linear like [2], that's fine with me too.  Plenty of examples of how they could do this, and plenty of flexibility to pick whatever they like best.  But leaving the invalid increase() use unfixed does not seem like a great long-term approach.

[1]: https://github.com/etcd-io/etcd/pull/13223
[2]: https://github.com/openshift/cluster-monitoring-operator/blob/d4acf6add896cd0d3eafc396e52cb10deb7762eb/assets/node-exporter/prometheus-rule.yaml#L17-L34

Comment 3 melbeher 2022-07-06 13:05:37 UTC
Hello @wking 

I raised a fix here https://github.com/etcd-io/etcd/pull/14196 

Let's hope it gets merged soon 

Thanks a lot

Comment 4 melbeher 2022-07-14 16:32:31 UTC
Upstream fix has been merged https://github.com/etcd-io/etcd/pull/14196.

Shall I cherry-pick it, or wait for the next rebase ? 

cc @wking

Comment 5 W. Trevor King 2022-07-14 21:57:04 UTC
I dunno when the next etcd rebase is likely to come around, but as long as it's just me on this bug, probably fine to wait on that rebase.  If we get some external customers who are bothered by the increase() false-positives, we can revisit and consider a pick and backports.

Comment 6 melbeher 2022-07-15 09:32:38 UTC
@wking I have cherrypicked it, lets see what the reviewers say :)

Comment 9 melbeher 2022-07-15 10:54:36 UTC
I have vendored the upstream alerts and opened a PR to bring the fix to Cluster-etcd-operator

Comment 11 Sandeep 2022-09-22 19:51:30 UTC
ocp version 4.12.0-0.nightly-2022-09-22-014209


steps followed : 


Initial etcd db size

sh-4.4# etcdctl endpoint status -w table
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|       ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.0.3:2379 | df3331c2c5b4de7e |   3.5.3 |  102 MB |     false |      false |         7 |     329705 |             329705 |        |
| https://10.0.0.4:2379 |  4e51fdb1568937b |   3.5.3 |  101 MB |      true |      false |         7 |     329705 |             329705 |        |
| https://10.0.0.5:2379 | a891485ea9688125 |   3.5.3 |  102 MB |     false |      false |         7 |     329705 |             329705 |        |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+




oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer sha256~iSBvyqtIJHsdj0sMeL5XCm7fq1iuMmxZjnW2jV6hg60" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=etcd_mvcc_db_total_size_in_bytes' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   886    0   886    0     0  15543      0 --:--:-- --:--:-- --:--:-- 15543
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "etcd_mvcc_db_total_size_in_bytes",
          "endpoint": "etcd-metrics",
          "instance": "10.0.0.3:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-skundu-gcp-ver-74z85-master-1.c.openshift-qe.internal",
          "service": "etcd"
        },
        "value": [
          1663874038.431,
          "100708352"
        ]
      },
      {
        "metric": {
          "__name__": "etcd_mvcc_db_total_size_in_bytes",
          "endpoint": "etcd-metrics",
          "instance": "10.0.0.4:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-skundu-gcp-ver-74z85-master-0.c.openshift-qe.internal",
          "service": "etcd"
        },
        "value": [
          1663874038.431,
          "99790848"
        ]
      },
      {
        "metric": {
          "__name__": "etcd_mvcc_db_total_size_in_bytes",
          "endpoint": "etcd-metrics",
          "instance": "10.0.0.5:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-skundu-gcp-ver-74z85-master-2.c.openshift-qe.internal",
          "service": "etcd"
        },
        "value": [
          1663874038.431,
          "101232640"
        ]
      }
    ]
  }
}


Increased etcd db size

sh-4.4# etcdctl endpoint status -w table
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|       ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.0.3:2379 | df3331c2c5b4de7e |   3.5.3 |  123 MB |     false |      false |         7 |     351550 |             351550 |        |
| https://10.0.0.4:2379 |  4e51fdb1568937b |   3.5.3 |  123 MB |      true |      false |         7 |     351550 |             351550 |        |
| https://10.0.0.5:2379 | a891485ea9688125 |   3.5.3 |  122 MB |     false |      false |         7 |     351550 |             351550 |        |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+




oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer sha256~iSBvyqtIJHsdj0sMeL5XCm7fq1iuMmxZjnW2jV6hg60" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=etcd_mvcc_db_total_size_in_bytes' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   887    0   887    0     0  15561      0 --:--:-- --:--:-- --:--:-- 15561
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "etcd_mvcc_db_total_size_in_bytes",
          "endpoint": "etcd-metrics",
          "instance": "10.0.0.3:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-skundu-gcp-ver-74z85-master-1.c.openshift-qe.internal",
          "service": "etcd"
        },
        "value": [
          1663876172.054,
          "122802176"
        ]
      },
      {
        "metric": {
          "__name__": "etcd_mvcc_db_total_size_in_bytes",
          "endpoint": "etcd-metrics",
          "instance": "10.0.0.4:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-skundu-gcp-ver-74z85-master-0.c.openshift-qe.internal",
          "service": "etcd"
        },
        "value": [
          1663876172.054,
          "122961920"
        ]
      },
      {
        "metric": {
          "__name__": "etcd_mvcc_db_total_size_in_bytes",
          "endpoint": "etcd-metrics",
          "instance": "10.0.0.5:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-skundu-gcp-ver-74z85-master-2.c.openshift-qe.internal",
          "service": "etcd"
        },
        "value": [
          1663876172.054,
          "122126336"
        ]
      }
    ]
  }
}

Comment 16 errata-xmlrpc 2023-01-17 19:49:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399