Bug 2013646 - fsync controller will show false positive if gaps in metrics are observed.
Summary: fsync controller will show false positive if gaps in metrics are observed.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.10.0
Assignee: Nobody
QA Contact: Sandeep
URL:
Whiteboard:
Depends On:
Blocks: 2008175
TreeView+ depends on / blocked
 
Reported: 2021-10-13 13:02 UTC by Lili Cosic
Modified: 2022-03-10 16:19 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2008175
Environment:
Last Closed: 2022-03-10 16:19:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:19:54 UTC

Description Lili Cosic 2021-10-13 13:02:17 UTC
+++ This bug was initially created as a clone of Bug #2008175 +++

Description of problem: If gaps in metrics are observed by fsync controller they can provide false positives for leader election growth over time series. This can result in sending admin on an invalid path of etcd performance triage.

> Detected leader change increase of 1.25 over 5 minutes on "BareMetal"; disk metrics are: etcd-nfvpe-12.oot.lab.eng.bos.redhat.com=0.0025688888888888875,etcd-nfvpe-13.oot.lab.eng.bos.redhat.com=0.0018824444444444331,etcd-nfvpe-02.oot.lab.eng.bos.redhat.com=0.001912424242424239


The image below shows leader elections(green) with spikes in leader elections(blue) as a result of the query.

https://user-images.githubusercontent.com/1249749/134918692-43ca6cf0-4fb0-4ee3-8735-f5d035d56170.png

This metric query clearly shows the gaps in the collection of metrics from possible networking issues.

https://user-images.githubusercontent.com/1249749/134919312-f62591ee-57df-41de-a313-545e207d89ad.png

Version-Release number of selected component (if applicable):


How reproducible: given condition 100%


Steps to Reproduce:
1. generate gaps in metrics collection by disrupting networking
2. review events from cluster-etcd-operator namespace for Detected leader change increase events.

3. verify etcd is not actually experiencing leader elections `etcd_server_leader_changes_seen_total`

Actual results: leader elections are fasley reported by fsync controller 


Expected results: fsync controller should event only with an actual observation of the issue.


Additional info:

Comment 4 Sandeep 2022-02-17 12:06:31 UTC
oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-02-16-171622   True        False         5h23m   Cluster version is 4.10.0-0.nightly-2022-02-16-171622


sh-4.4# etcdctl endpoint status -w table
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|       ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.0.3:2379 | 6951e8940830cac5 |   3.5.0 |   78 MB |      true |      false |        10 |     172448 |             172448 |        |
| https://10.0.0.4:2379 | 5ce2c3932bb984e2 |   3.5.0 |   78 MB |     false |      false |        10 |     172448 |             172448 |        |
| https://10.0.0.5:2379 | 41849292a8dfd31b |   3.5.0 |   78 MB |     false |      false |        10 |     172449 |             172449 |        |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Changed the leader few times (more than 2).
sh-4.4# etcdctl endpoint status -w table
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|       ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.0.3:2379 | 6951e8940830cac5 |   3.5.0 |   78 MB |     false |      false |        10 |     172448 |             172448 |        |
| https://10.0.0.4:2379 | 5ce2c3932bb984e2 |   3.5.0 |   78 MB |      true |      false |        10 |     172448 |             172448 |        |
| https://10.0.0.5:2379 | 41849292a8dfd31b |   3.5.0 |   78 MB |     false |      false |        10 |     172449 |             172449 |        |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+


the rules:etcd_server_leader_changes_seen_total are getting updated.


$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer sha256~lIpa1V2jbdLyrNBeRvihrXphl-b5mglXOmAsyEKYM9o" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=etcd_server_leader_changes_seen_total' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   851    0   851    0     0  47277      0 --:--:-- --:--:-- --:--:-- 47277
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "etcd_server_leader_changes_seen_total",
          "endpoint": "etcd-metrics",
          "instance": "10.0.141.21:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-ip-10-0-141-21.us-east-2.compute.internal",
          "service": "etcd"
        },
        "value": [
          1637311116.534,
          "3"
        ]
      },
      {
        "metric": {
          "__name__": "etcd_server_leader_changes_seen_total",
          "endpoint": "etcd-metrics",
          "instance": "10.0.170.2:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-ip-10-0-170-2.us-east-2.compute.internal",
          "service": "etcd"
        },
        "value": [
          1637311116.534,
          "1"
        ]
      },
      {
        "metric": {
          "__name__": "etcd_server_leader_changes_seen_total",
          "endpoint": "etcd-metrics",
          "instance": "10.0.220.249:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-ip-10-0-220-249.us-east-2.compute.internal",
          "service": "etcd"
        },
        "value": [
          1637311116.534,
          "2"
        ]
      }
    ]
  }
}



$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer sha256~lIpa1V2jbdLyrNBeRvihrXphl-b5mglXOmAsyEKYM9o" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=etcd_server_leader_changes_seen_total' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   851    0   851    0     0  56733      0 --:--:-- --:--:-- --:--:-- 56733
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "etcd_server_leader_changes_seen_total",
          "endpoint": "etcd-metrics",
          "instance": "10.0.141.21:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-ip-10-0-141-21.us-east-2.compute.internal",
          "service": "etcd"
        },
        "value": [
          1637311282.715,
          "1"
        ]
      },
      {
        "metric": {
          "__name__": "etcd_server_leader_changes_seen_total",
          "endpoint": "etcd-metrics",
          "instance": "10.0.170.2:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-ip-10-0-170-2.us-east-2.compute.internal",
          "service": "etcd"
        },
        "value": [
          1637311282.715,
          "2"
        ]
      },
      {
        "metric": {
          "__name__": "etcd_server_leader_changes_seen_total",
          "endpoint": "etcd-metrics",
          "instance": "10.0.220.249:9979",
          "job": "etcd",
          "namespace": "openshift-etcd",
          "pod": "etcd-ip-10-0-220-249.us-east-2.compute.internal",
          "service": "etcd"
        },
        "value": [
          1637311282.715,
          "3"
        ]
      }
    ]
  }
}

Comment 6 errata-xmlrpc 2022-03-10 16:19:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.