+++ This bug was initially created as a clone of Bug #2008175 +++ Description of problem: If gaps in metrics are observed by fsync controller they can provide false positives for leader election growth over time series. This can result in sending admin on an invalid path of etcd performance triage. > Detected leader change increase of 1.25 over 5 minutes on "BareMetal"; disk metrics are: etcd-nfvpe-12.oot.lab.eng.bos.redhat.com=0.0025688888888888875,etcd-nfvpe-13.oot.lab.eng.bos.redhat.com=0.0018824444444444331,etcd-nfvpe-02.oot.lab.eng.bos.redhat.com=0.001912424242424239 The image below shows leader elections(green) with spikes in leader elections(blue) as a result of the query. https://user-images.githubusercontent.com/1249749/134918692-43ca6cf0-4fb0-4ee3-8735-f5d035d56170.png This metric query clearly shows the gaps in the collection of metrics from possible networking issues. https://user-images.githubusercontent.com/1249749/134919312-f62591ee-57df-41de-a313-545e207d89ad.png Version-Release number of selected component (if applicable): How reproducible: given condition 100% Steps to Reproduce: 1. generate gaps in metrics collection by disrupting networking 2. review events from cluster-etcd-operator namespace for Detected leader change increase events. 3. verify etcd is not actually experiencing leader elections `etcd_server_leader_changes_seen_total` Actual results: leader elections are fasley reported by fsync controller Expected results: fsync controller should event only with an actual observation of the issue. Additional info:
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-02-16-171622 True False 5h23m Cluster version is 4.10.0-0.nightly-2022-02-16-171622 sh-4.4# etcdctl endpoint status -w table +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.0.3:2379 | 6951e8940830cac5 | 3.5.0 | 78 MB | true | false | 10 | 172448 | 172448 | | | https://10.0.0.4:2379 | 5ce2c3932bb984e2 | 3.5.0 | 78 MB | false | false | 10 | 172448 | 172448 | | | https://10.0.0.5:2379 | 41849292a8dfd31b | 3.5.0 | 78 MB | false | false | 10 | 172449 | 172449 | | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ Changed the leader few times (more than 2). sh-4.4# etcdctl endpoint status -w table +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.0.3:2379 | 6951e8940830cac5 | 3.5.0 | 78 MB | false | false | 10 | 172448 | 172448 | | | https://10.0.0.4:2379 | 5ce2c3932bb984e2 | 3.5.0 | 78 MB | true | false | 10 | 172448 | 172448 | | | https://10.0.0.5:2379 | 41849292a8dfd31b | 3.5.0 | 78 MB | false | false | 10 | 172449 | 172449 | | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ the rules:etcd_server_leader_changes_seen_total are getting updated. $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer sha256~lIpa1V2jbdLyrNBeRvihrXphl-b5mglXOmAsyEKYM9o" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=etcd_server_leader_changes_seen_total' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 851 0 851 0 0 47277 0 --:--:-- --:--:-- --:--:-- 47277 { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "etcd_server_leader_changes_seen_total", "endpoint": "etcd-metrics", "instance": "10.0.141.21:9979", "job": "etcd", "namespace": "openshift-etcd", "pod": "etcd-ip-10-0-141-21.us-east-2.compute.internal", "service": "etcd" }, "value": [ 1637311116.534, "3" ] }, { "metric": { "__name__": "etcd_server_leader_changes_seen_total", "endpoint": "etcd-metrics", "instance": "10.0.170.2:9979", "job": "etcd", "namespace": "openshift-etcd", "pod": "etcd-ip-10-0-170-2.us-east-2.compute.internal", "service": "etcd" }, "value": [ 1637311116.534, "1" ] }, { "metric": { "__name__": "etcd_server_leader_changes_seen_total", "endpoint": "etcd-metrics", "instance": "10.0.220.249:9979", "job": "etcd", "namespace": "openshift-etcd", "pod": "etcd-ip-10-0-220-249.us-east-2.compute.internal", "service": "etcd" }, "value": [ 1637311116.534, "2" ] } ] } } $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer sha256~lIpa1V2jbdLyrNBeRvihrXphl-b5mglXOmAsyEKYM9o" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=etcd_server_leader_changes_seen_total' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 851 0 851 0 0 56733 0 --:--:-- --:--:-- --:--:-- 56733 { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "etcd_server_leader_changes_seen_total", "endpoint": "etcd-metrics", "instance": "10.0.141.21:9979", "job": "etcd", "namespace": "openshift-etcd", "pod": "etcd-ip-10-0-141-21.us-east-2.compute.internal", "service": "etcd" }, "value": [ 1637311282.715, "1" ] }, { "metric": { "__name__": "etcd_server_leader_changes_seen_total", "endpoint": "etcd-metrics", "instance": "10.0.170.2:9979", "job": "etcd", "namespace": "openshift-etcd", "pod": "etcd-ip-10-0-170-2.us-east-2.compute.internal", "service": "etcd" }, "value": [ 1637311282.715, "2" ] }, { "metric": { "__name__": "etcd_server_leader_changes_seen_total", "endpoint": "etcd-metrics", "instance": "10.0.220.249:9979", "job": "etcd", "namespace": "openshift-etcd", "pod": "etcd-ip-10-0-220-249.us-east-2.compute.internal", "service": "etcd" }, "value": [ 1637311282.715, "3" ] } ] } }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056