Created attachment 1754946 [details] passing prometheus graph Created attachment 1754946 [details] passing prometheus graph Description of problem: recently this test case ([sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel]) has started to run in our serial jobs and we see that it has failed twice in the first 5 jobs. The test case is checking prometheus metrics to see if any etcd leader change has happened since the test suite has started. https://github.com/openshift/origin/blob/d2976875ee17fb8561201cbd0bfef3165f567e4b/test/extended/etcd/leader_changes.go#L27 From what I can tell the test seems to be reporting a failure when it should not, but I'm not the expert here. two examples: 1) Failing job: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568 looking at the two etcd operator logs, you can see that leadership change happened twice before the install finished at 01:56:58: curl -s --no-progress-meter https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568/artifacts/e2e-azure-serial/installer/.openshift_install.log | rg 'Cluster is initialized' time="2021-02-03T01:56:58Z" level=debug msg="Cluster is initialized" the 3 leadership elections (1st and then two changes) are easy to see timestamps in events.json: "lastTimestamp": "2021-02-03T01:36:57Z", "message": "etcd-operator-9b49d6d9f-f4rfv_f8952b5c-084d-4980-ac26-ec94cdaa534f became leader", "lastTimestamp": "2021-02-03T01:38:18Z", "message": "etcd-operator-9b49d6d9f-f4rfv_12f89afd-51f3-4122-8234-fd776523d01c became leader", "lastTimestamp": "2021-02-03T01:45:02Z", "message": "etcd-operator-9b49d6d9f-f4rfv_6be59167-e1ff-4888-a0f2-f1cfd397ee0e became leader", all of these leadership events happened before the cluster was finally initialized and before the test case window which is 56m11s from when it's checking which is at 02:53:22. test case failure output: fail [github.com/openshift/origin/test/extended/etcd/leader_changes.go:31]: Unexpected error: <*errors.errorString | 0xc002265630>: { s: "Observed 3.0066839285714284 leader changes in 56m11s: Leader changes are a result of stopping the etcd leader process or from latency (disk or network), review etcd performance metrics", } Observed 3.0066839285714284 leader changes in 56m11s: Leader changes are a result of stopping the etcd leader process or from latency (disk or network), review etcd performance metrics occurred 2) Passing job: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384 again, we can see in the two etcd operator logs to see two leadership changes. Cluster install finished at 17:43:40 curl -s --no-progress-meter https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384/artifacts/e2e-azure-serial/installer/.openshift_install.log | rg 'Cluster is initialized' time="2021-02-03T17:43:30Z" level=debug msg="Cluster is initialized" "lastTimestamp": "2021-02-03T17:22:04Z", "message": "etcd-operator-676577769f-955pf_8d8770c3-d517-48e1-98cc-3ae9774f4f30 became leader", "lastTimestamp": "2021-02-03T17:23:11Z", "message": "etcd-operator-676577769f-955pf_0a2a9c31-392f-41c3-b1ff-146a06c2357a became leader", "lastTimestamp": "2021-02-03T17:29:04Z", "message": "etcd-operator-676577769f-955pf_e5b60caa-3746-49d8-83ea-70ff99db6a8f became leader", "lastTimestamp": "2021-02-03T17:32:56Z", "message": "etcd-operator-676577769f-955pf_1caa5195-b37b-4d01-ad58-b288a3ce63d7 became leader" I'm not following why there are 4 leader events in events.json, but you can see that all leader changes are finished before the cluster is initialized which is before the window for when the test case will be checking. I tried pulling the promethues.tar files (links below) and locally trying similar queries that I see being done in the test code. I don't know if I got it right. I used this query for both: increase((max(max by (job) (etcd_server_leader_changes_seen_total)) or 0*absent(etcd_server_leader_changes_seen_total))[56m11s:15s]) 1h range and time set to 2021-02-03 02:53:22 for the failing case and 2021-02-03 18:37:39 for the failing case. I'm not quite sure what it means, but the passing case graph did bump up to 3 and fall back down to 0 eventually, whereas the failing case stayed at 3. I may have the ranges and time incorrect but tried to match what the code would have executed. (screen shots attached) failing) https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568/artifacts/e2e-azure-serial/metrics/prometheus.tar passing) https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384/artifacts/e2e-azure-serial/metrics/prometheus.tar
Created attachment 1754947 [details] failing prometheus graph
First of all great work on writing a very detailed bug on a complicated topic. re: https://bugzilla.redhat.com/show_bug.cgi?id=1924953#c1 the fact that the leader changes are observable while Prometheus is up is a real concern. > etcd-operator-676577769f-955pf_8d8770c3-d517-48e1-98cc-3ae9774f4f30 became leader The issue your finding here is that these are different leader elections. The metrics are tracking the operand (etcd) leader elections and those logs are in reference to operator leader elections. Azure Serial failing leader election test might be somewhat expected but we need to better understand this. The team will take a look and report.
The machine set test can wipe prometheus emptyDirs (the default config), which causes the query to return the pre-start count. Instead, change the query to only check over the time duration of the running prometheus instance (ignore changes prior to start). I will be turning on PVs for most prometheus IPI clusters, but it's still valid on some IPI platforms to have no dynamic volume provisioning and so the test must change.
Watched the last e2e job, there is not this err, so close it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438