Bug 1924953 - newly added 'excessive etcd leader changes' test case failing in serial job
Summary: newly added 'excessive etcd leader changes' test case failing in serial job
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.7
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.8.0
Assignee: Clayton Coleman
QA Contact: ge liu
Depends On:
Blocks: 1926556 1926577
TreeView+ depends on / blocked
Reported: 2021-02-04 01:14 UTC by jamo luhrsen
Modified: 2021-07-27 22:41 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1926556 (view as bug list)
Last Closed: 2021-07-27 22:40:58 UTC
Target Upstream Version:

Attachments (Terms of Use)
passing prometheus graph (77.10 KB, image/png)
2021-02-04 01:14 UTC, jamo luhrsen
no flags Details
failing prometheus graph (76.18 KB, image/png)
2021-02-04 01:14 UTC, jamo luhrsen
no flags Details

System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25870 0 None closed Bug 1924953: Allow prometheus state clear to pass etcd leader change test 2021-02-20 09:13:21 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:41:22 UTC

Description jamo luhrsen 2021-02-04 01:14:21 UTC
Created attachment 1754946 [details]
passing prometheus graph

Created attachment 1754946 [details]
passing prometheus graph

Description of problem:

recently this test case ([sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel]) has
started to run in our serial jobs and we see that it has failed twice in the first 5 jobs.

The test case is checking prometheus metrics to see if any etcd leader change has happened since
the test suite has started.


From what I can tell the test seems to be reporting a failure when it should not, but I'm not the expert

two examples:

Failing job:

looking at the two etcd operator logs, you can see that leadership change happened twice before the install
finished at 01:56:58:
curl -s --no-progress-meter https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568/artifacts/e2e-azure-serial/installer/.openshift_install.log | rg 'Cluster is initialized'
time="2021-02-03T01:56:58Z" level=debug msg="Cluster is initialized"

the 3 leadership elections (1st and then two changes) are easy to see timestamps in events.json:
            "lastTimestamp": "2021-02-03T01:36:57Z",
            "message": "etcd-operator-9b49d6d9f-f4rfv_f8952b5c-084d-4980-ac26-ec94cdaa534f became leader",
            "lastTimestamp": "2021-02-03T01:38:18Z",
            "message": "etcd-operator-9b49d6d9f-f4rfv_12f89afd-51f3-4122-8234-fd776523d01c became leader",
            "lastTimestamp": "2021-02-03T01:45:02Z",
            "message": "etcd-operator-9b49d6d9f-f4rfv_6be59167-e1ff-4888-a0f2-f1cfd397ee0e became leader",

all of these leadership events happened before the cluster was finally initialized and before the test
case window which is 56m11s from when it's checking which is at 02:53:22.

test case failure output:

fail [github.com/openshift/origin/test/extended/etcd/leader_changes.go:31]: Unexpected error:
    <*errors.errorString | 0xc002265630>: {
        s: "Observed 3.0066839285714284 leader changes in 56m11s: Leader changes are a result of stopping the etcd leader process or from latency (disk or network), review etcd performance metrics",
    Observed 3.0066839285714284 leader changes in 56m11s: Leader changes are a result of stopping the etcd leader process or from latency (disk or network), review etcd performance metrics

Passing job:

again, we can see in the two etcd operator logs to see two leadership changes. Cluster install finished at 17:43:40
curl -s --no-progress-meter https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384/artifacts/e2e-azure-serial/installer/.openshift_install.log | rg 'Cluster is initialized'
time="2021-02-03T17:43:30Z" level=debug msg="Cluster is initialized"

            "lastTimestamp": "2021-02-03T17:22:04Z",
            "message": "etcd-operator-676577769f-955pf_8d8770c3-d517-48e1-98cc-3ae9774f4f30 became leader",
            "lastTimestamp": "2021-02-03T17:23:11Z",
            "message": "etcd-operator-676577769f-955pf_0a2a9c31-392f-41c3-b1ff-146a06c2357a became leader",
            "lastTimestamp": "2021-02-03T17:29:04Z",
            "message": "etcd-operator-676577769f-955pf_e5b60caa-3746-49d8-83ea-70ff99db6a8f became leader",
            "lastTimestamp": "2021-02-03T17:32:56Z",
            "message": "etcd-operator-676577769f-955pf_1caa5195-b37b-4d01-ad58-b288a3ce63d7 became leader"

I'm not following why there are 4 leader events in events.json, but you can see that all leader changes
are finished before the cluster is initialized which is before the window for when the test case will
be checking.

I tried pulling the promethues.tar files (links below) and locally trying similar queries that I see being
done in the test code. I don't know if I got it right. I used this query for both:
  increase((max(max by (job) (etcd_server_leader_changes_seen_total)) or 0*absent(etcd_server_leader_changes_seen_total))[56m11s:15s])
  1h range and time set to 2021-02-03 02:53:22 for the failing case and 2021-02-03 18:37:39 for the failing case.

I'm not quite sure what it means, but the passing case graph did bump up to 3 and fall back
down to 0 eventually, whereas the failing case stayed at 3. I may have the ranges and time
incorrect but tried to match what the code would have executed. (screen shots attached)

failing) https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568/artifacts/e2e-azure-serial/metrics/prometheus.tar
passing) https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384/artifacts/e2e-azure-serial/metrics/prometheus.tar

Comment 1 jamo luhrsen 2021-02-04 01:14:45 UTC
Created attachment 1754947 [details]
failing prometheus graph

Comment 2 Sam Batschelet 2021-02-04 01:27:39 UTC
First of all great work on writing a very detailed bug on a complicated topic.

re: https://bugzilla.redhat.com/show_bug.cgi?id=1924953#c1 the fact that the leader changes are observable while Prometheus is up is a real concern.

> etcd-operator-676577769f-955pf_8d8770c3-d517-48e1-98cc-3ae9774f4f30 became leader

The issue your finding here is that these are different leader elections. The metrics are tracking the operand (etcd) leader elections and those logs are in reference to operator leader elections.

Azure Serial failing leader election test might be somewhat expected but we need to better understand this. The team will take a look and report.

Comment 3 Clayton Coleman 2021-02-09 02:58:40 UTC
The machine set test can wipe prometheus emptyDirs (the default config), which causes the query to return the pre-start count. Instead, change the query to only check over the time duration of the running prometheus instance (ignore changes prior to start). I will be turning on PVs for most prometheus IPI clusters, but it's still valid on some IPI platforms to have no dynamic volume provisioning and so the test must change.

Comment 5 ge liu 2021-02-10 12:15:20 UTC
Watched the last e2e job, there is not this err, so close it.

Comment 9 errata-xmlrpc 2021-07-27 22:40:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.