Bug 1926577 - newly added 'excessive etcd leader changes' test case failing in serial job
Summary: newly added 'excessive etcd leader changes' test case failing in serial job
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.z
Assignee: Clayton Coleman
QA Contact: ge liu
URL:
Whiteboard:
Depends On: 1924953
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-09 05:47 UTC by OpenShift BugZilla Robot
Modified: 2021-05-24 17:15 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-24 17:14:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25871 0 None closed Bug 1926556: Allow prometheus state clear to pass etcd leader change test 2021-02-15 21:16:10 UTC
Red Hat Product Errata RHSA-2021:1561 0 None None None 2021-05-24 17:15:10 UTC

Description OpenShift BugZilla Robot 2021-02-09 05:47:13 UTC
+++ This bug was initially created as a clone of Bug #1924953 +++

Created attachment 1754946 [details]
passing prometheus graph

Created attachment 1754946 [details]
passing prometheus graph

Description of problem:

recently this test case ([sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel]) has
started to run in our serial jobs and we see that it has failed twice in the first 5 jobs.

The test case is checking prometheus metrics to see if any etcd leader change has happened since
the test suite has started.

https://github.com/openshift/origin/blob/d2976875ee17fb8561201cbd0bfef3165f567e4b/test/extended/etcd/leader_changes.go#L27

From what I can tell the test seems to be reporting a failure when it should not, but I'm not the expert
here.


two examples:

1)
Failing job:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568

looking at the two etcd operator logs, you can see that leadership change happened twice before the install
finished at 01:56:58:
curl -s --no-progress-meter https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568/artifacts/e2e-azure-serial/installer/.openshift_install.log | rg 'Cluster is initialized'
time="2021-02-03T01:56:58Z" level=debug msg="Cluster is initialized"

the 3 leadership elections (1st and then two changes) are easy to see timestamps in events.json:
            "lastTimestamp": "2021-02-03T01:36:57Z",
            "message": "etcd-operator-9b49d6d9f-f4rfv_f8952b5c-084d-4980-ac26-ec94cdaa534f became leader",
            "lastTimestamp": "2021-02-03T01:38:18Z",
            "message": "etcd-operator-9b49d6d9f-f4rfv_12f89afd-51f3-4122-8234-fd776523d01c became leader",
            "lastTimestamp": "2021-02-03T01:45:02Z",
            "message": "etcd-operator-9b49d6d9f-f4rfv_6be59167-e1ff-4888-a0f2-f1cfd397ee0e became leader",

all of these leadership events happened before the cluster was finally initialized and before the test
case window which is 56m11s from when it's checking which is at 02:53:22.

test case failure output:

fail [github.com/openshift/origin/test/extended/etcd/leader_changes.go:31]: Unexpected error:
    <*errors.errorString | 0xc002265630>: {
        s: "Observed 3.0066839285714284 leader changes in 56m11s: Leader changes are a result of stopping the etcd leader process or from latency (disk or network), review etcd performance metrics",
    }
    Observed 3.0066839285714284 leader changes in 56m11s: Leader changes are a result of stopping the etcd leader process or from latency (disk or network), review etcd performance metrics
occurred



2)
Passing job:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384

again, we can see in the two etcd operator logs to see two leadership changes. Cluster install finished at 17:43:40
curl -s --no-progress-meter https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384/artifacts/e2e-azure-serial/installer/.openshift_install.log | rg 'Cluster is initialized'
time="2021-02-03T17:43:30Z" level=debug msg="Cluster is initialized"


            "lastTimestamp": "2021-02-03T17:22:04Z",
            "message": "etcd-operator-676577769f-955pf_8d8770c3-d517-48e1-98cc-3ae9774f4f30 became leader",
            "lastTimestamp": "2021-02-03T17:23:11Z",
            "message": "etcd-operator-676577769f-955pf_0a2a9c31-392f-41c3-b1ff-146a06c2357a became leader",
            "lastTimestamp": "2021-02-03T17:29:04Z",
            "message": "etcd-operator-676577769f-955pf_e5b60caa-3746-49d8-83ea-70ff99db6a8f became leader",
            "lastTimestamp": "2021-02-03T17:32:56Z",
            "message": "etcd-operator-676577769f-955pf_1caa5195-b37b-4d01-ad58-b288a3ce63d7 became leader"

I'm not following why there are 4 leader events in events.json, but you can see that all leader changes
are finished before the cluster is initialized which is before the window for when the test case will
be checking.


I tried pulling the promethues.tar files (links below) and locally trying similar queries that I see being
done in the test code. I don't know if I got it right. I used this query for both:
  increase((max(max by (job) (etcd_server_leader_changes_seen_total)) or 0*absent(etcd_server_leader_changes_seen_total))[56m11s:15s])
  1h range and time set to 2021-02-03 02:53:22 for the failing case and 2021-02-03 18:37:39 for the failing case.

I'm not quite sure what it means, but the passing case graph did bump up to 3 and fall back
down to 0 eventually, whereas the failing case stayed at 3. I may have the ranges and time
incorrect but tried to match what the code would have executed. (screen shots attached)


failing) https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568/artifacts/e2e-azure-serial/metrics/prometheus.tar
passing) https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384/artifacts/e2e-azure-serial/metrics/prometheus.tar

--- Additional comment from jluhrsen on 2021-02-04 01:14:45 UTC ---

Created attachment 1754947 [details]
failing prometheus graph

--- Additional comment from sbatsche on 2021-02-04 01:27:39 UTC ---

First of all great work on writing a very detailed bug on a complicated topic.

re: https://bugzilla.redhat.com/show_bug.cgi?id=1924953#c1 the fact that the leader changes are observable while Prometheus is up is a real concern.

> etcd-operator-676577769f-955pf_8d8770c3-d517-48e1-98cc-3ae9774f4f30 became leader

The issue your finding here is that these are different leader elections. The metrics are tracking the operand (etcd) leader elections and those logs are in reference to operator leader elections.

Azure Serial failing leader election test might be somewhat expected but we need to better understand this. The team will take a look and report.

--- Additional comment from ccoleman on 2021-02-09 02:58:40 UTC ---

The machine set test can wipe prometheus emptyDirs (the default config), which causes the query to return the pre-start count. Instead, change the query to only check over the time duration of the running prometheus instance (ignore changes prior to start). I will be turning on PVs for most prometheus IPI clusters, but it's still valid on some IPI platforms to have no dynamic volume provisioning and so the test must change.

Comment 1 Fabian von Feilitzsch 2021-02-10 20:13:50 UTC
Still seeing this test fail even after that PR was included in a CI payload

Comment 3 Sam Batschelet 2021-05-14 16:10:38 UTC
> This is still happening https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-e2e-azure-fips-serial/1374816059252543488


This has merged so lets get this verified. The fact that we still see leader election in tests such as fips that require additional resources is probably not a surprise. The test referenced reported a single leader election during the e2e.

Comment 6 errata-xmlrpc 2021-05-24 17:14:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.12 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1561


Note You need to log in before you can comment on or make changes to this bug.