Bug 1924953

Summary:

newly added 'excessive etcd leader changes' test case failing in serial job

Product:

OpenShift Container Platform

Reporter:

jamo luhrsen <jluhrsen>

Component:

Etcd

Assignee:

Clayton Coleman <ccoleman>

Status:

CLOSED ERRATA

QA Contact:

ge liu <geliu>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.7

CC:

ccoleman, wking

Target Milestone:

---

Target Release:

4.8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Clones:

1926556 (view as bug list)

Environment:

Last Closed:

2021-07-27 22:40:58 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1926556, 1926577

Attachments:

Description	Flags
passing prometheus graph	none
failing prometheus graph	none

Description jamo luhrsen 2021-02-04 01:14:21 UTC

Created attachment 1754946 [details]
passing prometheus graph

Created attachment 1754946 [details]
passing prometheus graph

Description of problem:

recently this test case ([sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel]) has
started to run in our serial jobs and we see that it has failed twice in the first 5 jobs.

The test case is checking prometheus metrics to see if any etcd leader change has happened since
the test suite has started.

https://github.com/openshift/origin/blob/d2976875ee17fb8561201cbd0bfef3165f567e4b/test/extended/etcd/leader_changes.go#L27

From what I can tell the test seems to be reporting a failure when it should not, but I'm not the expert
here.


two examples:

1)
Failing job:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568

looking at the two etcd operator logs, you can see that leadership change happened twice before the install
finished at 01:56:58:
curl -s --no-progress-meter https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568/artifacts/e2e-azure-serial/installer/.openshift_install.log | rg 'Cluster is initialized'
time="2021-02-03T01:56:58Z" level=debug msg="Cluster is initialized"

the 3 leadership elections (1st and then two changes) are easy to see timestamps in events.json:
            "lastTimestamp": "2021-02-03T01:36:57Z",
            "message": "etcd-operator-9b49d6d9f-f4rfv_f8952b5c-084d-4980-ac26-ec94cdaa534f became leader",
            "lastTimestamp": "2021-02-03T01:38:18Z",
            "message": "etcd-operator-9b49d6d9f-f4rfv_12f89afd-51f3-4122-8234-fd776523d01c became leader",
            "lastTimestamp": "2021-02-03T01:45:02Z",
            "message": "etcd-operator-9b49d6d9f-f4rfv_6be59167-e1ff-4888-a0f2-f1cfd397ee0e became leader",

all of these leadership events happened before the cluster was finally initialized and before the test
case window which is 56m11s from when it's checking which is at 02:53:22.

test case failure output:

fail [github.com/openshift/origin/test/extended/etcd/leader_changes.go:31]: Unexpected error:
    <*errors.errorString | 0xc002265630>: {
        s: "Observed 3.0066839285714284 leader changes in 56m11s: Leader changes are a result of stopping the etcd leader process or from latency (disk or network), review etcd performance metrics",
    }
    Observed 3.0066839285714284 leader changes in 56m11s: Leader changes are a result of stopping the etcd leader process or from latency (disk or network), review etcd performance metrics
occurred



2)
Passing job:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384

again, we can see in the two etcd operator logs to see two leadership changes. Cluster install finished at 17:43:40
curl -s --no-progress-meter https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384/artifacts/e2e-azure-serial/installer/.openshift_install.log | rg 'Cluster is initialized'
time="2021-02-03T17:43:30Z" level=debug msg="Cluster is initialized"


            "lastTimestamp": "2021-02-03T17:22:04Z",
            "message": "etcd-operator-676577769f-955pf_8d8770c3-d517-48e1-98cc-3ae9774f4f30 became leader",
            "lastTimestamp": "2021-02-03T17:23:11Z",
            "message": "etcd-operator-676577769f-955pf_0a2a9c31-392f-41c3-b1ff-146a06c2357a became leader",
            "lastTimestamp": "2021-02-03T17:29:04Z",
            "message": "etcd-operator-676577769f-955pf_e5b60caa-3746-49d8-83ea-70ff99db6a8f became leader",
            "lastTimestamp": "2021-02-03T17:32:56Z",
            "message": "etcd-operator-676577769f-955pf_1caa5195-b37b-4d01-ad58-b288a3ce63d7 became leader"

I'm not following why there are 4 leader events in events.json, but you can see that all leader changes
are finished before the cluster is initialized which is before the window for when the test case will
be checking.


I tried pulling the promethues.tar files (links below) and locally trying similar queries that I see being
done in the test code. I don't know if I got it right. I used this query for both:
  increase((max(max by (job) (etcd_server_leader_changes_seen_total)) or 0*absent(etcd_server_leader_changes_seen_total))[56m11s:15s])
  1h range and time set to 2021-02-03 02:53:22 for the failing case and 2021-02-03 18:37:39 for the failing case.

I'm not quite sure what it means, but the passing case graph did bump up to 3 and fall back
down to 0 eventually, whereas the failing case stayed at 3. I may have the ranges and time
incorrect but tried to match what the code would have executed. (screen shots attached)


failing) https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568/artifacts/e2e-azure-serial/metrics/prometheus.tar
passing) https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384/artifacts/e2e-azure-serial/metrics/prometheus.tar

Comment 1 jamo luhrsen 2021-02-04 01:14:45 UTC

Created attachment 1754947 [details]
failing prometheus graph

Comment 2 Sam Batschelet 2021-02-04 01:27:39 UTC

First of all great work on writing a very detailed bug on a complicated topic.

re: https://bugzilla.redhat.com/show_bug.cgi?id=1924953#c1 the fact that the leader changes are observable while Prometheus is up is a real concern.

> etcd-operator-676577769f-955pf_8d8770c3-d517-48e1-98cc-3ae9774f4f30 became leader

The issue your finding here is that these are different leader elections. The metrics are tracking the operand (etcd) leader elections and those logs are in reference to operator leader elections.

Azure Serial failing leader election test might be somewhat expected but we need to better understand this. The team will take a look and report.

Comment 3 Clayton Coleman 2021-02-09 02:58:40 UTC

The machine set test can wipe prometheus emptyDirs (the default config), which causes the query to return the pre-start count. Instead, change the query to only check over the time duration of the running prometheus instance (ignore changes prior to start). I will be turning on PVs for most prometheus IPI clusters, but it's still valid on some IPI platforms to have no dynamic volume provisioning and so the test must change.

Comment 5 ge liu 2021-02-10 12:15:20 UTC

Watched the last e2e job, there is not this err, so close it.

Comment 9 errata-xmlrpc 2021-07-27 22:40:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438