1926556 – 'excessive etcd leader changes' test case failing in serial job because prometheus data is wiped by machine set test

Bug 1926556 - 'excessive etcd leader changes' test case failing in serial job because prometheus data is wiped by machine set test

Summary: 'excessive etcd leader changes' test case failing in serial job because prome...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Clayton Coleman
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:	1924953
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-09 02:59 UTC by Clayton Coleman
Modified:	2021-02-24 15:58 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1924953
Environment:
Last Closed:	2021-02-24 15:58:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25871	0	None	closed	Bug 1926556: Allow prometheus state clear to pass etcd leader change test	2021-02-20 09:19:08 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:58:38 UTC

Description Clayton Coleman 2021-02-09 02:59:58 UTC

+++ This bug was initially created as a clone of Bug #1924953 +++

Created attachment 1754946 [details]
passing prometheus graph

Description of problem:

recently this test case ([sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel]) has
started to run in our serial jobs and we see that it has failed twice in the first 5 jobs.

The test case is checking prometheus metrics to see if any etcd leader change has happened since
the test suite has started.

https://github.com/openshift/origin/blob/d2976875ee17fb8561201cbd0bfef3165f567e4b/test/extended/etcd/leader_changes.go#L27

From what I can tell the test seems to be reporting a failure when it should not, but I'm not the expert
here.


two examples:

1)
Failing job:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568

looking at the two etcd operator logs, you can see that leadership change happened twice before the install
finished at 01:56:58:
curl -s --no-progress-meter https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568/artifacts/e2e-azure-serial/installer/.openshift_install.log | rg 'Cluster is initialized'
time="2021-02-03T01:56:58Z" level=debug msg="Cluster is initialized"

the 3 leadership elections (1st and then two changes) are easy to see timestamps in events.json:
            "lastTimestamp": "2021-02-03T01:36:57Z",
            "message": "etcd-operator-9b49d6d9f-f4rfv_f8952b5c-084d-4980-ac26-ec94cdaa534f became leader",
            "lastTimestamp": "2021-02-03T01:38:18Z",
            "message": "etcd-operator-9b49d6d9f-f4rfv_12f89afd-51f3-4122-8234-fd776523d01c became leader",
            "lastTimestamp": "2021-02-03T01:45:02Z",
            "message": "etcd-operator-9b49d6d9f-f4rfv_6be59167-e1ff-4888-a0f2-f1cfd397ee0e became leader",

all of these leadership events happened before the cluster was finally initialized and before the test
case window which is 56m11s from when it's checking which is at 02:53:22.

test case failure output:

fail [github.com/openshift/origin/test/extended/etcd/leader_changes.go:31]: Unexpected error:
    <*errors.errorString | 0xc002265630>: {
        s: "Observed 3.0066839285714284 leader changes in 56m11s: Leader changes are a result of stopping the etcd leader process or from latency (disk or network), review etcd performance metrics",
    }
    Observed 3.0066839285714284 leader changes in 56m11s: Leader changes are a result of stopping the etcd leader process or from latency (disk or network), review etcd performance metrics
occurred



2)
Passing job:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384

again, we can see in the two etcd operator logs to see two leadership changes. Cluster install finished at 17:43:40
curl -s --no-progress-meter https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384/artifacts/e2e-azure-serial/installer/.openshift_install.log | rg 'Cluster is initialized'
time="2021-02-03T17:43:30Z" level=debug msg="Cluster is initialized"


            "lastTimestamp": "2021-02-03T17:22:04Z",
            "message": "etcd-operator-676577769f-955pf_8d8770c3-d517-48e1-98cc-3ae9774f4f30 became leader",
            "lastTimestamp": "2021-02-03T17:23:11Z",
            "message": "etcd-operator-676577769f-955pf_0a2a9c31-392f-41c3-b1ff-146a06c2357a became leader",
            "lastTimestamp": "2021-02-03T17:29:04Z",
            "message": "etcd-operator-676577769f-955pf_e5b60caa-3746-49d8-83ea-70ff99db6a8f became leader",
            "lastTimestamp": "2021-02-03T17:32:56Z",
            "message": "etcd-operator-676577769f-955pf_1caa5195-b37b-4d01-ad58-b288a3ce63d7 became leader"

I'm not following why there are 4 leader events in events.json, but you can see that all leader changes
are finished before the cluster is initialized which is before the window for when the test case will
be checking.


I tried pulling the promethues.tar files (links below) and locally trying similar queries that I see being
done in the test code. I don't know if I got it right. I used this query for both:
  increase((max(max by (job) (etcd_server_leader_changes_seen_total)) or 0*absent(etcd_server_leader_changes_seen_total))[56m11s:15s])
  1h range and time set to 2021-02-03 02:53:22 for the failing case and 2021-02-03 18:37:39 for the failing case.

I'm not quite sure what it means, but the passing case graph did bump up to 3 and fall back
down to 0 eventually, whereas the failing case stayed at 3. I may have the ranges and time
incorrect but tried to match what the code would have executed. (screen shots attached)


failing) https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1356772683622125568/artifacts/e2e-azure-serial/metrics/prometheus.tar
passing) https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.7/1357010120671760384/artifacts/e2e-azure-serial/metrics/prometheus.tar

--- Additional comment from jamo luhrsen on 2021-02-03 20:14:45 EST ---



--- Additional comment from Sam Batschelet on 2021-02-03 20:27:39 EST ---

First of all great work on writing a very detailed bug on a complicated topic.

re: https://bugzilla.redhat.com/show_bug.cgi?id=1924953#c1 the fact that the leader changes are observable while Prometheus is up is a real concern.

> etcd-operator-676577769f-955pf_8d8770c3-d517-48e1-98cc-3ae9774f4f30 became leader

The issue your finding here is that these are different leader elections. The metrics are tracking the operand (etcd) leader elections and those logs are in reference to operator leader elections.

Azure Serial failing leader election test might be somewhat expected but we need to better understand this. The team will take a look and report.

--- Additional comment from Clayton Coleman on 2021-02-08 21:58:40 EST ---

The machine set test can wipe prometheus emptyDirs (the default config), which causes the query to return the pre-start count. Instead, change the query to only check over the time duration of the running prometheus instance (ignore changes prior to start). I will be turning on PVs for most prometheus IPI clusters, but it's still valid on some IPI platforms to have no dynamic volume provisioning and so the test must change.

Comment 1 Clayton Coleman 2021-02-09 03:00:52 UTC

The test fails 1/3 times because that's when we get unlucky and nuke the prometheus (which appears to cause a rollout in the monitoring operator, probably some overzealous "self healing" for stateful sets).

Comment 3 RamaKasturi 2021-02-10 13:50:52 UTC

Moving the bug to verified state as the issue being reported here has not occurred  in last 45 hours from the link below

[1] https://search.ci.openshift.org/?search=etcd+leader+changes+are+not+excessive&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 6 errata-xmlrpc 2021-02-24 15:58:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.