Bug 2093819 - An etcd member for a new machine was never added to the cluster
Summary: An etcd member for a new machine was never added to the cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.11
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Thomas Jungblut
QA Contact: ge liu
URL:
Whiteboard: non-multi-arch
: 2090628 2093827 2101466 (view as bug list)
Depends On:
Blocks: 2101460 2101466
TreeView+ depends on / blocked
 
Reported: 2022-06-06 07:24 UTC by Lukasz Szaszkiewicz
Modified: 2022-08-10 11:16 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:16:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 862 0 None open Bug 2093819: add timeout to health checks 2022-06-23 16:26:35 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:16:33 UTC

Description Lukasz Szaszkiewicz 2022-06-06 07:24:21 UTC
In this run [1] an etcd member was never added for a new machine that was created by the scaling test.

Timeline:

Created a new master machine/node "ci-op-sms4bscd-09318-kkj5p-master-0-clone"
"ci-op-sms4bscd-09318-kkj5p-master-0-clone" machine is in "Provisioned" state
"ci-op-sms4bscd-09318-kkj5p-master-0-clone" machine is in "Running" state
Waiting up to 10m0s for the cluster to reach the expected member count of 4
...
unexpected number of voting etcd members, expected exactly 4, got: 3, current members are:



The expectation is that a new member will be eventually added by the ClusterMemberController first as a learner and the be promoted to a voting member.


[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.11/1532975478556594176

Comment 3 Lukasz Szaszkiewicz 2022-06-07 07:46:41 UTC
(In reply to Lukasz Szaszkiewicz from comment #2)
> and another one
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-
> openshift-multiarch-master-nightly-4.11-ocp-e2e-serial-aws-arm64/
> 1533937747629182976

sorry, wrong link, this one https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-ovn-serial/1533646499815100416

Comment 6 Thomas Jungblut 2022-06-27 15:14:06 UTC
*** Bug 2101460 has been marked as a duplicate of this bug. ***

Comment 7 Thomas Jungblut 2022-06-27 15:21:22 UTC
*** Bug 2101466 has been marked as a duplicate of this bug. ***

Comment 9 Sandeep 2022-06-30 12:33:39 UTC
oc get clusterversion
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-arm64-2022-06-30-005722   True        False         71m     Cluster version is 4.11.0-0.nightly-arm64-2022-06-30-005722


Initially started with 3 members.


sh-4.4# etcdctl endpoint status -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|  https://10.0.144.66:2379 | bb508d87e68b6625 |   3.5.3 |   90 MB |     false |      false |         9 |      60469 |              60469 |        |
|  https://10.0.187.63:2379 | 71ae3bc9b2cc9ed8 |   3.5.3 |   90 MB |     false |      false |         9 |      60523 |              60523 |        |
| https://10.0.214.219:2379 | 5ab798fcf577c78e |   3.5.3 |   91 MB |      true |      false |         9 |      60523 |              60523 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+



created new machine.


oc create -f M0-new.yaml 
machine.machine.openshift.io/skundu-veri-11-ktp4k-master-new-0 created

$ oc get machines
NAME                                           PHASE          TYPE          REGION      ZONE         AGE
skundu-veri-11-ktp4k-master-0                  Running        c7g.2xlarge   us-west-2   us-west-2a   73m
skundu-veri-11-ktp4k-master-1                  Running        c7g.2xlarge   us-west-2   us-west-2b   73m
skundu-veri-11-ktp4k-master-2                  Running        c7g.2xlarge   us-west-2   us-west-2c   73m
skundu-veri-11-ktp4k-master-new-0              Provisioning   c7g.2xlarge   us-west-2   us-west-2a   18s
skundu-veri-11-ktp4k-worker-us-west-2a-hvdnx   Running        c7g.xlarge    us-west-2   us-west-2a   67m
skundu-veri-11-ktp4k-worker-us-west-2b-xggqq   Running        c7g.xlarge    us-west-2   us-west-2b   67m
skundu-veri-11-ktp4k-worker-us-west-2c-vb42k   Running        c7g.xlarge    us-west-2   us-west-2c   67m


new machine gets in running phase. 

oc get machines
NAME                                           PHASE     TYPE          REGION      ZONE         AGE
skundu-veri-11-ktp4k-master-0                  Running   c7g.2xlarge   us-west-2   us-west-2a   81m
skundu-veri-11-ktp4k-master-1                  Running   c7g.2xlarge   us-west-2   us-west-2b   81m
skundu-veri-11-ktp4k-master-2                  Running   c7g.2xlarge   us-west-2   us-west-2c   81m
skundu-veri-11-ktp4k-master-new-0              Running   c7g.2xlarge   us-west-2   us-west-2a   9m3s
skundu-veri-11-ktp4k-worker-us-west-2a-hvdnx   Running   c7g.xlarge    us-west-2   us-west-2a   76m
skundu-veri-11-ktp4k-worker-us-west-2b-xggqq   Running   c7g.xlarge    us-west-2   us-west-2b   76m
skundu-veri-11-ktp4k-worker-us-west-2c-vb42k   Running   c7g.xlarge    us-west-2   us-west-2c   76m


oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-138-15.us-west-2.compute.internal    Ready    worker   71m     v1.24.0+9ddc8b1
ip-10-0-144-66.us-west-2.compute.internal    Ready    master   81m     v1.24.0+9ddc8b1
ip-10-0-153-143.us-west-2.compute.internal   Ready    master   2m56s   v1.24.0+9ddc8b1
ip-10-0-183-56.us-west-2.compute.internal    Ready    worker   72m     v1.24.0+9ddc8b1
ip-10-0-187-63.us-west-2.compute.internal    Ready    master   81m     v1.24.0+9ddc8b1
ip-10-0-206-24.us-west-2.compute.internal    Ready    worker   67m     v1.24.0+9ddc8b1
ip-10-0-214-219.us-west-2.compute.internal   Ready    master   81m     v1.24.0+9ddc8b1


All four member endpoints are available.

sh-4.4# etcdctl endpoint status -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|  https://10.0.144.66:2379 | bb508d87e68b6625 |   3.5.3 |  107 MB |     false |      false |        12 |      64988 |              64988 |        |
| https://10.0.153.143:2379 | 5d4549cd383fea08 |   3.5.3 |  104 MB |     false |      false |        12 |      64988 |              64988 |        |
|  https://10.0.187.63:2379 | 71ae3bc9b2cc9ed8 |   3.5.3 |  106 MB |      true |      false |        12 |      64988 |              64988 |        |
| https://10.0.214.219:2379 | 5ab798fcf577c78e |   3.5.3 |  107 MB |     false |      false |        12 |      64988 |              64988 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Comment 12 Thomas Jungblut 2022-07-20 11:22:07 UTC
*** Bug 2093827 has been marked as a duplicate of this bug. ***

Comment 13 Thomas Jungblut 2022-07-20 11:22:31 UTC
*** Bug 2090628 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2022-08-10 11:16:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.