1817028 – cluster-etcd-operator: [DR] scaling fails if majority of nodes are new

Bug 1817028 - cluster-etcd-operator: [DR] scaling fails if majority of nodes are new

Summary: cluster-etcd-operator: [DR] scaling fails if majority of nodes are new

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1817071
TreeView+	depends on / blocked

Reported:	2020-03-25 12:40 UTC by Sam Batschelet
Modified:	2020-07-13 17:24 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	This was a 4.4 blocker, and fixed in 4.4. No new update for 4.5.
Clone Of:
Clones:	1817071 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:23:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
etcd-pod.yaml showing miss match (15.70 KB, text/plain) 2020-03-25 12:43 UTC, Sam Batschelet	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 284	None	closed	Bug 1817028: *: add an init container to stop the pod with bad revision	2020-06-29 11:20:43 UTC
Github	openshift library-go pull 760	None	closed	Bug 1817028: add WithCustomInstaller method to Builder interface	2020-06-29 11:20:43 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:23:59 UTC

Description Sam Batschelet 2020-03-25 12:40:39 UTC

Description of problem: one scenario that we cover for disaster recovery is losing 2 of the 3 master nodes. For example in AWS if somehow two of the master nodes were terminated we should be able to restore a single master control-plane with DR. Then create 2 new master nodes and scale etcd to them.

The problem is that the clustermembercontroller will scale etcd but then the static pod that is laid down on disk does not have the correct ENV variables. The result is quorum loss as we have scaled etcd but the second etcd can not start.

In this example we see the static pod using ENV

`NODE_ip_10_0_146_254_us_west_1_compute_internal_IP`

```
          exec etcd \
            --initial-advertise-peer-urls=https://${NODE_ip_10_0_146_254_us_west_1_compute_internal_IP}:2380 \
```

But then envvar controller has does not provide the matching record.

        - name: "NODE_ip_10_0_134_141_us_west_1_compute_internal_ETCD_NAME"
          value: "ip-10-0-134-141.us-west-1.compute.internal"
        - name: "NODE_ip_10_0_134_141_us_west_1_compute_internal_ETCD_URL_HOST"
          value: "10.0.134.141"
        - name: "NODE_ip_10_0_134_141_us_west_1_compute_internal_IP"
          value: "10.0.134.141"
        - name: "NODE_ip_10_0_142_68_us_west_1_compute_internal_ETCD_NAME"
          value: "ip-10-0-142-68.us-west-1.compute.internal"
        - name: "NODE_ip_10_0_142_68_us_west_1_compute_internal_ETCD_URL_HOST"
          value: "10.0.142.68"
        - name: "NODE_ip_10_0_142_68_us_west_1_compute_internal_IP"
          value: "10.0.142.68"
        - name: "NODE_ip_10_0_144_228_us_west_1_compute_internal_ETCD_NAME"
          value: "ip-10-0-144-228.us-west-1.compute.internal"
        - name: "NODE_ip_10_0_144_228_us_west_1_compute_internal_ETCD_URL_HOST"
          value: "10.0.144.228"
        - name: "NODE_ip_10_0_144_228_us_west_1_compute_internal_IP"
          value: "10.0.144.228"



Version-Release number of selected component (if applicable):


How reproducible: 100%


Steps to Reproduce:
1. create new AWS cluster
2. terminate 2 nodes
3. restore single master control-plane with DR
3. create 2 new master machines
4. when nodes are ready operator will scale on new node
5. quorum loss

Actual results: quorum loss


Expected results: static pod on disk reflects the current state.


Additional info:

Comment 1 Sam Batschelet 2020-03-25 12:43:42 UTC

Created attachment 1673404 [details]
etcd-pod.yaml showing miss match

Comment 4 Sam Batschelet 2020-03-25 15:24:02 UTC

last etcd status before quorum loss this could explain things. I think what happens is new nodes trigger rev 4 then 5. But we rollout 3 beforehand which is invalid.
 
     {
        "lastTransitionTime": "2020-03-25T12:50:07Z",
        "message": "NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 3; 0 nodes have achieved new revision 5",
        "reason": "NodeInstaller",
        "status": "True",
        "type": "Progressing"
      },
      {
        "lastTransitionTime": "2020-03-25T12:00:34Z",
        "message": "StaticPodsAvailable: 1 nodes are active; 2 nodes are at revision 0; 1 nodes are at revision 3; 0 nodes have achieved new revision 5\nEtcdMembersAvailable: ip-10-0-131-16.us-west-1.compute.internal members are available,  have not started,  are unhealthy,  are unknown",
        "reason": "AsExpected",
        "status": "True",
        "type": "Available"
      },

Comment 9 ge liu 2020-04-08 14:58:20 UTC

Verified with 4.5.0-0.nightly-2020-04-07-211130

The generic steps that I follow is:
1. take the backup on master-1
2. go to aws and terminate master-2 and master-3
3. restore on master-1
4. when kube api is available, delete machine-2 and machine-3 in openshift-machine-api namespace and recreate it
5. the operator will automatically scale when recreated machine instances join the cluster

Comment 11 errata-xmlrpc 2020-07-13 17:23:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.