Bug 1817028
Summary: | cluster-etcd-operator: [DR] scaling fails if majority of nodes are new | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sam Batschelet <sbatsche> | ||||
Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> | ||||
Status: | CLOSED ERRATA | QA Contact: | ge liu <geliu> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.4 | CC: | alpatel, mfojtik, skolicha, wsun | ||||
Target Milestone: | --- | Keywords: | TestBlocker | ||||
Target Release: | 4.5.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: |
This was a 4.4 blocker, and fixed in 4.4. No new update for 4.5.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1817071 (view as bug list) | Environment: | |||||
Last Closed: | 2020-07-13 17:23:43 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1817071 | ||||||
Attachments: |
|
Description
Sam Batschelet
2020-03-25 12:40:39 UTC
Created attachment 1673404 [details]
etcd-pod.yaml showing miss match
last etcd status before quorum loss this could explain things. I think what happens is new nodes trigger rev 4 then 5. But we rollout 3 beforehand which is invalid. { "lastTransitionTime": "2020-03-25T12:50:07Z", "message": "NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 3; 0 nodes have achieved new revision 5", "reason": "NodeInstaller", "status": "True", "type": "Progressing" }, { "lastTransitionTime": "2020-03-25T12:00:34Z", "message": "StaticPodsAvailable: 1 nodes are active; 2 nodes are at revision 0; 1 nodes are at revision 3; 0 nodes have achieved new revision 5\nEtcdMembersAvailable: ip-10-0-131-16.us-west-1.compute.internal members are available, have not started, are unhealthy, are unknown", "reason": "AsExpected", "status": "True", "type": "Available" }, Verified with 4.5.0-0.nightly-2020-04-07-211130 The generic steps that I follow is: 1. take the backup on master-1 2. go to aws and terminate master-2 and master-3 3. restore on master-1 4. when kube api is available, delete machine-2 and machine-3 in openshift-machine-api namespace and recreate it 5. the operator will automatically scale when recreated machine instances join the cluster Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |