Bug 1974074 - "Replacing the unhealthy etcd member" on a master running metal3 and cluster-baremetal-operator failed to delete machine
Summary: "Replacing the unhealthy etcd member" on a master running metal3 and cluster-...
Keywords:
Status: CLOSED DUPLICATE of bug 1973018
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Angus Salkeld
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On: 1973018
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-20 11:33 UTC by Lubov
Modified: 2021-06-28 13:54 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-28 13:54:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather errors (52.59 KB, text/plain)
2021-06-20 11:51 UTC, Lubov
no flags Details

Description Lubov 2021-06-20 11:33:37 UTC
Version:
./openshift-baremetal-install 4.8.0-0.nightly-2021-06-19-005119
built from commit a5ddd2dd6c72d8a5ea0a5f17acd8b964b6a3d1be
release image registry.ci.openshift.org/ocp/release@sha256:e61b7db574443b47d33f9ca429c7aea726958eebe0de8a2e876004948e0e4d93

Platform:
IPI
reproduced the same problem on both real BM and virtual simulation

What happened?
Running procedure https://docs.openshift.com/container-platform/4.7/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member on the master that holds metal3 and cluster-baremetal-operator pods


1. Deletion of machine object stuck:
$ oc get machine -A; oc get nodes
NAMESPACE               NAME                             PHASE      TYPE   REGION   ZONE   AGE
openshift-machine-api   ocp-edge3-qjkpv-master-0         Running                           134m
openshift-machine-api   ocp-edge3-qjkpv-master-1         Running                           134m
openshift-machine-api   ocp-edge3-qjkpv-master-2         Deleting                          134m
openshift-machine-api   ocp-edge3-qjkpv-worker-0-wk5m2   Running                           107m
openshift-machine-api   ocp-edge3-qjkpv-worker-0-x2hnx   Running                           107m
NAME                 STATUS                        ROLES    AGE    VERSION
openshift-master-0   Ready                         master   111m   v1.21.0-rc.0+120883f
openshift-master-1   Ready                         master   111m   v1.21.0-rc.0+120883f
openshift-master-2   NotReady,SchedulingDisabled   master   111m   v1.21.0-rc.0+120883f
openshift-worker-0   Ready                         worker   75m    v1.21.0-rc.0+120883f
openshift-worker-1   Ready                         worker   73m    v1.21.0-rc.0+120883f

2. Error in bmh object is reported
  Normal  ProvisionedRegistrationError  21m    metal3-baremetal-controller  Host adoption failed: Error while attempting to adopt node 44880afd-fcec-454c-a13a-e5778233f0ca: Cannot validate image information for node 44880afd-fcec-454c-a13a-e5778233f0ca because one or more parameters are missing from its instance_info and insufficent information is present to boot from a remote volume. Missing are: ['image_source', 'kernel', 'ramdisk'].

3. cbo and metal3 pods on master to delete are stuck in Terminating state also new pods are created
NAME                                          READY   STATUS        RESTARTS   AGE    IP            NODE                 NOMINATED NODE   READINESS GATES
cluster-autoscaler-operator-7589bfb57-p4tll   2/2     Running       0          130m   10.128.0.6    openshift-master-1   <none>           <none>
cluster-baremetal-operator-58976889b8-97p2c   2/2     Terminating   2          130m   10.130.0.6    openshift-master-2   <none>           <none>
cluster-baremetal-operator-58976889b8-hwcjp   2/2     Running       0          21m    10.128.0.60   openshift-master-1   <none>           <none>
machine-api-controllers-d7b79b59-8m4ll        7/7     Running       0          104m   10.128.0.20   openshift-master-1   <none>           <none>
machine-api-operator-55c89fbbfb-n6j4k         2/2     Running       0          130m   10.129.0.6    openshift-master-0   <none>           <none>
metal3-7c98d5b44c-dhgrb                       10/10   Running       0          21m    10.46.59.21   openshift-master-1   <none>           <none>
metal3-7c98d5b44c-k6nbx                       10/10   Terminating   0          100m   10.46.59.22   openshift-master-2   <none>           <none>
metal3-image-cache-clpbv                      1/1     Running       0          100m   10.46.59.20   openshift-master-0   <none>           <none>
metal3-image-cache-cqwj8                      1/1     Running       0          100m   10.46.59.21   openshift-master-1   <none>           <none>
metal3-image-cache-hq9rv                      1/1     Running       0          100m   10.46.59.22   openshift-master-2   <none>           <none>

What did you expect to happen?
he node is drained successfully and machine is deleted

How to reproduce it (as minimally and precisely as possible)?
1. Check what master runs cbo and metal3
2. Set online status to false on bmh dedicated to the master to simulate the situation when master becomes notReady
3. Run all steps from the procedure

Anything else we need to know?
adding must-gather

Comment 2 Lubov 2021-06-20 11:51:47 UTC
Created attachment 1792512 [details]
must-gather errors

Comment 4 Zane Bitter 2021-06-23 21:02:33 UTC
This looks like bug 1972572. We should retest with that fix once it's backported.

Comment 5 Zane Bitter 2021-06-24 19:13:35 UTC
Actually, I missed that it has been backported already as bug 1973018, and the 4.8 fix has now been verified, so this could be tested again now.

Comment 6 Lubov 2021-06-28 06:25:30 UTC
Verified on 4.8.0-0.nightly-2021-06-25-182927 - passed

Comment 7 Zane Bitter 2021-06-28 13:54:07 UTC

*** This bug has been marked as a duplicate of bug 1973018 ***


Note You need to log in before you can comment on or make changes to this bug.