Hide Forgot
This problem can also effect 4.6 +++ This bug was initially created as a clone of Bug #1911664 +++ Version: Client Version: 4.7.0-0.nightly-2020-12-21-131655 Server Version: 4.7.0-0.nightly-2020-12-21-131655 Kubernetes Version: v1.20.0+87544c5 [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-12-21-131655 True False 130m Cluster version is 4.7.0-0.nightly-2020-12-21-131655 ------------------------------------------- Platform: libvirt ------------------------------------------- What happened? Setup: Provisioning_net_IPv6, Baremetal_net_IPv4, disconnected environment After deleting metal3 pod, scaling of the worker didn't succeed and stuck on provisioning state. [kni@provisionhost-0-0 ~]$ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api openshift-master-0-0 OK externally provisioned ocp-edge-cluster-0-hsbtr-master-0 redfish://192.168.123.1:8000/redfish/v1/Systems/d81fff7c-7603-4bd6-85b7-86779da02b96 true openshift-machine-api openshift-master-0-1 OK externally provisioned ocp-edge-cluster-0-hsbtr-master-1 redfish://192.168.123.1:8000/redfish/v1/Systems/5ab0f8da-ec95-4624-8a9f-79404f0fd1e3 true openshift-machine-api openshift-master-0-2 OK externally provisioned ocp-edge-cluster-0-hsbtr-master-2 redfish://192.168.123.1:8000/redfish/v1/Systems/b2a26b0b-5848-4f84-8e18-ead1df064bec true openshift-machine-api openshift-worker-0-0 OK provisioned ocp-edge-cluster-0-hsbtr-worker-0-b82kh redfish://192.168.123.1:8000/redfish/v1/Systems/0160c42c-9809-4ad3-bdd6-e5f55129b57a unknown true openshift-machine-api openshift-worker-0-1 OK provisioned ocp-edge-cluster-0-hsbtr-worker-0-sq8w5 redfish://192.168.123.1:8000/redfish/v1/Systems/30f07f16-3986-447a-8fbc-74dc0ff6b444 unknown true openshift-machine-api openshift-worker-0-2 OK provisioned ocp-edge-cluster-0-hsbtr-worker-0-9rzmv redfish://192.168.123.1:8000/redfish/v1/Systems/98f4d225-69c0-48f9-9123-24b0d12d5a41 unknown true openshift-machine-api openshift-worker-0-3 OK provisioning ocp-edge-cluster-0-hsbtr-worker-0-fjl9f redfish://192.168.123.1:8000/redfish/v1/Systems/2a495a2a-4c1e-41fb-b9ff-c409365bbcc0 unknown true [kni@provisionhost-0-0 ~]$ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ocp-edge-cluster-0-hsbtr-master-0 Running 3h9m openshift-machine-api ocp-edge-cluster-0-hsbtr-master-1 Running 3h9m openshift-machine-api ocp-edge-cluster-0-hsbtr-master-2 Running 3h9m openshift-machine-api ocp-edge-cluster-0-hsbtr-worker-0-9rzmv Running 72m openshift-machine-api ocp-edge-cluster-0-hsbtr-worker-0-b82kh Running 158m openshift-machine-api ocp-edge-cluster-0-hsbtr-worker-0-fjl9f Provisioning 47m openshift-machine-api ocp-edge-cluster-0-hsbtr-worker-0-sq8w5 Running 158m ------------------------------------------- How to reproduce it (as minimally and precisely as possible)? $ oc create -f new-node3.yaml -n openshift-machine-api new-node3.yaml: apiVersion: v1 kind: Secret metadata: name: openshift-worker-0-3-bmc-secret type: Opaque data: username: YWRtaW4K password: cGFzc3dvcmQK --- apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: openshift-worker-0-3 spec: online: true bmc: address: redfish://192.168.123.1:8000/redfish/v1/Systems/2a495a2a-4c1e-41fb-b9ff-c409365bbcc0 credentialsName: openshift-worker-0-3-bmc-secret disableCertificateVerification: True username: admin password: password bootMACAddress: 52:54:00:b3:e8:96 hardwareProfile: unknown $ oc delete [meta3 pod name] -n openshift-machine-api (In order to find metal3 pod name - oc get pods -A | grep metal3) (Watch the "PROVISIONING STATUS" of the newly added bmh switching to "inspecting" and once finished to "ready") $ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=N+1 (where N is the current machines number in machinset) $ oc delete [meta3 pod name] -n openshift-machine-api (Watch the "PROVISIONING STATUS" of the newly added bmh switching to "provisioning" and once finished to "provisioned") ----------------------------------------------- Actual Result: After deleting metal3 pod bmh stucked on provisioning state ----------------------------------------------- What did you expect to happen? Successful scaling of the worker (Watch the "PROVISIONING STATUS" of the newly added bmh switching to "provisioning" and once finished to "provisioned") ----------------------------------------------- must gather - https://drive.google.com/drive/folders/1dShieHsQ1o0TQMYkn_NWAPvGZaUdUJ8t?usp=sharing --- Additional comment from Derek Higgins on 2021-01-07 15:14:18 GMT --- This turned out to be a flavour of a bug we hit a few weeks ago https://bugzilla.redhat.com/show_bug.cgi?id=1901040 At time we found a bug that is causing the provisioning device to loose its link-local address, we have opened a bug for this to be investigated in coreos (possibly rhel/networkmanager) https://bugzilla.redhat.com/show_bug.cgi?id=1908302 As a work around we added a line to set addr_gen_mode, which prevents the link local address from being lost echo 0 > /proc/sys/net/ipv6/conf/$PROVISIONING_INTERFACE/addr_gen_mode this worked in most cases except in cases where the "metal3-static-ip-manager" container moved from one master to another and back again(as it did above when the metal3 pod was deleted twice). In the case the workaround fails as addr_gen_mode was already "0" and setting to to "0" again triggers nothing. I'm updating the work around to toggle addr_gen_mode to 1 then back to 0 which deals with this case. --- Additional comment from OpenShift Automated Release Tooling on 2021-01-14 18:06:38 GMT --- Elliott changed bug status from MODIFIED to ON_QA. --- Additional comment from Polina Rabinovich on 2021-01-18 09:50:13 GMT --- [kni@provisionhost-0-0 ~]$ oc create -f new-node2.yaml -n openshift-machine-api secret/openshift-worker-0-2-bmc-secret created baremetalhost.metal3.io/openshift-worker-0-2 created [kni@provisionhost-0-0 ~]$ oc delete pod metal3-6d4b84c44c-6h2lc -n openshift-machine-api pod "metal3-6d4b84c44c-6h2lc" deleted [kni@provisionhost-0-0 ~]$ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api openshift-master-0-0 OK externally provisioned ocp-edge-cluster-0-zwv4z-master-0 redfish://192.168.123.1:8000/redfish/v1/Systems/df14cea9-1c75-4cc3-bef5-0fea481e3498 true openshift-machine-api openshift-master-0-1 OK externally provisioned ocp-edge-cluster-0-zwv4z-master-1 redfish://192.168.123.1:8000/redfish/v1/Systems/a047acb8-ef94-4cbe-a582-64a3cdfbaae9 true openshift-machine-api openshift-master-0-2 OK externally provisioned ocp-edge-cluster-0-zwv4z-master-2 redfish://192.168.123.1:8000/redfish/v1/Systems/161f89e2-9124-43eb-9b3b-907d127e6408 true openshift-machine-api openshift-worker-0-0 OK provisioned ocp-edge-cluster-0-zwv4z-worker-0-zhs2c redfish://192.168.123.1:8000/redfish/v1/Systems/e1f01c6b-012e-4c93-917e-8c7e849ea884 unknown true openshift-machine-api openshift-worker-0-1 OK provisioned ocp-edge-cluster-0-zwv4z-worker-0-llcgk redfish://192.168.123.1:8000/redfish/v1/Systems/774f0d6e-6b63-4380-bd00-fe916afaad3f unknown true openshift-machine-api openshift-worker-0-2 OK ready redfish://192.168.123.1:8000/redfish/v1/Systems/41f854ca-9597-44f4-97b9-a08032448ae7 true [kni@provisionhost-0-0 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-zwv4z-worker-0 --replicas=3 machineset.machine.openshift.io/ocp-edge-cluster-0-zwv4z-worker-0 scaled [kni@provisionhost-0-0 ~]$ oc delete pod metal3-6d4b84c44c-wl9nr -n openshift-machine-api pod "metal3-6d4b84c44c-wl9nr" deleted [kni@provisionhost-0-0 ~]$ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api openshift-master-0-0 OK externally provisioned ocp-edge-cluster-0-zwv4z-master-0 redfish://192.168.123.1:8000/redfish/v1/Systems/df14cea9-1c75-4cc3-bef5-0fea481e3498 true openshift-machine-api openshift-master-0-1 OK externally provisioned ocp-edge-cluster-0-zwv4z-master-1 redfish://192.168.123.1:8000/redfish/v1/Systems/a047acb8-ef94-4cbe-a582-64a3cdfbaae9 true openshift-machine-api openshift-master-0-2 OK externally provisioned ocp-edge-cluster-0-zwv4z-master-2 redfish://192.168.123.1:8000/redfish/v1/Systems/161f89e2-9124-43eb-9b3b-907d127e6408 true openshift-machine-api openshift-worker-0-0 OK provisioned ocp-edge-cluster-0-zwv4z-worker-0-zhs2c redfish://192.168.123.1:8000/redfish/v1/Systems/e1f01c6b-012e-4c93-917e-8c7e849ea884 unknown true openshift-machine-api openshift-worker-0-1 OK provisioned ocp-edge-cluster-0-zwv4z-worker-0-llcgk redfish://192.168.123.1:8000/redfish/v1/Systems/774f0d6e-6b63-4380-bd00-fe916afaad3f unknown true openshift-machine-api openshift-worker-0-2 OK provisioned ocp-edge-cluster-0-zwv4z-worker-0-xxtdk redfish://192.168.123.1:8000/redfish/v1/Systems/41f854ca-9597-44f4-97b9-a08032448ae7 unknown true
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.6.16 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0308