Bug 1911664
Summary: | [Negative Test] After deleting metal3 pod, scaling worker stuck on provisioning state | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Polina Rabinovich <prabinov> | |
Component: | Bare Metal Hardware Provisioning | Assignee: | Derek Higgins <derekh> | |
Bare Metal Hardware Provisioning sub component: | ironic | QA Contact: | Polina Rabinovich <prabinov> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | medium | CC: | derekh, rbartal, smiron, zbitter | |
Version: | 4.7 | Keywords: | Triaged, UpcomingSprint | |
Target Milestone: | --- | |||
Target Release: | 4.7.0 | |||
Hardware: | Unspecified | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1918779 (view as bug list) | Environment: | ||
Last Closed: | 2021-02-24 15:49:31 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1918779 |
Description
Polina Rabinovich
2020-12-30 15:35:32 UTC
This turned out to be a flavour of a bug we hit a few weeks ago https://bugzilla.redhat.com/show_bug.cgi?id=1901040 At time we found a bug that is causing the provisioning device to loose its link-local address, we have opened a bug for this to be investigated in coreos (possibly rhel/networkmanager) https://bugzilla.redhat.com/show_bug.cgi?id=1908302 As a work around we added a line to set addr_gen_mode, which prevents the link local address from being lost echo 0 > /proc/sys/net/ipv6/conf/$PROVISIONING_INTERFACE/addr_gen_mode this worked in most cases except in cases where the "metal3-static-ip-manager" container moved from one master to another and back again(as it did above when the metal3 pod was deleted twice). In the case the workaround fails as addr_gen_mode was already "0" and setting to to "0" again triggers nothing. I'm updating the work around to toggle addr_gen_mode to 1 then back to 0 which deals with this case. [kni@provisionhost-0-0 ~]$ oc create -f new-node2.yaml -n openshift-machine-api secret/openshift-worker-0-2-bmc-secret created baremetalhost.metal3.io/openshift-worker-0-2 created [kni@provisionhost-0-0 ~]$ oc delete pod metal3-6d4b84c44c-6h2lc -n openshift-machine-api pod "metal3-6d4b84c44c-6h2lc" deleted [kni@provisionhost-0-0 ~]$ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api openshift-master-0-0 OK externally provisioned ocp-edge-cluster-0-zwv4z-master-0 redfish://192.168.123.1:8000/redfish/v1/Systems/df14cea9-1c75-4cc3-bef5-0fea481e3498 true openshift-machine-api openshift-master-0-1 OK externally provisioned ocp-edge-cluster-0-zwv4z-master-1 redfish://192.168.123.1:8000/redfish/v1/Systems/a047acb8-ef94-4cbe-a582-64a3cdfbaae9 true openshift-machine-api openshift-master-0-2 OK externally provisioned ocp-edge-cluster-0-zwv4z-master-2 redfish://192.168.123.1:8000/redfish/v1/Systems/161f89e2-9124-43eb-9b3b-907d127e6408 true openshift-machine-api openshift-worker-0-0 OK provisioned ocp-edge-cluster-0-zwv4z-worker-0-zhs2c redfish://192.168.123.1:8000/redfish/v1/Systems/e1f01c6b-012e-4c93-917e-8c7e849ea884 unknown true openshift-machine-api openshift-worker-0-1 OK provisioned ocp-edge-cluster-0-zwv4z-worker-0-llcgk redfish://192.168.123.1:8000/redfish/v1/Systems/774f0d6e-6b63-4380-bd00-fe916afaad3f unknown true openshift-machine-api openshift-worker-0-2 OK ready redfish://192.168.123.1:8000/redfish/v1/Systems/41f854ca-9597-44f4-97b9-a08032448ae7 true [kni@provisionhost-0-0 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-zwv4z-worker-0 --replicas=3 machineset.machine.openshift.io/ocp-edge-cluster-0-zwv4z-worker-0 scaled [kni@provisionhost-0-0 ~]$ oc delete pod metal3-6d4b84c44c-wl9nr -n openshift-machine-api pod "metal3-6d4b84c44c-wl9nr" deleted [kni@provisionhost-0-0 ~]$ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api openshift-master-0-0 OK externally provisioned ocp-edge-cluster-0-zwv4z-master-0 redfish://192.168.123.1:8000/redfish/v1/Systems/df14cea9-1c75-4cc3-bef5-0fea481e3498 true openshift-machine-api openshift-master-0-1 OK externally provisioned ocp-edge-cluster-0-zwv4z-master-1 redfish://192.168.123.1:8000/redfish/v1/Systems/a047acb8-ef94-4cbe-a582-64a3cdfbaae9 true openshift-machine-api openshift-master-0-2 OK externally provisioned ocp-edge-cluster-0-zwv4z-master-2 redfish://192.168.123.1:8000/redfish/v1/Systems/161f89e2-9124-43eb-9b3b-907d127e6408 true openshift-machine-api openshift-worker-0-0 OK provisioned ocp-edge-cluster-0-zwv4z-worker-0-zhs2c redfish://192.168.123.1:8000/redfish/v1/Systems/e1f01c6b-012e-4c93-917e-8c7e849ea884 unknown true openshift-machine-api openshift-worker-0-1 OK provisioned ocp-edge-cluster-0-zwv4z-worker-0-llcgk redfish://192.168.123.1:8000/redfish/v1/Systems/774f0d6e-6b63-4380-bd00-fe916afaad3f unknown true openshift-machine-api openshift-worker-0-2 OK provisioned ocp-edge-cluster-0-zwv4z-worker-0-xxtdk redfish://192.168.123.1:8000/redfish/v1/Systems/41f854ca-9597-44f4-97b9-a08032448ae7 unknown true Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |