Bug 1911664

Summary:	[Negative Test] After deleting metal3 pod, scaling worker stuck on provisioning state
Product:	OpenShift Container Platform	Reporter:	Polina Rabinovich <prabinov>
Component:	Bare Metal Hardware Provisioning	Assignee:	Derek Higgins <derekh>
Bare Metal Hardware Provisioning sub component:	ironic	QA Contact:	Polina Rabinovich <prabinov>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	derekh, rbartal, smiron, zbitter
Version:	4.7	Keywords:	Triaged, UpcomingSprint
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1918779 (view as bug list)		Environment:
Last Closed:	2021-02-24 15:49:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1918779

Description Polina Rabinovich 2020-12-30 15:35:32 UTC

Version:
Client Version: 4.7.0-0.nightly-2020-12-21-131655
Server Version: 4.7.0-0.nightly-2020-12-21-131655
Kubernetes Version: v1.20.0+87544c5

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-12-21-131655   True        False         130m    Cluster version is 4.7.0-0.nightly-2020-12-21-131655

-------------------------------------------

Platform:
libvirt
-------------------------------------------

What happened?
Setup: Provisioning_net_IPv6, Baremetal_net_IPv4, disconnected environment

After deleting metal3 pod, scaling of the worker didn't succeed and stuck on  provisioning state. 

[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-hsbtr-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/d81fff7c-7603-4bd6-85b7-86779da02b96                      true     
openshift-machine-api   openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-hsbtr-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/5ab0f8da-ec95-4624-8a9f-79404f0fd1e3                      true     
openshift-machine-api   openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-hsbtr-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/b2a26b0b-5848-4f84-8e18-ead1df064bec                      true     
openshift-machine-api   openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-hsbtr-worker-0-b82kh   redfish://192.168.123.1:8000/redfish/v1/Systems/0160c42c-9809-4ad3-bdd6-e5f55129b57a   unknown            true     
openshift-machine-api   openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-hsbtr-worker-0-sq8w5   redfish://192.168.123.1:8000/redfish/v1/Systems/30f07f16-3986-447a-8fbc-74dc0ff6b444   unknown            true     
openshift-machine-api   openshift-worker-0-2   OK       provisioned              ocp-edge-cluster-0-hsbtr-worker-0-9rzmv   redfish://192.168.123.1:8000/redfish/v1/Systems/98f4d225-69c0-48f9-9123-24b0d12d5a41   unknown            true     
openshift-machine-api   openshift-worker-0-3   OK       provisioning             ocp-edge-cluster-0-hsbtr-worker-0-fjl9f   redfish://192.168.123.1:8000/redfish/v1/Systems/2a495a2a-4c1e-41fb-b9ff-c409365bbcc0   unknown            true   


[kni@provisionhost-0-0 ~]$ oc get machines -A
NAMESPACE               NAME                                      PHASE          TYPE   REGION   ZONE   AGE
openshift-machine-api   ocp-edge-cluster-0-hsbtr-master-0         Running                               3h9m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-master-1         Running                               3h9m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-master-2         Running                               3h9m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-worker-0-9rzmv   Running                               72m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-worker-0-b82kh   Running                               158m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-worker-0-fjl9f   Provisioning                          47m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-worker-0-sq8w5   Running                               158m


-------------------------------------------
How to reproduce it (as minimally and precisely as possible)?

$ oc create -f new-node3.yaml -n openshift-machine-api

new-node3.yaml: 

apiVersion: v1
kind: Secret
metadata:
  name: openshift-worker-0-3-bmc-secret
type: Opaque
data:
  username: YWRtaW4K
  password: cGFzc3dvcmQK
---
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: openshift-worker-0-3
spec:
  online: true
  bmc:
    address: redfish://192.168.123.1:8000/redfish/v1/Systems/2a495a2a-4c1e-41fb-b9ff-c409365bbcc0
    credentialsName: openshift-worker-0-3-bmc-secret
    disableCertificateVerification: True
    username: admin
    password: password
  bootMACAddress: 52:54:00:b3:e8:96
  hardwareProfile: unknown


$ oc delete [meta3 pod name] -n openshift-machine-api
(In order to find metal3 pod name - oc get pods -A | grep metal3)
 
(Watch the "PROVISIONING STATUS" of the newly added bmh switching to "inspecting" and once finished to "ready")

$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=N+1
(where N is the current machines number in machinset)

$ oc delete [meta3 pod name] -n openshift-machine-api

(Watch the "PROVISIONING STATUS" of the newly added bmh switching to "provisioning" and once finished to "provisioned")

-----------------------------------------------
Actual Result:
After deleting metal3 pod bmh stucked on provisioning state
-----------------------------------------------
What did you expect to happen?
Successful scaling of the worker 
(Watch the "PROVISIONING STATUS" of the newly added bmh switching to "provisioning" and once finished to "provisioned")
-----------------------------------------------
must gather - https://drive.google.com/drive/folders/1dShieHsQ1o0TQMYkn_NWAPvGZaUdUJ8t?usp=sharing

Comment 1 Derek Higgins 2021-01-07 15:14:18 UTC

This turned out to be a flavour of a bug we hit a few weeks ago
https://bugzilla.redhat.com/show_bug.cgi?id=1901040

At time we found a bug that is causing the provisioning device to
loose its link-local address, we have opened a bug for this to be 
investigated in coreos (possibly rhel/networkmanager)
https://bugzilla.redhat.com/show_bug.cgi?id=1908302

As a work around we added a line to set addr_gen_mode, which prevents
the link local address from being lost
echo 0 > /proc/sys/net/ipv6/conf/$PROVISIONING_INTERFACE/addr_gen_mode

this worked in most cases except in cases where the "metal3-static-ip-manager"
container moved from one master to another and back again(as it did above when
the metal3 pod was deleted twice). In the case the workaround fails as addr_gen_mode
was already "0" and setting to to "0" again triggers nothing.

I'm updating the work around to toggle addr_gen_mode to 1 then back to 0
which deals with this case.

Comment 3 Polina Rabinovich 2021-01-18 09:50:13 UTC

[kni@provisionhost-0-0 ~]$ oc create -f new-node2.yaml -n openshift-machine-api
secret/openshift-worker-0-2-bmc-secret created
baremetalhost.metal3.io/openshift-worker-0-2 created

[kni@provisionhost-0-0 ~]$ oc delete pod metal3-6d4b84c44c-6h2lc -n openshift-machine-api
pod "metal3-6d4b84c44c-6h2lc" deleted

[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/df14cea9-1c75-4cc3-bef5-0fea481e3498                      true     
openshift-machine-api   openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/a047acb8-ef94-4cbe-a582-64a3cdfbaae9                      true     
openshift-machine-api   openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/161f89e2-9124-43eb-9b3b-907d127e6408                      true     
openshift-machine-api   openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-zhs2c   redfish://192.168.123.1:8000/redfish/v1/Systems/e1f01c6b-012e-4c93-917e-8c7e849ea884   unknown            true     
openshift-machine-api   openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-llcgk   redfish://192.168.123.1:8000/redfish/v1/Systems/774f0d6e-6b63-4380-bd00-fe916afaad3f   unknown            true     
openshift-machine-api   openshift-worker-0-2   OK       ready                                                         redfish://192.168.123.1:8000/redfish/v1/Systems/41f854ca-9597-44f4-97b9-a08032448ae7                      true 

[kni@provisionhost-0-0 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-zwv4z-worker-0 --replicas=3
machineset.machine.openshift.io/ocp-edge-cluster-0-zwv4z-worker-0 scaled


[kni@provisionhost-0-0 ~]$ oc delete pod metal3-6d4b84c44c-wl9nr -n openshift-machine-api
pod "metal3-6d4b84c44c-wl9nr" deleted


[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/df14cea9-1c75-4cc3-bef5-0fea481e3498                      true     
openshift-machine-api   openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/a047acb8-ef94-4cbe-a582-64a3cdfbaae9                      true     
openshift-machine-api   openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/161f89e2-9124-43eb-9b3b-907d127e6408                      true     
openshift-machine-api   openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-zhs2c   redfish://192.168.123.1:8000/redfish/v1/Systems/e1f01c6b-012e-4c93-917e-8c7e849ea884   unknown            true     
openshift-machine-api   openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-llcgk   redfish://192.168.123.1:8000/redfish/v1/Systems/774f0d6e-6b63-4380-bd00-fe916afaad3f   unknown            true     
openshift-machine-api   openshift-worker-0-2   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-xxtdk   redfish://192.168.123.1:8000/redfish/v1/Systems/41f854ca-9597-44f4-97b9-a08032448ae7   unknown            true

Comment 6 errata-xmlrpc 2021-02-24 15:49:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633