1918779 – [Negative Test] After deleting metal3 pod, scaling worker stuck on provisioning state

Bug 1918779 - [Negative Test] After deleting metal3 pod, scaling worker stuck on provisioning state

Summary: [Negative Test] After deleting metal3 pod, scaling worker stuck on provisioni...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Derek Higgins
QA Contact:	Ori Michaeli
Docs Contact:
URL:
Whiteboard:
Depends On:	1911664
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-21 14:35 UTC by Derek Higgins
Modified:	2021-02-08 13:51 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously in some circumstances the provisioning interface may loose its IPv6 link-local address, preventing more workers from being provisioned. This is now fixed.
Clone Of:	1911664
Environment:
Last Closed:	2021-02-08 13:51:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ironic-static-ip-manager pull 17	0	None	closed	Bug 1918779: Toggle addr_gen_mode to 1 and back	2021-02-13 04:47:07 UTC
Red Hat Product Errata	RHSA-2021:0308	0	None	None	None	2021-02-08 13:51:57 UTC

Description Derek Higgins 2021-01-21 14:35:15 UTC

This problem can also effect 4.6

+++ This bug was initially created as a clone of Bug #1911664 +++

Version:
Client Version: 4.7.0-0.nightly-2020-12-21-131655
Server Version: 4.7.0-0.nightly-2020-12-21-131655
Kubernetes Version: v1.20.0+87544c5

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-12-21-131655   True        False         130m    Cluster version is 4.7.0-0.nightly-2020-12-21-131655

-------------------------------------------

Platform:
libvirt
-------------------------------------------

What happened?
Setup: Provisioning_net_IPv6, Baremetal_net_IPv4, disconnected environment

After deleting metal3 pod, scaling of the worker didn't succeed and stuck on  provisioning state. 

[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-hsbtr-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/d81fff7c-7603-4bd6-85b7-86779da02b96                      true     
openshift-machine-api   openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-hsbtr-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/5ab0f8da-ec95-4624-8a9f-79404f0fd1e3                      true     
openshift-machine-api   openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-hsbtr-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/b2a26b0b-5848-4f84-8e18-ead1df064bec                      true     
openshift-machine-api   openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-hsbtr-worker-0-b82kh   redfish://192.168.123.1:8000/redfish/v1/Systems/0160c42c-9809-4ad3-bdd6-e5f55129b57a   unknown            true     
openshift-machine-api   openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-hsbtr-worker-0-sq8w5   redfish://192.168.123.1:8000/redfish/v1/Systems/30f07f16-3986-447a-8fbc-74dc0ff6b444   unknown            true     
openshift-machine-api   openshift-worker-0-2   OK       provisioned              ocp-edge-cluster-0-hsbtr-worker-0-9rzmv   redfish://192.168.123.1:8000/redfish/v1/Systems/98f4d225-69c0-48f9-9123-24b0d12d5a41   unknown            true     
openshift-machine-api   openshift-worker-0-3   OK       provisioning             ocp-edge-cluster-0-hsbtr-worker-0-fjl9f   redfish://192.168.123.1:8000/redfish/v1/Systems/2a495a2a-4c1e-41fb-b9ff-c409365bbcc0   unknown            true   


[kni@provisionhost-0-0 ~]$ oc get machines -A
NAMESPACE               NAME                                      PHASE          TYPE   REGION   ZONE   AGE
openshift-machine-api   ocp-edge-cluster-0-hsbtr-master-0         Running                               3h9m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-master-1         Running                               3h9m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-master-2         Running                               3h9m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-worker-0-9rzmv   Running                               72m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-worker-0-b82kh   Running                               158m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-worker-0-fjl9f   Provisioning                          47m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-worker-0-sq8w5   Running                               158m


-------------------------------------------
How to reproduce it (as minimally and precisely as possible)?

$ oc create -f new-node3.yaml -n openshift-machine-api

new-node3.yaml: 

apiVersion: v1
kind: Secret
metadata:
  name: openshift-worker-0-3-bmc-secret
type: Opaque
data:
  username: YWRtaW4K
  password: cGFzc3dvcmQK
---
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: openshift-worker-0-3
spec:
  online: true
  bmc:
    address: redfish://192.168.123.1:8000/redfish/v1/Systems/2a495a2a-4c1e-41fb-b9ff-c409365bbcc0
    credentialsName: openshift-worker-0-3-bmc-secret
    disableCertificateVerification: True
    username: admin
    password: password
  bootMACAddress: 52:54:00:b3:e8:96
  hardwareProfile: unknown


$ oc delete [meta3 pod name] -n openshift-machine-api
(In order to find metal3 pod name - oc get pods -A | grep metal3)
 
(Watch the "PROVISIONING STATUS" of the newly added bmh switching to "inspecting" and once finished to "ready")

$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=N+1
(where N is the current machines number in machinset)

$ oc delete [meta3 pod name] -n openshift-machine-api

(Watch the "PROVISIONING STATUS" of the newly added bmh switching to "provisioning" and once finished to "provisioned")

-----------------------------------------------
Actual Result:
After deleting metal3 pod bmh stucked on provisioning state
-----------------------------------------------
What did you expect to happen?
Successful scaling of the worker 
(Watch the "PROVISIONING STATUS" of the newly added bmh switching to "provisioning" and once finished to "provisioned")
-----------------------------------------------
must gather - https://drive.google.com/drive/folders/1dShieHsQ1o0TQMYkn_NWAPvGZaUdUJ8t?usp=sharing

--- Additional comment from Derek Higgins on 2021-01-07 15:14:18 GMT ---

This turned out to be a flavour of a bug we hit a few weeks ago
https://bugzilla.redhat.com/show_bug.cgi?id=1901040

At time we found a bug that is causing the provisioning device to
loose its link-local address, we have opened a bug for this to be 
investigated in coreos (possibly rhel/networkmanager)
https://bugzilla.redhat.com/show_bug.cgi?id=1908302

As a work around we added a line to set addr_gen_mode, which prevents
the link local address from being lost
echo 0 > /proc/sys/net/ipv6/conf/$PROVISIONING_INTERFACE/addr_gen_mode

this worked in most cases except in cases where the "metal3-static-ip-manager"
container moved from one master to another and back again(as it did above when
the metal3 pod was deleted twice). In the case the workaround fails as addr_gen_mode
was already "0" and setting to to "0" again triggers nothing.

I'm updating the work around to toggle addr_gen_mode to 1 then back to 0
which deals with this case.

--- Additional comment from OpenShift Automated Release Tooling on 2021-01-14 18:06:38 GMT ---

Elliott changed bug status from MODIFIED to ON_QA.

--- Additional comment from Polina Rabinovich on 2021-01-18 09:50:13 GMT ---

[kni@provisionhost-0-0 ~]$ oc create -f new-node2.yaml -n openshift-machine-api
secret/openshift-worker-0-2-bmc-secret created
baremetalhost.metal3.io/openshift-worker-0-2 created

[kni@provisionhost-0-0 ~]$ oc delete pod metal3-6d4b84c44c-6h2lc -n openshift-machine-api
pod "metal3-6d4b84c44c-6h2lc" deleted

[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/df14cea9-1c75-4cc3-bef5-0fea481e3498                      true     
openshift-machine-api   openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/a047acb8-ef94-4cbe-a582-64a3cdfbaae9                      true     
openshift-machine-api   openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/161f89e2-9124-43eb-9b3b-907d127e6408                      true     
openshift-machine-api   openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-zhs2c   redfish://192.168.123.1:8000/redfish/v1/Systems/e1f01c6b-012e-4c93-917e-8c7e849ea884   unknown            true     
openshift-machine-api   openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-llcgk   redfish://192.168.123.1:8000/redfish/v1/Systems/774f0d6e-6b63-4380-bd00-fe916afaad3f   unknown            true     
openshift-machine-api   openshift-worker-0-2   OK       ready                                                         redfish://192.168.123.1:8000/redfish/v1/Systems/41f854ca-9597-44f4-97b9-a08032448ae7                      true 

[kni@provisionhost-0-0 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-zwv4z-worker-0 --replicas=3
machineset.machine.openshift.io/ocp-edge-cluster-0-zwv4z-worker-0 scaled


[kni@provisionhost-0-0 ~]$ oc delete pod metal3-6d4b84c44c-wl9nr -n openshift-machine-api
pod "metal3-6d4b84c44c-wl9nr" deleted


[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/df14cea9-1c75-4cc3-bef5-0fea481e3498                      true     
openshift-machine-api   openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/a047acb8-ef94-4cbe-a582-64a3cdfbaae9                      true     
openshift-machine-api   openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/161f89e2-9124-43eb-9b3b-907d127e6408                      true     
openshift-machine-api   openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-zhs2c   redfish://192.168.123.1:8000/redfish/v1/Systems/e1f01c6b-012e-4c93-917e-8c7e849ea884   unknown            true     
openshift-machine-api   openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-llcgk   redfish://192.168.123.1:8000/redfish/v1/Systems/774f0d6e-6b63-4380-bd00-fe916afaad3f   unknown            true     
openshift-machine-api   openshift-worker-0-2   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-xxtdk   redfish://192.168.123.1:8000/redfish/v1/Systems/41f854ca-9597-44f4-97b9-a08032448ae7   unknown            true

Comment 5 errata-xmlrpc 2021-02-08 13:51:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.6.16 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0308

Note You need to log in before you can comment on or make changes to this bug.