Bug 1911664 - [Negative Test] After deleting metal3 pod, scaling worker stuck on provisioning state
Summary: [Negative Test] After deleting metal3 pod, scaling worker stuck on provisioni...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.7
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Derek Higgins
QA Contact: Polina Rabinovich
URL:
Whiteboard:
Depends On:
Blocks: 1918779
TreeView+ depends on / blocked
 
Reported: 2020-12-30 15:35 UTC by Polina Rabinovich
Modified: 2021-02-24 15:49 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1918779 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:49:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github metal3-io static-ip-manager-image pull 8 0 None open Toggle addr_gen_mode to 1 and back 2021-01-31 08:52:17 UTC
Github openshift ironic-static-ip-manager pull 14 0 None closed Bug 1911664: Toggle addr_gen_mode to 1 and back 2021-01-31 08:52:17 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:49:54 UTC

Description Polina Rabinovich 2020-12-30 15:35:32 UTC
Version:
Client Version: 4.7.0-0.nightly-2020-12-21-131655
Server Version: 4.7.0-0.nightly-2020-12-21-131655
Kubernetes Version: v1.20.0+87544c5

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-12-21-131655   True        False         130m    Cluster version is 4.7.0-0.nightly-2020-12-21-131655

-------------------------------------------

Platform:
libvirt
-------------------------------------------

What happened?
Setup: Provisioning_net_IPv6, Baremetal_net_IPv4, disconnected environment

After deleting metal3 pod, scaling of the worker didn't succeed and stuck on  provisioning state. 

[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-hsbtr-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/d81fff7c-7603-4bd6-85b7-86779da02b96                      true     
openshift-machine-api   openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-hsbtr-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/5ab0f8da-ec95-4624-8a9f-79404f0fd1e3                      true     
openshift-machine-api   openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-hsbtr-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/b2a26b0b-5848-4f84-8e18-ead1df064bec                      true     
openshift-machine-api   openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-hsbtr-worker-0-b82kh   redfish://192.168.123.1:8000/redfish/v1/Systems/0160c42c-9809-4ad3-bdd6-e5f55129b57a   unknown            true     
openshift-machine-api   openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-hsbtr-worker-0-sq8w5   redfish://192.168.123.1:8000/redfish/v1/Systems/30f07f16-3986-447a-8fbc-74dc0ff6b444   unknown            true     
openshift-machine-api   openshift-worker-0-2   OK       provisioned              ocp-edge-cluster-0-hsbtr-worker-0-9rzmv   redfish://192.168.123.1:8000/redfish/v1/Systems/98f4d225-69c0-48f9-9123-24b0d12d5a41   unknown            true     
openshift-machine-api   openshift-worker-0-3   OK       provisioning             ocp-edge-cluster-0-hsbtr-worker-0-fjl9f   redfish://192.168.123.1:8000/redfish/v1/Systems/2a495a2a-4c1e-41fb-b9ff-c409365bbcc0   unknown            true   


[kni@provisionhost-0-0 ~]$ oc get machines -A
NAMESPACE               NAME                                      PHASE          TYPE   REGION   ZONE   AGE
openshift-machine-api   ocp-edge-cluster-0-hsbtr-master-0         Running                               3h9m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-master-1         Running                               3h9m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-master-2         Running                               3h9m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-worker-0-9rzmv   Running                               72m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-worker-0-b82kh   Running                               158m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-worker-0-fjl9f   Provisioning                          47m
openshift-machine-api   ocp-edge-cluster-0-hsbtr-worker-0-sq8w5   Running                               158m


-------------------------------------------
How to reproduce it (as minimally and precisely as possible)?

$ oc create -f new-node3.yaml -n openshift-machine-api

new-node3.yaml: 

apiVersion: v1
kind: Secret
metadata:
  name: openshift-worker-0-3-bmc-secret
type: Opaque
data:
  username: YWRtaW4K
  password: cGFzc3dvcmQK
---
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: openshift-worker-0-3
spec:
  online: true
  bmc:
    address: redfish://192.168.123.1:8000/redfish/v1/Systems/2a495a2a-4c1e-41fb-b9ff-c409365bbcc0
    credentialsName: openshift-worker-0-3-bmc-secret
    disableCertificateVerification: True
    username: admin
    password: password
  bootMACAddress: 52:54:00:b3:e8:96
  hardwareProfile: unknown


$ oc delete [meta3 pod name] -n openshift-machine-api
(In order to find metal3 pod name - oc get pods -A | grep metal3)
 
(Watch the "PROVISIONING STATUS" of the newly added bmh switching to "inspecting" and once finished to "ready")

$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=N+1
(where N is the current machines number in machinset)

$ oc delete [meta3 pod name] -n openshift-machine-api

(Watch the "PROVISIONING STATUS" of the newly added bmh switching to "provisioning" and once finished to "provisioned")

-----------------------------------------------
Actual Result:
After deleting metal3 pod bmh stucked on provisioning state
-----------------------------------------------
What did you expect to happen?
Successful scaling of the worker 
(Watch the "PROVISIONING STATUS" of the newly added bmh switching to "provisioning" and once finished to "provisioned")
-----------------------------------------------
must gather - https://drive.google.com/drive/folders/1dShieHsQ1o0TQMYkn_NWAPvGZaUdUJ8t?usp=sharing

Comment 1 Derek Higgins 2021-01-07 15:14:18 UTC
This turned out to be a flavour of a bug we hit a few weeks ago
https://bugzilla.redhat.com/show_bug.cgi?id=1901040

At time we found a bug that is causing the provisioning device to
loose its link-local address, we have opened a bug for this to be 
investigated in coreos (possibly rhel/networkmanager)
https://bugzilla.redhat.com/show_bug.cgi?id=1908302

As a work around we added a line to set addr_gen_mode, which prevents
the link local address from being lost
echo 0 > /proc/sys/net/ipv6/conf/$PROVISIONING_INTERFACE/addr_gen_mode

this worked in most cases except in cases where the "metal3-static-ip-manager"
container moved from one master to another and back again(as it did above when
the metal3 pod was deleted twice). In the case the workaround fails as addr_gen_mode
was already "0" and setting to to "0" again triggers nothing.

I'm updating the work around to toggle addr_gen_mode to 1 then back to 0
which deals with this case.

Comment 3 Polina Rabinovich 2021-01-18 09:50:13 UTC
[kni@provisionhost-0-0 ~]$ oc create -f new-node2.yaml -n openshift-machine-api
secret/openshift-worker-0-2-bmc-secret created
baremetalhost.metal3.io/openshift-worker-0-2 created

[kni@provisionhost-0-0 ~]$ oc delete pod metal3-6d4b84c44c-6h2lc -n openshift-machine-api
pod "metal3-6d4b84c44c-6h2lc" deleted

[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/df14cea9-1c75-4cc3-bef5-0fea481e3498                      true     
openshift-machine-api   openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/a047acb8-ef94-4cbe-a582-64a3cdfbaae9                      true     
openshift-machine-api   openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/161f89e2-9124-43eb-9b3b-907d127e6408                      true     
openshift-machine-api   openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-zhs2c   redfish://192.168.123.1:8000/redfish/v1/Systems/e1f01c6b-012e-4c93-917e-8c7e849ea884   unknown            true     
openshift-machine-api   openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-llcgk   redfish://192.168.123.1:8000/redfish/v1/Systems/774f0d6e-6b63-4380-bd00-fe916afaad3f   unknown            true     
openshift-machine-api   openshift-worker-0-2   OK       ready                                                         redfish://192.168.123.1:8000/redfish/v1/Systems/41f854ca-9597-44f4-97b9-a08032448ae7                      true 

[kni@provisionhost-0-0 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-zwv4z-worker-0 --replicas=3
machineset.machine.openshift.io/ocp-edge-cluster-0-zwv4z-worker-0 scaled


[kni@provisionhost-0-0 ~]$ oc delete pod metal3-6d4b84c44c-wl9nr -n openshift-machine-api
pod "metal3-6d4b84c44c-wl9nr" deleted


[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/df14cea9-1c75-4cc3-bef5-0fea481e3498                      true     
openshift-machine-api   openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/a047acb8-ef94-4cbe-a582-64a3cdfbaae9                      true     
openshift-machine-api   openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-zwv4z-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/161f89e2-9124-43eb-9b3b-907d127e6408                      true     
openshift-machine-api   openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-zhs2c   redfish://192.168.123.1:8000/redfish/v1/Systems/e1f01c6b-012e-4c93-917e-8c7e849ea884   unknown            true     
openshift-machine-api   openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-llcgk   redfish://192.168.123.1:8000/redfish/v1/Systems/774f0d6e-6b63-4380-bd00-fe916afaad3f   unknown            true     
openshift-machine-api   openshift-worker-0-2   OK       provisioned              ocp-edge-cluster-0-zwv4z-worker-0-xxtdk   redfish://192.168.123.1:8000/redfish/v1/Systems/41f854ca-9597-44f4-97b9-a08032448ae7   unknown            true

Comment 6 errata-xmlrpc 2021-02-24 15:49:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.