2099584 – [Bare Metal IPI] BMH pre-provisoning netowrk configurations are not preserved when machineset replicas are scaled up/down

Bug 2099584 - [Bare Metal IPI] BMH pre-provisoning netowrk configurations are not preserved when machineset replicas are scaled up/down

Summary: [Bare Metal IPI] BMH pre-provisoning netowrk configurations are not preserved...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Derek Higgins
QA Contact:	Yoav Porag
Docs Contact:
URL:
Whiteboard:
Depends On:	2105345
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-21 09:43 UTC by Francesco Cristini
Modified:	2023-09-15 01:56 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2105345 (view as bug list)
Environment:
Last Closed:	2022-08-31 12:34:13 UTC
Target Upstream Version:
Embargoed:
Flags:	rpittau: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-baremetal pull 174	None	Merged	Bug 2099584: Update BMO vendor to v0.0.0-20211102102625-469cc9bf7fee	2022-08-23 10:34:39 UTC
Red Hat Knowledge Base (Solution)	6972940	None	None	None	2022-08-23 10:31:32 UTC
Red Hat Product Errata	RHSA-2022:6133	None	None	None	2022-08-31 12:34:43 UTC

Description Francesco Cristini 2022-06-21 09:43:28 UTC

Description of problem:
network configurations added to BareMetalHost as day 1 with spec.preprovisioningNetworkDataName are not surviving machineset replicas scaling down and up again

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Best way is having a 3 node cluster with masters schedulable and add one worker, then scale down and scale up again replicas of the worker machineset

Steps to Reproduce:
1. Add a new worker BMH using spec.preprovisioningNetworkDataName
2. Scale up the worker machineset replicas so the new BMH gets provisioned
3. Scale down the worker machineset replicas so the new BMH gets deprovisioned
4. Scale up the worker machineset replicas so the new BMH gets provisioned (again)

Actual results:
The custom network configuration is lost and the BMH can't be reached anymore.

Expected results:
network configuration should be preserved.

Additional info:

Comment 1 Darren Carpenter 2022-06-28 13:38:02 UTC

Hi All,

Just bumping this one a little to hopefully get eyes on it. Let us know if there is any information that can be procured from the customer or other to get it moving.

- dc

Comment 2 Derek Higgins 2022-07-06 15:06:44 UTC

Confirmed this happens on 4.10, it doesn't appear to be the case for 4.11, I'm trying to track down the difference.

Comment 3 Derek Higgins 2022-07-07 13:15:31 UTC

(In reply to Derek Higgins from comment #2)
> Confirmed this happens on 4.10, it doesn't appear to be the case for 4.11,
> I'm trying to track down the difference.

BareMetalHostSpec in cluster-api-provider-baremetal doesn't contain 
>       PreprovisioningNetworkDataName string `json:"preprovisioningNetworkDataName,omitempty"`

it contains a outdated version of the vendored baremetal-operator, as a result
preprovisioningNetworkDataName gets lost when the bmh is used. Adding this fixes the
problem, but before we update the vendored BMO we first need to make sure we fix
bz#2097695 as fixing this bug triggers the other

Comment 4 Raúl Fernández 2022-07-26 08:52:38 UTC

Hi,

Any update? It seems that the issue has been identified and the blocker BZ is also solved.

Also, is there any fix or workaround that can be applied without the need to upgrade to the Z version where this fix will be released?

Many thanks.

Comment 5 Francesco Cristini 2022-08-10 14:44:11 UTC

Hi!
Any update on this??

Comment 7 Derek Higgins 2022-08-22 13:32:54 UTC

@rauferna the PR has now merged, you should be able to try it on the next nightly

Comment 8 Zane Bitter 2022-08-22 19:03:42 UTC

Workaround for this issue:

* Scale down MachineSet to only existing nodes
* Create new BMH with preprovisioningNetworkDataName  (or at least add that field back in, wait a while, then reboot the host)
* *Wait* for inspection to complete
* Scale up MachineSet

Comment 11 Yoav Porag 2022-08-28 12:58:53 UTC

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-08-26-140250   True        False         154m    Cluster version is 4.10.0-0.nightly-2022-08-26-140250

deployed normal cluster with 3/2 workers provisioned.

Scaled up worker using the following configuration file:

[kni@provisionhost-0-0 ~]$ cat worker-0-2.yaml 
apiVersion: v1
kind: Secret
metadata:
  name: openshift-worker-2-network
type: Opaque
stringData:
  nmstate: |
    routes:
      config:
      - destination: 0.0.0.0/0
        next-hop-address: 192.168.123.1
        next-hop-interface: enp0s4
    dns-resolver:
      config:
        server:
        - 192.168.123.1
    interfaces:
    - name: enp0s4
      state: up
      ipv4:
        address:
        - ip: 192.168.123.11
          prefix-length: 24
        enabled: true
        dhcp: false
---
apiVersion: v1
kind: Secret
metadata:
  name: openshift-worker-2-bmc-secret
  namespace: openshift-machine-api
type: Opaque
data:
  username: YWRtaW4=
  password: cGFzc3dvcmQK
---
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: openshift-worker-0-2
  namespace: openshift-machine-api
spec:
  online: True
  bmc:
    address: redfish://192.168.123.1:8000/redfish/v1/Systems/e96821dd-7245-41e8-be43-cc60ec117fc9
    credentialsName: openshift-worker-2-bmc-secret
    disableCertificateVerification: True
    username: admin
    password: password
  bootMACAddress: 52:54:00:ba:ff:b3
  rootDeviceHints:
    deviceName: /dev/sda
  preprovisioningNetworkDataName: openshift-worker-2-network

After multiple scale-ups and scale downs, worker-0-2 was available and ssh-able with the assigned ip-address.

Tested on a normal DHCP environment and with no-DHCP fully environment where all initial nodes were deployed with static-ip.
In both cases there was no issues.

Comment 13 errata-xmlrpc 2022-08-31 12:34:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.10.30 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6133

Comment 14 Red Hat Bugzilla 2023-09-15 01:56:05 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.