Bug 2000081 - [IPI baremetal] The metal3 pod failed to restart when switching from Disabled to Managed provisioning without specifying provisioningInterface parameter
Summary: [IPI baremetal] The metal3 pod failed to restart when switching from Disable...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.10.0
Assignee: Steven Hardy
QA Contact: Aleksandra Malykhin
URL:
Whiteboard:
Depends On:
Blocks: 2012684
TreeView+ depends on / blocked
 
Reported: 2021-09-01 10:49 UTC by Aleksandra Malykhin
Modified: 2022-03-10 16:06 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When modifying the provisioning.metal3.io provisioning-configuration resource to move provisioningNetwork from Disabled to Managed state, pod restart did not work as expected unless provisioningInterface was also specified. Consequence: Users that wish to specify an interface by MAC were unable to move the provisioningNetwork from Disabled to Managed state, since in this case the provisioningInterface is not always consistent between controlplane hosts Fix: Added a provisioningMacAddresses field to the provisioning.metal3.io CRD, such that provisioning interfaces for controlplane hosts may be specified by MAC, not only by name Result: It is now possible to specify provisioning interface devices by MAC address, and in this case it is now possible to move the provisioningNetwork from Disabled to Managed state without specifying the provisioningInterface field.
Clone Of:
: 2012684 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:06:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-baremetal-operator pull 195 0 None None None 2021-09-09 00:12:08 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:06:54 UTC

Description Aleksandra Malykhin 2021-09-01 10:49:37 UTC
Description of problem:


Version-Release number of selected component (if applicable):

OCP 4.9 quay.io/openshift-release-dev/ocp-release:4.9.0-fc.0-x86_64

How reproducible:
1/1

Steps to Reproduce:
1. First build the 4.9 with Disabled provisioning
2. Add the provisioning parameters without specifying  the provisioningInterface  in the provisioning CR.
[kni@provisionhost-0-0 ~]$ oc get provisioning -o yaml >> to_managed.yaml
[kni@provisionhost-0-0 ~]$ vi to_managed.yaml

Change from:
 provisioningNetwork: Disabled    
To:
Managed.yaml for IPv4:
    provisioningDHCPRange: 172.22.0.10,172.22.0.254
    provisioningIP: 172.22.0.3
    provisioningNetwork: Managed
    provisioningNetworkCIDR: 172.22.0.0/24

Managed.yaml for IPv6:
    provisioningDHCPRange: fd00:1101:0:1::a,fd00:1101:0:1:ffff:ffff:ffff:fffe
    provisioningIP: fd00:1101:0:1::3
    provisioningNetwork: Managed
    provisioningNetworkCIDR: fd00:1101:0:1::/64

[kni@provisionhost-0-0 ~]$ oc apply -f to_managed.yaml
3. Verify the provisioning yaml
[kni@provisionhost-0-0 ~]$ oc get provisioning -o yaml 

4. This should restart the Metal3 pod
[kni@provisionhost-0-0 ~]$ oc get pods -w -n openshift-machine-api


Actual results:
The metal3 pod was restarted, but didn't return to the Running state
metal3-5778b485d8-vvqzb                        0/10    Init:CrashLoopBackOff   524 (3m4s ago)   44h

Expected results:
The metal3 pod was restarted and running


Additional info:
See must gather in the next comment

Comment 2 Angus Salkeld 2021-09-02 00:17:50 UTC
The broken pod look like this

initContainers:
  - command:
    - /set-static-ip
    env:
    - name: PROVISIONING_IP
      value: fd00:1101:0:1::3/64
    - name: PROVISIONING_INTERFACE
    - name: PROVISIONING_MACS
      value: 52:54:00:a2:33:ff,52:54:00:3e:fa:8e,52:54:00:d2:f6:16
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:04c1c932eaa96137b5ca575e232a8181ae46aaa5a146d59cca7e21cc114fe398

since it has hostNetworking I logged into a master and ran the script directly on the host

./test.sh 
+ export PROVISIONING_IP=fd00:1101:0:1::3/64
+ PROVISIONING_IP=fd00:1101:0:1::3/64
+ export PROVISIONING_INTERFACE=
+ PROVISIONING_INTERFACE=
+ export PROVISIONING_MACS=52:54:00:a2:33:ff,52:54:00:3e:fa:8e,52:54:00:d2:f6:16
+ PROVISIONING_MACS=52:54:00:a2:33:ff,52:54:00:3e:fa:8e,52:54:00:d2:f6:16
+ '[' -z fd00:1101:0:1::3/64 ']'
+ '[' -z '' ']'
+ '[' -n 52:54:00:a2:33:ff,52:54:00:3e:fa:8e,52:54:00:d2:f6:16 ']'
+ for mac in ${PROVISIONING_MACS//,/ }
+ ip -br link show up
+ grep -q 52:54:00:a2:33:ff
+ for mac in ${PROVISIONING_MACS//,/ }
+ ip -br link show up
+ grep -q 52:54:00:3e:fa:8e
+ for mac in ${PROVISIONING_MACS//,/ }
+ ip -br link show up
+ grep -q 52:54:00:d2:f6:16
+ '[' -n '' ']'
+ echo 'ERROR: Could not find suitable interface for "fd00:1101:0:1::3/64"'
ERROR: Could not find suitable interface for "fd00:1101:0:1::3/64"
+ exit 1

basically none of the real mac addresses match 

ip -br link show up
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP> 
eth0             UP             52:54:00:56:07:da <BROADCAST,MULTICAST,UP,LOWER_UP> 
eth1             UP             52:54:00:ca:42:fc <BROADCAST,MULTICAST,UP,LOWER_UP> 
eth2             UP             52:54:00:06:00:58 <BROADCAST,MULTICAST,UP,LOWER_UP> 
virbr0           DOWN           52:54:00:4b:36:2d <NO-CARRIER,BROADCAST,MULTICAST,UP> 
baremetal-0      UP             52:54:00:ca:42:fc <BROADCAST,MULTICAST,UP,LOWER_UP>

what cbo thinks they should be
+ export PROVISIONING_MACS=52:54:00:a2:33:ff,52:54:00:3e:fa:8e,52:54:00:d2:f6:16

Note: cbo only reads these once off here: https://github.com/openshift/cluster-baremetal-operator/blob/master/controllers/provisioning_controller.go#L217-L222

Does bmo see this change and reboot the machines with different macs?

Comment 3 Angus Salkeld 2021-09-02 00:36:54 UTC
Ignore the comment above, I was on the wrong host :facepalm:

the issue seems to be that there are 2 interfaces that match the same mac

ip -br link show up | grep 52:54:00:3e:fa:8e |cut -f 1 -d ' '
enp0s4
br-ex

so that it finds the wrong thing.

ip  link show up
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:06:0e:08 brd ff:ff:ff:ff:ff:ff
3: enp0s4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovs-system state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:3e:fa:8e brd ff:ff:ff:ff:ff:ff
5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 52:54:00:3e:fa:8e brd ff:ff:ff:ff:ff:ff
etc..


changing :
ip -br link show up | grep 52:54:00:3e:fa:8e |cut -f 1 -d ' '
to:
ip -br link show up | grep 52:54:00:3e:fa:8e |grep -v UNKNOWN |cut -f 1 -d ' '

could work..

Comment 6 Angus Salkeld 2021-09-13 21:29:13 UTC
@amalykhi please note that the merged PR adds a new field to the Provisioning CRD

+	// ProvisioningMacAddresses is a list of mac addresses of network interfaces
+	// on a baremetal server to the provisioning network.
+	// Use this instead of ProvisioningInterface to allow interfaces of different
+	// names. If not provided it will be populated by the BMH.Spec.BootMacAddress
+	// of each master.
+	ProvisioningMacAddresses []string `json:"provisioningMacAddresses,omitempty"`

So when you are in the disabled state, go onto each the master machines and get the mac of the available nic.
then when you update the CR to Managed, also add these macs.

We decided that it was less error prone for the user to do this than to have a script attempt to choose an unused nic.

Comment 15 errata-xmlrpc 2022-03-10 16:06:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.