1823359 – Openshift 4.4 Baremetal IPI install fails using external DHCP server on provisioning network

Bug 1823359 - Openshift 4.4 Baremetal IPI install fails using external DHCP server on provisioning network

Summary: Openshift 4.4 Baremetal IPI install fails using external DHCP server on provi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Stephen Benjamin
QA Contact:	Chad Crum
Docs Contact:	Victor Voronkov
URL:
Whiteboard:
Depends On:
Blocks:	1826922 dit
TreeView+	depends on / blocked

Reported:	2020-04-13 13:08 UTC by Chad Crum
Modified:	2020-07-13 17:27 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1826922 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:27:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 3496	0	None	closed	Bug 1823359: baremetal: update provisioning CR to quote strings	2020-09-08 10:35:55 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:27:45 UTC

Internal Links: 1826983

Comment 1 Stephen Benjamin 2020-04-13 18:45:31 UTC

Can you attach the dnsmasq configuration? If the control plane succeeds but workers fail, are you sure the next-server is setup correctly for workers? It uses a different IP.

You can also use virtualmedia-based installs, which reduces the complexity of needing PXE configuration in the external DHCP server.

Comment 2 Chad Crum 2020-04-13 19:15:10 UTC

Below is the dnsmasq config I'm using. I'm trying to separate master / worker next-server using tags matched by the node mac address. 


interface=provisioning-0

except-interface=lo
bind-dynamic
#enable-tftp
#tftp-root=/shared/tftpboot

# Disable listening for DNS
port=0
log-dhcp
dhcp-range=172.22.0.10,172.22.0.100

dhcp-option-force=tag:master,66,172.22.0.2
dhcp-option-force=tag:worker,66,172.22.0.3

# Disable default router(s) and DNS over provisioning network
dhcp-option=3
dhcp-option=6

dhcp-host=52:54:00:35:17:a2,set:master
dhcp-host=52:54:00:ad:06:8e,set:master
dhcp-host=52:54:00:87:b2:66,set:master
dhcp-host=52:54:00:06:4a:8e,set:worker
dhcp-host=52:54:00:4d:c0:0d,set:worker

# IPv4 Configuration:
dhcp-match=ipxe,175

# Client is already running iPXE; move to next stage of chainloading
dhcp-boot=tag:master,tag:ipxe,http://172.22.0.2:80/dualboot.ipxe
dhcp-boot=tag:worker,tag:ipxe,http://172.22.0.3:80/dualboot.ipxe

# Note: Need to test EFI booting
dhcp-match=set:efi,option:client-arch,7
dhcp-match=set:efi,option:client-arch,9
dhcp-match=set:efi,option:client-arch,11

# Client is PXE booting over EFI without iPXE ROM; send EFI version of iPXE chainloader
dhcp-boot=tag:master,tag:efi,tag:!ipxe,ipxe.efi,172.22.0.2
dhcp-boot=tag:worker,tag:efi,tag:!ipxe,ipxe.efi,172.22.0.3

# Client is running PXE over BIOS; send BIOS version of iPXE chainloader
dhcp-boot=tag:master,/undionly.kpxe,172.22.0.2
dhcp-boot=tag:worker,/undionly.kpxe,172.22.0.3

Comment 22 Stephen Benjamin 2020-04-22 18:37:57 UTC

Ok so we figured out what was wrong. On the bootstrap server we see this error:


```
Apr 21 19:05:38 localhost bootkube.sh[14955]: "99_baremetal-provisioning-config.yaml": failed to create provisionings.v1alpha1.metal3.io/provisioning-configuration -n : Provisioning.metal3.io "provisioning-configuration" is invalid: spec.provisioningDHCPRange: Invalid value: "null": spec.provisioningDHCPRange in body must be of type string: "null"
```

Here's the Provisioning CR the installer is creating:

```
$ cat provisioning.yaml
apiVersion: metal3.io/v1alpha1
kind: Provisioning
metadata:
  name: provisioning-configuration
spec:
  provisioningInterface: enp4s0
  provisioningIP: 172.22.0.3
  provisioningNetworkCIDR: 172.22.0.0/24
  provisioningDHCPExternal: truWhen provisioningDHCPRange is emptye
  provisioningDHCPRange: 
  provisioningOSDownloadURL: https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.4/44.81.202003110027-0/x86_64/rhcos-44.81.202003110027-0-openstack.x86_64.qcow2.gz?sha256=237b9e0af475bf318abbe8d83d5508c2c3d4cca96fdcdb16edace2cc062216d1
```

We need to set provisioningDHCPRange to "" not null in the installer.

A temporary workaround would be to fix the Provisioning CRD, apply it to the cluster, and make the metal3 pod restart:

   oc scale deployment --replicas=0 metal3 -n openshift-machine-api 


We'll need to get this fixed in the installer.

Comment 29 Stephen Benjamin 2020-05-17 20:46:58 UTC

The two 4.4 BZ's are:
  - https://bugzilla.redhat.com/show_bug.cgi?id=1826922
  - https://bugzilla.redhat.com/show_bug.cgi?id=1829938

The 4.5 BZ's (this one and https://bugzilla.redhat.com/show_bug.cgi?id=1826983) need to be verified before they can get cherry-picked. Did you intend to set yourself as the QA contact on this bug?

Comment 32 errata-xmlrpc 2020-07-13 17:27:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.