Description of problem: In recent builds of 4.6, baremetal IPI installs have been failing due to issues with the DHCP-assigned IP changing. It looks like this is because the dhclient binary is missing, which causes the mechanism used to ensure consistent IPs between dracut and NetworkManager to break. Oddly, the dhcp-client package appears to be installed: # rpm -ql dhcp-client | grep sbin /usr/sbin/dhclient /usr/sbin/dhclient-script However, those files don't actually exist: # /usr/sbin/dhclient bash: /usr/sbin/dhclient: No such file or directory How reproducible: Always Steps to Reproduce: 1. Attempt to deploy baremetal IPI. Actual results: Nodes get a different IP address after their initial reboot, which breaks OpenShift service configuration. Expected results: Nodes get same IP after reboot. Additional info: This problem is not present in a current 4.5 image.
We switched to dhcp=internal with RHEL 8.2. See also https://bugzilla.redhat.com/show_bug.cgi?id=1204226 This also came up with https://bugzilla.redhat.com/show_bug.cgi?id=1800901 What are you doing with dhclient?
Oh, you're saying we get a different IP address in the initramfs versus the real root? This reminds me of https://github.com/coreos/fedora-coreos-config/pull/82
I believe that was the motivation for us to use dhclient, yeah. It didn't break us to have different addresses, but it was a bit confusing and wasted addresses in the deployer environment. I think the problem now is that we still force dhclient, so before pivot we use that client, then after the pivot we end up with internal and get a different address. That does break us, at least on IPv6.
Ahhh right, we're pivoting from 4.5 bootimages still. OK so this one should be fixed when we update the pinned RHCOS in the installer.
I tested with https://github.com/openshift/installer/pull/3763 applied and /usr/sbin/dhclient still seems to be missing - do we need to fix that in the RHCOS image before bumping the version in the installer? $ cat os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="4.5" VERSION_ID="4.5" OPENSHIFT_VERSION="4.5" RHEL_VERSION="8.2" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 4.5 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.5" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.5" OSTREE_VERSION='46.82.202006161801-0' Install-config version (mirrored locally via dev-scripts): bootstrapOSImage: http://[fd00:1101::1]/images/rhcos-46.82.202006162207-0-qemu.x86_64.qcow2.gz?sha256=20f030d87afad007130e1ab6ce844748d0fb95ba904ec8f10e03cbea04da7fcf clusterOSImage: http://[fd00:1101::1]/images/rhcos-46.82.202006162207-0-openstack.x86_64.qcow2.gz?sha256=6c3644591b4b5a46debcdd18f2eb4acacd934f00a3f89dd0565a7de4d7426f91 [core@master-0 conf.d]$ sudo ls /usr/sbin/dhclient ls: cannot access '/usr/sbin/dhclient': No such file or directory [core@master-0 conf.d]$ cat /etc/NetworkManager/conf.d/99-kni.conf [main] dhcp=dhclient rc-manager=unmanaged [connection] ipv6.dhcp-duid=ll We can also still see the IP is wrong (not the one reserved via client-id in the dnsmasq config [shardy@virthost ~]$ sudo virsh net-dhcp-leases ostestbm | grep master-0 2020-06-17 14:58:37 00:6b:8e:69:99:2e ipv6 fd2e:6f44:5dd8:c956::14/120 master-0 00:03:00:01:00:6b:8e:69:99:2e 2020-06-17 15:03:15 00:6b:8e:69:99:2e ipv6 fd2e:6f44:5dd8:c956::37/120 master-0 00:03:00:01:00:6b:8e:69:99:2e [shardy@virthost ~]$ ping -c1 fd2e:6f44:5dd8:c956::14 # correct IP reserved in dnsmasq PING fd2e:6f44:5dd8:c956::14(fd2e:6f44:5dd8:c956::14) 56 data bytes From fd2e:6f44:5dd8:c956::1: icmp_seq=1 Destination unreachable: Address unreachable --- fd2e:6f44:5dd8:c956::14 ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms [shardy@virthost ~]$ ping -c1 fd2e:6f44:5dd8:c956::37 # Incorrect additional IP PING fd2e:6f44:5dd8:c956::37(fd2e:6f44:5dd8:c956::37) 56 data bytes 64 bytes from fd2e:6f44:5dd8:c956::37: icmp_seq=1 ttl=64 time=0.256 ms --- fd2e:6f44:5dd8:c956::37 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.256/0.256/0.256/0.000 ms
Wait, you are explicitly setting dhcp=dhclient why?
Sorry let me restate: You want a consistent DHCP client ID between the initramfs and the real root, which makes total sense. I *think* (but still need to verify) that's true in current RHCOS 4.6. Are you doing anything else with dhclient like installing a hook?
Reinstating dhclient is going to partially invalidate the work in https://bugzilla.redhat.com/show_bug.cgi?id=1800901 ... I think what we need here is: - installer updates to 4.6 bootimage (that's https://github.com/openshift/installer/pull/3763) - KNI stops forcing on dhcp=dhclient
I'll re-test with the dhcp=dhclient removed from the MCO template - IIRC the reason for that is/was to ensure a consistent IAID with dracut (as well as the client-id which is deterministic due to the ipv6.dhcp-duid=ll). When making reservations in dnsmasq, you only specify the client-id, but it seems that if the IAID ever changes while there's an existing lease, it then gives an IP from the pool rather than the reserved IP.
Ok I re-tested with https://github.com/openshift/installer/pull/3763 and https://github.com/openshift/machine-config-operator/pull/1839 applied, but I still see the wrong IP and two different IAIDs What DHCP client is used during the dracut stage of the boot? Is that still dhclient? I also wonder what we'll do about upgrades if the IP is liable to change when we swap out dhclient for the native client, but I guess we can focus on just getting 4.6 working again as a first step :) This is the lease with the expected/reserved IP for master-0 (this is from /var/lib/libvirt/dnsmasq) { "iaid": "2148659025", "ip-address": "fd2e:6f44:5dd8:c956::14", "mac-address": "00:16:80:11:ef:51", "hostname": "master-0", "client-id": "00:03:00:01:00:16:80:11:ef:51", "server-duid": "00:01:00:01:26:7c:f5:41:98:03:9b:87:08:4e", "expiry-time": 1592413081 }, This is the wrong IP, with the same client-id but a different IAID: { "iaid": "1575119893", "ip-address": "fd2e:6f44:5dd8:c956::39", "mac-address": "00:16:80:11:ef:51", "hostname": "master-0", "client-id": "00:03:00:01:00:16:80:11:ef:51", "server-duid": "00:01:00:01:26:7c:f5:41:98:03:9b:87:08:4e", "expiry-time": 1592413362 }, We can see from the expiry-time that the "bad" lease happened slightly after the "good" one that used the reservation, but I've not yet captured via tcpdump to exactly correlate with the dracut part of the boot.
I also tested, and in my environment it looks like master-0 never got the address it was supposed to. It got ::26 for dracut and ::28 for NM. Apparently changing the client broke our static assignments. Weirdly, both of those addresses have the same iaid (but different client-ids): { "iaid": "1575119893", "ip-address": "fd2e:6f44:5dd8:c956::26", "mac-address": "00:43:53:e4:46:95", "client-id": "00:04:55:2a:da:16:0c:4d:cb:82:46:29:d9:ed:6b:b1:5b:1e", "server-duid": "00:01:00:01:26:7c:f3:71:00:21:9b:93:36:5f", "expiry-time": 1592413742 }, { "iaid": "1575119893", "ip-address": "fd2e:6f44:5dd8:c956::28", "mac-address": "00:43:53:e4:46:95", "hostname": "master-0", "client-id": "00:03:00:01:00:43:53:e4:46:95", "server-duid": "00:01:00:01:26:7c:f3:71:00:21:9b:93:36:5f", "expiry-time": 1592413942 }, Here are the journal entries where it got those leases: journalctl | grep ip6_address Jun 17 16:09:02 localhost NetworkManager[672]: <info> [1592410142.0289] dhcp6 (enp2s0): option ip6_address => 'fd2e:6f44:5dd8:c956::26' Jun 17 16:09:41 localhost NetworkManager[1622]: <info> [1592410181.9665] dhcp6 (enp2s0): option ip6_address => 'fd2e:6f44:5dd8:c956::28' Jun 17 16:12:23 localhost NetworkManager[1318]: <info> [1592410343.2279] dhcp6 (enp2s0): option ip6_address => 'fd2e:6f44:5dd8:c956::28' It looks like the first two are pre-reboot, the last is post. The lease it should have gotten looks like this: { "iaid": "1407469205", "ip-address": "fd2e:6f44:5dd8:c956::14", "mac-address": "00:43:53:e4:46:95", "hostname": "master-0", "client-id": "00:03:00:01:00:43:53:e4:46:95", "server-duid": "00:01:00:01:26:7c:f3:71:00:21:9b:93:36:5f", "expiry-time": 1592413367 }, I guess there must be a difference in how the internal client comes up with the iaid?
> Oddly, the dhcp-client package appears to be installed: # rpm -ql dhcp-client | grep sbin /usr/sbin/dhclient /usr/sbin/dhclient-script Whenever posting things like this, please *also* post `rpm -q` and also the output of `rpm-ostree status -b`: I am not seeing this: ``` $ rpm -q dhclient package dhclient is not installed $ rpm-ostree status -b State: idle BootedDeployment: * ostree://f6b9bc2a6ee0e6b4e07901480864af68577d8d6dd57425411c630e41cb88caa4 Version: 46.82.202006171555-0 (2020-06-17T15:59:15Z) $ ```
The package isn't called dhclient, it's dhcp-client. Here's the output from my latest test run with the installer patch to use the new image: [root@master-0 core]# rpm -q dhcp-client dhcp-client-4.3.6-40.el8.x86_64 [root@master-0 core]# rpm-ostree status -b State: idle AutomaticUpdates: disabled BootedDeployment: ● pivot://registry.svc.ci.openshift.org/ocp/4.6-2020-06-17-154742@sha256:0f4899327850d1f5a38b09a8e5d3d978e99439e83934ce86d95cb3bf33d0d504 CustomOrigin: Managed by machine-config-operator Version: 46.82.202006161801-0 (2020-06-16T18:05:39Z) Worth noting that this deployment included an MCO patch that fixed DHCP, but now we've run into a different issue with OVN that is blocking progress. I'm working on a PR for the fix, but we had some other config related to dhclient that I need to figure out how to migrate.
I tested with https://github.com/openshift/installer/pull/3763 and https://github.com/openshift/machine-config-operator/pull/1840 applied I then re-tested without the installer PR (so with the older RHCOS bootimage) and the DHCPv6 IP for the masters looks correct in both cases. The cluster still doesn't come fully up, but that doesn't seem to be related to the DHCP issues (there are OVN issues ref https://bugzilla.redhat.com/show_bug.cgi?id=1848048)
https://github.com/openshift/installer/pull/3763 and https://github.com/openshift/machine-config-operator/pull/1851 landed - how are we on this issue?
1851 unblocked baremetal, but according to https://github.com/openshift/machine-config-operator/pull/1865 it's now blocking openstack as well. Ultimately I think we want to get https://github.com/openshift/machine-config-operator/pull/1840 in to call this solved as that fixes it for all platforms and cleans up the dhclient cruft.
OpenStack was unblocked with https://github.com/openshift/machine-config-operator/pull/1865 and https://github.com/openshift/installer/pull/3789. Still waiting on https://github.com/openshift/machine-config-operator/pull/1840 for the real MCO fix.
Deployment finished successfully, reboot of master node performed, IP received as expected [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-07-13-203610 True False 5h35m Cluster version is 4.6.0-0.nightly-2020-07-13-203610
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196