Bug 1847705 - [openstack, metal] Handle dhclient removal in RHCOS 4.6
Summary: [openstack, metal] Handle dhclient removal in RHCOS 4.6
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.6.0
Assignee: Antonio Murdaca
QA Contact: Victor Voronkov
URL:
Whiteboard:
Depends On:
Blocks: 1849150
TreeView+ depends on / blocked
 
Reported: 2020-06-16 20:08 UTC by Ben Nemec
Modified: 2020-10-27 16:07 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:07:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3763 0 None closed Bug 1847705: rhcos: Bump to 46.82.202006162207-0 2021-02-19 20:06:46 UTC
Github openshift machine-config-operator pull 1840 0 None closed Bug 1847705: Stop forcing dhclient in baremetal and friends 2021-02-19 20:06:45 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:07:56 UTC

Description Ben Nemec 2020-06-16 20:08:17 UTC
Description of problem: In recent builds of 4.6, baremetal IPI installs have been failing due to issues with the DHCP-assigned IP changing. It looks like this is because the dhclient binary is missing, which causes the mechanism used to ensure consistent IPs between dracut and NetworkManager to break.

Oddly, the dhcp-client package appears to be installed:
# rpm -ql dhcp-client | grep sbin
/usr/sbin/dhclient
/usr/sbin/dhclient-script

However, those files don't actually exist:
# /usr/sbin/dhclient
bash: /usr/sbin/dhclient: No such file or directory


How reproducible: Always


Steps to Reproduce:
1. Attempt to deploy baremetal IPI.

Actual results: Nodes get a different IP address after their initial reboot, which breaks OpenShift service configuration.


Expected results: Nodes get same IP after reboot.


Additional info: This problem is not present in a current 4.5 image.

Comment 1 Colin Walters 2020-06-16 20:10:04 UTC
We switched to dhcp=internal with RHEL 8.2.

See also https://bugzilla.redhat.com/show_bug.cgi?id=1204226

This also came up with https://bugzilla.redhat.com/show_bug.cgi?id=1800901

What are you doing with dhclient?

Comment 2 Colin Walters 2020-06-16 20:24:20 UTC
Oh, you're saying we get a different IP address in the initramfs versus the real root?

This reminds me of https://github.com/coreos/fedora-coreos-config/pull/82

Comment 3 Ben Nemec 2020-06-16 20:46:54 UTC
I believe that was the motivation for us to use dhclient, yeah. It didn't break us to have different addresses, but it was a bit confusing and wasted addresses in the deployer environment. I think the problem now is that we still force dhclient, so before pivot we use that client, then after the pivot we end up with internal and get a different address. That does break us, at least on IPv6.

Comment 4 Colin Walters 2020-06-16 20:50:49 UTC
Ahhh right, we're pivoting from 4.5 bootimages still.

OK so this one should be fixed when we update the pinned RHCOS in the installer.

Comment 5 Steven Hardy 2020-06-17 13:19:24 UTC
I tested with https://github.com/openshift/installer/pull/3763 applied and /usr/sbin/dhclient still seems to be missing - do we need to fix that in the RHCOS image before bumping the version in the installer?

  $ cat os-release 
  NAME="Red Hat Enterprise Linux CoreOS"
  VERSION="4.5"
  VERSION_ID="4.5"
  OPENSHIFT_VERSION="4.5"
  RHEL_VERSION="8.2"
  PRETTY_NAME="Red Hat Enterprise Linux CoreOS 4.5 (Ootpa)"
  ID="rhcos"
  ID_LIKE="rhel fedora"
  ANSI_COLOR="0;31"
  HOME_URL="https://www.redhat.com/"
  BUG_REPORT_URL="https://bugzilla.redhat.com/"
  REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
  REDHAT_BUGZILLA_PRODUCT_VERSION="4.5"
  REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
  REDHAT_SUPPORT_PRODUCT_VERSION="4.5"
  OSTREE_VERSION='46.82.202006161801-0'


Install-config version (mirrored locally via dev-scripts):

  bootstrapOSImage: http://[fd00:1101::1]/images/rhcos-46.82.202006162207-0-qemu.x86_64.qcow2.gz?sha256=20f030d87afad007130e1ab6ce844748d0fb95ba904ec8f10e03cbea04da7fcf
  clusterOSImage: http://[fd00:1101::1]/images/rhcos-46.82.202006162207-0-openstack.x86_64.qcow2.gz?sha256=6c3644591b4b5a46debcdd18f2eb4acacd934f00a3f89dd0565a7de4d7426f91


  [core@master-0 conf.d]$ sudo ls /usr/sbin/dhclient
  ls: cannot access '/usr/sbin/dhclient': No such file or directory
 
  [core@master-0 conf.d]$ cat /etc/NetworkManager/conf.d/99-kni.conf 
  [main]
  dhcp=dhclient
  rc-manager=unmanaged
  [connection]
  ipv6.dhcp-duid=ll



We can also still see the IP is wrong (not the one reserved via client-id in the dnsmasq config

  [shardy@virthost ~]$ sudo virsh net-dhcp-leases ostestbm | grep master-0
   2020-06-17 14:58:37   00:6b:8e:69:99:2e   ipv6       fd2e:6f44:5dd8:c956::14/120   master-0   00:03:00:01:00:6b:8e:69:99:2e
   2020-06-17 15:03:15   00:6b:8e:69:99:2e   ipv6       fd2e:6f44:5dd8:c956::37/120   master-0   00:03:00:01:00:6b:8e:69:99:2e

  [shardy@virthost ~]$ ping -c1 fd2e:6f44:5dd8:c956::14 # correct IP reserved in dnsmasq
  PING fd2e:6f44:5dd8:c956::14(fd2e:6f44:5dd8:c956::14) 56 data bytes

  From fd2e:6f44:5dd8:c956::1: icmp_seq=1 Destination unreachable: Address unreachable

  --- fd2e:6f44:5dd8:c956::14 ping statistics ---
  1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

  [shardy@virthost ~]$ ping -c1 fd2e:6f44:5dd8:c956::37 # Incorrect additional IP
  PING fd2e:6f44:5dd8:c956::37(fd2e:6f44:5dd8:c956::37) 56 data bytes
  64 bytes from fd2e:6f44:5dd8:c956::37: icmp_seq=1 ttl=64 time=0.256 ms

  --- fd2e:6f44:5dd8:c956::37 ping statistics ---
  1 packets transmitted, 1 received, 0% packet loss, time 0ms
  rtt min/avg/max/mdev = 0.256/0.256/0.256/0.000 ms

Comment 6 Colin Walters 2020-06-17 13:24:30 UTC
Wait, you are explicitly setting

dhcp=dhclient

why?

Comment 7 Colin Walters 2020-06-17 13:32:11 UTC
Sorry let me restate:

You want a consistent DHCP client ID between the initramfs and the real root, which makes total sense.  I *think* (but still need to verify) that's true in current RHCOS 4.6.

Are you doing anything else with dhclient like installing a hook?

Comment 8 Colin Walters 2020-06-17 14:52:32 UTC
Reinstating dhclient is going to partially invalidate the work in https://bugzilla.redhat.com/show_bug.cgi?id=1800901 ...

I think what we need here is:

- installer updates to 4.6 bootimage (that's https://github.com/openshift/installer/pull/3763)
- KNI stops forcing on dhcp=dhclient

Comment 9 Steven Hardy 2020-06-17 15:11:15 UTC
I'll re-test with the dhcp=dhclient removed from the MCO template - IIRC the reason for that is/was to ensure a consistent IAID with dracut (as well as the client-id which is deterministic due to the ipv6.dhcp-duid=ll).

When making reservations in dnsmasq, you only specify the client-id, but it seems that if the IAID ever changes while there's an existing lease, it then gives an IP from the pool rather than the reserved IP.

Comment 10 Steven Hardy 2020-06-17 16:16:01 UTC
Ok I re-tested with https://github.com/openshift/installer/pull/3763 and https://github.com/openshift/machine-config-operator/pull/1839 applied, but I still see the wrong IP and two different IAIDs

What DHCP client is used during the dracut stage of the boot?  Is that still dhclient?

I also wonder what we'll do about upgrades if the IP is liable to change when we swap out dhclient for the native client, but I guess we can focus on just getting 4.6 working again as a first step :)

This is the lease with the expected/reserved IP for master-0 (this is from /var/lib/libvirt/dnsmasq)

 {
    "iaid": "2148659025",
    "ip-address": "fd2e:6f44:5dd8:c956::14",
    "mac-address": "00:16:80:11:ef:51",
    "hostname": "master-0",
    "client-id": "00:03:00:01:00:16:80:11:ef:51",
    "server-duid": "00:01:00:01:26:7c:f5:41:98:03:9b:87:08:4e",
    "expiry-time": 1592413081
  },


This is the wrong IP, with the same client-id but a different IAID:

  {
    "iaid": "1575119893",
    "ip-address": "fd2e:6f44:5dd8:c956::39",
    "mac-address": "00:16:80:11:ef:51",
    "hostname": "master-0",
    "client-id": "00:03:00:01:00:16:80:11:ef:51",
    "server-duid": "00:01:00:01:26:7c:f5:41:98:03:9b:87:08:4e",
    "expiry-time": 1592413362
  },


We can see from the expiry-time that the "bad" lease happened slightly after the "good" one that used the reservation, but I've not yet captured via tcpdump to exactly correlate with the dracut part of the boot.

Comment 11 Ben Nemec 2020-06-17 16:30:57 UTC
I also tested, and in my environment it looks like master-0 never got the address it was supposed to. It got ::26 for dracut and ::28 for NM. Apparently changing the client broke our static assignments. Weirdly, both of those addresses have the same iaid (but different client-ids):

{
    "iaid": "1575119893",
    "ip-address": "fd2e:6f44:5dd8:c956::26",
    "mac-address": "00:43:53:e4:46:95",
    "client-id": "00:04:55:2a:da:16:0c:4d:cb:82:46:29:d9:ed:6b:b1:5b:1e",
    "server-duid": "00:01:00:01:26:7c:f3:71:00:21:9b:93:36:5f",
    "expiry-time": 1592413742
  },

{
    "iaid": "1575119893",
    "ip-address": "fd2e:6f44:5dd8:c956::28",
    "mac-address": "00:43:53:e4:46:95",
    "hostname": "master-0",
    "client-id": "00:03:00:01:00:43:53:e4:46:95",
    "server-duid": "00:01:00:01:26:7c:f3:71:00:21:9b:93:36:5f",
    "expiry-time": 1592413942
  },

Here are the journal entries where it got those leases:

journalctl | grep ip6_address
Jun 17 16:09:02 localhost NetworkManager[672]: <info>  [1592410142.0289] dhcp6 (enp2s0): option ip6_address          => 'fd2e:6f44:5dd8:c956::26'
Jun 17 16:09:41 localhost NetworkManager[1622]: <info>  [1592410181.9665] dhcp6 (enp2s0): option ip6_address          => 'fd2e:6f44:5dd8:c956::28'
Jun 17 16:12:23 localhost NetworkManager[1318]: <info>  [1592410343.2279] dhcp6 (enp2s0): option ip6_address          => 'fd2e:6f44:5dd8:c956::28'

It looks like the first two are pre-reboot, the last is post.

The lease it should have gotten looks like this:
{
    "iaid": "1407469205",
    "ip-address": "fd2e:6f44:5dd8:c956::14",
    "mac-address": "00:43:53:e4:46:95",
    "hostname": "master-0",
    "client-id": "00:03:00:01:00:43:53:e4:46:95",
    "server-duid": "00:01:00:01:26:7c:f3:71:00:21:9b:93:36:5f",
    "expiry-time": 1592413367
  },

I guess there must be a difference in how the internal client comes up with the iaid?

Comment 12 Colin Walters 2020-06-17 18:52:11 UTC
> Oddly, the dhcp-client package appears to be installed:
# rpm -ql dhcp-client | grep sbin
/usr/sbin/dhclient
/usr/sbin/dhclient-script

Whenever posting things like this, please *also* post `rpm -q` and also the output of `rpm-ostree status -b`:

I am not seeing this:

```
$ rpm -q dhclient
package dhclient is not installed
$ rpm-ostree status -b
State: idle
BootedDeployment:
* ostree://f6b9bc2a6ee0e6b4e07901480864af68577d8d6dd57425411c630e41cb88caa4
                   Version: 46.82.202006171555-0 (2020-06-17T15:59:15Z)
$
```

Comment 13 Ben Nemec 2020-06-17 19:49:13 UTC
The package isn't called dhclient, it's dhcp-client.

Here's the output from my latest test run with the installer patch to use the new image:

[root@master-0 core]# rpm -q dhcp-client
dhcp-client-4.3.6-40.el8.x86_64
[root@master-0 core]# rpm-ostree status -b
State: idle
AutomaticUpdates: disabled
BootedDeployment:
● pivot://registry.svc.ci.openshift.org/ocp/4.6-2020-06-17-154742@sha256:0f4899327850d1f5a38b09a8e5d3d978e99439e83934ce86d95cb3bf33d0d504
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202006161801-0 (2020-06-16T18:05:39Z)

Worth noting that this deployment included an MCO patch that fixed DHCP, but now we've run into a different issue with OVN that is blocking progress. I'm working on a PR for the fix, but we had some other config related to dhclient that I need to figure out how to migrate.

Comment 14 Steven Hardy 2020-06-18 08:24:57 UTC
I tested with https://github.com/openshift/installer/pull/3763 and  https://github.com/openshift/machine-config-operator/pull/1840 applied

I then re-tested without the installer PR (so with the older RHCOS bootimage) and the DHCPv6 IP for the masters looks correct in both cases.

The cluster still doesn't come fully up, but that doesn't seem to be related to the DHCP issues (there are OVN issues ref https://bugzilla.redhat.com/show_bug.cgi?id=1848048)

Comment 15 Colin Walters 2020-06-22 14:28:18 UTC
https://github.com/openshift/installer/pull/3763 and https://github.com/openshift/machine-config-operator/pull/1851
landed - how are we on this issue?

Comment 16 Ben Nemec 2020-06-23 19:02:17 UTC
1851 unblocked baremetal, but according to https://github.com/openshift/machine-config-operator/pull/1865 it's now blocking openstack as well. Ultimately I think we want to get https://github.com/openshift/machine-config-operator/pull/1840 in to call this solved as that fixes it for all platforms and cleans up the dhclient cruft.

Comment 20 Victor Voronkov 2020-07-14 14:51:49 UTC
Deployment finished successfully, reboot of master node performed, IP received as expected

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-07-13-203610   True        False         5h35m   Cluster version is 4.6.0-0.nightly-2020-07-13-203610

Comment 22 errata-xmlrpc 2020-10-27 16:07:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.