Description of problem: In recent OCP4 versions, nodeip-configuration service is failing: [root@kni1-worker-0 ~]# systemctl status nodeip-configuration.service ● nodeip-configuration.service - Writes IP address configuration so that kubelet and crio services select a valid node IP Loaded: loaded (/etc/systemd/system/nodeip-configuration.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Wed 2020-03-25 13:47:46 UTC; 1h 10min ago Process: 2092 ExecStart=/usr/local/bin/nodeip-finder 10.19.138.14 (code=exited, status=1/FAILURE) Main PID: 2092 (code=exited, status=1/FAILURE) CPU: 106ms Mar 25 13:47:46 localhost.localdomain systemd[1]: Starting Writes IP address configuration so that kubelet and crio services select a valid node IP... Mar 25 13:47:46 localhost.localdomain nodeip-finder[2092]: Filtering out Address(127.0.0.1/8, dev=lo) due to it having host scope Mar 25 13:47:46 localhost.localdomain nodeip-finder[2092]: Filtering out Address(::1/128, dev=lo) due to it having host scope Mar 25 13:47:46 localhost.localdomain nodeip-finder[2092]: Is 10.19.138.14 between 172.22.0.0 and 172.22.0.255 Mar 25 13:47:46 localhost.localdomain nodeip-finder[2092]: Is 10.19.138.14 between fe80:: and fe80::ffff:ffff:ffff:ffff Mar 25 13:47:46 localhost.localdomain nodeip-finder[2092]: Failed to find suitable node ip Mar 25 13:47:46 localhost.localdomain systemd[1]: nodeip-configuration.service: Main process exited, code=exited, status=1/FAILURE Mar 25 13:47:46 localhost.localdomain systemd[1]: nodeip-configuration.service: Failed with result 'exit-code'. Mar 25 13:47:46 localhost.localdomain systemd[1]: Failed to start Writes IP address configuration so that kubelet and crio services select a valid node IP. Mar 25 13:47:46 localhost.localdomain systemd[1]: nodeip-configuration.service: Consumed 106ms CPU time It seems the problem is the service was executed before an address in the vip subnet was set. Version-Release number of selected component (if applicable): 4.4.0-202003230106 at least (4.4 RC0) How reproducible: It seems a race condition because sometimes it works. Steps to Reproduce: 1. Deploy an IPI baremetal cluster 2. 3. Actual results: Installation fails due to timeout. Expected results: Installation finish successfully. Additional info: The service has After=network-online.target dependency but according to the docs: Units that strictly require a configured network connection should pull in network-online.target (via a Wants= type dependency) and order themselves after it. This target unit is intended to pull in a service that delays further execution until the network is sufficiently set up. What precisely this requires is left to the implementation of the network managing service. So I'm not sure if NetworkManager starts the network-online.target before having that IP and if that's the case, maybe an ExecStartPre directive to ping the gateway or something in that direction will fix it.
The issue is: * NetworkManager-wait-online will only wait for whichever networking comes up first (which is not necessarily the control plane network we need) or 30 seconds (whichever comes first). * The nodeip configuration service runs as a oneshot systemd service and if it fails it does not restart. We'll either leverage newer systemd support for retry on failure oneshot services or add the mechanism to the executables being called by the service.
The commit to systemd that adds the restart capability is: 10e72727ee - (6 months ago) Allow restart for oneshot units — Claudio Zumbo This got included to systemd releases: $ git tag --contains 10e72727ee v244 v244-rc1 v245 v245-rc1 v245-rc2 So most likely we need to implement the retry mechanism in our executables.
As this is becoming important, I raised the Customer Escalation Flag.
Updating to 4.4.0 target per https://github.com/openshift/machine-config-operator/pull/1601#issuecomment-608425178
Verified on 4.4.0-0.ci-2020-04-09-133825 we modified the systemd configuration to let the nodeip service start before network-online.target sudo vi /etc/systemd/system/nodeip-configuration.service [Unit] Description=Writes IP address configuration so that kubelet and crio services select a valid node IP # This only applies to VIP managing environments where the kubelet and crio IP # address picking logic is flawed and may end up selecting an address from a # different subnet or a deprecated address #Wants=network-online.target After=ignition-firstboot-complete.service Before=kubelet.service crio.service [Service] # Need oneshot to delay kubelet Type=oneshot ExecStart=/usr/local/bin/nodeip-finder --retry-on-failure fd2e:6f44:5dd8::5 [Install] WantedBy=multi-user.target ##### then rebooted the node and watch the log: journalctl -u nodeip-configuration.service -- Reboot -- Apr 10 09:03:54 localhost systemd[1]: Starting Writes IP address configuration so that kubelet and crio services select a valid node IP... Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Filtering out Address(127.0.0.1/8, dev=lo) due to it having host scope Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Filtering out Address(::1/128, dev=lo) due to it having host scope Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Failed to find suitable node ip. Retrying... Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Filtering out Address(127.0.0.1/8, dev=lo) due to it having host scope Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Filtering out Address(::1/128, dev=lo) due to it having host scope Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Is 192.168.123.5 between fe80:: and fe80::ffff:ffff:ffff:ffff Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Is 192.168.123.5 between 192.168.123.0 and 192.168.123.255 Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Is 192.168.123.5 between fe80:: and fe80::ffff:ffff:ffff:ffff Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: VIP Subnet 192.168.123.0/24 Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Processing CustomAction for target Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: parser = 140016845542176 Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: values = '192.168.123.5' Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: option_string = None Apr 10 09:03:55 master-0-0 systemd[1]: Started Writes IP address configuration so that kubelet and crio services select a valid node IP. Apr 10 09:03:55 master-0-0 systemd[1]: nodeip-configuration.service: Consumed 162ms CPU time [core@master-0-0 ~]$ sudo systemctl status crio.service Warning: The unit file, source configuration file or drop-ins of crio.service changed on disk. Run 'systemctl daemon-reload' to reload units. ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf, 20-nodenet.conf, 20-stream-address.conf Active: active (running) since Fri 2020-04-10 09:04:26 UTC; 5min ago Docs: https://github.com/cri-o/cri-o Main PID: 1553 (crio) Tasks: 50 Memory: 186.1M CPU: 53.911s CGroup: /system.slice/crio.service ├─ 1553 /usr/bin/crio --stream-address=192.168.123.132 --enable-metrics=true --metrics-port=9537 └─29701 /usr/libexec/crio/conmon -c bb1020474ebec33f1322f277591a265dd80cc0e8be2173e65c9d0944b0f635e6 -n k8s_etcd_etcd-master-0-0_openshift-etc cluster state is good, all nodes ready
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581