Bug 1817594
Summary: | nodeip-configuration 'Failed to find suitable node ip' | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Eduardo Minguez <eminguez> | |
Component: | Machine Config Operator | Assignee: | Antoni Segura Puimedon <asegurap> | |
Status: | CLOSED ERRATA | QA Contact: | Victor Voronkov <vvoronko> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.4 | CC: | asegurap, dsafford, dtrainor, jparrill, jsaucier, kboumedh, mcornea, openshift-bugs-escalate, smilner, tschaibl, vlaad, vvoronko, ykashtan | |
Target Milestone: | --- | |||
Target Release: | 4.4.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1819484 (view as bug list) | Environment: | ||
Last Closed: | 2020-05-04 11:47:28 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1819484 | |||
Bug Blocks: | 1771572 |
Description
Eduardo Minguez
2020-03-26 15:58:50 UTC
The issue is: * NetworkManager-wait-online will only wait for whichever networking comes up first (which is not necessarily the control plane network we need) or 30 seconds (whichever comes first). * The nodeip configuration service runs as a oneshot systemd service and if it fails it does not restart. We'll either leverage newer systemd support for retry on failure oneshot services or add the mechanism to the executables being called by the service. The commit to systemd that adds the restart capability is: 10e72727ee - (6 months ago) Allow restart for oneshot units — Claudio Zumbo This got included to systemd releases: $ git tag --contains 10e72727ee v244 v244-rc1 v245 v245-rc1 v245-rc2 So most likely we need to implement the retry mechanism in our executables. As this is becoming important, I raised the Customer Escalation Flag. Updating to 4.4.0 target per https://github.com/openshift/machine-config-operator/pull/1601#issuecomment-608425178 Verified on 4.4.0-0.ci-2020-04-09-133825 we modified the systemd configuration to let the nodeip service start before network-online.target sudo vi /etc/systemd/system/nodeip-configuration.service [Unit] Description=Writes IP address configuration so that kubelet and crio services select a valid node IP # This only applies to VIP managing environments where the kubelet and crio IP # address picking logic is flawed and may end up selecting an address from a # different subnet or a deprecated address #Wants=network-online.target After=ignition-firstboot-complete.service Before=kubelet.service crio.service [Service] # Need oneshot to delay kubelet Type=oneshot ExecStart=/usr/local/bin/nodeip-finder --retry-on-failure fd2e:6f44:5dd8::5 [Install] WantedBy=multi-user.target ##### then rebooted the node and watch the log: journalctl -u nodeip-configuration.service -- Reboot -- Apr 10 09:03:54 localhost systemd[1]: Starting Writes IP address configuration so that kubelet and crio services select a valid node IP... Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Filtering out Address(127.0.0.1/8, dev=lo) due to it having host scope Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Filtering out Address(::1/128, dev=lo) due to it having host scope Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Failed to find suitable node ip. Retrying... Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Filtering out Address(127.0.0.1/8, dev=lo) due to it having host scope Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Filtering out Address(::1/128, dev=lo) due to it having host scope Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Is 192.168.123.5 between fe80:: and fe80::ffff:ffff:ffff:ffff Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Is 192.168.123.5 between 192.168.123.0 and 192.168.123.255 Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Is 192.168.123.5 between fe80:: and fe80::ffff:ffff:ffff:ffff Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: VIP Subnet 192.168.123.0/24 Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: Processing CustomAction for target Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: parser = 140016845542176 Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: values = '192.168.123.5' Apr 10 09:03:55 master-0-0 nodeip-finder[1362]: option_string = None Apr 10 09:03:55 master-0-0 systemd[1]: Started Writes IP address configuration so that kubelet and crio services select a valid node IP. Apr 10 09:03:55 master-0-0 systemd[1]: nodeip-configuration.service: Consumed 162ms CPU time [core@master-0-0 ~]$ sudo systemctl status crio.service Warning: The unit file, source configuration file or drop-ins of crio.service changed on disk. Run 'systemctl daemon-reload' to reload units. ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf, 20-nodenet.conf, 20-stream-address.conf Active: active (running) since Fri 2020-04-10 09:04:26 UTC; 5min ago Docs: https://github.com/cri-o/cri-o Main PID: 1553 (crio) Tasks: 50 Memory: 186.1M CPU: 53.911s CGroup: /system.slice/crio.service ├─ 1553 /usr/bin/crio --stream-address=192.168.123.132 --enable-metrics=true --metrics-port=9537 └─29701 /usr/libexec/crio/conmon -c bb1020474ebec33f1322f277591a265dd80cc0e8be2173e65c9d0944b0f635e6 -n k8s_etcd_etcd-master-0-0_openshift-etc cluster state is good, all nodes ready Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |