Bug 1922812
Summary: | Missing daemon reload between nodeip-configuration.service and kubelet.service | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Andreas Karis <akaris> |
Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> |
Status: | CLOSED DUPLICATE | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.7 | CC: | danw |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-15 14:35:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Andreas Karis
2021-01-31 17:25:03 UTC
I can reproduce this easily - I set the wrong IP manually in the file, then run systemctl daemon-reload and restart kubelet - the IPv6 address on the interface is fc00::fc1c:1e22:b052:ef48 and the setting should be corrected by nodeip-config service right before kubelet starts: ~~~ [root@openshift-master-0 ~]# cat /etc/systemd/system/kubelet.service.d/20-nodenet.conf [Service] Environment="KUBELET_NODE_IP=192.168.123.200" "KUBELET_NODE_IPS=192.168.123.200,fc00::fc1c:1e22:b052:ef49" [root@openshift-master-0 ~]# systemctl daemon-reload [root@openshift-master-0 ~]# systemctl restart kubelet [root@openshift-master-0 ~]# ps aux | grep kubelet | grep node-ip root 79896 20.0 0.8 2014484 142008 ? Ssl 17:08 0:02 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip=192.168.123.200,fc00::fc1c:1e22:b052:ef49 --minimum-container-ttl-duration=6m0s --cloud-provider= --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9519ae9a0a3e262e311c7f12a08adb2568e29e1576d2c6c229fd5d355c551d4b --v=2 ~~~ Before kubelet started, nodeip-configuration.service ran and set the correct address in the config file: ~~~ [root@openshift-master-0 ~]# cat /etc/systemd/system/kubelet.service.d/20-nodenet.conf [Service] Environment="KUBELET_NODE_IP=192.168.123.200" "KUBELET_NODE_IPS=192.168.123.200,fc00::fc1c:1e22:b052:ef48" ~~~ But as you can see above, kubelet did not take it. Indeed, I can restart kubelet all I want, the service configuration file changed, but systemctl daemon-reload was not run, so kubelet will never come up with the correct IP: ~~~ [root@openshift-master-0 ~]# systemctl restart kubelet Warning: The unit file, source configuration file or drop-ins of kubelet.service changed on disk. Run 'systemctl daemon-reload' to reload units. [root@openshift-master-0 ~]# ps aux | grep kubelet | grep node-ip root 82501 14.2 0.9 2156380 148156 ? Ssl 17:10 0:05 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip=192.168.123.200,fc00::fc1c:1e22:b052:ef49 --minimum-container-ttl-duration=6m0s --cloud-provider= --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9519ae9a0a3e262e311c7f12a08adb2568e29e1576d2c6c229fd5d355c551d4b --v=2 ~~~ And here's how to fix this manually: ~~~ root@openshift-master-0 ~]# systemctl daemon-reload [root@openshift-master-0 ~]# systemctl restart kubelet [root@openshift-master-0 ~]# ps aux | grep kubelet | grep node-ip root 91904 18.6 0.8 1940752 140464 ? Ssl 17:19 0:03 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip=192.168.123.200,fc00::fc1c:1e22:b052:ef48 --minimum-container-ttl-duration=6m0s --cloud-provider= --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9519ae9a0a3e262e311c7f12a08adb2568e29e1576d2c6c229fd5d355c551d4b --v=2 ~~~ ~~~ [root@openshift-master-0 ~]# cat /etc/systemd/system/kubelet.service [Unit] Description=Kubernetes Kubelet Wants=rpc-statd.service network-online.target crio.service After=network-online.target crio.service [Service] Type=notify ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state EnvironmentFile=/etc/os-release EnvironmentFile=-/etc/kubernetes/kubelet-workaround EnvironmentFile=-/etc/kubernetes/kubelet-env ExecStart=/usr/bin/hyperkube \ kubelet \ --config=/etc/kubernetes/kubelet.conf \ --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \ --kubeconfig=/var/lib/kubelet/kubeconfig \ --container-runtime=remote \ --container-runtime-endpoint=/var/run/crio/crio.sock \ --runtime-cgroups=/system.slice/crio.service \ --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=${ID} \ --node-ip=${KUBELET_NODE_IPS} \ --minimum-container-ttl-duration=6m0s \ --cloud-provider= \ --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \ \ --register-with-taints=node-role.kubernetes.io/master=:NoSchedule \ --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9519ae9a0a3e262e311c7f12a08adb2568e29e1576d2c6c229fd5d355c551d4b \ --v=${KUBELET_LOG_LEVEL} Restart=always RestartSec=10 [Install] WantedBy=multi-user.target ~~~ ~~~ [root@openshift-master-0 ~]# cat /etc/systemd/system/kubelet.service.d/20-nodenet.conf [Service] Environment="KUBELET_NODE_IP=192.168.123.200" "KUBELET_NODE_IPS=192.168.123.200,fc00::fc1c:1e22:b052:ef48" [root@openshift-master-0 ~]# cat /etc/systemd/system/nodeip-configuration.service [Unit] Description=Writes IP address configuration so that kubelet and crio services select a valid node IP Wants=network-online.target After=network-online.target ignition-firstboot-complete.service Before=kubelet.service crio.service [Service] # Need oneshot to delay kubelet Type=oneshot # Would prefer to do Restart=on-failure instead of this bash retry loop, but # the version of systemd we have right now doesn't support it. It should be # available in systemd v244 and higher. ExecStart=/bin/bash -c " \ until \ /usr/bin/podman run --rm \ --authfile /var/lib/kubelet/config.json \ --net=host \ --volume /etc/systemd/system:/etc/systemd/system:z \ quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b1e1542aa0934233fd1515872d2e1be4f1f1e5ce0c8d35860eb2847badd3c609 \ node-ip \ set --retry-on-failure; \ do \ sleep 5; \ done" [Install] RequiredBy=kubelet.service ~~~ Here's the fix: ~~~ [root@openshift-master-0 ~]# ps aux | grep kubelet root 1856 1.7 0.6 2014036 103448 ? Ssl 23:31 0:03 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip=192.168.123.200,fc00::abff:51ff:cf1f:6d6c --minimum-container-ttl-duration=6m0s --cloud-provider= --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9519ae9a0a3e262e311c7f12a08adb2568e29e1576d2c6c229fd5d355c551d4b --v=2 root 2306 0.0 0.0 12792 1088 pts/0 S+ 23:34 0:00 grep --color=auto kubelet [root@openshift-master-0 ~]# ip -6 a ls dev br-ex 5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 inet6 fc00::cb09:6043:4da9:239f/64 scope global dynamic noprefixroute valid_lft 86349sec preferred_lft 14349sec inet6 fe80::de43:21c0:c08b:fbc7/64 scope link noprefixroute valid_lft forever preferred_lft forever [root@openshift-master-0 ~]# systemctl restart kubelet Warning: The unit file, source configuration file or drop-ins of kubelet.service changed on disk. Run 'systemctl daemon-reload' to reload units. [root@openshift-master-0 ~]# vi /etc/systemd/system/nodeip-configuration.service [root@openshift-master-0 ~]# # if I daemo^C [root@openshift-master-0 ~]# cat !$ cat /etc/systemd/system/nodeip-configuration.service [Unit] Description=Writes IP address configuration so that kubelet and crio services select a valid node IP Wants=network-online.target After=network-online.target ignition-firstboot-complete.service Before=kubelet.service crio.service [Service] # Need oneshot to delay kubelet Type=oneshot # Would prefer to do Restart=on-failure instead of this bash retry loop, but # the version of systemd we have right now doesn't support it. It should be # available in systemd v244 and higher. ExecStart=/bin/bash -c " \ until \ /usr/bin/podman run --rm \ --authfile /var/lib/kubelet/config.json \ --net=host \ --volume /etc/systemd/system:/etc/systemd/system:z \ quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b1e1542aa0934233fd1515872d2e1be4f1f1e5ce0c8d35860eb2847badd3c609 \ node-ip \ set --retry-on-failure; \ do \ sleep 5; \ done; \ systemctl daemon-reload" [Install] RequiredBy=kubelet.service [root@openshift-master-0 ~]# # if I daemon-reload now, I also reload kubelet, so let's reset kubelet to the earlier state [root@openshift-master-0 ~]# vi /etc/systemd/system/kubelet.service kubelet.service kubelet.service.d/ kubelet.service.requires/ [root@openshift-master-0 ~]# vi /etc/systemd/system/kubelet.service.d/20- 20-logging.conf 20-nodenet.conf [root@openshift-master-0 ~]# vi /etc/systemd/system/kubelet.service.d/20-nodenet.conf [root@openshift-master-0 ~]# vi /etc/systemd/system/kubelet.service.d/20-nodenet.conf [root@openshift-master-0 ~]# cat !$ cat /etc/systemd/system/kubelet.service.d/20-nodenet.conf [Service] Environment="KUBELET_NODE_IP=192.168.123.200" "KUBELET_NODE_IPS=192.168.123.200,fc00::abff:51ff:cf1f:6d6c" [root@openshift-master-0 ~]# systemctl daemon-reload [root@openshift-master-0 ~]# cat /etc/systemd/system/kubelet.service.d/20-nodenet.conf [Service] Environment="KUBELET_NODE_IP=192.168.123.200" "KUBELET_NODE_IPS=192.168.123.200,fc00::abff:51ff:cf1f:6d6c" [root@openshift-master-0 ~]# ip a | grep kubelet [root@openshift-master-0 ~]# syspps ^C [root@openshift-master-0 ~]# ps aux | grep kubelet root 2460 1.5 0.6 1939792 106736 ? Ssl 23:34 0:03 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip=192.168.123.200,fc00::abff:51ff:cf1f:6d6c --minimum-container-ttl-duration=6m0s --cloud-provider= --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9519ae9a0a3e262e311c7f12a08adb2568e29e1576d2c6c229fd5d355c551d4b --v=2 root 2768 0.0 0.0 12792 1080 pts/0 S+ 23:38 0:00 grep --color=auto kubelet [root@openshift-master-0 ~]# systemctl restart kubelet [root@openshift-master-0 ~]# ps aux | grep kubelet root 2969 2.8 0.6 1939792 100432 ? Ssl 23:38 0:00 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip=192.168.123.200,fc00::cb09:6043:4da9:239f --minimum-container-ttl-duration=6m0s --cloud-provider= --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9519ae9a0a3e262e311c7f12a08adb2568e29e1576d2c6c229fd5d355c551d4b --v=2 root 3038 0.0 0.0 12792 1084 pts/0 S+ 23:38 0:00 grep --color=auto kubelet [root@openshift-master-0 ~]# ip -6 a ls dev br-ex 5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 inet6 fc00::cb09:6043:4da9:239f/64 scope global dynamic noprefixroute valid_lft 86354sec preferred_lft 14354sec inet6 fe80::de43:21c0:c08b:fbc7/64 scope link noprefixroute valid_lft forever preferred_lft forever ~~~ Workaround in my lab: ~~~ # workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1922812 if $IPV6 ; then mkdir -p /root/fake-root-master/etc/systemd/system/nodeip-configuration.service.d/ mkdir -p /root/fake-root-worker/etc/systemd/system/nodeip-configuration.service.d/ cat << 'EOF' > /root/fake-root-master/etc/systemd/system/nodeip-configuration.service.d/10-execstart.conf [Service] ExecStart= ExecStart=/bin/bash -c " \ until \ /usr/bin/podman run --rm \ --authfile /var/lib/kubelet/config.json \ --net=host \ --volume /etc/systemd/system:/etc/systemd/system:z \ quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b1e1542aa0934233fd1515872d2e1be4f1f1e5ce0c8d35860eb2847badd3c609 \ node-ip \ set --retry-on-failure; \ do \ sleep 5; \ done; \ systemctl daemon-reload" EOF cat << 'EOF' > /root/fake-root-worker/etc/systemd/system/nodeip-configuration.service.d/10-execstart.conf [Service] ExecStart= ExecStart=/bin/bash -c " \ until \ /usr/bin/podman run --rm \ --authfile /var/lib/kubelet/config.json \ --net=host \ --volume /etc/systemd/system:/etc/systemd/system:z \ quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b1e1542aa0934233fd1515872d2e1be4f1f1e5ce0c8d35860eb2847badd3c609 \ node-ip \ set --retry-on-failure; \ do \ sleep 5; \ done; \ systemctl daemon-reload" EOF fi for type in bootstrap master worker ; do filetranspiler/filetranspile -f /root/fake-root-${type}/ -i openshift-install/${type}.ign > /root/openshift-install/${type}.transpiled.ign # cat /root/openshift-install/${type}.transpiled.ign | jq '.ignition.config.append[0].source = "https://192.168.123.10:22623/config/'${type}'"' | tee /root/openshift-install/${type}.transpiled.jq.ign done echo "Copying ignition config files" for type in bootstrap master worker; do \cp /root/openshift-install/${type}.transpiled.ign /httpboot/openshift-${type}/${type}.ign chmod +r /httpboot/openshift-${type}/${type}.ign done ~~~ Just to illustrate this further: Let's say I update /etc/systemd/system/kubelet.service.d/20-nodenet.conf manually (but for whatever reason, see my earlier comments, this has not been set correctly by the initial configuration service e.g. because of ipv6 stable-privacy). In this environment, 192.168.123.221 is the correct IP. I set the wrong one on purpose with: ~~~ [root@openshift-worker-1 ~]# cat /etc/systemd/system/kubelet.service.d/20-nodenet.conf [Service] Environment="KUBELET_NODE_IP=192.168.123.222" "KUBELET_NODE_IPS=192.168.123.222" [root@openshift-worker-1 ~]# systemctl daemon-reload [root@openshift-worker-1 ~]# systemctl restart kubelet [root@openshift-worker-1 ~]# cat /etc/systemd/system/kubelet.service.d/20-nodenet.conf [Service] Environment="KUBELET_NODE_IP=192.168.123.222" "KUBELET_NODE_IPS=192.168.123.222" [root@openshift-worker-1 ~]# reboot Connection to openshift-worker-1.example.com closed by remote host. Connection to openshift-worker-1.example.com closed. ~~~ After reboot: ~~~ [root@openshift-jumpserver-0 ~]# ssh core.com Red Hat Enterprise Linux CoreOS 47.83.202103051045-0 Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.7/architecture/architecture-rhcos.html --- Last login: Wed Mar 17 20:29:40 2021 from 192.168.123.1 [systemd] Failed Units: 1 NetworkManager-wait-online.service [core@openshift-worker-1 ~]$ sudo -i [systemd] Failed Units: 1 NetworkManager-wait-online.service [root@openshift-worker-1 ~]# ps aux | grep kubel root 4526 2.2 0.0 2752388 93952 ? Ssl 21:01 0:00 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_id=rhcos --node-ip=192.168.123.222 --address=192.168.123.222 --minimum-container-ttl-duration=6m0s --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --cloud-provider= --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7b8e2e2857d8ac3499c9eb4e449cc3296409f1da21aa21d0140134d611e65b84 --v=2 root 4760 0.0 0.0 12792 1072 pts/0 S+ 21:02 0:00 grep --color=auto kubel [root@openshift-worker-1 ~]# cat /etc/systemd/system/kubelet.service.d/20-nodenet.conf [Service] Environment="KUBELET_NODE_IP=192.168.123.221" "KUBELET_NODE_IPS=192.168.123.221" ~~~ Due to the missing daemon-reload, the --node-ip is still not set right. So we can either: systemctl daemon-reload ; systemctl restart kubelet ... or simply reboot the node, now. Belatedly noticing this bug; this is fixed as of 4.7.8. *** This bug has been marked as a duplicate of bug 1944394 *** |