Description of problem: During server reboot, especially when updating by CVO. Server registered as localhost.localdomain by the kubelet. 1. Network manager started here: Oct 19 14:49:03 localhost.localdomain NetworkManager[1672]: <info> [1571496543.0594] NetworkManager (version 1.14.0-14.el8) is starting... (for the first time) 2. And set localhost.localdomain as pre-configuration: Oct 19 14:49:03 localhost.localdomain NetworkManager[1672]: <info> [1571496543.2152] hostname: hostname: using hostnamed Oct 19 14:49:03 localhost.localdomain NetworkManager[1672]: <info> [1571496543.2154] hostname: hostname changed from (none) to "localhost.localdomain" 3. Here pivot started the deployment and kubelet BUT NetworkManager didnt yet completed. Oct 19 14:49:03 localhost.localdomain rpm-ostree[1668]: BootedDeployment: Oct 19 14:49:03 localhost.localdomain rpm-ostree[1668]: ● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09d692ec1fdf35f8f1f780f4040ba60cbc61dcb7b1560dec22de3c53f3839f70 4. NetworkManager completed here, but kubelet already sent localhost.localdomain as hostname, hence unrecognized by control plane. Oct 19 14:49:03 localhost.localdomain NetworkManager[1672]: <info> [1571496543.7144] manager: NetworkManager state is now CONNECTED_GLOBAL Oct 19 14:49:03 localhost.localdomain NetworkManager[1672]: <info> [1571496543.7150] manager: startup complete Oct 19 14:49:03 localhost.localdomain NetworkManager[1672]: <info> [1571496543.7158] policy: set-hostname: set hostname to 'worker3.ocp.tmrnd.com.my' (from address lookup) Version-Release number of selected component (if applicable): How reproducible: Unable to reproduce manually, it happens random. Steps to Reproduce: 1. 2. 3. Actual results: kubelet started with localhost.localadmin due to NetworkManager did not yet finished configuring the network environment (DHCP). Expected results: Pivot and kublet should wait until NetworkManager completed before attempt to start the service. Additional info:
> The kubelet systemd unit waits for rpc-statd.service to start, which depends on network-online, nss-lookup, and rpcbind.socket units being ready. The network-online and nss-lookup units will only go active after the network and hostname are ready. Perhaps something is prematurely making those units ready? IIUC, the definition for the kubelet unit comes from: https://github.com/openshift/machine-config-operator/blob/4ca19844e2dea0fa8c2accef542942c3225ac4c5/templates/worker/01-worker-kubelet/_base/units/kubelet.yaml#L6 There, I see: > Wants=rpc-statd.service This only tells systemd to try to activate that service unit as part of the transaction if the kubelet is activated. It makes no guarantees about ordering. If we want to wait until NetworkManager is done, we can add: ``` After=network-online.target ``` There's some interesting tidbits about this in https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/. In general, it's not encouraged to depend on this unit in favour of just having software that can come up earlier and deal with changing networking conditions. I guess it applies a bit less in the OCP case though than e.g. a service running on a developer's laptop. Anyway, I think doing this would fix the kubelet transiently reporting as localhost.localdomain bit, though I'm not entirely sure if that's the actual issue here. It seems like the concrete problem is: > E1021 11:49:28.674945 3271 daemon.go:1136] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:56ac1ec564bcb821ce49ffc46da9d0936751e9fa9aa3c49725fc2a33ad83a9b2 > E1021 11:49:28.674977 3271 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-worker-4f6018a7efbce5973afff8312f9b9a70 From the description and https://bugzilla.redhat.com/show_bug.cgi?id=1763700#c2, this is not about an initial pivot on first boot, it's just trying to update an existing worker node, right? If that's the case, seems like an issue with the MCD/pivot to me? Do you have more logs from the MCD? Try with `oc logs -p`; it should include the logs from pivot and rpm-ostreed that the MCD proxies.
@Jonathan +1 to adding the `After=network-online.target`. It needs to be added to the master kubelet config as well [1]. https://github.com/openshift/machine-config-operator/blob/4ca19844e2dea0fa8c2accef542942c3225ac4c5/templates/master/01-master-kubelet/_base/units/kubelet.yaml
> @Jonathan +1 to adding the `After=network-online.target`. It needs to be added to the master kubelet config as well [1]. To be clear, I think the ordering issue is one part of this, though like I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1763700#c12, I'm not convinced it would actually fix the root cause of the update failure in this case. Seems like we need more information to diagnose.
Sorry, I cleared the flags you set by accident in my last comment, and I see now Prow relies on this. Resetting those.
Cloned this for 4.2z - https://bugzilla.redhat.com/show_bug.cgi?id=1764720 4.1z - https://bugzilla.redhat.com/show_bug.cgi?id=1764719
Verified in 4.3.0-0.nightly-2019-10-29-140935 The original BZ was filed against 4.1, but there is no report of the original problem happening in 4.3 version. Due to the rules enforcing how fixes are delivered to various versions, we must verify the fix in the current 'master' version (4.3). In this case, we are simply verifying that the fix to the `kubelet.service` made in https://github.com/openshift/machine-config-operator/pull/1206 is present in recent 4.3 versions. ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2019-10-29-140935 True False 5m46s Cluster version is 4.3.0-0.nightly-2019-10-29-140935 $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-129-239.us-west-1.compute.internal Ready master 20m v1.16.2 ip-10-0-130-63.us-west-1.compute.internal Ready worker 13m v1.16.2 ip-10-0-136-165.us-west-1.compute.internal Ready master 20m v1.16.2 ip-10-0-140-83.us-west-1.compute.internal Ready worker 13m v1.16.2 ip-10-0-152-194.us-west-1.compute.internal Ready master 20m v1.16.2 ip-10-0-158-57.us-west-1.compute.internal Ready worker 13m v1.16.2 ### Check kublet on master $ oc debug node/ip-10-0-129-239.us-west-1.compute.internal Starting pod/ip-10-0-129-239us-west-1computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.129.239 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# systemctl cat kubelet.service # /etc/systemd/system/kubelet.service [Unit] Description=Kubernetes Kubelet Wants=rpc-statd.service network-online.target crio.service After=network-online.target crio.service [Service] Type=notify ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state EnvironmentFile=/etc/os-release EnvironmentFile=-/etc/kubernetes/kubelet-workaround EnvironmentFile=-/etc/kubernetes/kubelet-env ExecStart=/usr/bin/hyperkube \ kubelet \ --config=/etc/kubernetes/kubelet.conf \ --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \ --kubeconfig=/var/lib/kubelet/kubeconfig \ --container-runtime=remote \ --container-runtime-endpoint=/var/run/crio/crio.sock \ --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=${ID} \ --minimum-container-ttl-duration=6m0s \ --cloud-provider=aws \ --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \ \ --register-with-taints=node-role.kubernetes.io/master=:NoSchedule \ --v=3 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... ### Check kublet on worker $ oc debug node/ip-10-0-158-57.us-west-1.compute.internal Starting pod/ip-10-0-158-57us-west-1computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.158.57 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# systemctl cat kubelet.service # /etc/systemd/system/kubelet.service [Unit] Description=Kubernetes Kubelet Wants=rpc-statd.service network-online.target crio.service After=network-online.target crio.service [Service] Type=notify ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state EnvironmentFile=/etc/os-release EnvironmentFile=-/etc/kubernetes/kubelet-workaround EnvironmentFile=-/etc/kubernetes/kubelet-env ExecStart=/usr/bin/hyperkube \ kubelet \ --config=/etc/kubernetes/kubelet.conf \ --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \ --kubeconfig=/var/lib/kubelet/kubeconfig \ --container-runtime=remote \ --container-runtime-endpoint=/var/run/crio/crio.sock \ --node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_id=${ID} \ --minimum-container-ttl-duration=6m0s \ --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \ --cloud-provider=aws \ \ --v=3 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target # /etc/systemd/system/kubelet.service.d/10-default-env.conf sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... ```
I am seeing something like this happening in VMware UPI 4.1. Cluster hosts are randomly getting reassigned localhost.localdomain and rejoining the cluster under that name instead of their original hostname. Support ticket for one: https://access.redhat.com/support/cases/#/case/02501662
(In reply to Nick Curry from comment #20) > I am seeing something like this happening in VMware UPI 4.1. Cluster hosts > are randomly getting reassigned localhost.localdomain and rejoining the > cluster under that name instead of their original hostname. > > Support ticket for one: > https://access.redhat.com/support/cases/#/case/02501662 There is an existing BZ for 4.1.z here - https://bugzilla.redhat.com/show_bug.cgi?id=1764719
Setting the hostname in /etc/hostname by modifying the ignition files or directly on the hosts after boot seems to serve as a workaround for this issue.
Is this fixed in 4.3?
(In reply to Mark Dunnett from comment #23) > Is this fixed in 4.3? From comment#19, this was never reported as a problem in 4.3. However, the changes that needed to be back-ported to 4.1 will be included in the 4.3 release.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062
*** Bug 1803962 has been marked as a duplicate of this bug. ***