Bug 1736254
| Summary: | Unable to add a new node to a recent UPI on vSphere installation | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | rvanderp |
| Component: | Machine Config Operator | Assignee: | Ryan Phillips <rphillips> |
| Status: | CLOSED ERRATA | QA Contact: | Micah Abbott <miabbott> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.1.0 | CC: | agarcial, aos-bugs, jokerman, kgarriso |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 4.3.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
UPI Installation on vSphere
|
|
| Last Closed: | 2020-01-23 11:05:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
rvanderp
2019-08-01 15:24:54 UTC
do we have logs for the kube controller manager? The must-gather has logs from the kube controller pods if that is what you are looking for. I was just on a remote session with the customer to analyze the node. On the node, the kubelet service was disabled. Once the service was started and enabled, it was able to join the cluster. The customer has added numerous nodes to the cluster and rebuilt this failing node numerous times. It is not clear to me why the kubelet service would have been disabled but thought it was interesting. Hi, kubelet service might be disable for various reasons: - disabled manually - disabled by ignition - disabled by default in the baked image - other reasons The closest component responsible for node configuration is machine config operator. From https://github.com/openshift/machine-config-operator/#machine-config-operator: "OpenShift 4 is an operator-focused platform, and the Machine Config operator extends that to the operating system itself, managing updates and configuration changes to essentially everything between the kernel and kubelet. To repeat for emphasis, this operator manages updates to systemd, cri-o/kubelet, kernel, NetworkManager, etc. It also offers a new MachineConfig CRD that can write configuration files onto the host." It seems like the issue here is that the node pivoted but did not restart. The kubelet is blocked from starting by the MCD. If there is a initial pivot, the kubelet does not start until after the reboot to prevent interruptions during the bootstrapping (i.e. delays the bootstrapping until machine-os-content is at the level expected by the release) https://github.com/openshift/machine-config-operator/blob/e40893ad54cc33ae5bd19e50e5e3289dc6c4f9a1/templates/common/_base/units/machine-config-daemon-host.service#L13-L17 I would expect to see this in the node logs but I don't. machine-config-daemon[1299]: I0814 18:32:04.397554 1299 pivot.go:247] Rebooting due to /run/pivot/reboot-needed There's been some indication that kubelet on vsphere was having problems. Passing this over to double check since I'm unsure of the current state (and the BZ state of 4.1) Ryan, PTAL? The error I noted above was due to kubelet problems when cloud provider is set. see: https://github.com/openshift/machine-config-operator/pull/998#discussion_r305934051 Looking at MCP the 3 masters and 2 workers all were available from MCO POV: masters: degradedMachineCount: 0 machineCount: 3 observedGeneration: 2 readyMachineCount: 3 unavailableMachineCount: 0 updatedMachineCount: 3 workers: degradedMachineCount: 0 machineCount: 2 observedGeneration: 2 readyMachineCount: 2 unavailableMachineCount: 0 updatedMachineCount: 2 The OCP QE team helped verify this (thanks Jia Liu and Weibin Liang!) with 4.3.0-0.nightly-2020-01-08-181129
A single RHCOS worker node was added successfully to the cluster:
```
# oc get node
NAME STATUS ROLES AGE VERSION
compute-0 Ready worker 13h v1.16.2
compute-1 Ready worker 97m v1.16.2
control-plane-0 Ready master 13h v1.16.2
# systemctl status kubelet.service
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-default-env.conf
Active: active (running) since Fri 2020-01-10 03:36:03 UTC; 1h 35min ago
# oc get mcp worker
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT
worker rendered-worker-1622848b3b42752740e4d4d1f44b8db2 True False False 2 2 2 0
```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |