Description of problem: Unable to add a new node to a recent UPI on vSphere installation. Version-Release number of selected component (if applicable): 4.1.7 How reproducible: Consistently Steps to Reproduce: 1. Install cluster 2. Follow instructions for provisioning worker node - https://docs.openshift.com/container-platform/4.1/installing/installing_vsphere/installing-vsphere.html#installation-vsphere-machines_installing-vsphere 3. The cluster does not recognize the node Actual results: Node is not joined to the cluster Expected results: Node is joined to the cluster Additional info: must-gather shows events related to the nodes arrival and removal from the cluster. - apiVersion: v1 count: 1 eventTime: null firstTimestamp: 2019-07-31T16:10:52Z involvedObject: kind: Node name: ocp4-node2.x.x.x.x uid: d79affbe-b3ab-11e9-9e2b-005056a09b13 kind: Event lastTimestamp: 2019-07-31T16:10:52Z message: 'Node ocp4-node2.x.x.x.x status is now: NodeNotReady' metadata: creationTimestamp: 2019-07-31T16:10:52Z name: ocp4-node2.x.x.x.x.15b689d12f68c708 namespace: default resourceVersion: "677993" selfLink: /api/v1/namespaces/default/events/ocp4-node2.x.x.x.x.15b689d12f68c708 uid: c49eda33-b3ad-11e9-8609-005056a0a975 reason: NodeNotReady reportingComponent: "" reportingInstance: "" source: component: node-controller type: Normal - apiVersion: v1 count: 1 eventTime: null firstTimestamp: 2019-07-31T16:10:52Z involvedObject: kind: Node name: ocp4-node2.x.x.x.x uid: d79affbe-b3ab-11e9-9e2b-005056a09b13 kind: Event lastTimestamp: 2019-07-31T16:10:52Z message: 'Node ocp4-node2.x.x.x.x event: Deleting Node ocp4-node2.x.x.x.x because it''s not present according to cloud provider' metadata: creationTimestamp: 2019-07-31T16:10:52Z name: ocp4-node2.x.x.x.x.15b689d142b4478b namespace: default resourceVersion: "678030" selfLink: /api/v1/namespaces/default/events/ocp4-node2.x.x.x.x.15b689d142b4478b uid: c4d01784-b3ad-11e9-8609-005056a0a975 reason: DeletingNode reportingComponent: "" reportingInstance: "" source: component: node-controller type: Normal I have reviewed the journal from node2 and there does not appear to be any evidence of the kubelet starting. cu has claimed to have restarted this node a few times since igniting the node.
do we have logs for the kube controller manager?
The must-gather has logs from the kube controller pods if that is what you are looking for.
I was just on a remote session with the customer to analyze the node. On the node, the kubelet service was disabled. Once the service was started and enabled, it was able to join the cluster. The customer has added numerous nodes to the cluster and rebuilt this failing node numerous times. It is not clear to me why the kubelet service would have been disabled but thought it was interesting.
Hi, kubelet service might be disable for various reasons: - disabled manually - disabled by ignition - disabled by default in the baked image - other reasons The closest component responsible for node configuration is machine config operator. From https://github.com/openshift/machine-config-operator/#machine-config-operator: "OpenShift 4 is an operator-focused platform, and the Machine Config operator extends that to the operating system itself, managing updates and configuration changes to essentially everything between the kernel and kubelet. To repeat for emphasis, this operator manages updates to systemd, cri-o/kubelet, kernel, NetworkManager, etc. It also offers a new MachineConfig CRD that can write configuration files onto the host."
It seems like the issue here is that the node pivoted but did not restart. The kubelet is blocked from starting by the MCD. If there is a initial pivot, the kubelet does not start until after the reboot to prevent interruptions during the bootstrapping (i.e. delays the bootstrapping until machine-os-content is at the level expected by the release) https://github.com/openshift/machine-config-operator/blob/e40893ad54cc33ae5bd19e50e5e3289dc6c4f9a1/templates/common/_base/units/machine-config-daemon-host.service#L13-L17 I would expect to see this in the node logs but I don't. machine-config-daemon[1299]: I0814 18:32:04.397554 1299 pivot.go:247] Rebooting due to /run/pivot/reboot-needed
There's been some indication that kubelet on vsphere was having problems. Passing this over to double check since I'm unsure of the current state (and the BZ state of 4.1) Ryan, PTAL?
The error I noted above was due to kubelet problems when cloud provider is set. see: https://github.com/openshift/machine-config-operator/pull/998#discussion_r305934051
Looking at MCP the 3 masters and 2 workers all were available from MCO POV: masters: degradedMachineCount: 0 machineCount: 3 observedGeneration: 2 readyMachineCount: 3 unavailableMachineCount: 0 updatedMachineCount: 3 workers: degradedMachineCount: 0 machineCount: 2 observedGeneration: 2 readyMachineCount: 2 unavailableMachineCount: 0 updatedMachineCount: 2
The OCP QE team helped verify this (thanks Jia Liu and Weibin Liang!) with 4.3.0-0.nightly-2020-01-08-181129 A single RHCOS worker node was added successfully to the cluster: ``` # oc get node NAME STATUS ROLES AGE VERSION compute-0 Ready worker 13h v1.16.2 compute-1 Ready worker 97m v1.16.2 control-plane-0 Ready master 13h v1.16.2 # systemctl status kubelet.service ● kubelet.service - Kubernetes Kubelet Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/kubelet.service.d └─10-default-env.conf Active: active (running) since Fri 2020-01-10 03:36:03 UTC; 1h 35min ago # oc get mcp worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT worker rendered-worker-1622848b3b42752740e4d4d1f44b8db2 True False False 2 2 2 0 ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days