1736254 – Unable to add a new node to a recent UPI on vSphere installation

Bug 1736254 - Unable to add a new node to a recent UPI on vSphere installation

Summary: Unable to add a new node to a recent UPI on vSphere installation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Ryan Phillips
QA Contact:	Micah Abbott
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-01 15:24 UTC by rvanderp
Modified:	2023-09-14 05:38 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	UPI Installation on vSphere
Last Closed:	2020-01-23 11:05:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 998	0	'None'	'closed'	'Bug 1736254: template/render: support vsphere platform '	2019-12-04 20:21:01 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:05:16 UTC

Description rvanderp 2019-08-01 15:24:54 UTC

Description of problem:
Unable to add a new node to a recent UPI on vSphere installation.

Version-Release number of selected component (if applicable):
4.1.7

How reproducible:
Consistently

Steps to Reproduce:
1. Install cluster
2. Follow instructions for provisioning worker node - https://docs.openshift.com/container-platform/4.1/installing/installing_vsphere/installing-vsphere.html#installation-vsphere-machines_installing-vsphere
3. The cluster does not recognize the node

Actual results:
Node is not joined to the cluster

Expected results:
Node is joined to the cluster

Additional info:

must-gather shows events related to the nodes arrival and removal from the cluster.

- apiVersion: v1
  count: 1
  eventTime: null
  firstTimestamp: 2019-07-31T16:10:52Z
  involvedObject:
    kind: Node
    name: ocp4-node2.x.x.x.x
    uid: d79affbe-b3ab-11e9-9e2b-005056a09b13
  kind: Event
  lastTimestamp: 2019-07-31T16:10:52Z
  message: 'Node ocp4-node2.x.x.x.x status is now: NodeNotReady'
  metadata:
    creationTimestamp: 2019-07-31T16:10:52Z
    name: ocp4-node2.x.x.x.x.15b689d12f68c708
    namespace: default
    resourceVersion: "677993"
    selfLink: /api/v1/namespaces/default/events/ocp4-node2.x.x.x.x.15b689d12f68c708
    uid: c49eda33-b3ad-11e9-8609-005056a0a975
  reason: NodeNotReady
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: node-controller
  type: Normal
- apiVersion: v1
  count: 1
  eventTime: null
  firstTimestamp: 2019-07-31T16:10:52Z
  involvedObject:
    kind: Node
    name: ocp4-node2.x.x.x.x
    uid: d79affbe-b3ab-11e9-9e2b-005056a09b13
  kind: Event
  lastTimestamp: 2019-07-31T16:10:52Z
  message: 'Node ocp4-node2.x.x.x.x event: Deleting Node ocp4-node2.x.x.x.x
    because it''s not present according to cloud provider'
  metadata:
    creationTimestamp: 2019-07-31T16:10:52Z
    name: ocp4-node2.x.x.x.x.15b689d142b4478b
    namespace: default
    resourceVersion: "678030"
    selfLink: /api/v1/namespaces/default/events/ocp4-node2.x.x.x.x.15b689d142b4478b
    uid: c4d01784-b3ad-11e9-8609-005056a0a975
  reason: DeletingNode
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: node-controller
  type: Normal



I have reviewed the journal from node2 and there does not appear to be any evidence of the kubelet starting.  cu has claimed to have restarted this node a few times since igniting the node.

Comment 2 Alberto 2019-08-05 10:07:31 UTC

do we have logs for the kube controller manager?

Comment 3 rvanderp 2019-08-05 15:19:40 UTC

The must-gather has logs from the kube controller pods if that is what you are looking for.

Comment 4 rvanderp 2019-08-06 13:53:21 UTC

I was just on a remote session with the customer to analyze the node.  On the node, the kubelet service was disabled.  Once the service was started and enabled, it was able to join the cluster.  The customer has added numerous nodes to the cluster and rebuilt this failing node numerous times.  It is not clear to me why the kubelet service would have been disabled but thought it was interesting.

Comment 5 Jan Chaloupka 2019-08-13 12:33:08 UTC

Hi,

kubelet service might be disable for various reasons:
- disabled manually
- disabled by ignition
- disabled by default in the baked image
- other reasons

The closest component responsible for node configuration is machine config operator. From https://github.com/openshift/machine-config-operator/#machine-config-operator: "OpenShift 4 is an operator-focused platform, and the Machine Config operator extends that to the operating system itself, managing updates and configuration changes to essentially everything between the kernel and kubelet. To repeat for emphasis, this operator manages updates to systemd, cri-o/kubelet, kernel, NetworkManager, etc. It also offers a new MachineConfig CRD that can write configuration files onto the host."

Comment 6 Seth Jennings 2019-08-14 18:42:40 UTC

It seems like the issue here is that the node pivoted but did not restart.  The kubelet is blocked from starting by the MCD.  If there is a initial pivot, the kubelet does not start until after the reboot to prevent interruptions during the bootstrapping (i.e. delays the bootstrapping until machine-os-content is at the level expected by the release)

https://github.com/openshift/machine-config-operator/blob/e40893ad54cc33ae5bd19e50e5e3289dc6c4f9a1/templates/common/_base/units/machine-config-daemon-host.service#L13-L17

I would expect to see this in the node logs but I don't.

machine-config-daemon[1299]: I0814 18:32:04.397554    1299 pivot.go:247] Rebooting due to /run/pivot/reboot-needed

Comment 10 Kirsten Garrison 2019-11-08 02:57:39 UTC

There's been some indication that kubelet on vsphere was having problems.

Passing this over to double check since I'm unsure of the current state (and the BZ state of 4.1)

Ryan, PTAL?

Comment 11 Kirsten Garrison 2019-11-08 03:05:01 UTC

The error I noted above was due to kubelet problems when cloud provider is set.

see: https://github.com/openshift/machine-config-operator/pull/998#discussion_r305934051

Comment 12 Kirsten Garrison 2019-11-08 03:30:18 UTC

Looking at MCP the 3 masters and 2 workers all were available from MCO POV:

masters:
 degradedMachineCount: 0
  machineCount: 3
  observedGeneration: 2
  readyMachineCount: 3
  unavailableMachineCount: 0
  updatedMachineCount: 3
workers:  
  degradedMachineCount: 0
  machineCount: 2
  observedGeneration: 2
  readyMachineCount: 2
  unavailableMachineCount: 0
  updatedMachineCount: 2

Comment 14 Micah Abbott 2020-01-10 14:23:32 UTC

The OCP QE team helped verify this (thanks Jia Liu and Weibin Liang!) with 4.3.0-0.nightly-2020-01-08-181129

A single RHCOS worker node was added successfully to the cluster:

```
# oc get node
NAME       STATUS  ROLES  AGE  VERSION
compute-0     Ready  worker  13h  v1.16.2
compute-1     Ready  worker  97m  v1.16.2
control-plane-0  Ready  master  13h  v1.16.2

# systemctl status kubelet.service 
● kubelet.service - Kubernetes Kubelet
  Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
 Drop-In: /etc/systemd/system/kubelet.service.d
      └─10-default-env.conf
  Active: active (running) since Fri 2020-01-10 03:36:03 UTC; 1h 35min ago

# oc get mcp worker
NAME   CONFIG                       UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT
worker  rendered-worker-1622848b3b42752740e4d4d1f44b8db2  True   False   False   2       2          2           0
```

Comment 16 errata-xmlrpc 2020-01-23 11:05:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 17 Red Hat Bugzilla 2023-09-14 05:38:48 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.