1847926 – worker nodes not created during installation

Bug 1847926 - worker nodes not created during installation

Summary: worker nodes not created during installation

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Michał Dulko
QA Contact:	David Sanz
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1847181
TreeView+	depends on / blocked

Reported:	2020-06-17 11:09 UTC by rlobillo
Modified:	2020-10-06 17:13 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-19 14:55:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must-gather (17.70 MB, application/gzip) 2020-06-17 11:09 UTC, rlobillo	no flags	Details
View All

Description rlobillo 2020-06-17 11:09:47 UTC

Created attachment 1697800 [details]
must-gather

Description of problem:

OCP installation on OSP is failing because worker nodes are not created. Issue is observed during OCP installation using Kuryr and openshift-SDN.

$ oc get nodes
NAME                    STATUS   ROLES    AGE   VERSION
ostest-ntkvp-master-0   Ready    master   15h   v1.18.3+91d0edd
ostest-ntkvp-master-1   Ready    master   15h   v1.18.3+91d0edd
ostest-ntkvp-master-2   Ready    master   15h   v1.18.3+91d0edd

machine resources remained stuck at Provisioning Phase:

[stack@undercloud-0 ~]$ oc get machines -A
NAMESPACE               NAME                        PHASE         TYPE        REGION   ZONE   AGE
openshift-machine-api   ostest-ntkvp-master-0       Running       m4.xlarge            nova   15h
openshift-machine-api   ostest-ntkvp-master-1       Running       m4.xlarge            nova   15h
openshift-machine-api   ostest-ntkvp-master-2       Running       m4.xlarge            nova   15h
openshift-machine-api   ostest-ntkvp-worker-dkq6p   Provisioned   m4.xlarge            nova   14h
openshift-machine-api   ostest-ntkvp-worker-ft57g   Provisioned   m4.xlarge            nova   14h
openshift-machine-api   ostest-ntkvp-worker-zjzz9   Provisioned   m4.xlarge            nova   14h

The worker's VMs are successfully deployed from OpenStack perspective. While connecting to the VM, it is observed that nodeip-configuration.service is failing:

(overcloud) [stack@undercloud-0 ~]$ ssh -J core.22.101 core.1.11
Warning: Permanently added '10.46.22.101' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.196.1.11' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 45.81.202006160103-0
  Part of OpenShift 4.5, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.5/architecture/architecture-rhcos.html

---
[systemd]
Failed Units: 2
  nodeip-configuration.service
  user
[core@ostest-ntkvp-worker-dkq6p ~]$ journalctl ^C
[core@ostest-ntkvp-worker-dkq6p ~]$ systemctl status nodeip-configuration.service
● nodeip-configuration.service - Writes IP address configuration so that kubelet and crio services select a valid node IP
   Loaded: loaded (/etc/systemd/system/nodeip-configuration.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2020-06-16 15:43:55 UTC; 15h ago
  Process: 1334 ExecStart=/usr/bin/podman run --rm --authfile /var/lib/kubelet/config.json --volume /etc/systemd/system:/etc/systemd/system:z --net=host quay.io/openshift-release-dev/ocp-v4>
 Main PID: 1334 (code=exited, status=125)
      CPU: 655ms

Jun 16 15:43:52 ostest-ntkvp-worker-dkq6p systemd[1]: Starting Writes IP address configuration so that kubelet and crio services select a valid node IP...
Jun 16 15:43:53 ostest-ntkvp-worker-dkq6p podman[1334]: 2020-06-16 15:43:53.69426512 +0000 UTC m=+0.677361642 system refresh
Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p podman[1334]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a3d3f68e5e97d6d34f508300bb6831f1558823cbb095f6dfac1738efa4e337b6>
Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p podman[1334]: Failed
Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p podman[1334]: Error: unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a3d3f68e5e97d6d34f508300bb6831f1558823cbb095f6dfac1738efa>
Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p systemd[1]: nodeip-configuration.service: Main process exited, code=exited, status=125/n/a
Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p systemd[1]: nodeip-configuration.service: Failed with result 'exit-code'.
Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p systemd[1]: Failed to start Writes IP address configuration so that kubelet and crio services select a valid node IP.
Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p systemd[1]: nodeip-configuration.service: Consumed 655ms CPU time
[core@ostest-ntkvp-worker-dkq6p ~]$ 


Version-Release number of the following components:
OCP4.5 (4.5.0-0.nightly-2020-06-17-001505)

How reproducible:

Steps to Reproduce:
1. Install either OSP13 (2020-06-09.2) with OVS or OSP16 (RHOS_TRUNK-16.0-RHEL-8-20200513.n.1) with OVN on hybrid setup.
2. Deploy latest OCP4.5 nightly version with Kuryr on the OSP setup.
3. Wait until installation fails.

Actual results:
must-gather included.

Expected results:
Successful installation.

Additional info:

Comment 1 Michał Dulko 2020-06-17 11:15:12 UTC

It's also worth saying that root cause of node not appearing in `oc get nodes` is the fact that NetworkManager-resolv-prepender fails too for the same reason (podman run is used there) and pod doesn't have cluster DNS configured. Also interestingly restarting NetworkManager and nodeip-configuration gets the node up. It's probably matter of /var/lib/kubelet/config.json file that's for some reason is not present or not accessible early on the node.

Comment 2 Michał Dulko 2020-06-19 14:55:08 UTC

This got caused by podman downgrade happening in 4.5's coreos version 45.81.202006110030-0 and got fixed in 45.81.202006182229-0. Pretty unexpected, but well…

Note You need to log in before you can comment on or make changes to this bug.