Created attachment 1697800 [details] must-gather Description of problem: OCP installation on OSP is failing because worker nodes are not created. Issue is observed during OCP installation using Kuryr and openshift-SDN. $ oc get nodes NAME STATUS ROLES AGE VERSION ostest-ntkvp-master-0 Ready master 15h v1.18.3+91d0edd ostest-ntkvp-master-1 Ready master 15h v1.18.3+91d0edd ostest-ntkvp-master-2 Ready master 15h v1.18.3+91d0edd machine resources remained stuck at Provisioning Phase: [stack@undercloud-0 ~]$ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-ntkvp-master-0 Running m4.xlarge nova 15h openshift-machine-api ostest-ntkvp-master-1 Running m4.xlarge nova 15h openshift-machine-api ostest-ntkvp-master-2 Running m4.xlarge nova 15h openshift-machine-api ostest-ntkvp-worker-dkq6p Provisioned m4.xlarge nova 14h openshift-machine-api ostest-ntkvp-worker-ft57g Provisioned m4.xlarge nova 14h openshift-machine-api ostest-ntkvp-worker-zjzz9 Provisioned m4.xlarge nova 14h The worker's VMs are successfully deployed from OpenStack perspective. While connecting to the VM, it is observed that nodeip-configuration.service is failing: (overcloud) [stack@undercloud-0 ~]$ ssh -J core.22.101 core.1.11 Warning: Permanently added '10.46.22.101' (ECDSA) to the list of known hosts. Warning: Permanently added '10.196.1.11' (ECDSA) to the list of known hosts. Red Hat Enterprise Linux CoreOS 45.81.202006160103-0 Part of OpenShift 4.5, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.5/architecture/architecture-rhcos.html --- [systemd] Failed Units: 2 nodeip-configuration.service user [core@ostest-ntkvp-worker-dkq6p ~]$ journalctl ^C [core@ostest-ntkvp-worker-dkq6p ~]$ systemctl status nodeip-configuration.service ● nodeip-configuration.service - Writes IP address configuration so that kubelet and crio services select a valid node IP Loaded: loaded (/etc/systemd/system/nodeip-configuration.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Tue 2020-06-16 15:43:55 UTC; 15h ago Process: 1334 ExecStart=/usr/bin/podman run --rm --authfile /var/lib/kubelet/config.json --volume /etc/systemd/system:/etc/systemd/system:z --net=host quay.io/openshift-release-dev/ocp-v4> Main PID: 1334 (code=exited, status=125) CPU: 655ms Jun 16 15:43:52 ostest-ntkvp-worker-dkq6p systemd[1]: Starting Writes IP address configuration so that kubelet and crio services select a valid node IP... Jun 16 15:43:53 ostest-ntkvp-worker-dkq6p podman[1334]: 2020-06-16 15:43:53.69426512 +0000 UTC m=+0.677361642 system refresh Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p podman[1334]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a3d3f68e5e97d6d34f508300bb6831f1558823cbb095f6dfac1738efa4e337b6> Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p podman[1334]: Failed Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p podman[1334]: Error: unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a3d3f68e5e97d6d34f508300bb6831f1558823cbb095f6dfac1738efa> Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p systemd[1]: nodeip-configuration.service: Main process exited, code=exited, status=125/n/a Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p systemd[1]: nodeip-configuration.service: Failed with result 'exit-code'. Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p systemd[1]: Failed to start Writes IP address configuration so that kubelet and crio services select a valid node IP. Jun 16 15:43:55 ostest-ntkvp-worker-dkq6p systemd[1]: nodeip-configuration.service: Consumed 655ms CPU time [core@ostest-ntkvp-worker-dkq6p ~]$ Version-Release number of the following components: OCP4.5 (4.5.0-0.nightly-2020-06-17-001505) How reproducible: Steps to Reproduce: 1. Install either OSP13 (2020-06-09.2) with OVS or OSP16 (RHOS_TRUNK-16.0-RHEL-8-20200513.n.1) with OVN on hybrid setup. 2. Deploy latest OCP4.5 nightly version with Kuryr on the OSP setup. 3. Wait until installation fails. Actual results: must-gather included. Expected results: Successful installation. Additional info:
It's also worth saying that root cause of node not appearing in `oc get nodes` is the fact that NetworkManager-resolv-prepender fails too for the same reason (podman run is used there) and pod doesn't have cluster DNS configured. Also interestingly restarting NetworkManager and nodeip-configuration gets the node up. It's probably matter of /var/lib/kubelet/config.json file that's for some reason is not present or not accessible early on the node.
This got caused by podman downgrade happening in 4.5's coreos version 45.81.202006110030-0 and got fixed in 45.81.202006182229-0. Pretty unexpected, but well…