Description of problem: UPI installation with OCP 4.4 is failing with following error: me="2020-04-07T14:56:22Z" level=debug msg="Generating Metadata..." time="2020-04-07T15:10:33Z" level=debug msg="OpenShift Installer 4.4.0-0.nightly-2020-04-04-025830" time="2020-04-07T15:10:33Z" level=debug msg="Built from commit 39af1f7c497e7919cbab0b9243ab7a089d7cfeb7" time="2020-04-07T15:10:33Z" level=info msg="Waiting up to 20m0s for the Kubernetes API at https://api.jnk-pr1789-b1444.qe.rh-ocs.com:6443..." time="2020-04-07T15:10:33Z" level=info msg="API v1.17.1 up" time="2020-04-07T15:10:33Z" level=info msg="Waiting up to 40m0s for bootstrapping to complete..." time="2020-04-07T15:50:34Z" level=info msg="Use the following commands to gather logs from the cluster" time="2020-04-07T15:50:34Z" level=info msg="openshift-install gather bootstrap --help" time="2020-04-07T15:50:34Z" level=fatal msg="failed to wait for bootstrapping to complete: timed out waiting for the condition" Installation log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr1789-b1444/jnk-pr1789-b1444_20200407T143757/logs/openshift_install_create_cluster_1586274710.log Gather bootstrap logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr1789-b1444/jnk-pr1789-b1444_20200407T143757/logs/gather_bootstrap/log-bundle-20200407155035.tar.gz another attempt: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr1789-b1442/jnk-pr1789-b1442_20200407T090544/logs/gather_bootstrap/log-bundle-20200407101657.tar.gz Job where we failed: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6465/console and https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6448/ Version-Release number of the following components: OpenShift Installer 4.4.0-0.nightly-2020-04-04-025830 How reproducible: Steps to Reproduce: 1. We used the same approach as in 4.3 2. Whole logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr1789-b1444/jnk-pr1789-b1444_20200407T143757/logs/ocs-ci-logs-1586271130/by_outcome/failed/tests/ecosystem/deployment/test_deployment.py/test_deployment/logs 3. We are using script from openshift misc repo: openshift-misc-1586271130/v3-launch-templates/functionality-testing/aos-4_4/hosts/upi_on_aws-install.sh Actual results: Failing with timeout Expected results: Additional info: Added links to logs above This currently blocking OCS QE with UPI deployment over AWS.
Currently in our CI we are using 'rhcos_ami:ami-06c85f9d106577272' which I suspect could be a problem, I will try with new ami and update the bug.
Tried AWSUPI 4.3 with ami-0d8f77b753c0d96dd (chosen default by the upi-on-aws_install.sh, but no luck . Failed again with ========================================== ./openshift-install wait-for bootstrap-complete --dir /mnt/shmohan/Downloads/git/ocs-ci/external/openshift-misc-1586304109/v3-launch-templates/functionality-testing/aos-4_3/hosts/install-dir level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.shmohanupi.qe.rh-ocs.com:6443..." level=error msg="Attempted to gather ClusterOperator status after wait failure: listing ClusterOperator objects: Get https://api.shmohanupi.qe.rh-ocs.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 18.223.90.69:6443: connect: connection refused" level=info msg="Use the following commands to gather logs from the cluster" level=info msg="openshift-install gather bootstrap --help" level=fatal msg="waiting for Kubernetes API: context deadline exceeded" =============================================
Same issue with ocp-4.4 aws upi as well. ============================ ./openshift-install wait-for bootstrap-complete --dir /mnt/shmohan/Downloads/git/ocs-ci/external/openshift-misc-1586308020/v3-launch-templates/functionality-testing/aos-4_4/hosts/install-dir level=info msg="Waiting up to 20m0s for the Kubernetes API at https://api.shmohanupi.qe.rh-ocs.com:6443..." level=error msg="Attempted to gather ClusterOperator status after wait failure: listing ClusterOperator objects: Get https://api.shmohanupi.qe.rh-ocs.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 3.22.137.239:6443: connect: connection refused" level=info msg="Use the following commands to gather logs from the cluster" level=info msg="openshift-install gather bootstrap --help" level=fatal msg="waiting for Kubernetes API: context deadline exceeded" + exit 3 =============================
Plrease stop filing these bugs as urgent unless you're going to learn how to debug bootstrap failures and route them to the appropriate team.
This was blocking OCS QE to test on top of OCP 4.4 as a dependent product. Sorry for that but as we need to get results over 4.4 ASAP for OCS 4.4. So I needed to get high attention here hence I set urgent priority/severity. For how to debug bootstrap failures, we will highly appreciate some session which will be given to our OCS QE Ecosystem team about how to debug those gathered logs. Can someone give us such session? It will definitely help us/you a lot when filling such bugs. The issue here was really cause of old RHCOS AMI used so you can close this one but I will appreciate the reply for above question about session how to debug deployment issues on OCP for our team and I guess more teams will appreciate it as well if we will do some recording. Or is there some such of recording already? Thanks
Step #1 is always review the bootkube log in bootstrap/journals/bootkube.log which shows what's failing or what it's waiting on, this may very well just be a situation where you need to wait longer. Apr 07 15:50:36 ip-10-0-10-45 bootkube.sh[14122]: [#91] failed to create some manifests: Apr 07 15:50:36 ip-10-0-10-45 bootkube.sh[14122]: "99_openshift-machineconfig_99-master-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-master-ssh.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1" Apr 07 15:50:36 ip-10-0-10-45 bootkube.sh[14122]: "99_openshift-machineconfig_99-worker-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-worker-ssh.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1" *** This bug has been marked as a duplicate of bug 1816178 ***