Bug 1821912 - OCP UPI installation Failing with Failed to wait for bootstrapping to complete: timed out waiting for the condition
Summary: OCP UPI installation Failing with Failed to wait for bootstrapping to complet...
Keywords:
Status: CLOSED DUPLICATE of bug 1816178
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.5.0
Assignee: Abhinav Dahiya
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-07 19:53 UTC by Petr Balogh
Modified: 2020-04-08 13:39 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-08 13:39:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Petr Balogh 2020-04-07 19:53:42 UTC
Description of problem:
UPI installation with OCP 4.4 is failing with following error:

me="2020-04-07T14:56:22Z" level=debug msg="Generating Metadata..."
time="2020-04-07T15:10:33Z" level=debug msg="OpenShift Installer 4.4.0-0.nightly-2020-04-04-025830"
time="2020-04-07T15:10:33Z" level=debug msg="Built from commit 39af1f7c497e7919cbab0b9243ab7a089d7cfeb7"
time="2020-04-07T15:10:33Z" level=info msg="Waiting up to 20m0s for the Kubernetes API at https://api.jnk-pr1789-b1444.qe.rh-ocs.com:6443..."
time="2020-04-07T15:10:33Z" level=info msg="API v1.17.1 up"
time="2020-04-07T15:10:33Z" level=info msg="Waiting up to 40m0s for bootstrapping to complete..."
time="2020-04-07T15:50:34Z" level=info msg="Use the following commands to gather logs from the cluster"
time="2020-04-07T15:50:34Z" level=info msg="openshift-install gather bootstrap --help"
time="2020-04-07T15:50:34Z" level=fatal msg="failed to wait for bootstrapping to complete: timed out waiting for the condition"

Installation log:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr1789-b1444/jnk-pr1789-b1444_20200407T143757/logs/openshift_install_create_cluster_1586274710.log

Gather bootstrap logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr1789-b1444/jnk-pr1789-b1444_20200407T143757/logs/gather_bootstrap/log-bundle-20200407155035.tar.gz

another attempt:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr1789-b1442/jnk-pr1789-b1442_20200407T090544/logs/gather_bootstrap/log-bundle-20200407101657.tar.gz

Job where we failed:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6465/console

and

https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6448/

Version-Release number of the following components:
OpenShift Installer 4.4.0-0.nightly-2020-04-04-025830

How reproducible:

Steps to Reproduce:
1. We used the same approach as in 4.3
2. Whole logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr1789-b1444/jnk-pr1789-b1444_20200407T143757/logs/ocs-ci-logs-1586271130/by_outcome/failed/tests/ecosystem/deployment/test_deployment.py/test_deployment/logs
3. We are using script from openshift misc repo: openshift-misc-1586271130/v3-launch-templates/functionality-testing/aos-4_4/hosts/upi_on_aws-install.sh

Actual results:
Failing with timeout

Expected results:

Additional info:
Added links to logs above

This currently blocking OCS QE with UPI deployment over AWS.

Comment 1 shylesh 2020-04-07 21:34:23 UTC
Currently in our CI we are using 'rhcos_ami:ami-06c85f9d106577272' which I suspect could be a problem, I will try with new ami and update the bug.

Comment 2 shylesh 2020-04-08 00:56:59 UTC
Tried AWSUPI 4.3 with ami-0d8f77b753c0d96dd (chosen default by the upi-on-aws_install.sh, but no luck . 

Failed again with 

==========================================
./openshift-install wait-for bootstrap-complete --dir /mnt/shmohan/Downloads/git/ocs-ci/external/openshift-misc-1586304109/v3-launch-templates/functionality-testing/aos-4_3/hosts/install-dir
level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.shmohanupi.qe.rh-ocs.com:6443..."
level=error msg="Attempted to gather ClusterOperator status after wait failure: listing ClusterOperator objects: Get https://api.shmohanupi.qe.rh-ocs.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 18.223.90.69:6443: connect: connection refused"
level=info msg="Use the following commands to gather logs from the cluster"
level=info msg="openshift-install gather bootstrap --help"
level=fatal msg="waiting for Kubernetes API: context deadline exceeded"

=============================================

Comment 3 shylesh 2020-04-08 01:51:54 UTC
Same issue with ocp-4.4 aws upi as well.

============================
./openshift-install wait-for bootstrap-complete --dir /mnt/shmohan/Downloads/git/ocs-ci/external/openshift-misc-1586308020/v3-launch-templates/functionality-testing/aos-4_4/hosts/install-dir
level=info msg="Waiting up to 20m0s for the Kubernetes API at https://api.shmohanupi.qe.rh-ocs.com:6443..."
level=error msg="Attempted to gather ClusterOperator status after wait failure: listing ClusterOperator objects: Get https://api.shmohanupi.qe.rh-ocs.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 3.22.137.239:6443: connect: connection refused"
level=info msg="Use the following commands to gather logs from the cluster"
level=info msg="openshift-install gather bootstrap --help"
level=fatal msg="waiting for Kubernetes API: context deadline exceeded"
+ exit 3
=============================

Comment 4 Scott Dodson 2020-04-08 13:21:06 UTC
Plrease stop filing these bugs as urgent unless you're going to learn how to debug bootstrap failures and route them to the appropriate team.

Comment 5 Petr Balogh 2020-04-08 13:31:36 UTC
This was blocking OCS QE to test on top of OCP 4.4 as a dependent product.

Sorry for that but as we need to get results over 4.4 ASAP for OCS 4.4. So I needed to get high attention here hence I set urgent priority/severity.

For how to debug bootstrap failures, we will highly appreciate some session which will be given to our OCS QE Ecosystem team about how to debug those gathered logs.

Can someone give us such session? It will definitely help us/you a lot when filling such bugs.


The issue here was really cause of old RHCOS AMI used so you can close this one but I will appreciate the reply for above question about session how to debug deployment issues on OCP for our team and I guess more teams will appreciate it as well if we will do some recording. Or is there some such of recording already?

Thanks

Comment 6 Scott Dodson 2020-04-08 13:39:34 UTC
Step #1 is always review the bootkube log in bootstrap/journals/bootkube.log which shows what's failing or what it's waiting on, this may very well just be a situation where you need to wait longer.

Apr 07 15:50:36 ip-10-0-10-45 bootkube.sh[14122]: [#91] failed to create some manifests:
Apr 07 15:50:36 ip-10-0-10-45 bootkube.sh[14122]: "99_openshift-machineconfig_99-master-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-master-ssh.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1"
Apr 07 15:50:36 ip-10-0-10-45 bootkube.sh[14122]: "99_openshift-machineconfig_99-worker-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-worker-ssh.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1"

*** This bug has been marked as a duplicate of bug 1816178 ***


Note You need to log in before you can comment on or make changes to this bug.