Bug 2074497 - bootstrap stage timing out on proxy installation with OVNKubernetes and Kuryr
Summary: bootstrap stage timing out on proxy installation with OVNKubernetes and Kuryr
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.11
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: ---
Assignee: ShiftStack Bugwatcher
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-12 10:57 UTC by rlobillo
Modified: 2023-03-09 01:17 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-09 01:17:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description rlobillo 2022-04-12 10:57:44 UTC
Version:

$ openshift-install version
openshift-install 4.11.0-0.nightly-2022-04-12-000004
built from commit 496c1de207b3fd8ef838e832388df20781bd2efe
release image registry.ci.openshift.org/ocp/release@sha256:4c4ae72860c578c0080e83dd2369ec85f0ffc91ab4a6cb6f1ba0e6d44111ec5d
release architecture amd64

Platform: openstack

Please specify:
* IPI with Proxy+BYON with external CCM TP enabled.

What happened?

The installation fails due to bootstrap timing out:

time="2022-04-12T05:47:38-04:00" level=info msg="API v1.23.3+d36f9d8 up"
time="2022-04-12T05:47:38-04:00" level=info msg="Waiting up to 30m0s (until 6:17AM) for bootstrapping to complete..."
time="2022-04-12T06:17:38-04:00" level=debug msg="Fetching Bootstrap SSH Key Pair..."
[...]
time="2022-04-12T06:17:39-04:00" level=error msg="Cluster operator cluster-autoscaler Degraded is True with MissingDependency: machine-api not ready"

The installation is progressing despite the bootstrap stage timed out and the cluster is finally correctly set. This is observed persistently only with Kuryr and OVNKubernetes network type. With OpenshiftSDN we observed one successful installation.


What did you expect to happen? 
Installation is successful. On OSP D/S CI it is successful 100% in all the releases < 4.11.


How to reproduce it (as minimally and precisely as possible)?
IPI installation from an installer-vm with two interfaces, one connected to the external network and one connected to a restricted network which is used as BYON in the install-config.yaml (attached). The installer-vm runs a proxy that enables external connectivity to the cluster while deploying.


Anything else we need to know?
must-gather and post-installation bundle linked on private comment.

Comment 2 rlobillo 2022-06-15 10:41:27 UTC
Latest CI runs show better results:

type	number	timestamp	status	Failing stage	Auditor	Category	Rerun	networkType	osp_puddle	ocp_puddle
DFG-osasinfra-shiftstack_ci-ocp_install-4.11-kuryr-ipi_proxy	16	2022/06/04 1:50:20	SUCCESS	N/A	N/A	N/A	N/A	kuryr	RHOS-16.2-RHEL-8-20220311.n.1	4.11.0-0.nightly-2022-06-01-200905
DFG-osasinfra-shiftstack_ci-ocp_install-4.11-ovnkubernetes-ipi_proxy	14	2022/06/03 16:41:44	SUCCESS	N/A	N/A	N/A	N/A	ovnkubernetes	RHOS-16.2-RHEL-8-20220311.n.1	4.11.0-0.nightly-2022-06-01-200905
DFG-osasinfra-shiftstack_ci-ocp_install-4.11-kuryr-ipi_proxy	14	2022/05/27 22:09:45	SUCCESS	N/A	N/A	N/A	N/A	kuryr	RHOS-16.1-RHEL-8-20220329.n.1	4.11.0-0.nightly-2022-05-25-193227
DFG-osasinfra-shiftstack_ci-ocp_install-4.11-ovnkubernetes-ipi_proxy	11	2022/05/20 6:04:05	SUCCESS	N/A	N/A	N/A	N/A	ovnkubernetes	RHOS-16.2-RHEL-8-20220311.n.1	4.11.0-0.nightly-2022-05-18-171831
DFG-osasinfra-shiftstack_ci-ocp_install-4.11-kuryr-ipi_proxy	12	2022/05/15 16:27:04	SUCCESS	N/A	N/A	N/A	N/A	kuryr	RHOS-16.1-RHEL-8-20220329.n.1	4.11.0-0.nightly-2022-05-11-054135
DFG-osasinfra-shiftstack_ci-ocp_install-4.11-ovnkubernetes-ipi_proxy	10	2022/05/13 11:06:15	SUCCESS	N/A	N/A	N/A	N/A	ovnkubernetes	RHOS-16.1-RHEL-8-20220329.n.1	4.11.0-0.nightly-2022-05-11-054135
DFG-osasinfra-shiftstack_ci-ocp_install-4.11-kuryr-ipi_proxy	10	2022/05/07 1:23:38	SUCCESS	N/A	N/A	N/A	N/A	kuryr	RHOS-16.2-RHEL-8-20220311.n.1	4.11.0-0.nightly-2022-05-06-180112
DFG-osasinfra-shiftstack_ci-ocp_install-4.11-ovnkubernetes-ipi_proxy	9	2022/05/06 13:32:11	SUCCESS	N/A	N/A	N/A	N/A	ovnkubernetes	RHOS-16.2-RHEL-8-20220311.n.1	4.11.0-0.nightly-2022-05-06-060226

Comment 3 rlobillo 2022-07-19 16:03:40 UTC
Reopening as it is observed more frequently on latest nightlies of OCP4.11. In this case it is not setting the TP flag so it is using internal CCM, which is default in OCP4.11.

Version:

$ openshift-install version
openshift-install 4.11.0-0.nightly-2022-07-19-043523
built from commit 37684309bcb598757c99d3ea9fbc0758343d64a5
release image registry.ci.openshift.org/ocp/release@sha256:17817866b4f106771ff3098769b866e541082de32d6669a6d20e65c9be5902b6
release architecture amd64

Platform: openstack
NetworkType: OVNKubernetes and Kuryr

Please specify:
* IPI with Proxy+BYON with external CCM TP enabled.

What happened?

The installation is timing out on bootstrap stage, but the installation is progressing behind scenes and at the end the cluster is healthy, with the bootstrap server still there.

[cloud-user@installer-host ~]$ tail -f ostest/.openshift_install.log
INFO API v1.24.0+9546431 up                                                                                                                                                                   
DEBUG Loading Install Config...                                                                                                                                                               
DEBUG   Loading SSH Key...                                                                                                                                                                    
DEBUG   Loading Base Domain...                                                                                                                                                                
DEBUG     Loading Platform...                                                                                                                                                                 
DEBUG   Loading Cluster Name...                                                                                                                                                               
DEBUG     Loading Base Domain...                                                                                                                                                              
DEBUG     Loading Platform...                                                                                                                                                                 
DEBUG   Loading Networking...                                                                                                                                                                 
DEBUG     Loading Platform...                                                                                                                                                                 
DEBUG   Loading Pull Secret...                                                                                                                                                                
DEBUG   Loading Platform...                                                                                                                                                                   
DEBUG Using Install Config loaded from state file                                                                                                                                             
INFO Waiting up to 30m0s (until 10:42AM) for bootstrapping to complete... 
[... stuck here for 30 minutes]
ERROR Bootstrap failed to complete: timed out waiting for the condition
ERROR Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the co
ntrol plane.


On the bootstrap logs, is is seen below:

 [cloud-user@installer-host journals]$ tail -60 bootkube.log                                                                                                             
Jul 19 14:43:06 ostest-p2mk7-bootstrap bootkube.sh[12411]: Skipped "secret-localhost-serving-signer.yaml" secrets.v1./localhost-serving-signer -n openshift-kube-apiserver-operator as it already exists
Jul 19 14:43:07 ostest-p2mk7-bootstrap cluster-bootstrap[12426]: Skipped "secret-service-network-serving-signer.yaml" secrets.v1./service-network-serving-signer -n openshift-kube-apiserver-operator as it already exists
Jul 19 14:43:07 ostest-p2mk7-bootstrap bootkube.sh[12411]: Skipped "secret-service-network-serving-signer.yaml" secrets.v1./service-network-serving-signer -n openshift-kube-apiserver-operator as it already exists
Jul 19 14:43:07 ostest-p2mk7-bootstrap cluster-bootstrap[12426]: Tearing down temporary bootstrap control plane...                                                                   
Jul 19 14:43:07 ostest-p2mk7-bootstrap cluster-bootstrap[12426]: Skipped "user-ca-bundle-config.yaml" configmaps.v1./user-ca-bundle -n openshift-config as it already exists        
Jul 19 14:43:07 ostest-p2mk7-bootstrap bootkube.sh[12411]: Tearing down temporary bootstrap control plane...                                                                         
Jul 19 14:43:07 ostest-p2mk7-bootstrap bootkube.sh[12411]: Sending bootstrap-finished event.Skipped "user-ca-bundle-config.yaml" configmaps.v1./user-ca-bundle -n openshift-config as it already exists
Jul 19 14:43:07 ostest-p2mk7-bootstrap cluster-bootstrap[12426]: Sending bootstrap-finished event.                                                                                   
Jul 19 14:43:07 ostest-p2mk7-bootstrap bootkube.sh[11105]: Restoring CVO overrides                                                                                                  
Jul 19 14:43:28 ostest-p2mk7-bootstrap bootkube.sh[17211]: clusterversion.config.openshift.io/version patched                                                                        
Jul 19 14:43:28 ostest-p2mk7-bootstrap bootkube.sh[11105]: Waiting for CEO to finish...                                                                                             
Jul 19 14:43:28 ostest-p2mk7-bootstrap podman[17407]:                                                                                                                                
Jul 19 14:43:28 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:43:28.938226       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:43:28 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:43:28.938226       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True. 
Jul 19 14:43:36 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:43:36.053782       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:43:36 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:43:36.053782       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:43:38 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:43:38.374880       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:43:38 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:43:38.374880       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:43:40 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:43:40.955174       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:43:40 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:43:40.955174       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:43:47 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:43:47.844623       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:43:47 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:43:47.844623       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                        
Jul 19 14:45:42 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:45:42.809061       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:45:42 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:45:42.809061       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                        
Jul 19 14:45:44 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:45:44.195166       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:45:44 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:45:44.195166       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:45:53 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:45:53.508446       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:45:53 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:45:53.508446       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:45:53 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:45:53.798827       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:45:53 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:45:53.798827       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:45:53 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:45:53.988137       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:45:53 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:45:53.988137       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:45:57 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:45:57.016734       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:45:57 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:45:57.016734       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Jul 19 14:47:12 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:47:12.185879       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:47:12 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:47:12.185879       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                        
Jul 19 14:48:00 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:48:00.395236       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:48:00 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:48:00.395236       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                        
Jul 19 14:48:01 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:48:01.645413       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:48:01 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:48:01.645413       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                        
Jul 19 14:48:07 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:48:07.653805       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:48:07 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:48:07.653805       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                        
Jul 19 14:51:02 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:51:02.257249       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:51:02 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:51:02.257249       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                        
Jul 19 14:51:09 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:51:09.073935       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:51:09 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:51:09.073935       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                        
Jul 19 14:52:53 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:52:53.515760       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:52:53 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:52:53.515760       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                        
Jul 19 14:52:53 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:52:53.548649       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:52:53 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:52:53.548649       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                        
Jul 19 14:52:53 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:52:53.662238       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:52:53 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:52:53.662238       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                        
Jul 19 14:52:53 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:52:53.818046       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                       
Jul 19 14:52:53 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:52:53.818046       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.                                                                                                                                        
Jul 19 14:52:56 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:52:56.030248       1 waitforceo.go:64] Cluster etcd operator bootstrapped successfully                                                                                                                                                                 
Jul 19 14:52:56 ostest-p2mk7-bootstrap wait-for-ceo[17444]: I0719 14:52:56.030450       1 waitforceo.go:58] cluster-etcd-operator bootstrap etcd                                                                                                                                                                            
Jul 19 14:52:56 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:52:56.030248       1 waitforceo.go:64] Cluster etcd operator bootstrapped successfully                                                                                                                                                                  
Jul 19 14:52:56 ostest-p2mk7-bootstrap bootkube.sh[17407]: I0719 14:52:56.030450       1 waitforceo.go:58] cluster-etcd-operator bootstrap etcd                                                                                                                                                                             
Jul 19 14:52:56 ostest-p2mk7-bootstrap bootkube.sh[11105]: bootkube.service complete                                                                                                                                                                                                                                        
Jul 19 14:52:56 ostest-p2mk7-bootstrap systemd[1]: bootkube.service: Succeeded.                                                                                                         

For some reason, the condition "EtcdRunningInCluster in etcd CR /cluster to be True" is delaying the bootstrap leading to the failure on the installation script.

How to reproduce it (as minimally and precisely as possible)?
IPI installation from an installer-vm with two interfaces, one connected to the external network and one connected to a restricted network which is used as BYON in the install-config.yaml. The installer-vm runs a proxy that enables external connectivity to the cluster while deploying.

The issue is observed with NetworkType Kuryr and OVNKubernetes, and it is not persistent: The ipi-proxy installation was successful 8/20 times in OVNKubernetes, 9/21 in Kuryr and 15/23 in OpenShiftSDN.
On the other hand, in versions <4.11 this installation is working fine for all the networkTypes.

Anything else we need to know?
must-gather and post-installation bundle linked on private comment.

Comment 9 rdobosz 2022-07-20 14:27:06 UTC
For sure we need to investigate this bug, but as there is a workaround (like using openshift-install wait-for install-complete after failure) to make installation successful despite of the issue, we considered it as a non-blocker.

Comment 12 Itzik Brown 2022-10-26 07:32:13 UTC
(In reply to rdobosz from comment #9)
> For sure we need to investigate this bug, but as there is a workaround (like
> using openshift-install wait-for install-complete after failure) to make
> installation successful despite of the issue, we considered it as a
> non-blocker.

Is it documented that when using Proxy there is a need to use wait-for install-complete?

Comment 14 Shiftzilla 2023-03-09 01:17:18 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9221


Note You need to log in before you can comment on or make changes to this bug.