Bug 2046226 - [IPI on Alibabacloud] installation in a disconnected network got 'Bootstrap failed to complete'
Summary: [IPI on Alibabacloud] installation in a disconnected network got 'Bootstrap f...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Thomas Wiest
QA Contact: Jianli Wei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-26 12:47 UTC by Jianli Wei
Modified: 2022-10-14 16:14 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-14 16:14:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
[4.11] bootstrap logs (1.41 MB, application/gzip)
2022-07-12 06:43 UTC, Jianli Wei
no flags Details

Description Jianli Wei 2022-01-26 12:47:00 UTC
Created attachment 1855504 [details]
attach1.1

Created attachment 1855504 [details]
attach1.1

Background: Generally there are 4 fundermental scenarios that QE testing needs to cover, including, 
(1) IPI a cluster
(2) IPI a private cluster
(3) IPI a cluster in a disconnected network
(4) IPI a cluster in a disconnected network behind http proxy

>With the above said, we need to figure out as soon what's wrong in either QE test env or the installer. So maybe it is not a bug, high attentions are expected, thanks in advance! 


Version:
./openshift-install 4.10.0-0.nightly-2022-01-25-023600
built from commit 6bd4f3ecafb39f0ea2f62b7b27b548ca74bab020
release image registry.ci.openshift.org/ocp/release@sha256:19fd4b9a313f2dfcdc982f088cffcc5859484af3cb8966cde5ec0be90b262dbc
release architecture amd64

Platform: alibabacloud

Please specify:
* IPI

What happened?
Installation in a disconnected network got 'Bootstrap failed to complete', and the bootstrap machine seems ok, but all 3 masters are NotReady and with very few images pulled. 

FYI there are 2 scenarios: 
(1) disconnected with local mirror registry which is within the VPC
(2) disconnected with http proxy for Internet access

We are using alicloud VPC network ACL to construct a disconnected network. 

What did you expect to happen?
The installation should succeed. 

How to reproduce it (as minimally and precisely as possible)?
Always.

Anything else we need to know?

FYI The QE flexy-install job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/70906/
The imageContentSources section in "install-config.yaml":
imageContentSources:
- mirrors:
  - jiwei-304.mirror-registry.alicloud-qe.devcluster.openshift.com:5000/ocp/release
  source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
- mirrors:
  - jiwei-304.mirror-registry.alicloud-qe.devcluster.openshift.com:5000/ocp/release
  source: registry.ci.openshift.org/ocp/release

Please see attach1.1 for nodes status and related alicloud resources, and attach1.2 for the gathered bootstrap logs. 


FYI The QE flexy-install job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/70932/
The proxy section in "install-config.yaml":
proxy:
  httpProxy: http://username:password@10.0.143.21:3128
  httpsProxy: http://username:password@10.0.143.21:3128
  noProxy: test.no-proxy.com

BTW, the installer would use the bastion's public IP address for http proxy. 

Please see attach2.1 for nodes status and related alicloud resources, and attach2.2 for the gathered bootstrap logs.

Comment 1 Jianli Wei 2022-01-26 12:47:31 UTC
Created attachment 1855505 [details]
attach1.2

Comment 2 Jianli Wei 2022-01-26 12:48:16 UTC
Created attachment 1855506 [details]
attach2.1

Comment 3 Jianli Wei 2022-01-26 12:51:21 UTC
Created attachment 1855507 [details]
attach2.2

Comment 4 Jianli Wei 2022-01-26 12:53:49 UTC
Created attachment 1855508 [details]
attach2.2

Comment 5 Marco Braga 2022-01-26 13:53:46 UTC
Jianli Wei,

> We are using alicloud VPC network ACL to construct a disconnected network. 

Could you please describe all the steps to create the disconnected network, including the commands? I found only the destroy scripts on build artifacts.

Comment 6 Jianli Wei 2022-01-27 01:49:58 UTC
(In reply to Marco Braga from comment #5)
> Jianli Wei,
> 
> > We are using alicloud VPC network ACL to construct a disconnected network. 
> 
> Could you please describe all the steps to create the disconnected network,
> including the commands? I found only the destroy scripts on build artifacts.

@Marco please refer to https://gitlab.cee.redhat.com/jiwei/flexy-templates/-/blob/ipi-on-ali/functionality-testing/aos-4_10/hosts/libs/alicloud/utils_v2.sh#L371-457, thanks!

Comment 8 Jianli Wei 2022-01-28 01:34:54 UTC
Sorry, please ignore the 2nd attach2.2 which is a duplicate uploading due to network issue that time, thanks!

Comment 14 Jianli Wei 2022-07-12 06:43:09 UTC
Created attachment 1896245 [details]
[4.11] bootstrap logs

I retried the scenario, i.e. "IPI a cluster in a disconnected network behind http proxy", with 4.11.0-0.nightly-2022-07-11-080250, and the attachment is the gathered bootstrap logs. 

As a disconnected network, the VPC doesn't have NAT gateway configured, so all control-plane nodes and compute nodes are expected to use the http proxy when accessing internet. 


FYI the content of install-config.yaml:

apiVersion: v1
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    alibabacloud:
      instanceType: ecs.g6.xlarge
  replicas: 3
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    alibabacloud:
      instanceType: ecs.g6.large
  replicas: 2
metadata:
  name: jiwei-0712-05
platform:
  alibabacloud:
    region: ap-northeast-1
    resourceGroupID: rg-aek2c4huej7f3ni
    vpcID: vpc-6we8dsk71y9ldriddscdq
    vswitchIDs:
    - vsw-6weguf7vesewzhxzwq4f0
    - vsw-6werzddz3hqwl4nrkrooj
pullSecret: <pull secret>
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  serviceNetwork:
  - 172.30.0.0/16
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OpenShiftSDN
publish: External
proxy:
  httpProxy: http://<username>:<password>@10.0.175.232:3128
  httpsProxy: http://<username>:<password>@10.0.175.232:3128
  noProxy: test.no-proxy.com
credentialsMode: Manual
baseDomain: alicloud-qe.devcluster.openshift.com
sshKey: <ssh keys>

Comment 18 Beth White 2022-10-14 16:14:10 UTC
Cloned to Jira project https://issues.redhat.com/browse/OCPBUGS-2388


Note You need to log in before you can comment on or make changes to this bug.