Bug 2033536

Summary: [IPI on Alibabacloud] bootstrap complains invalid value for alibabaCloud.resourceGroupID when updating "cluster-infrastructure-02-config.yml" status, which leads to bootstrap failed and all master nodes NotReady
Product: OpenShift Container Platform Reporter: Jianli Wei <jiwei>
Component: InstallerAssignee: Kenny Woodson <kwoodson>
Installer sub component: openshift-installer QA Contact: Jianli Wei <jiwei>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: ehashman, gpei, jialiu, kwoodson, mstaeble, ropatil, wking
Version: 4.10   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:34:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jianli Wei 2021-12-17 07:13:18 UTC
Version:

$ openshift-install version
openshift-install 4.10.0-0.ci-2021-12-17-014800
built from commit 2a320fbf90d2a232c19517251ea4f7f5e171682c
release image registry.ci.openshift.org/ocp/release@sha256:c04823dfd2d2fd8cc4da4bf63dc4e5d7f34ec9a414746863a350a822423b8c7c
release architecture amd64

Platform: alibabacloud

Please specify:
* IPI (automated install with `openshift-install`. If you don't know, then it's IPI)

What happened?
The bootstrap VM keeps telling Invalid value for alibabaCloud.resourceGroupID, which leads to all master nodes NotReady and bootstrap failed. FYI If we manually edit the YAML file and provide a valid resource group ID, bootstrap stage can complete successfully. 

[core@jiwei-ali01-v954d-bootstrap ~]$ journalctl -b -f -u bootkube.service
...
Dec 17 05:59:07 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: Failed to update status for the "cluster-infrastructure-02-config.yml" infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: status.platformStatus.alibabaCloud.resourceGroupID: Invalid value: "": status.platformStatus.alibabaCloud.resourceGroupID in body should match '^rg-[0-9A-Za-z]+$'
Dec 17 05:59:07 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: [#398] failed to create some manifests:
Dec 17 05:59:07 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: "cluster-infrastructure-02-config.yml": failed to update status for infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: status.platformStatus.alibabaCloud.resourceGroupID: Invalid value: "": status.platformStatus.alibabaCloud.resourceGroupID in body should match '^rg-[0-9A-Za-z]+$'
Dec 17 05:59:07 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: Skipped "cluster-infrastructure-02-config.yml" infrastructures.v1.config.openshift.io/cluster -n  as it already exists
Dec 17 05:59:08 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: Failed to update status for the "cluster-infrastructure-02-config.yml" infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: status.platformStatus.alibabaCloud.resourceGroupID: Invalid value: "": status.platformStatus.alibabaCloud.resourceGroupID in body should match '^rg-[0-9A-Za-z]+$'
Dec 17 05:59:08 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: [#399] failed to create some manifests:
Dec 17 05:59:08 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: "cluster-infrastructure-02-config.yml": failed to update status for infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: status.platformStatus.alibabaCloud.resourceGroupID: Invalid value: "": status.platformStatus.alibabaCloud.resourceGroupID in body should match '^rg-[0-9A-Za-z]+$'
Dec 17 05:59:08 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: Skipped "cluster-infrastructure-02-config.yml" infrastructures.v1.config.openshift.io/cluster -n  as it already exists
^C
[core@jiwei-ali01-v954d-bootstrap ~]$  

What did you expect to happen?
Updating status for the infrastructure yaml should be without error, and bootstrap stage should complete successfully.

How to reproduce it (as minimally and precisely as possible)?
Always.

Anything else we need to know?
[core@jiwei-ali01-v954d-bootstrap ~]$ sudo su
[root@jiwei-ali01-v954d-bootstrap core]# rpm-ostree status
State: idle
Deployments:
* ostree://658b35d30d5da7226bf2abeb9c318a92c1521de2ea65486bc47632f2eee4e6c6
                   Version: 410.84.202112040202-0 (2021-12-04T02:05:40Z)
[root@jiwei-ali01-v954d-bootstrap core]# crictl img
IMAGE                                                  TAG                 IMAGE ID            SIZE
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              46113188a3622       368MB
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              3c5561395829d       390MB
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              3356e7e30d3de       402MB
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              c01ffe83a0a90       410MB
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              5ef2a766e4fb1       496MB
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              1865547e94225       372MB
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              3608778273763       410MB
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              984eb4af9ac25       296MB
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              74de4736f9fb3       753MB
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              06c95040849da       463MB
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              9fda3c19cad82       405MB
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              f230587f6e18c       379MB
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              2f52b396019fc       370MB
registry.ci.openshift.org/ocp/release                  <none>              2f43b7b2afa5d       357MB
[root@jiwei-ali01-v954d-bootstrap core]# crictl ps 
CONTAINER           IMAGE                                                                                                                          CREATED             STATE               NAME                             ATTEMPT             POD ID
ac14d7c44eb42       c01ffe83a0a9005d72ea1ec4a4c4f54a6ecc7f1fa27360b09d8c145b8a3dc38f                                                               6 minutes ago       Running             kube-apiserver-insecure-readyz   1                   726b0cc7eaca5
e6535bc27965d       74de4736f9fb33a13e597d4b04e34df2a13284fbd82b8e16912eb0f2b7489a70                                                               6 minutes ago       Running             kube-apiserver                   1                   726b0cc7eaca5
14d348261540f       registry.ci.openshift.org/ocp/release@sha256:c04823dfd2d2fd8cc4da4bf63dc4e5d7f34ec9a414746863a350a822423b8c7c                  7 minutes ago       Running             cluster-version-operator         1                   4bfb71533fd6f
ea6af56ed9691       46113188a3622be9c65460c5a81bf40aea280cffc5f15da2b2a4eb882c790b93                                                               7 minutes ago       Running             cluster-policy-controller        1                   0bffcf2ac2a75
a5076e01f1aba       74de4736f9fb33a13e597d4b04e34df2a13284fbd82b8e16912eb0f2b7489a70                                                               7 minutes ago       Running             kube-scheduler                   1                   6a03e5cf545ba
ed25d150f2caa       06c95040849da24b834cb879d6861069d00d9598aa4a4e73ccbf4c099abd090a                                                               7 minutes ago       Running             cloud-credential-operator        1                   1a578d845655d
ce72d7cb163a6       74de4736f9fb33a13e597d4b04e34df2a13284fbd82b8e16912eb0f2b7489a70                                                               7 minutes ago       Running             kube-controller-manager          1                   0bffcf2ac2a75
f22754f543efa       5ef2a766e4fb15ec065dd1eff04ddf3a220a3af089d69ffb0b0abcf12862a070                                                               28 minutes ago      Running             machine-config-server            0                   2118ca2f6afb4
9dca9f8e6eb6a       2f52b396019fc5e63d6485d6cb64f1914eb018b9f689ad8ae328271b575fdbe3                                                               28 minutes ago      Running             etcd                             0                   bfbdb06411960
c1ccd8d9bbef7       registry.ci.openshift.org/ocp/4.10-2021-12-17-014800@sha256:ee1cc27b6f28f5accf1f5260e81e20ac6abc224698e3ff96f1110ee3aed940e6   28 minutes ago      Running             etcdctl                          0                   bfbdb06411960
[root@jiwei-ali01-v954d-bootstrap core]# 

[root@jiwei-ali01-v954d-bootstrap core]# ssh -i openshift-qe.pem core.5.36
The authenticity of host '10.0.5.36 (10.0.5.36)' can't be established.
ECDSA key fingerprint is SHA256:MeMU3K3F38zMmKJu7aLvz9mMceL+4+Hw2AP2cvWfH64.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '10.0.5.36' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 410.84.202112162002-0
  Part of OpenShift 4.10, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.10/architecture/architecture-rhcos.html

---
[core@jiwei-ali01-v954d-master-0 ~]$ sudo su
[root@jiwei-ali01-v954d-master-0 core]# crictl ps 
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID
[root@jiwei-ali01-v954d-master-0 core]# rpm-ostree status
State: idle
Deployments:
* pivot://registry.ci.openshift.org/ocp/4.10-2021-12-17-014800@sha256:738c1994d3720161cd94b200b422f543e0b416bdcedad3ab0c03d024426ca552
              CustomOrigin: Managed by machine-config-operator
                   Version: 410.84.202112162002-0 (2021-12-16T20:05:59Z)

  ostree://658b35d30d5da7226bf2abeb9c318a92c1521de2ea65486bc47632f2eee4e6c6
                   Version: 410.84.202112040202-0 (2021-12-04T02:05:40Z)
[root@jiwei-ali01-v954d-master-0 core]# crictl img
IMAGE                                                  TAG                 IMAGE ID            SIZE
registry.ci.openshift.org/ocp/4.10-2021-12-17-014800   <none>              5ef2a766e4fb1       496MB
[root@jiwei-ali01-v954d-master-0 core]#

Comment 1 Kenny Woodson 2021-12-19 15:33:37 UTC
This bug was introduced late in the process. This bug was not caught by testing because the install-config.yaml that was used included a resourceGroupID and therefore never exercised this code path.

The initial change started in the installer by removing the resourceGroupID as it is no longer required:
https://github.com/openshift/installer/pull/5431

The follow up was a change to the openshift/API which made this field no longer required:
https://github.com/openshift/api/pull/1080

The validation was placed back on this field to keep the schema compatibility:
https://github.com/openshift/api/pull/1081

Unfortunately this change to the API was not propagated to the cluster-config-operator and the machine-config-operator. Without these changes the installer/CVO creates a field from the latest CRD that is optional but still has kubebuilder validation. When the status (status.platformStatus.alibabaCloud.resourceGroupID) is updated for this field, golang defaults this field to "". This value is then validated by the kubebuilder string validation (^rg-[0-9A-Za-z]+$) which then fails. 

This was causing the infrastructure CRD to not pass validation and fail the installation as the control plane nodes were not able to enter a ready state.

Since the openshift/api is used in multiple places I attempted to fix this in the following PRs.
1. Update the kubebuilder validation for this field to allow for "" or rg-<chars>. 
https://github.com/openshift/api/pull/1088
2. Update the cluster-config-operator by vendoring openshift/api with the new validation and CRD marking the field as optional
https://github.com/openshift/cluster-config-operator/pull/229
3. Update the machine-config-operator by vendoring the openshift/api and updating the controllerconfig CRD removing the resourceGroupID as a required field and updating the CRD with the new regular expression to match to allow an empty string.
https://github.com/openshift/machine-config-operator/pull/2884

Once these changes are applied I am able to successfully install a cluster:

INFO Install complete!                            
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/kwoodson/tmp/alibaba/cluster/auth/kubeconfig' 
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.test.alicloud-dev.devcluster.openshift.com 
INFO Login to the console with user: "kubeadmin", and password: "xxxxxxxxxx" 
DEBUG Time elapsed per stage:                      
DEBUG            cluster: 2m35s                    
DEBUG          bootstrap: 1m6s                     
DEBUG Bootstrap Complete: 16m6s                    
DEBUG                API: 6m21s                    
DEBUG  Bootstrap Destroy: 43s                      
DEBUG  Cluster Operators: 26m8s                    
INFO Time elapsed: 46m43s 

NAME                                 STATUS   ROLES    AGE   VERSION
test-sp98f-master-0                  Ready	master   12h   v1.22.1+6859754
test-sp98f-master-1                  Ready	master   12h   v1.22.1+6859754
test-sp98f-master-2                  Ready	master   12h   v1.22.1+6859754
test-sp98f-worker-us-east-1a-2pm6w   Ready    worker   12h   v1.22.1+6859754
test-sp98f-worker-us-east-1b-qmr96   Ready    worker   12h   v1.22.1+6859754
test-sp98f-worker-us-east-1b-tblbl   Ready    worker   12h   v1.22.1+6859754
NAME                                 PHASE     TYPE            REGION      ZONE         AGE
test-sp98f-master-0                  Running   ecs.g6.xlarge   us-east-1   us-east-1b   12h
test-sp98f-master-1                  Running   ecs.g6.xlarge   us-east-1   us-east-1a   12h
test-sp98f-master-2                  Running   ecs.g6.xlarge   us-east-1   us-east-1b   12h
test-sp98f-worker-us-east-1a-2pm6w   Running   ecs.g6.large    us-east-1   us-east-1a   12h
test-sp98f-worker-us-east-1b-qmr96   Running   ecs.g6.large    us-east-1   us-east-1b   12h
test-sp98f-worker-us-east-1b-tblbl   Running   ecs.g6.large    us-east-1   us-east-1b   12h


@jianli wei Please review the pull requests. Thanks!

Comment 3 W. Trevor King 2021-12-22 02:24:04 UTC
I dropped the installer pull request, based on [1].

[1]: https://github.com/openshift/installer/pull/5498#issuecomment-998882750

Comment 6 Jianli Wei 2021-12-24 01:59:09 UTC
Verified in 4.10.0-0.nightly-2021-12-23-193744, with 'credentials_mode: "Manual"'.

FYI the flexy-install & flexy-destroy jobs used in QE testing: 
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/61808/
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-destroy/51773/

Comment 9 errata-xmlrpc 2022-03-10 16:34:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056