Bug 2033536
Summary: | [IPI on Alibabacloud] bootstrap complains invalid value for alibabaCloud.resourceGroupID when updating "cluster-infrastructure-02-config.yml" status, which leads to bootstrap failed and all master nodes NotReady | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jianli Wei <jiwei> |
Component: | Installer | Assignee: | Kenny Woodson <kwoodson> |
Installer sub component: | openshift-installer | QA Contact: | Jianli Wei <jiwei> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | ehashman, gpei, jialiu, kwoodson, mstaeble, ropatil, wking |
Version: | 4.10 | ||
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:34:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jianli Wei
2021-12-17 07:13:18 UTC
This bug was introduced late in the process. This bug was not caught by testing because the install-config.yaml that was used included a resourceGroupID and therefore never exercised this code path. The initial change started in the installer by removing the resourceGroupID as it is no longer required: https://github.com/openshift/installer/pull/5431 The follow up was a change to the openshift/API which made this field no longer required: https://github.com/openshift/api/pull/1080 The validation was placed back on this field to keep the schema compatibility: https://github.com/openshift/api/pull/1081 Unfortunately this change to the API was not propagated to the cluster-config-operator and the machine-config-operator. Without these changes the installer/CVO creates a field from the latest CRD that is optional but still has kubebuilder validation. When the status (status.platformStatus.alibabaCloud.resourceGroupID) is updated for this field, golang defaults this field to "". This value is then validated by the kubebuilder string validation (^rg-[0-9A-Za-z]+$) which then fails. This was causing the infrastructure CRD to not pass validation and fail the installation as the control plane nodes were not able to enter a ready state. Since the openshift/api is used in multiple places I attempted to fix this in the following PRs. 1. Update the kubebuilder validation for this field to allow for "" or rg-<chars>. https://github.com/openshift/api/pull/1088 2. Update the cluster-config-operator by vendoring openshift/api with the new validation and CRD marking the field as optional https://github.com/openshift/cluster-config-operator/pull/229 3. Update the machine-config-operator by vendoring the openshift/api and updating the controllerconfig CRD removing the resourceGroupID as a required field and updating the CRD with the new regular expression to match to allow an empty string. https://github.com/openshift/machine-config-operator/pull/2884 Once these changes are applied I am able to successfully install a cluster: INFO Install complete! INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/kwoodson/tmp/alibaba/cluster/auth/kubeconfig' INFO Access the OpenShift web-console here: https://console-openshift-console.apps.test.alicloud-dev.devcluster.openshift.com INFO Login to the console with user: "kubeadmin", and password: "xxxxxxxxxx" DEBUG Time elapsed per stage: DEBUG cluster: 2m35s DEBUG bootstrap: 1m6s DEBUG Bootstrap Complete: 16m6s DEBUG API: 6m21s DEBUG Bootstrap Destroy: 43s DEBUG Cluster Operators: 26m8s INFO Time elapsed: 46m43s NAME STATUS ROLES AGE VERSION test-sp98f-master-0 Ready master 12h v1.22.1+6859754 test-sp98f-master-1 Ready master 12h v1.22.1+6859754 test-sp98f-master-2 Ready master 12h v1.22.1+6859754 test-sp98f-worker-us-east-1a-2pm6w Ready worker 12h v1.22.1+6859754 test-sp98f-worker-us-east-1b-qmr96 Ready worker 12h v1.22.1+6859754 test-sp98f-worker-us-east-1b-tblbl Ready worker 12h v1.22.1+6859754 NAME PHASE TYPE REGION ZONE AGE test-sp98f-master-0 Running ecs.g6.xlarge us-east-1 us-east-1b 12h test-sp98f-master-1 Running ecs.g6.xlarge us-east-1 us-east-1a 12h test-sp98f-master-2 Running ecs.g6.xlarge us-east-1 us-east-1b 12h test-sp98f-worker-us-east-1a-2pm6w Running ecs.g6.large us-east-1 us-east-1a 12h test-sp98f-worker-us-east-1b-qmr96 Running ecs.g6.large us-east-1 us-east-1b 12h test-sp98f-worker-us-east-1b-tblbl Running ecs.g6.large us-east-1 us-east-1b 12h @jianli wei Please review the pull requests. Thanks! I dropped the installer pull request, based on [1]. [1]: https://github.com/openshift/installer/pull/5498#issuecomment-998882750 Verified in 4.10.0-0.nightly-2021-12-23-193744, with 'credentials_mode: "Manual"'. FYI the flexy-install & flexy-destroy jobs used in QE testing: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/61808/ https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-destroy/51773/ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |