Version: $ openshift-install version openshift-install 4.10.0-0.ci-2021-12-17-014800 built from commit 2a320fbf90d2a232c19517251ea4f7f5e171682c release image registry.ci.openshift.org/ocp/release@sha256:c04823dfd2d2fd8cc4da4bf63dc4e5d7f34ec9a414746863a350a822423b8c7c release architecture amd64 Platform: alibabacloud Please specify: * IPI (automated install with `openshift-install`. If you don't know, then it's IPI) What happened? The bootstrap VM keeps telling Invalid value for alibabaCloud.resourceGroupID, which leads to all master nodes NotReady and bootstrap failed. FYI If we manually edit the YAML file and provide a valid resource group ID, bootstrap stage can complete successfully. [core@jiwei-ali01-v954d-bootstrap ~]$ journalctl -b -f -u bootkube.service ... Dec 17 05:59:07 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: Failed to update status for the "cluster-infrastructure-02-config.yml" infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: status.platformStatus.alibabaCloud.resourceGroupID: Invalid value: "": status.platformStatus.alibabaCloud.resourceGroupID in body should match '^rg-[0-9A-Za-z]+$' Dec 17 05:59:07 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: [#398] failed to create some manifests: Dec 17 05:59:07 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: "cluster-infrastructure-02-config.yml": failed to update status for infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: status.platformStatus.alibabaCloud.resourceGroupID: Invalid value: "": status.platformStatus.alibabaCloud.resourceGroupID in body should match '^rg-[0-9A-Za-z]+$' Dec 17 05:59:07 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: Skipped "cluster-infrastructure-02-config.yml" infrastructures.v1.config.openshift.io/cluster -n as it already exists Dec 17 05:59:08 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: Failed to update status for the "cluster-infrastructure-02-config.yml" infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: status.platformStatus.alibabaCloud.resourceGroupID: Invalid value: "": status.platformStatus.alibabaCloud.resourceGroupID in body should match '^rg-[0-9A-Za-z]+$' Dec 17 05:59:08 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: [#399] failed to create some manifests: Dec 17 05:59:08 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: "cluster-infrastructure-02-config.yml": failed to update status for infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: status.platformStatus.alibabaCloud.resourceGroupID: Invalid value: "": status.platformStatus.alibabaCloud.resourceGroupID in body should match '^rg-[0-9A-Za-z]+$' Dec 17 05:59:08 jiwei-ali01-v954d-bootstrap bootkube.sh[9243]: Skipped "cluster-infrastructure-02-config.yml" infrastructures.v1.config.openshift.io/cluster -n as it already exists ^C [core@jiwei-ali01-v954d-bootstrap ~]$ What did you expect to happen? Updating status for the infrastructure yaml should be without error, and bootstrap stage should complete successfully. How to reproduce it (as minimally and precisely as possible)? Always. Anything else we need to know? [core@jiwei-ali01-v954d-bootstrap ~]$ sudo su [root@jiwei-ali01-v954d-bootstrap core]# rpm-ostree status State: idle Deployments: * ostree://658b35d30d5da7226bf2abeb9c318a92c1521de2ea65486bc47632f2eee4e6c6 Version: 410.84.202112040202-0 (2021-12-04T02:05:40Z) [root@jiwei-ali01-v954d-bootstrap core]# crictl img IMAGE TAG IMAGE ID SIZE registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> 46113188a3622 368MB registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> 3c5561395829d 390MB registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> 3356e7e30d3de 402MB registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> c01ffe83a0a90 410MB registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> 5ef2a766e4fb1 496MB registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> 1865547e94225 372MB registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> 3608778273763 410MB registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> 984eb4af9ac25 296MB registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> 74de4736f9fb3 753MB registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> 06c95040849da 463MB registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> 9fda3c19cad82 405MB registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> f230587f6e18c 379MB registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> 2f52b396019fc 370MB registry.ci.openshift.org/ocp/release <none> 2f43b7b2afa5d 357MB [root@jiwei-ali01-v954d-bootstrap core]# crictl ps CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID ac14d7c44eb42 c01ffe83a0a9005d72ea1ec4a4c4f54a6ecc7f1fa27360b09d8c145b8a3dc38f 6 minutes ago Running kube-apiserver-insecure-readyz 1 726b0cc7eaca5 e6535bc27965d 74de4736f9fb33a13e597d4b04e34df2a13284fbd82b8e16912eb0f2b7489a70 6 minutes ago Running kube-apiserver 1 726b0cc7eaca5 14d348261540f registry.ci.openshift.org/ocp/release@sha256:c04823dfd2d2fd8cc4da4bf63dc4e5d7f34ec9a414746863a350a822423b8c7c 7 minutes ago Running cluster-version-operator 1 4bfb71533fd6f ea6af56ed9691 46113188a3622be9c65460c5a81bf40aea280cffc5f15da2b2a4eb882c790b93 7 minutes ago Running cluster-policy-controller 1 0bffcf2ac2a75 a5076e01f1aba 74de4736f9fb33a13e597d4b04e34df2a13284fbd82b8e16912eb0f2b7489a70 7 minutes ago Running kube-scheduler 1 6a03e5cf545ba ed25d150f2caa 06c95040849da24b834cb879d6861069d00d9598aa4a4e73ccbf4c099abd090a 7 minutes ago Running cloud-credential-operator 1 1a578d845655d ce72d7cb163a6 74de4736f9fb33a13e597d4b04e34df2a13284fbd82b8e16912eb0f2b7489a70 7 minutes ago Running kube-controller-manager 1 0bffcf2ac2a75 f22754f543efa 5ef2a766e4fb15ec065dd1eff04ddf3a220a3af089d69ffb0b0abcf12862a070 28 minutes ago Running machine-config-server 0 2118ca2f6afb4 9dca9f8e6eb6a 2f52b396019fc5e63d6485d6cb64f1914eb018b9f689ad8ae328271b575fdbe3 28 minutes ago Running etcd 0 bfbdb06411960 c1ccd8d9bbef7 registry.ci.openshift.org/ocp/4.10-2021-12-17-014800@sha256:ee1cc27b6f28f5accf1f5260e81e20ac6abc224698e3ff96f1110ee3aed940e6 28 minutes ago Running etcdctl 0 bfbdb06411960 [root@jiwei-ali01-v954d-bootstrap core]# [root@jiwei-ali01-v954d-bootstrap core]# ssh -i openshift-qe.pem core.5.36 The authenticity of host '10.0.5.36 (10.0.5.36)' can't be established. ECDSA key fingerprint is SHA256:MeMU3K3F38zMmKJu7aLvz9mMceL+4+Hw2AP2cvWfH64. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added '10.0.5.36' (ECDSA) to the list of known hosts. Red Hat Enterprise Linux CoreOS 410.84.202112162002-0 Part of OpenShift 4.10, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.10/architecture/architecture-rhcos.html --- [core@jiwei-ali01-v954d-master-0 ~]$ sudo su [root@jiwei-ali01-v954d-master-0 core]# crictl ps CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID [root@jiwei-ali01-v954d-master-0 core]# rpm-ostree status State: idle Deployments: * pivot://registry.ci.openshift.org/ocp/4.10-2021-12-17-014800@sha256:738c1994d3720161cd94b200b422f543e0b416bdcedad3ab0c03d024426ca552 CustomOrigin: Managed by machine-config-operator Version: 410.84.202112162002-0 (2021-12-16T20:05:59Z) ostree://658b35d30d5da7226bf2abeb9c318a92c1521de2ea65486bc47632f2eee4e6c6 Version: 410.84.202112040202-0 (2021-12-04T02:05:40Z) [root@jiwei-ali01-v954d-master-0 core]# crictl img IMAGE TAG IMAGE ID SIZE registry.ci.openshift.org/ocp/4.10-2021-12-17-014800 <none> 5ef2a766e4fb1 496MB [root@jiwei-ali01-v954d-master-0 core]#
This bug was introduced late in the process. This bug was not caught by testing because the install-config.yaml that was used included a resourceGroupID and therefore never exercised this code path. The initial change started in the installer by removing the resourceGroupID as it is no longer required: https://github.com/openshift/installer/pull/5431 The follow up was a change to the openshift/API which made this field no longer required: https://github.com/openshift/api/pull/1080 The validation was placed back on this field to keep the schema compatibility: https://github.com/openshift/api/pull/1081 Unfortunately this change to the API was not propagated to the cluster-config-operator and the machine-config-operator. Without these changes the installer/CVO creates a field from the latest CRD that is optional but still has kubebuilder validation. When the status (status.platformStatus.alibabaCloud.resourceGroupID) is updated for this field, golang defaults this field to "". This value is then validated by the kubebuilder string validation (^rg-[0-9A-Za-z]+$) which then fails. This was causing the infrastructure CRD to not pass validation and fail the installation as the control plane nodes were not able to enter a ready state. Since the openshift/api is used in multiple places I attempted to fix this in the following PRs. 1. Update the kubebuilder validation for this field to allow for "" or rg-<chars>. https://github.com/openshift/api/pull/1088 2. Update the cluster-config-operator by vendoring openshift/api with the new validation and CRD marking the field as optional https://github.com/openshift/cluster-config-operator/pull/229 3. Update the machine-config-operator by vendoring the openshift/api and updating the controllerconfig CRD removing the resourceGroupID as a required field and updating the CRD with the new regular expression to match to allow an empty string. https://github.com/openshift/machine-config-operator/pull/2884 Once these changes are applied I am able to successfully install a cluster: INFO Install complete! INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/kwoodson/tmp/alibaba/cluster/auth/kubeconfig' INFO Access the OpenShift web-console here: https://console-openshift-console.apps.test.alicloud-dev.devcluster.openshift.com INFO Login to the console with user: "kubeadmin", and password: "xxxxxxxxxx" DEBUG Time elapsed per stage: DEBUG cluster: 2m35s DEBUG bootstrap: 1m6s DEBUG Bootstrap Complete: 16m6s DEBUG API: 6m21s DEBUG Bootstrap Destroy: 43s DEBUG Cluster Operators: 26m8s INFO Time elapsed: 46m43s NAME STATUS ROLES AGE VERSION test-sp98f-master-0 Ready master 12h v1.22.1+6859754 test-sp98f-master-1 Ready master 12h v1.22.1+6859754 test-sp98f-master-2 Ready master 12h v1.22.1+6859754 test-sp98f-worker-us-east-1a-2pm6w Ready worker 12h v1.22.1+6859754 test-sp98f-worker-us-east-1b-qmr96 Ready worker 12h v1.22.1+6859754 test-sp98f-worker-us-east-1b-tblbl Ready worker 12h v1.22.1+6859754 NAME PHASE TYPE REGION ZONE AGE test-sp98f-master-0 Running ecs.g6.xlarge us-east-1 us-east-1b 12h test-sp98f-master-1 Running ecs.g6.xlarge us-east-1 us-east-1a 12h test-sp98f-master-2 Running ecs.g6.xlarge us-east-1 us-east-1b 12h test-sp98f-worker-us-east-1a-2pm6w Running ecs.g6.large us-east-1 us-east-1a 12h test-sp98f-worker-us-east-1b-qmr96 Running ecs.g6.large us-east-1 us-east-1b 12h test-sp98f-worker-us-east-1b-tblbl Running ecs.g6.large us-east-1 us-east-1b 12h @jianli wei Please review the pull requests. Thanks!
I dropped the installer pull request, based on [1]. [1]: https://github.com/openshift/installer/pull/5498#issuecomment-998882750
Verified in 4.10.0-0.nightly-2021-12-23-193744, with 'credentials_mode: "Manual"'. FYI the flexy-install & flexy-destroy jobs used in QE testing: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/61808/ https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-destroy/51773/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056