Poking around in rc.1 to look at the tech-preview, cluster-API bits: $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.10.0-rc.1-x86_64 Extracted release payload from digest sha256:eadfb8892c466438995e979f9a112c454286b4c0888eeb90da19354e8410fb24 created at 2022-02-04T20:00:57Z $ jq . manifests/image-references | jq -r '.spec.tags[] | select(.name | contains("cluster-api")) | .name + " " + .from.name + " " + .annotations["io.openshift.build.source-location"]' cluster-api quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a11772593573d97529cd243bca90d963acb094a0640417608f7f863ac0f8714c https://github.com/openshift/cluster-api $ grep -r a11772593573d97529cd243bca90d963acb094a0640417608f7f863ac0f8714c manifests manifests/image-references: "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a11772593573d97529cd243bca90d963acb094a0640417608f7f863ac0f8714c" nothing consuming that image? Anyhow, there are some tech-preview manifests: $ grep -r feature-gate manifests | grep namespace manifests/0000_30_cluster-api_capi-operator_00_namespace.yaml: release.openshift.io/feature-gate: "TechPreviewNoUpgrade" which seems to be from [1]. $ grep name: manifests/0000_30_cluster-api_capi-operator_00_namespace.yaml name: openshift-cluster-api Looking in a recent tech-preview CI job [2], I don't see that namespace: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/namespaces.json | jq -r '.items[].metadata.name' | grep cluster openshift-cluster-csi-drivers ... Because the CVO doen't think it's a tech-preview cluster? $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep tech I0204 10:19:53.097129 1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false. Checking the gate: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-must-gather/artifacts/must-gather.tar | tar tzv | grep featuregates.yaml -rw------- 1015690000/root 594 2022-02-04 02:59 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-90979cdd2bba6c5a63b62d3fa2cc3aca15e3e7c684198b40e915be82f4fdd175/cluster-scoped-resources/config.openshift.io/featuregates.yaml $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-90979cdd2bba6c5a63b62d3fa2cc3aca15e3e7c684198b40e915be82f4fdd175/cluster-scoped-resources/config.openshift.io/featuregates.yaml | yaml2json | jq -r '.items[] | {creationTimestamp: .metadata.creationTimestamp, featureSet: .spec.featureSet}' { "creationTimestamp": "2022-02-04T10:05:23Z", "featureSet": "TechPreviewNoUpgrade" } Aha, it's because we are edge triggered, but we're missing the initial state sync: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech' W0204 10:10:40.126809 1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused I0204 10:19:53.097129 1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false. Not clear to me why our informer logic doesn't eventually resync and let us know it was confused. Perhaps we need some sort of post-cache-sync check before we enter the event-driven loop? Or, if we can't figure that graceful approach out, we could revert [3], and just crash-loop until the initial FeatureGate fetch succeeds. [1]: https://github.com/openshift/cluster-capi-operator/blob/ef673ccce35453a8ae8a9ef9016e322c93d8c03a/manifests/0000_30_cluster-api_capi-operator_00_namespace.yaml [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264 [3]: https://github.com/openshift/cluster-version-operator/commit/90b14542aed4134eb4054780f8550191dde01381
Fix landed in [1], which ran TechPreview CI like [2]. Checking its CVO logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7785c85b54-l9t4s_cluster-version-operator_previous.log | grep 'ClusterVersionOperator\|Error getting featuregate value\|techpreview' I0211 02:11:23.577531 1 start.go:23] ClusterVersionOperator 4.11.0-202202102052.p0.gb133d82.assembly.stream-b133d82 W0211 02:11:23.579038 1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused I0211 02:30:49.795387 1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false. I0211 02:30:49.795936 1 techpreviewchangestopper.go:78] TechPreviewNoUpgrade was false, but the current feature set is "TechPreviewNoUpgrade"; requesting shutdown. I0211 02:30:49.796010 1 techpreviewchangestopper.go:109] Shutting down stop-on-techpreview-change controller I0211 02:30:49.795882 1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 3m8.377710572s I0211 02:30:49.796588 1 start.go:282] Collected stop-on-techpreview-change controller goroutine. I0211 02:30:49.807000 1 cvo.go:440] Shutting down ClusterVersionOperator I0211 02:30:49.814421 1 start.go:31] Graceful shutdown complete for ClusterVersionOperator 4.11.0-202202102052.p0.gb133d82.assembly.stream-b133d82. So that's the initial control-plane CVO, and hit the early-install Kube-API hiccup, assumed the gate was off, Kube-API came up, CVO got a lease, launched controllers, noticed the gate was actually TechPreviewNoUpgrade, and gracefully shutdown. Lovely. When the replacement container comes up: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7785c85b54-l9t4s_cluster-version-operator.log | grep 'ClusterVersionOperator\|Error getting featuregate value\|techpreview' I0211 02:30:50.795928 1 start.go:23] ClusterVersionOperator 4.11.0-202202102052.p0.gb133d82.assembly.stream-b133d82 I0211 02:30:51.413209 1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade true. I0211 02:30:51.413803 1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 2m15.086131489s So it noticed when it was starting up that it was TechPreviewNoUpgrade, no reboot required. Also lovely. Checking the namespace: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/namespaces.json | jq -r '.items[].metadata.name' | grep cluster openshift-cluster-api openshift-cluster-csi-drivers ... Hooray :). And the ClusterOperator for good measure: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488/artifacts/e2e-aws-techpreview/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "cluster-api").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2022-02-11T02:09:50Z Available=True AsExpected: Cluster CAPI Operator is available at 4.11.0-0.nightly-2022-02-11-014337 2022-02-11T02:09:50Z Progressing=False AsExpected: 2022-02-11T02:09:50Z Degraded=False AsExpected: 2022-02-11T02:09:50Z Upgradeable=True AsExpected: Hooray :). Looks verified to me, but I'll wait for a second set of eyes from QE :) [1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-02-11-014337 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1491953215070015488
From the description, the TechPreviewNoUpgrade featureset is created during cluster's initialization stage. But the cluster is not accessible for users before the installation finish(this should be the only supported way). So I tried to check if it could be reproduced by following way based on a normal cluster(getting the featureset enabled before cvo sync). 1. scale down cvo to stop the sync. 2. create TechPreviewNoUpgrade featureset # ./oc get featuregate cluster -oyaml|grep 'creationTimestamp\|featureSet' creationTimestamp: "2022-03-07T02:21:20Z" featureSet: TechPreviewNoUpgrade 3. check that no tp operator installed even with TechPreviewNoUpgrade featureset # ./oc get ns openshift-cluster-api Error from server (NotFound): namespaces "openshift-cluster-api" not found # ./oc get co cluster-api Error from server (NotFound): clusteroperators.config.openshift.io "cluster-api" not found 4. scale up cvo to start the sync. 5. check that tp operator installed successfully. # ./oc get co cluster-api NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE cluster-api 4.10.2 True False False 10s # ./oc logs cluster-version-operator-567756b458-88hcg|grep TechPreviewNoUpgrade I0307 03:18:22.338574 1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade true. It works well, so can not reproduce the issue. Verified the bug through the ci job periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview as comment4. Version:4.11.0-0.nightly-2022-03-04-063157 TechPreviewNoUpgrade featureset was created 06:46:22Z. # curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1499634483534499840/artifacts/e2e-aws-techpreview/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-e80ffaaae914124522da3865034d497aa44b64a52b0c70ff5e8d24925147bbbd/cluster-scoped-resources/config.openshift.io/featuregates.yaml | yaml2json | jq -r '.items[] | {creationTimestamp: .metadata.creationTimestamp, featureSet: .spec.featureSet}' { "creationTimestamp": "2022-03-04T06:46:22Z", "featureSet": "TechPreviewNoUpgrade" } During initial stage, no tp operator installed due to kube-api is not reday even the TechPreviewNoUpgrade featureset is enabled. # curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1499634483534499840/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-856d46654d-8pgsw_cluster-version-operator_previous.log|grep 'ClusterVersionOperator\|TechPreviewNoUpgrade\|featuregate' I0304 06:51:25.859053 1 start.go:23] ClusterVersionOperator 4.11.0-202203021449.p0.g0e9bc4e.assembly.stream-0e9bc4e W0304 06:51:25.860400 1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused ... I0304 07:04:24.490835 1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false. I0304 07:04:24.490869 1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 2m16.675706003s I0304 07:04:24.490903 1 techpreviewchangestopper.go:78] TechPreviewNoUpgrade was false, but the current feature set is "TechPreviewNoUpgrade"; requesting shutdown. I0304 07:04:24.496164 1 cvo.go:440] Shutting down ClusterVersionOperator I0304 07:04:24.502419 1 start.go:31] Graceful shutdown complete for ClusterVersionOperator 4.11.0-202203021449.p0.g0e9bc4e.assembly.stream-0e9bc4e. After initial finish, tp operator is installed successfully. # curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-techpreview/1499634483534499840/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-856d46654d-8pgsw_cluster-version-operator.log|grep 'ClusterVersionOperator\|TechPreviewNoUpgrade\|featuregate'|head -n5 I0304 07:04:25.470420 1 start.go:23] ClusterVersionOperator 4.11.0-202203021449.p0.g0e9bc4e.assembly.stream-0e9bc4e I0304 07:04:25.995604 1 techpreviewchangestopper.go:101] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade true. I0304 07:04:25.997350 1 cvo.go:338] Starting ClusterVersionOperator with minimum reconcile period 3m16.4555211s I0304 07:04:34.092211 1 sync_worker.go:824] Running sync for customresourcedefinition "featuregates.config.openshift.io" (42 of 814) I0304 07:04:34.459782 1 sync_worker.go:824] Running sync for featuregate "cluster" (24 of 814) From above, verified the bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069