Somewhat like bug 1961925 about the new admission plugin being overly sensitive to a lack of nodes, the new admission plugin also seems overly sensitive to 4.7 workloads. 4.7->4.8->4.7 rollback CI consistently locks during the return-to-4.7 leg [1], and digging into one such run [2] shows the cluster-version operator blocking on [3]: deployment openshift-etcd-operator/etcd-operator has a replica failure FailedCreate: pods "etcd-operator-7b677856dc-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride infrastructure resource has empty status.controlPlaneTopology or status.infrastructureTopology That etcd operator deployment is very early in the (rollback) update graph [4], so we're trying to take that etcd operator Deployment back to its 4.7 state well before the bulk of cluster components have been asked to move back to their 4.7 state. Skimming the CVO logs [5], I don't think the CVO is actually attempting to change anything important about the Deployment; we just haven't ported bug 1881484 about Deployment hotlooping back to the 4.7 CVO. Checking the Infrastructure object, it is indeed missing those topology properties: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-ci-openshift-org-ocp-4-8-2021-07-09-111851-sha256-9a3ea481a3ffd9b341dc60067c39ec9fca6fd8936b71e73442c5ccff3838719e/cluster-scoped-resources/config.openshift.io/infrastructures/cluster.yaml ... status: apiServerInternalURI: https://api-int.ci-op-6rhcth7k-9f994.aws-2.ci.openshift.org:6443 apiServerURL: https://api.ci-op-6rhcth7k-9f994.aws-2.ci.openshift.org:6443 etcdDiscoveryDomain: "" infrastructureName: ci-op-6rhcth7k-9f994-rkt8q platform: AWS platformStatus: aws: region: us-east-1 type: AWS Comparing with a non-rollback 4.7->4.8 job [6], where we do have those properties: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1413177400551804928/artifacts/launch/must-gather.tar | tar xOz ./quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-68bb88201fbf54c8c9709f224d2daaada8e7420f339e5c54d0628914c4401ebb/cluster-scoped-resources/config.openshift.io/infrastructures/cluster.yaml ... status: apiServerInternalURI: https://api-int.ci-ln-b8dtkzb-d5d6b.origin-ci-int-aws.dev.rhcloud.com:6443 apiServerURL: https://api.ci-ln-b8dtkzb-d5d6b.origin-ci-int-aws.dev.rhcloud.com:6443 controlPlaneTopology: HighlyAvailable etcdDiscoveryDomain: "" infrastructureName: ci-ln-b8dtkzb-d5d6b-zlwcn infrastructureTopology: HighlyAvailable platform: AWS platformStatus: aws: region: us-west-2 type: AWS Back to our rollback job, to try and figure out where the properties went, they seem to have been dropped from the CRD: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-ci-openshift-org-ocp-4-8-2021-07-09-111851-sha256-9a3ea481a3ffd9b341dc60067c39ec9fca6fd8936b71e73442c5ccff3838719e/cluster-scoped-resources/apiextensions.k8s.io/customresourcedefinitions/infrastructures.config.openshift.io.yaml | grep -ci topology 0 Because the manifest for that CRD is even earlier in the update graph than the one for the etcd operator deployment: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-75b6fdf9b7-l4smh_cluster-version-operator.log | grep infrastructures.config.openshift.io I0714 00:45:32.566876 1 sync_worker.go:762] Running sync for customresourcedefinition "infrastructures.config.openshift.io" (42 of 669) I0714 00:45:32.678901 1 apiext.go:66] Updating CRD infrastructures.config.openshift.io due to diff: &v1.CustomResourceDefinition{ ... So to support rollbacks to 4.7, the 4.8 admission plugin probably needs to locally hard-code the CRD's 'HighlyAvailable' default. There are no 4.9-direct-to-4.7 rollbacks, so we don't need to touch master/4.9. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432 [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusterversion.json [4]: https://github.com/openshift/cluster-version-operator/blob/3a68652568e9075c23f491bc8c037942bd67ec82/docs/user/reconciliation.md#manifest-graph [5]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-75b6fdf9b7-l4smh_cluster-version-operator.log [6]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1413177400551804928
> So to support rollbacks to 4.7, the 4.8 admission plugin probably needs to locally hard-code the CRD's 'HighlyAvailable' default. I can probably figure out how to actually do this ;)
Bug 1977351 is still ON_QA, and since the PR attached to that one added the lines I'm adjusting, maybe we should close this one as a dup, and hang my 4.8 PR on bug 1977351 instead?
*** This bug has been marked as a duplicate of bug 1977351 ***
4.7 -> 4.8 -> 4.7 jobs are still failing [1]. Checking a recent run, still timing out [2]: {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 3h0m0s timeout","severity":"error","time":"2021-09-04T02:50:49Z"} From the ClusterVersion [3], the job got stuck in the return leg from 4.8.0-0.ci-2021-09-03-100015 to 4.7.29, with the not particularly informative: Working towards 4.7.29: 68 of 669 done (10% complete) It's sticking earlier now, on some etcd thing: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1433931475606048768/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-b7444fb9-7qzqx_cluster-version-operator.log | grep 'Running sync.*in state\|Result of work' | tail -n6 I0904 02:39:46.601247 1 task_graph.go:555] Result of work: [deployment openshift-etcd-operator/etcd-operator has a replica failure FailedCreate: pods "etcd-operator-69bb77696-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride infrastructure resource has empty status.controlPlaneTopology or status.infrastructureTopology] I0904 02:42:43.862556 1 sync_worker.go:549] Running sync registry.build02.ci.openshift.org/ci-op-w45p3w7d/release@sha256:b10034bedb4bf08a393462caf4c3fac8f9e4646d3b49d05915850dce0145cf15 (force=true) on generation 3 in state Updating at attempt 11 I0904 02:48:25.775750 1 task_graph.go:555] Result of work: [deployment openshift-etcd-operator/etcd-operator has a replica failure FailedCreate: pods "etcd-operator-69bb77696-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride infrastructure resource has empty status.controlPlaneTopology or status.infrastructureTopology] I0904 02:51:45.280411 1 sync_worker.go:549] Running sync registry.build02.ci.openshift.org/ci-op-w45p3w7d/release@sha256:b10034bedb4bf08a393462caf4c3fac8f9e4646d3b49d05915850dce0145cf15 (force=true) on generation 3 in state Updating at attempt 12 I0904 02:57:27.193759 1 task_graph.go:555] Result of work: [deployment openshift-etcd-operator/etcd-operator has a replica failure FailedCreate: pods "etcd-operator-69bb77696-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride infrastructure resource has empty status.controlPlaneTopology or status.infrastructureTopology] I0904 03:00:41.110856 1 sync_worker.go:549] Running sync registry.build02.ci.openshift.org/ci-op-w45p3w7d/release@sha256:b10034bedb4bf08a393462caf4c3fac8f9e4646d3b49d05915850dce0145cf15 (force=true) on generation 3 in state Updating at attempt 13 So still suffering from this issue. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1433931475606048768#1:build-log.txt%3A172 [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1433931475606048768/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusterversion.json
Hi folks, the relevant PR was merged for 4.9, so we should check the upgrade and roll-back flow for 4.9->4.8->4.9, once we will verify it, the cherry-pick https://github.com/openshift/kubernetes/pull/895 can be merged and we can verify the flow 4.8->4.7->4.8.
I can see https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-rollback passed(it has some test failures but the deployment passed). I think we can move it to verified.
Thanks Artyom, moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759