Bug 1982868

Summary: 4.8 ManagementCPUsOverride admission plugin blocks 4.7 deployments on empty topology
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: NodeAssignee: Artyom <alukiano>
Node sub component: Autoscaler (HPA, VPA) QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: alukiano, aos-bugs, nagrawal, rphillips
Version: 4.8Keywords: Reopened
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1982873 (view as bug list) Environment:
Last Closed: 2021-10-18 17:39:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1982873    
Bug Blocks: 1995714    

Description W. Trevor King 2021-07-15 21:12:33 UTC
Somewhat like bug 1961925 about the new admission plugin being overly sensitive to a lack of nodes, the new admission plugin also seems overly sensitive to 4.7 workloads.  4.7->4.8->4.7 rollback CI consistently locks during the return-to-4.7 leg [1], and digging into one such run [2] shows the cluster-version operator blocking on [3]:

  deployment openshift-etcd-operator/etcd-operator has a replica failure FailedCreate: pods "etcd-operator-7b677856dc-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride infrastructure resource has empty status.controlPlaneTopology or status.infrastructureTopology

That etcd operator deployment is very early in the (rollback) update graph [4], so we're trying to take that etcd operator Deployment back to its 4.7 state well before the bulk of cluster components have been asked to move back to their 4.7 state.  Skimming the CVO logs [5], I don't think the CVO is actually attempting to change anything important about the Deployment; we just haven't ported bug 1881484 about Deployment hotlooping back to the 4.7 CVO.

Checking the Infrastructure object, it is indeed missing those topology properties:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-ci-openshift-org-ocp-4-8-2021-07-09-111851-sha256-9a3ea481a3ffd9b341dc60067c39ec9fca6fd8936b71e73442c5ccff3838719e/cluster-scoped-resources/config.openshift.io/infrastructures/cluster.yaml 
...
status:
  apiServerInternalURI: https://api-int.ci-op-6rhcth7k-9f994.aws-2.ci.openshift.org:6443
  apiServerURL: https://api.ci-op-6rhcth7k-9f994.aws-2.ci.openshift.org:6443
  etcdDiscoveryDomain: ""
  infrastructureName: ci-op-6rhcth7k-9f994-rkt8q
  platform: AWS
  platformStatus:
    aws:
      region: us-east-1
    type: AWS

Comparing with a non-rollback 4.7->4.8 job [6], where we do have those properties:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1413177400551804928/artifacts/launch/must-gather.tar | tar xOz ./quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-68bb88201fbf54c8c9709f224d2daaada8e7420f339e5c54d0628914c4401ebb/cluster-scoped-resources/config.openshift.io/infrastructures/cluster.yaml
...
status:
  apiServerInternalURI: https://api-int.ci-ln-b8dtkzb-d5d6b.origin-ci-int-aws.dev.rhcloud.com:6443
  apiServerURL: https://api.ci-ln-b8dtkzb-d5d6b.origin-ci-int-aws.dev.rhcloud.com:6443
  controlPlaneTopology: HighlyAvailable
  etcdDiscoveryDomain: ""
  infrastructureName: ci-ln-b8dtkzb-d5d6b-zlwcn
  infrastructureTopology: HighlyAvailable
  platform: AWS
  platformStatus:
    aws:
      region: us-west-2
    type: AWS

Back to our rollback job, to try and figure out where the properties went, they seem to have been dropped from the CRD:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-ci-openshift-org-ocp-4-8-2021-07-09-111851-sha256-9a3ea481a3ffd9b341dc60067c39ec9fca6fd8936b71e73442c5ccff3838719e/cluster-scoped-resources/apiextensions.k8s.io/customresourcedefinitions/infrastructures.config.openshift.io.yaml | grep -ci topology
  0

Because the manifest for that CRD is even earlier in the update graph than the one for the etcd operator deployment:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-75b6fdf9b7-l4smh_cluster-version-operator.log | grep infrastructures.config.openshift.io   
  I0714 00:45:32.566876       1 sync_worker.go:762] Running sync for customresourcedefinition "infrastructures.config.openshift.io" (42 of 669)
  I0714 00:45:32.678901       1 apiext.go:66] Updating CRD infrastructures.config.openshift.io due to diff:   &v1.CustomResourceDefinition{
  ...

So to support rollbacks to 4.7, the 4.8 admission plugin probably needs to locally hard-code the CRD's 'HighlyAvailable' default.  There are no 4.9-direct-to-4.7 rollbacks, so we don't need to touch master/4.9.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusterversion.json
[4]: https://github.com/openshift/cluster-version-operator/blob/3a68652568e9075c23f491bc8c037942bd67ec82/docs/user/reconciliation.md#manifest-graph
[5]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-75b6fdf9b7-l4smh_cluster-version-operator.log
[6]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1413177400551804928

Comment 1 W. Trevor King 2021-07-15 21:23:50 UTC
> So to support rollbacks to 4.7, the 4.8 admission plugin probably needs to locally hard-code the CRD's 'HighlyAvailable' default.

I can probably figure out how to actually do this ;)

Comment 2 W. Trevor King 2021-07-15 23:49:39 UTC
Bug 1977351 is still ON_QA, and since the PR attached to that one added the lines I'm adjusting, maybe we should close this one as a dup, and hang my 4.8 PR on bug 1977351 instead?

Comment 3 Neelesh Agrawal 2021-07-16 18:32:35 UTC

*** This bug has been marked as a duplicate of bug 1977351 ***

Comment 7 W. Trevor King 2021-09-06 01:52:14 UTC
4.7 -> 4.8 -> 4.7 jobs are still failing [1].  Checking a recent run, still timing out [2]:

  {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 3h0m0s timeout","severity":"error","time":"2021-09-04T02:50:49Z"}

From the ClusterVersion [3], the job got stuck in the return leg from 4.8.0-0.ci-2021-09-03-100015 to 4.7.29, with the not particularly informative:

  Working towards 4.7.29: 68 of 669 done (10% complete)

It's sticking earlier now, on some etcd thing:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1433931475606048768/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-b7444fb9-7qzqx_cluster-version-operator.log | grep 'Running sync.*in state\|Result of work' | tail -n6
I0904 02:39:46.601247       1 task_graph.go:555] Result of work: [deployment openshift-etcd-operator/etcd-operator has a replica failure FailedCreate: pods "etcd-operator-69bb77696-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride infrastructure resource has empty status.controlPlaneTopology or status.infrastructureTopology]
I0904 02:42:43.862556       1 sync_worker.go:549] Running sync registry.build02.ci.openshift.org/ci-op-w45p3w7d/release@sha256:b10034bedb4bf08a393462caf4c3fac8f9e4646d3b49d05915850dce0145cf15 (force=true) on generation 3 in state Updating at attempt 11
I0904 02:48:25.775750       1 task_graph.go:555] Result of work: [deployment openshift-etcd-operator/etcd-operator has a replica failure FailedCreate: pods "etcd-operator-69bb77696-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride infrastructure resource has empty status.controlPlaneTopology or status.infrastructureTopology]
I0904 02:51:45.280411       1 sync_worker.go:549] Running sync registry.build02.ci.openshift.org/ci-op-w45p3w7d/release@sha256:b10034bedb4bf08a393462caf4c3fac8f9e4646d3b49d05915850dce0145cf15 (force=true) on generation 3 in state Updating at attempt 12
I0904 02:57:27.193759       1 task_graph.go:555] Result of work: [deployment openshift-etcd-operator/etcd-operator has a replica failure FailedCreate: pods "etcd-operator-69bb77696-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride infrastructure resource has empty status.controlPlaneTopology or status.infrastructureTopology]
I0904 03:00:41.110856       1 sync_worker.go:549] Running sync registry.build02.ci.openshift.org/ci-op-w45p3w7d/release@sha256:b10034bedb4bf08a393462caf4c3fac8f9e4646d3b49d05915850dce0145cf15 (force=true) on generation 3 in state Updating at attempt 13

So still suffering from this issue.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1433931475606048768#1:build-log.txt%3A172
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1433931475606048768/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusterversion.json

Comment 9 Artyom 2021-09-09 05:43:52 UTC
Hi folks, the relevant PR was merged for 4.9, so we should check the upgrade and roll-back flow for 4.9->4.8->4.9, once we will verify it, the cherry-pick https://github.com/openshift/kubernetes/pull/895 can be merged and we can verify the flow 4.8->4.7->4.8.

Comment 10 Artyom 2021-09-09 05:47:10 UTC
I can see https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-rollback passed(it has some test failures but the deployment passed).
I think we can move it to verified.

Comment 11 Sunil Choudhary 2021-09-14 08:31:48 UTC
Thanks Artyom, moving to verified.

Comment 14 errata-xmlrpc 2021-10-18 17:39:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759