Bug 1982868 - 4.8 ManagementCPUsOverride admission plugin blocks 4.7 deployments on empty topology
Summary: 4.8 ManagementCPUsOverride admission plugin blocks 4.7 deployments on empty t...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.9.0
Assignee: Artyom
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On: 1982873
Blocks: 1995714
TreeView+ depends on / blocked
 
Reported: 2021-07-15 21:12 UTC by W. Trevor King
Modified: 2021-10-18 17:40 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1982873 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:39:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift api pull 986 0 None None None 2021-08-18 19:34:30 UTC
Github openshift kubernetes pull 877 0 None None None 2021-08-18 19:34:27 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:40:08 UTC

Description W. Trevor King 2021-07-15 21:12:33 UTC
Somewhat like bug 1961925 about the new admission plugin being overly sensitive to a lack of nodes, the new admission plugin also seems overly sensitive to 4.7 workloads.  4.7->4.8->4.7 rollback CI consistently locks during the return-to-4.7 leg [1], and digging into one such run [2] shows the cluster-version operator blocking on [3]:

  deployment openshift-etcd-operator/etcd-operator has a replica failure FailedCreate: pods "etcd-operator-7b677856dc-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride infrastructure resource has empty status.controlPlaneTopology or status.infrastructureTopology

That etcd operator deployment is very early in the (rollback) update graph [4], so we're trying to take that etcd operator Deployment back to its 4.7 state well before the bulk of cluster components have been asked to move back to their 4.7 state.  Skimming the CVO logs [5], I don't think the CVO is actually attempting to change anything important about the Deployment; we just haven't ported bug 1881484 about Deployment hotlooping back to the 4.7 CVO.

Checking the Infrastructure object, it is indeed missing those topology properties:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-ci-openshift-org-ocp-4-8-2021-07-09-111851-sha256-9a3ea481a3ffd9b341dc60067c39ec9fca6fd8936b71e73442c5ccff3838719e/cluster-scoped-resources/config.openshift.io/infrastructures/cluster.yaml 
...
status:
  apiServerInternalURI: https://api-int.ci-op-6rhcth7k-9f994.aws-2.ci.openshift.org:6443
  apiServerURL: https://api.ci-op-6rhcth7k-9f994.aws-2.ci.openshift.org:6443
  etcdDiscoveryDomain: ""
  infrastructureName: ci-op-6rhcth7k-9f994-rkt8q
  platform: AWS
  platformStatus:
    aws:
      region: us-east-1
    type: AWS

Comparing with a non-rollback 4.7->4.8 job [6], where we do have those properties:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1413177400551804928/artifacts/launch/must-gather.tar | tar xOz ./quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-68bb88201fbf54c8c9709f224d2daaada8e7420f339e5c54d0628914c4401ebb/cluster-scoped-resources/config.openshift.io/infrastructures/cluster.yaml
...
status:
  apiServerInternalURI: https://api-int.ci-ln-b8dtkzb-d5d6b.origin-ci-int-aws.dev.rhcloud.com:6443
  apiServerURL: https://api.ci-ln-b8dtkzb-d5d6b.origin-ci-int-aws.dev.rhcloud.com:6443
  controlPlaneTopology: HighlyAvailable
  etcdDiscoveryDomain: ""
  infrastructureName: ci-ln-b8dtkzb-d5d6b-zlwcn
  infrastructureTopology: HighlyAvailable
  platform: AWS
  platformStatus:
    aws:
      region: us-west-2
    type: AWS

Back to our rollback job, to try and figure out where the properties went, they seem to have been dropped from the CRD:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-ci-openshift-org-ocp-4-8-2021-07-09-111851-sha256-9a3ea481a3ffd9b341dc60067c39ec9fca6fd8936b71e73442c5ccff3838719e/cluster-scoped-resources/apiextensions.k8s.io/customresourcedefinitions/infrastructures.config.openshift.io.yaml | grep -ci topology
  0

Because the manifest for that CRD is even earlier in the update graph than the one for the etcd operator deployment:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-75b6fdf9b7-l4smh_cluster-version-operator.log | grep infrastructures.config.openshift.io   
  I0714 00:45:32.566876       1 sync_worker.go:762] Running sync for customresourcedefinition "infrastructures.config.openshift.io" (42 of 669)
  I0714 00:45:32.678901       1 apiext.go:66] Updating CRD infrastructures.config.openshift.io due to diff:   &v1.CustomResourceDefinition{
  ...

So to support rollbacks to 4.7, the 4.8 admission plugin probably needs to locally hard-code the CRD's 'HighlyAvailable' default.  There are no 4.9-direct-to-4.7 rollbacks, so we don't need to touch master/4.9.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusterversion.json
[4]: https://github.com/openshift/cluster-version-operator/blob/3a68652568e9075c23f491bc8c037942bd67ec82/docs/user/reconciliation.md#manifest-graph
[5]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1415083407846674432/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-75b6fdf9b7-l4smh_cluster-version-operator.log
[6]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1413177400551804928

Comment 1 W. Trevor King 2021-07-15 21:23:50 UTC
> So to support rollbacks to 4.7, the 4.8 admission plugin probably needs to locally hard-code the CRD's 'HighlyAvailable' default.

I can probably figure out how to actually do this ;)

Comment 2 W. Trevor King 2021-07-15 23:49:39 UTC
Bug 1977351 is still ON_QA, and since the PR attached to that one added the lines I'm adjusting, maybe we should close this one as a dup, and hang my 4.8 PR on bug 1977351 instead?

Comment 3 Neelesh Agrawal 2021-07-16 18:32:35 UTC

*** This bug has been marked as a duplicate of bug 1977351 ***

Comment 7 W. Trevor King 2021-09-06 01:52:14 UTC
4.7 -> 4.8 -> 4.7 jobs are still failing [1].  Checking a recent run, still timing out [2]:

  {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 3h0m0s timeout","severity":"error","time":"2021-09-04T02:50:49Z"}

From the ClusterVersion [3], the job got stuck in the return leg from 4.8.0-0.ci-2021-09-03-100015 to 4.7.29, with the not particularly informative:

  Working towards 4.7.29: 68 of 669 done (10% complete)

It's sticking earlier now, on some etcd thing:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1433931475606048768/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-b7444fb9-7qzqx_cluster-version-operator.log | grep 'Running sync.*in state\|Result of work' | tail -n6
I0904 02:39:46.601247       1 task_graph.go:555] Result of work: [deployment openshift-etcd-operator/etcd-operator has a replica failure FailedCreate: pods "etcd-operator-69bb77696-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride infrastructure resource has empty status.controlPlaneTopology or status.infrastructureTopology]
I0904 02:42:43.862556       1 sync_worker.go:549] Running sync registry.build02.ci.openshift.org/ci-op-w45p3w7d/release@sha256:b10034bedb4bf08a393462caf4c3fac8f9e4646d3b49d05915850dce0145cf15 (force=true) on generation 3 in state Updating at attempt 11
I0904 02:48:25.775750       1 task_graph.go:555] Result of work: [deployment openshift-etcd-operator/etcd-operator has a replica failure FailedCreate: pods "etcd-operator-69bb77696-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride infrastructure resource has empty status.controlPlaneTopology or status.infrastructureTopology]
I0904 02:51:45.280411       1 sync_worker.go:549] Running sync registry.build02.ci.openshift.org/ci-op-w45p3w7d/release@sha256:b10034bedb4bf08a393462caf4c3fac8f9e4646d3b49d05915850dce0145cf15 (force=true) on generation 3 in state Updating at attempt 12
I0904 02:57:27.193759       1 task_graph.go:555] Result of work: [deployment openshift-etcd-operator/etcd-operator has a replica failure FailedCreate: pods "etcd-operator-69bb77696-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride infrastructure resource has empty status.controlPlaneTopology or status.infrastructureTopology]
I0904 03:00:41.110856       1 sync_worker.go:549] Running sync registry.build02.ci.openshift.org/ci-op-w45p3w7d/release@sha256:b10034bedb4bf08a393462caf4c3fac8f9e4646d3b49d05915850dce0145cf15 (force=true) on generation 3 in state Updating at attempt 13

So still suffering from this issue.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1433931475606048768#1:build-log.txt%3A172
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1433931475606048768/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusterversion.json

Comment 9 Artyom 2021-09-09 05:43:52 UTC
Hi folks, the relevant PR was merged for 4.9, so we should check the upgrade and roll-back flow for 4.9->4.8->4.9, once we will verify it, the cherry-pick https://github.com/openshift/kubernetes/pull/895 can be merged and we can verify the flow 4.8->4.7->4.8.

Comment 10 Artyom 2021-09-09 05:47:10 UTC
I can see https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-rollback passed(it has some test failures but the deployment passed).
I think we can move it to verified.

Comment 11 Sunil Choudhary 2021-09-14 08:31:48 UTC
Thanks Artyom, moving to verified.

Comment 14 errata-xmlrpc 2021-10-18 17:39:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.