Bug 1977351

Summary: CVO pod skipped by workload partitioning with incorrect error stating cluster is not SNO
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: NodeAssignee: Artyom <alukiano>
Node sub component: Autoscaler (HPA, VPA) QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: alukiano, aos-bugs, browsell, dhellmann, kewang, keyoung, mfojtik, sttts, wking, xxia
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:13:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1976379    
Bug Blocks:    

Description OpenShift BugZilla Robot 2021-06-29 13:57:58 UTC
+++ This bug was initially created as a clone of Bug #1976379 +++

Created attachment 1794553 [details]
must-gather from cluster where this occurred

Description of problem:
Pod "cluster-version-operator-89bf5cdb5-4qhhh" in openshift-cluster-version namespace was not handled by the workload partitioning pod mutation logic. A warning was added to the pod:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    workload.openshift.io/warning: only single-node clusters support workload partitioning


Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-06-24-222938

How reproducible: unknown


Steps to Reproduce:
1. Cluster installed
2. "oc describe node" shows 20m CPU requests for this pod
3.

Actual results:
  openshift-cluster-version                         cluster-version-operator-89bf5cdb5-4qhhh                        20m (0%)      0 (0%)      50Mi (0%)        0 (0%)         4h   


Expected results:
  openshift-cluster-version                         cluster-version-operator-89bf5cdb5-4qhhh                        0 (0%)      0 (0%)      50Mi (0%)        0 (0%)         4h   


Additional info:

--- Additional comment from alukiano on 2021-06-27 11:31:08 UTC ---

Can you please provide the installer debug log?

Comment 4 Xingxing Xia 2021-07-05 10:40:28 UTC
Like the 4.9 clone bug 1976379#c3 steps, I tested latest 4.8 non-SNO env (4.8.0-0.nightly-2021-07-04-112043), the issue still exists.
Checked its last o/k commit:
oc adm release info --commits registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-07-04-112043 | grep hyperkube
  hyperkube    https://github.com/openshift/kubernetes    f36aa364667...

https://github.com/openshift/kubernetes/blob/f36aa364667/openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/admission.go#L183-L186 already contains the PR code. Thus moving back to ASSIGNED.

Comment 5 Artyom 2021-07-05 11:10:50 UTC
The problem that this annotation was added for the SNO cluster with the workload partitioning when it should not.
It's ok to have this annotation under the pod under the non-SNO cluster.

Can you please verify the bug for the SNO cluster with the workload partitioning enabled?

Comment 6 Xingxing Xia 2021-07-05 12:22:18 UTC
(In reply to Artyom from comment #5)
Thanks for clarification. Then it is better to have the QE colleague from team of the workload partitioning feature. Let me update.

Comment 8 Artyom 2021-07-06 09:47:51 UTC
Did you enable the workload partitioning during the setup? You should provide an additional machine config manifest to enabled it.
Please see - https://github.com/openshift/enhancements/blob/master/enhancements/workload-partitioning/management-workload-partitioning.md#example-manifests

Comment 12 Neelesh Agrawal 2021-07-16 18:32:32 UTC
*** Bug 1982868 has been marked as a duplicate of this bug. ***

Comment 14 W. Trevor King 2021-07-22 05:06:27 UTC
Neelesh closed bug 1982868 as a dup of this one [1], but while this bug is now VERIFIED, 4.7 -> 4.8 -> 4.7 rollback jobs are still failing.  And a recent failure, from 4.7.20-x86_64 to 4.8.0-0.ci-2021-07-19-070057 and back [3] still blocks with [4]:

  deployment openshift-etcd-operator/etcd-operator has a replica failure FailedCreate: pods "etcd-operator-7b677856dc-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride infrastructure resource has empty status.controlPlaneTopology or status.infrastructureTopology

Did we want to move this back to ASSIGNED until we get that sorted out?  Or should I reopen bug 1982868 so we can handle it separately?

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1982868#c3
[2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1417258388370231296
[4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1417258388370231296/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusterversion.json

Comment 17 errata-xmlrpc 2021-07-27 23:13:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438