Bug 2028217 - Cluster-version operator does not default Deployment replicas to one
Summary: Cluster-version operator does not default Deployment replicas to one
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.10.0
Assignee: W. Trevor King
QA Contact: Yang Yang
URL:
Whiteboard:
: 2028599 (view as bug list)
Depends On:
Blocks: 2028602
TreeView+ depends on / blocked
 
Reported: 2021-12-01 18:13 UTC by W. Trevor King
Modified: 2022-03-10 16:31 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:31:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 698 0 None open Bug 2028217: lib/resourcemerge/apps: Default Deployment replicas to one 2021-12-01 18:38:51 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:31:55 UTC

Description W. Trevor King 2021-12-01 18:13:00 UTC
Tomas and Vadim noticed that, when a Deployment manifest leaves 'replicas' unset, the CVO ignores the property.  This means that cluster admins can scale those Deployments up or, worse, down to 0, and the CVO will happily continue on without stomping them.  Auditing 4.9.10:

  $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.9.10-x86_64
  Extracted release payload from digest sha256:e1853d68d8ff093ec353ca7078b6b6df1533729688bb016b8208263ee7423f66 created at 2021-12-01T09:19:24Z
  $ for F in $(grep -rl 'kind: Deployment' manifests); do yaml2json < "${F}" | jq -r '.[] | select(.kind == "Deployment" and .spec.replicas == null).metadata | .namespace + " " + .name'; done | sort | uniq
  openshift-cluster-machine-approver machine-approver
  openshift-insights insights-operator
  openshift-network-operator network-operator

Those are all important operators, and I'm fairly confident that none of their maintainers expect "cluster admin scales them down to 0" to be a supported UX.  We should have the CVO default Deployment replicas to 1 (the type's default [1]), so admins who decide they don't want a network operator pod, etc., have to use some more explicit, alarming API to remove those pods (e.g. setting spec.overrides in the ClusterVersion object to assume control of the resource themselves).

[1]: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/deployment-v1/#DeploymentSpec

Comment 1 W. Trevor King 2021-12-02 18:07:42 UTC
*** Bug 2028599 has been marked as a duplicate of this bug. ***

Comment 3 W. Trevor King 2021-12-02 18:49:39 UTC
Test plan is something like:

1. Install a nightly with the fix (of which none exist yet?  Not clear why ART swept this into ON_QA so quickly)
2. Scale down an operator with a vulnerable manifest:
     $ oc -n openshift-network-operator scale --replicas 0 deployment/network-operator
3. Wait a few minutes while the CVO walks the manifest graph to notice the divergence.
4. Confirm that:
     $ oc -n openshift-network-operator get -o jsonpath='{.spec.replicas}{"\n"}' deployment network-operator
   has been returned to 1.

Comment 5 Yang Yang 2021-12-06 07:21:51 UTC
Reproducing with 4.9.10:

1. Install a v4.9.10 cluster
2. Check the pod status of network-operator

# oc project openshift-network-operator
# oc get all
NAME                                    READY   STATUS    RESTARTS   AGE
pod/network-operator-797978c5db-7tkkj   1/1     Running   0          3h54m

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   1/1     1            1           3h54m

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-797978c5db   1         1         1       3h54m

3. Scale down to 0 replica
# oc scale --replicas 0 deployment/network-operator
deployment.apps/network-operator scaled

# oc get all
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   0/0     0            0           3h59m

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-797978c5db   0         0         0       3h59m

CVO's happy to scale down it to 0

=================================================================================

Verifying with 4.10.0-0.nightly-2021-12-03-213835

1. Install a cluster with build 4.10.0-0.nightly-2021-12-03-213835
2. Check the pod status of network-operator

# oc project openshift-network-operator
# oc get all
NAME                                  READY   STATUS    RESTARTS   AGE
pod/network-operator-f5b59798-2f4xn   1/1     Running   0          4h7m

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   1/1     1            1           4h7m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-f5b59798   1         1         1       4h7m

3. Scale down to 0 replica
# oc scale --replicas 0 deployment/network-operator
deployment.apps/network-operator scaled
# oc get all
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   0/0     0            0           4h8m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-f5b59798   0         0         0       4h8m

# oc get all
NAME                                  READY   STATUS    RESTARTS   AGE
pod/network-operator-f5b59798-m48c6   1/1     Running   0          19s

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   1/1     1            1           4h9m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-f5b59798   1         1         1       4h9m

So, for the manifest leaves 'replicas' unset, CVO defaults it to 1 replica. Moving it to verified state.

Comment 9 errata-xmlrpc 2022-03-10 16:31:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.