2028217 – Cluster-version operator does not default Deployment replicas to one

Bug 2028217 - Cluster-version operator does not default Deployment replicas to one

Summary: Cluster-version operator does not default Deployment replicas to one

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.10.0
Assignee:	W. Trevor King
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2028599 (view as bug list)
Depends On:
Blocks:	2028602
TreeView+	depends on / blocked

Reported:	2021-12-01 18:13 UTC by W. Trevor King
Modified:	2022-03-10 16:31 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:31:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 698	0	None	open	Bug 2028217: lib/resourcemerge/apps: Default Deployment replicas to one	2021-12-01 18:38:51 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:31:55 UTC

Description W. Trevor King 2021-12-01 18:13:00 UTC

Tomas and Vadim noticed that, when a Deployment manifest leaves 'replicas' unset, the CVO ignores the property.  This means that cluster admins can scale those Deployments up or, worse, down to 0, and the CVO will happily continue on without stomping them.  Auditing 4.9.10:

  $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.9.10-x86_64
  Extracted release payload from digest sha256:e1853d68d8ff093ec353ca7078b6b6df1533729688bb016b8208263ee7423f66 created at 2021-12-01T09:19:24Z
  $ for F in $(grep -rl 'kind: Deployment' manifests); do yaml2json < "${F}" | jq -r '.[] | select(.kind == "Deployment" and .spec.replicas == null).metadata | .namespace + " " + .name'; done | sort | uniq
  openshift-cluster-machine-approver machine-approver
  openshift-insights insights-operator
  openshift-network-operator network-operator

Those are all important operators, and I'm fairly confident that none of their maintainers expect "cluster admin scales them down to 0" to be a supported UX.  We should have the CVO default Deployment replicas to 1 (the type's default [1]), so admins who decide they don't want a network operator pod, etc., have to use some more explicit, alarming API to remove those pods (e.g. setting spec.overrides in the ClusterVersion object to assume control of the resource themselves).

[1]: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/deployment-v1/#DeploymentSpec

Comment 1 W. Trevor King 2021-12-02 18:07:42 UTC

*** Bug 2028599 has been marked as a duplicate of this bug. ***

Comment 3 W. Trevor King 2021-12-02 18:49:39 UTC

Test plan is something like:

1. Install a nightly with the fix (of which none exist yet?  Not clear why ART swept this into ON_QA so quickly)
2. Scale down an operator with a vulnerable manifest:
     $ oc -n openshift-network-operator scale --replicas 0 deployment/network-operator
3. Wait a few minutes while the CVO walks the manifest graph to notice the divergence.
4. Confirm that:
     $ oc -n openshift-network-operator get -o jsonpath='{.spec.replicas}{"\n"}' deployment network-operator
   has been returned to 1.

Comment 5 Yang Yang 2021-12-06 07:21:51 UTC

Reproducing with 4.9.10:

1. Install a v4.9.10 cluster
2. Check the pod status of network-operator

# oc project openshift-network-operator
# oc get all
NAME                                    READY   STATUS    RESTARTS   AGE
pod/network-operator-797978c5db-7tkkj   1/1     Running   0          3h54m

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   1/1     1            1           3h54m

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-797978c5db   1         1         1       3h54m

3. Scale down to 0 replica
# oc scale --replicas 0 deployment/network-operator
deployment.apps/network-operator scaled

# oc get all
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   0/0     0            0           3h59m

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-797978c5db   0         0         0       3h59m

CVO's happy to scale down it to 0

=================================================================================

Verifying with 4.10.0-0.nightly-2021-12-03-213835

1. Install a cluster with build 4.10.0-0.nightly-2021-12-03-213835
2. Check the pod status of network-operator

# oc project openshift-network-operator
# oc get all
NAME                                  READY   STATUS    RESTARTS   AGE
pod/network-operator-f5b59798-2f4xn   1/1     Running   0          4h7m

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   1/1     1            1           4h7m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-f5b59798   1         1         1       4h7m

3. Scale down to 0 replica
# oc scale --replicas 0 deployment/network-operator
deployment.apps/network-operator scaled
# oc get all
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   0/0     0            0           4h8m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-f5b59798   0         0         0       4h8m

# oc get all
NAME                                  READY   STATUS    RESTARTS   AGE
pod/network-operator-f5b59798-m48c6   1/1     Running   0          19s

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   1/1     1            1           4h9m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-f5b59798   1         1         1       4h9m

So, for the manifest leaves 'replicas' unset, CVO defaults it to 1 replica. Moving it to verified state.

Comment 9 errata-xmlrpc 2022-03-10 16:31:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.