Description of problem: The CVO does not respect the `strategy` field within a Deployment. Therefore, if I as an OpenShift developer wish to change anything about the strategy after the resource is initially created, CVO will not apply the changes. Note the broken function is https://github.com/openshift/cluster-version-operator/blob/513a2fc5e2e8f4e8fad5c30913d8129c110b0684/lib/resourcemerge/apps.go#L10-L28 Needs an extra block something like: ``` if existing.Spec.Strategy == nil && required.Spec.Strategy != nil { *modified = true existing.Spec.Strategy = required.Spec.Strategy } ``` Version-Release number of the following components: <4.9/master How reproducible: 100% Add a strategy section to a deployment in the payload to move from rolling update to recreate ``` strategy: type: Recreate rollingUpdate: {} ``` Actual results: No change to the deployment Expected results: Deployment strategy should be updated
In recent 4.10 CI [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws/1474040369288056832/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-758df88975-j29jb_cluster-version-operator.log | grep -o 'Updating .*due to diff' | sort | uniq -c 7 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff 8 Updating Deployment openshift-authentication-operator/authentication-operator due to diff 8 Updating Deployment openshift-cluster-node-tuning-operator/cluster-node-tuning-operator due to diff 8 Updating Deployment openshift-cluster-samples-operator/cluster-samples-operator due to diff 8 Updating Deployment openshift-cluster-storage-operator/cluster-storage-operator due to diff 8 Updating Deployment openshift-cluster-storage-operator/csi-snapshot-controller-operator due to diff 7 Updating Deployment openshift-console-operator/console-operator due to diff 8 Updating Deployment openshift-controller-manager-operator/openshift-controller-manager-operator due to diff 8 Updating Deployment openshift-dns-operator/dns-operator due to diff 8 Updating Deployment openshift-image-registry/cluster-image-registry-operator due to diff 8 Updating Deployment openshift-kube-scheduler-operator/openshift-kube-scheduler-operator due to diff 8 Updating Deployment openshift-machine-api/cluster-autoscaler-operator due to diff 8 Updating Deployment openshift-machine-api/cluster-baremetal-operator due to diff 7 Updating Deployment openshift-machine-api/machine-api-operator due to diff 7 Updating Deployment openshift-machine-config-operator/machine-config-operator due to diff 8 Updating Deployment openshift-marketplace/marketplace-operator due to diff 8 Updating Deployment openshift-monitoring/cluster-monitoring-operator due to diff 8 Updating Deployment openshift-network-operator/network-operator due to diff 7 Updating Deployment openshift-operator-lifecycle-manager/catalog-operator due to diff 7 Updating Deployment openshift-operator-lifecycle-manager/olm-operator due to diff 14 Updating Deployment openshift-operator-lifecycle-manager/package-server-manager due to diff 8 Updating Deployment openshift-service-ca-operator/service-ca-operator due to diff 7 Updating Service openshift-monitoring/cluster-monitoring-operator due to diff We do not see the Deployment hotlooping in a recent 4.9 run [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws/1478247345723281408/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-6f8b969579-q8dx4_cluster-version-operator.log | grep -o 'Updating .*due to diff' | sort | uniq -c 9 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff 9 Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff The high counts for openshift-operator-lifecycle-manager/package-server-manager are because of busted cluster-profile annotations, and I've filed bug 2037168 to track that. Poking at one of the others: $ curl -s curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws/1474040369288056832/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-758df88975-j29jb_cluster-version-operator.log | grep -A139 'Updating Deployment openshift-service-ca-operator/service-ca-operator due to diff' | tail -n140 I1223 16:38:48.886837 1 apps.go:38] Updating Deployment openshift-service-ca-operator/service-ca-operator due to diff: &v1.Deployment{ TypeMeta: v1.TypeMeta{ - Kind: "", + Kind: "Deployment", - APIVersion: "", + APIVersion: "apps/v1", }, ObjectMeta: v1.ObjectMeta{ ... // 2 identical fields Namespace: "openshift-service-ca-operator", SelfLink: "", - UID: "3e20a2f6-1a8a-4e93-b21c-9327924ca689", + UID: "", - ResourceVersion: "9102", + ResourceVersion: "", - Generation: 1, + Generation: 0, - CreationTimestamp: v1.Time{Time: s"2021-12-23 15:47:19 +0000 UTC"}, + CreationTimestamp: v1.Time{}, DeletionTimestamp: nil, DeletionGracePeriodSeconds: nil, Labels: {"app": "service-ca-operator"}, Annotations: map[string]string{ - "deployment.kubernetes.io/revision": "1", "include.release.openshift.io/self-managed-high-availability": "true", "include.release.openshift.io/single-node-developer": "true", }, OwnerReferences: {{APIVersion: "config.openshift.io/v1", Kind: "ClusterVersion", Name: "version", UID: "3178956d-3b1f-45a3-bbc7-d995a437002b", ...}}, Finalizers: nil, ClusterName: "", - ManagedFields: []v1.ManagedFieldsEntry{ - ...blah blah blah... - }, + ManagedFields: nil, }, Spec: v1.DeploymentSpec{ Replicas: &1, Selector: &{MatchLabels: {"app": "service-ca-operator"}}, Template: v1.PodTemplateSpec{ ObjectMeta: {Name: "service-ca-operator", Labels: {"app": "service-ca-operator"}, Annotations: {"target.workload.openshift.io/management": `{"effect": "PreferredDuringScheduling"}`}}, Spec: v1.PodSpec{ Volumes: {{Name: "serving-cert", VolumeSource: {Secret: &{SecretName: "serving-cert", DefaultMode: &420, Optional: &true}}}, {Name: "config", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: "service-ca-operator-config"}, DefaultMode: &420}}}}, InitContainers: nil, Containers: []v1.Container{ { ... // 13 identical fields StartupProbe: nil, Lifecycle: nil, - TerminationMessagePath: "/dev/termination-log", + TerminationMessagePath: "", - TerminationMessagePolicy: "File", + TerminationMessagePolicy: "", ImagePullPolicy: "IfNotPresent", SecurityContext: nil, ... // 3 identical fields }, }, EphemeralContainers: nil, - RestartPolicy: "Always", + RestartPolicy: "", - TerminationGracePeriodSeconds: &30, + TerminationGracePeriodSeconds: nil, ActiveDeadlineSeconds: nil, - DNSPolicy: "ClusterFirst", + DNSPolicy: "", NodeSelector: {"node-role.kubernetes.io/master": ""}, ServiceAccountName: "service-ca-operator", - DeprecatedServiceAccount: "service-ca-operator", + DeprecatedServiceAccount: "", AutomountServiceAccountToken: nil, NodeName: "", ... // 7 identical fields Subdomain: "", Affinity: nil, - SchedulerName: "default-scheduler", + SchedulerName: "", Tolerations: {{Key: "node-role.kubernetes.io/master", Operator: "Exists", Effect: "NoSchedule"}, {Key: "node.kubernetes.io/unreachable", Operator: "Exists", Effect: "NoExecute", TolerationSeconds: &120}, {Key: "node.kubernetes.io/not-ready", Operator: "Exists", Effect: "NoExecute", TolerationSeconds: &120}}, HostAliases: nil, ... // 10 identical fields }, }, Strategy: {}, MinReadySeconds: 0, - RevisionHistoryLimit: &10, + RevisionHistoryLimit: nil, Paused: false, - ProgressDeadlineSeconds: &600, + ProgressDeadlineSeconds: nil, }, Status: v1.DeploymentStatus{ ...blah blah blah... }, } I1223 16:38:48.887115 1 sync_worker.go:771] Done syncing for clusterrolebinding "prometheus-k8s-scheduler-resources" (729 of 766) I dunno why it's not showing up in the diff, but I think the CVO is complaining about the implicit strategy: $ oc adm release extract --to manifests registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-12-23-153012 $ grep -c strategy: manifests/0000_50_service-ca-operator_05_deploy.yaml 0 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws/1478247345723281408/artifacts/e2e-aws/gather-extra/artifacts/deployments.json | jq '.items[] | select(.metadata.name == "service-ca-operator").spec.strategy' { "rollingUpdate": { "maxSurge": "25%", "maxUnavailable": "25%" }, "type": "RollingUpdate" } vs. the CVO's requirement that it be semantically equal [3], which is new in 4.10 [4] via this bug. I'm re-opening to get that Deployment hot-looping calmed back down. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws/1474040369288056832 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws/1478247345723281408 [3]: https://github.com/openshift/cluster-version-operator/blob/85f767cd663e738b6b9aacec74d010cfbc32993a/lib/resourcemerge/apps.go#L29 [4]: https://github.com/openshift/cluster-version-operator/pull/650
Reproducing the hot-looping issue with 4.10.0-0.nightly-2022-01-06-183041 # grep -o 'Updating .*due to diff' cvo1.log | sort | uniq -c 78 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff 78 Updating Deployment openshift-authentication-operator/authentication-operator due to diff 78 Updating Deployment openshift-cluster-node-tuning-operator/cluster-node-tuning-operator due to diff 78 Updating Deployment openshift-cluster-samples-operator/cluster-samples-operator due to diff 78 Updating Deployment openshift-cluster-storage-operator/cluster-storage-operator due to diff 78 Updating Deployment openshift-cluster-storage-operator/csi-snapshot-controller-operator due to diff 78 Updating Deployment openshift-console-operator/console-operator due to diff 78 Updating Deployment openshift-controller-manager-operator/openshift-controller-manager-operator due to diff 78 Updating Deployment openshift-dns-operator/dns-operator due to diff 78 Updating Deployment openshift-image-registry/cluster-image-registry-operator due to diff 78 Updating Deployment openshift-kube-scheduler-operator/openshift-kube-scheduler-operator due to diff 78 Updating Deployment openshift-machine-api/cluster-autoscaler-operator due to diff 78 Updating Deployment openshift-machine-api/cluster-baremetal-operator due to diff 78 Updating Deployment openshift-machine-api/machine-api-operator due to diff 78 Updating Deployment openshift-machine-config-operator/machine-config-operator due to diff 78 Updating Deployment openshift-marketplace/marketplace-operator due to diff 78 Updating Deployment openshift-monitoring/cluster-monitoring-operator due to diff 78 Updating Deployment openshift-network-operator/network-operator due to diff 78 Updating Deployment openshift-operator-lifecycle-manager/catalog-operator due to diff 78 Updating Deployment openshift-operator-lifecycle-manager/olm-operator due to diff 156 Updating Deployment openshift-operator-lifecycle-manager/package-server-manager due to diff 78 Updating Deployment openshift-service-ca-operator/service-ca-operator due to diff 59 Updating Service openshift-monitoring/cluster-monitoring-operator due to diff
Verifying with 4.10.0-0.nightly-2022-01-09-195852 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-09-195852 True False 49m Cluster version is 4.10.0-0.nightly-2022-01-09-195852 # oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-f94b75f9b-vxlbt 1/1 Running 0 67m # oc logs cluster-version-operator-f94b75f9b-vxlbt -n openshift-cluster-version > cvo2.log # grep -o 'Updating .*due to diff' cvo2.log | sort | uniq -c 18 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff 12 Updating Service openshift-monitoring/cluster-monitoring-operator due to diff We don't see Deployment hot-looping. Moving it to verified state.
Hi Trevor, we can see hot-looping on Service openshift-monitoring/cluster-monitoring-operator and CronJob openshift-operator-lifecycle-manager/collect-profiles. Looking at the Service hotloop. # grep -A139 'Updating Service openshift-monitoring/cluster-monitoring-operator due to diff' cvo2.log | tail -n140 I0110 05:42:20.035301 1 core.go:78] Updating Service openshift-monitoring/cluster-monitoring-operator due to diff: &v1.Service{ TypeMeta: v1.TypeMeta{ - Kind: "", + Kind: "Service", - APIVersion: "", + APIVersion: "v1", }, ObjectMeta: v1.ObjectMeta{ ... // 2 identical fields Namespace: "openshift-monitoring", SelfLink: "", - UID: "bf847434-1d7b-403e-8799-3301762a9e4b", + UID: "", - ResourceVersion: "43235", + ResourceVersion: "", Generation: 0, - CreationTimestamp: v1.Time{Time: s"2022-01-10 04:35:09 +0000 UTC"}, + CreationTimestamp: v1.Time{}, DeletionTimestamp: nil, DeletionGracePeriodSeconds: nil, Labels: map[string]string{ "app": "cluster-monitoring-operator", - "app.kubernetes.io/name": "cluster-monitoring-operator", }, Annotations: map[string]string{ "include.release.openshift.io/ibm-cloud-managed": "true", "include.release.openshift.io/self-managed-high-availability": "true", "include.release.openshift.io/single-node-developer": "true", - "service.alpha.openshift.io/serving-cert-signed-by": "openshift-service-serving-signer@1641789443", "service.beta.openshift.io/serving-cert-secret-name": "cluster-monitoring-operator-tls", - "service.beta.openshift.io/serving-cert-signed-by": "openshift-service-serving-signer@1641789443", }, OwnerReferences: {{APIVersion: "config.openshift.io/v1", Kind: "ClusterVersion", Name: "version", UID: "334d6c04-126d-4271-96ec-d303e93b7d1c", ...}}, Finalizers: nil, ClusterName: "", - ManagedFields: []v1.ManagedFieldsEntry{ - { - Manager: "cluster-version-operator", - Operation: "Update", - APIVersion: "v1", - Time: s"2022-01-10 05:39:32 +0000 UTC", - FieldsType: "FieldsV1", - FieldsV1: s`{"f:metadata":{"f:annotations":{".":{},"f:include.release.opensh`..., - }, - { - Manager: "Go-http-client", - Operation: "Update", - APIVersion: "v1", - Time: s"2022-01-10 05:39:35 +0000 UTC", - FieldsType: "FieldsV1", - FieldsV1: s`{"f:metadata":{"f:annotations":{"f:service.alpha.openshift.io/se`..., - }, - }, + ManagedFields: nil, }, Spec: v1.ServiceSpec{ Ports: []v1.ServicePort{ { Name: "https", - Protocol: "TCP", + Protocol: "", AppProtocol: nil, Port: 8443, ... // 2 identical fields }, }, Selector: {"app": "cluster-monitoring-operator"}, ClusterIP: "None", - ClusterIPs: []string{"None"}, + ClusterIPs: nil, - Type: "ClusterIP", + Type: "", ExternalIPs: nil, - SessionAffinity: "None", + SessionAffinity: "", LoadBalancerIP: "", LoadBalancerSourceRanges: nil, ... // 3 identical fields PublishNotReadyAddresses: false, SessionAffinityConfig: nil, - IPFamilies: []v1.IPFamily{"IPv4"}, + IPFamilies: nil, - IPFamilyPolicy: &"SingleStack", + IPFamilyPolicy: nil, AllocateLoadBalancerNodePorts: nil, LoadBalancerClass: nil, - InternalTrafficPolicy: &"Cluster", + InternalTrafficPolicy: nil, }, Status: {}, } # cat 0000_50_cluster-monitoring-operator_03-service.yaml 1 apiVersion: v1 2 kind: Service 3 metadata: 4 annotations: 5 service.beta.openshift.io/serving-cert-secret-name: cluster-monitoring-operator-tls 6 include.release.openshift.io/ibm-cloud-managed: "true" 7 include.release.openshift.io/self-managed-high-availability: "true" 8 include.release.openshift.io/single-node-developer: "true" 9 labels: 10 app: cluster-monitoring-operator 11 name: cluster-monitoring-operator 12 namespace: openshift-monitoring 13 spec: 14 clusterIP: None 15 ports: 16 - name: https 17 port: 8443 18 targetPort: https 19 selector: 20 app: cluster-monitoring-operator # oc get service/cluster-monitoring-operator -oyaml -n openshift-monitoring apiVersion: v1 kind: Service metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" service.alpha.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1641789443 service.beta.openshift.io/serving-cert-secret-name: cluster-monitoring-operator-tls service.beta.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1641789443 creationTimestamp: "2022-01-10T04:35:09Z" labels: app: cluster-monitoring-operator app.kubernetes.io/name: cluster-monitoring-operator name: cluster-monitoring-operator namespace: openshift-monitoring ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 334d6c04-126d-4271-96ec-d303e93b7d1c resourceVersion: "66862" uid: bf847434-1d7b-403e-8799-3301762a9e4b spec: clusterIP: None clusterIPs: - None internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: https port: 8443 protocol: TCP targetPort: https selector: app: cluster-monitoring-operator sessionAffinity: None type: ClusterIP status: loadBalancer: {} Is CVO complaining about "service.alpha.openshift.io/serving-cert-signed-by"?
> Is CVO complaining about "service.alpha.openshift.io/serving-cert-signed-by"? CVO shouldn't mind folks adding additional annotations and labels beyond what the manifest carries. But yeah, we should grow separate bugs for the Service and the CronJob, because we want to fix those, but the root cause will be some aspect of reconciliation for those types, while this bug is just about reconciling Deployments.
From comment 0: > Therefore, if I as an OpenShift developer wish to change anything about the strategy after the resource is initially created, CVO will not apply the changes. However, we didn't end up getting any manifest strategy changes from 4.9 to 4.10: $ oc adm release extract --to 4.9 quay.io/openshift-release-dev/ocp-release:4.9.19-x86_64 Extracted release payload from digest sha256:79dc491573156e7413c9b39d06b5c10d01f7eee26779cb34aed3d1b93158508c created at 2022-02-01T15:31:25Z $ oc adm release extract --to 4.10 quay.io/openshift-release-dev/ocp-release:4.10.0-rc.0-x86_64 Extracted release payload from digest sha256:be7ff17230199b5c0ee9cd48a932ceb06145ced8fec1c7d31bfe2e1a36746830 created at 2022-02-02T11:21:41Z $ diff -ru 4.9 4.10 | grep -A1 strategy: strategy: @@ -35,19 +33,19 @@ -- strategy: type: RollingUpdate -- strategy: type: RollingUpdate -- strategy: type: RollingUpdate And even expanding to include Deployments not managed by the CVO, there's no difference between 4.9.0 [1] and 4.10.0-rc.0 [2] CI runs: $ REF_A=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1448662625234718720/artifacts/e2e-aws-upgrade/deployments.json $ REF_B=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1488836164038168576/artifacts/e2e-aws-upgrade/deployments.json $ JQ='[.items[] | {namespace: .metadata.namespace, name: .metadata.name, strategy: .spec.strategy}] | sort_by(.namespace + .name)[]' $ diff -u1 <(curl -s "${REF_A}" | jq -c "${JQ}") <(curl -s "${REF_B}" | jq -c "${JQ}") --- /dev/fd/63 2022-02-04 09:00:14.861964053 -0800 +++ /dev/fd/62 2022-02-04 09:00:14.863964053 -0800 @@ -1,2 +1,2 @@ -{"namespace":"e2e-k8s-sig-apps-deployment-upgrade-8373","name":"dp","strategy":{"rollingUpdate":{"maxSurge":"25%","maxUnavailable":"25%"},"type":"RollingUpdate"}} +{"namespace":"e2e-k8s-sig-apps-deployment-upgrade-7357","name":"dp","strategy":{"rollingUpdate":{"maxSurge":"25%","maxUnavailable":"25%"},"type":"RollingUpdate"}} {"namespace":"openshift-apiserver-operator","name":"openshift-apiserver-operator","strategy":{"type":"Recreate"}} @@ -8,2 +8,3 @@ {"namespace":"openshift-cloud-credential-operator","name":"pod-identity-webhook","strategy":{"rollingUpdate":{"maxSurge":"25%","maxUnavailable":"25%"},"type":"RollingUpdate"}} +{"namespace":"openshift-cloud-network-config-controller","name":"cloud-network-config-controller","strategy":{"type":"Recreate"}} {"namespace":"openshift-cluster-csi-drivers","name":"aws-ebs-csi-driver-controller","strategy":{"rollingUpdate":{"maxSurge":0,"maxUnavailable":1},"type":"RollingUpdate"}} So I expect this was more along the lines of "we might want this", not "we actually need this". Per the verification [3], this does help protect the cluster from admins patching CVO-managed deployments with custom update strategies (which perhaps the patched deployment is not prepared to support), but that seems narrow enough that I don't think it's worth release docs. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1448662625234718720 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1488836164038168576 [3]: https://github.com/openshift/cluster-version-operator/pull/650#issuecomment-924155597
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056