Bug 2000938
| Summary: | CVO does not respect changes to a Deployment strategy | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Joel Speed <jspeed> |
| Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> |
| Status: | CLOSED ERRATA | QA Contact: | Yang Yang <yanyang> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.9 | CC: | aos-bugs, lmohanty, pamoedom, wking, yanyang |
| Target Milestone: | --- | ||
| Target Release: | 4.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-10 16:07:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Joel Speed
2021-09-03 10:35:37 UTC
In recent 4.10 CI [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws/1474040369288056832/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-758df88975-j29jb_cluster-version-operator.log | grep -o 'Updating .*due to diff' | sort | uniq -c 7 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff 8 Updating Deployment openshift-authentication-operator/authentication-operator due to diff 8 Updating Deployment openshift-cluster-node-tuning-operator/cluster-node-tuning-operator due to diff 8 Updating Deployment openshift-cluster-samples-operator/cluster-samples-operator due to diff 8 Updating Deployment openshift-cluster-storage-operator/cluster-storage-operator due to diff 8 Updating Deployment openshift-cluster-storage-operator/csi-snapshot-controller-operator due to diff 7 Updating Deployment openshift-console-operator/console-operator due to diff 8 Updating Deployment openshift-controller-manager-operator/openshift-controller-manager-operator due to diff 8 Updating Deployment openshift-dns-operator/dns-operator due to diff 8 Updating Deployment openshift-image-registry/cluster-image-registry-operator due to diff 8 Updating Deployment openshift-kube-scheduler-operator/openshift-kube-scheduler-operator due to diff 8 Updating Deployment openshift-machine-api/cluster-autoscaler-operator due to diff 8 Updating Deployment openshift-machine-api/cluster-baremetal-operator due to diff 7 Updating Deployment openshift-machine-api/machine-api-operator due to diff 7 Updating Deployment openshift-machine-config-operator/machine-config-operator due to diff 8 Updating Deployment openshift-marketplace/marketplace-operator due to diff 8 Updating Deployment openshift-monitoring/cluster-monitoring-operator due to diff 8 Updating Deployment openshift-network-operator/network-operator due to diff 7 Updating Deployment openshift-operator-lifecycle-manager/catalog-operator due to diff 7 Updating Deployment openshift-operator-lifecycle-manager/olm-operator due to diff 14 Updating Deployment openshift-operator-lifecycle-manager/package-server-manager due to diff 8 Updating Deployment openshift-service-ca-operator/service-ca-operator due to diff 7 Updating Service openshift-monitoring/cluster-monitoring-operator due to diff We do not see the Deployment hotlooping in a recent 4.9 run [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws/1478247345723281408/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-6f8b969579-q8dx4_cluster-version-operator.log | grep -o 'Updating .*due to diff' | sort | uniq -c 9 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff 9 Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff The high counts for openshift-operator-lifecycle-manager/package-server-manager are because of busted cluster-profile annotations, and I've filed bug 2037168 to track that. Poking at one of the others: $ curl -s curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws/1474040369288056832/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-758df88975-j29jb_cluster-version-operator.log | grep -A139 'Updating Deployment openshift-service-ca-operator/service-ca-operator due to diff' | tail -n140 I1223 16:38:48.886837 1 apps.go:38] Updating Deployment openshift-service-ca-operator/service-ca-operator due to diff: &v1.Deployment{ TypeMeta: v1.TypeMeta{ - Kind: "", + Kind: "Deployment", - APIVersion: "", + APIVersion: "apps/v1", }, ObjectMeta: v1.ObjectMeta{ ... // 2 identical fields Namespace: "openshift-service-ca-operator", SelfLink: "", - UID: "3e20a2f6-1a8a-4e93-b21c-9327924ca689", + UID: "", - ResourceVersion: "9102", + ResourceVersion: "", - Generation: 1, + Generation: 0, - CreationTimestamp: v1.Time{Time: s"2021-12-23 15:47:19 +0000 UTC"}, + CreationTimestamp: v1.Time{}, DeletionTimestamp: nil, DeletionGracePeriodSeconds: nil, Labels: {"app": "service-ca-operator"}, Annotations: map[string]string{ - "deployment.kubernetes.io/revision": "1", "include.release.openshift.io/self-managed-high-availability": "true", "include.release.openshift.io/single-node-developer": "true", }, OwnerReferences: {{APIVersion: "config.openshift.io/v1", Kind: "ClusterVersion", Name: "version", UID: "3178956d-3b1f-45a3-bbc7-d995a437002b", ...}}, Finalizers: nil, ClusterName: "", - ManagedFields: []v1.ManagedFieldsEntry{ - ...blah blah blah... - }, + ManagedFields: nil, }, Spec: v1.DeploymentSpec{ Replicas: &1, Selector: &{MatchLabels: {"app": "service-ca-operator"}}, Template: v1.PodTemplateSpec{ ObjectMeta: {Name: "service-ca-operator", Labels: {"app": "service-ca-operator"}, Annotations: {"target.workload.openshift.io/management": `{"effect": "PreferredDuringScheduling"}`}}, Spec: v1.PodSpec{ Volumes: {{Name: "serving-cert", VolumeSource: {Secret: &{SecretName: "serving-cert", DefaultMode: &420, Optional: &true}}}, {Name: "config", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: "service-ca-operator-config"}, DefaultMode: &420}}}}, InitContainers: nil, Containers: []v1.Container{ { ... // 13 identical fields StartupProbe: nil, Lifecycle: nil, - TerminationMessagePath: "/dev/termination-log", + TerminationMessagePath: "", - TerminationMessagePolicy: "File", + TerminationMessagePolicy: "", ImagePullPolicy: "IfNotPresent", SecurityContext: nil, ... // 3 identical fields }, }, EphemeralContainers: nil, - RestartPolicy: "Always", + RestartPolicy: "", - TerminationGracePeriodSeconds: &30, + TerminationGracePeriodSeconds: nil, ActiveDeadlineSeconds: nil, - DNSPolicy: "ClusterFirst", + DNSPolicy: "", NodeSelector: {"node-role.kubernetes.io/master": ""}, ServiceAccountName: "service-ca-operator", - DeprecatedServiceAccount: "service-ca-operator", + DeprecatedServiceAccount: "", AutomountServiceAccountToken: nil, NodeName: "", ... // 7 identical fields Subdomain: "", Affinity: nil, - SchedulerName: "default-scheduler", + SchedulerName: "", Tolerations: {{Key: "node-role.kubernetes.io/master", Operator: "Exists", Effect: "NoSchedule"}, {Key: "node.kubernetes.io/unreachable", Operator: "Exists", Effect: "NoExecute", TolerationSeconds: &120}, {Key: "node.kubernetes.io/not-ready", Operator: "Exists", Effect: "NoExecute", TolerationSeconds: &120}}, HostAliases: nil, ... // 10 identical fields }, }, Strategy: {}, MinReadySeconds: 0, - RevisionHistoryLimit: &10, + RevisionHistoryLimit: nil, Paused: false, - ProgressDeadlineSeconds: &600, + ProgressDeadlineSeconds: nil, }, Status: v1.DeploymentStatus{ ...blah blah blah... }, } I1223 16:38:48.887115 1 sync_worker.go:771] Done syncing for clusterrolebinding "prometheus-k8s-scheduler-resources" (729 of 766) I dunno why it's not showing up in the diff, but I think the CVO is complaining about the implicit strategy: $ oc adm release extract --to manifests registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-12-23-153012 $ grep -c strategy: manifests/0000_50_service-ca-operator_05_deploy.yaml 0 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws/1478247345723281408/artifacts/e2e-aws/gather-extra/artifacts/deployments.json | jq '.items[] | select(.metadata.name == "service-ca-operator").spec.strategy' { "rollingUpdate": { "maxSurge": "25%", "maxUnavailable": "25%" }, "type": "RollingUpdate" } vs. the CVO's requirement that it be semantically equal [3], which is new in 4.10 [4] via this bug. I'm re-opening to get that Deployment hot-looping calmed back down. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws/1474040369288056832 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws/1478247345723281408 [3]: https://github.com/openshift/cluster-version-operator/blob/85f767cd663e738b6b9aacec74d010cfbc32993a/lib/resourcemerge/apps.go#L29 [4]: https://github.com/openshift/cluster-version-operator/pull/650 Reproducing the hot-looping issue with 4.10.0-0.nightly-2022-01-06-183041
# grep -o 'Updating .*due to diff' cvo1.log | sort | uniq -c
78 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff
78 Updating Deployment openshift-authentication-operator/authentication-operator due to diff
78 Updating Deployment openshift-cluster-node-tuning-operator/cluster-node-tuning-operator due to diff
78 Updating Deployment openshift-cluster-samples-operator/cluster-samples-operator due to diff
78 Updating Deployment openshift-cluster-storage-operator/cluster-storage-operator due to diff
78 Updating Deployment openshift-cluster-storage-operator/csi-snapshot-controller-operator due to diff
78 Updating Deployment openshift-console-operator/console-operator due to diff
78 Updating Deployment openshift-controller-manager-operator/openshift-controller-manager-operator due to diff
78 Updating Deployment openshift-dns-operator/dns-operator due to diff
78 Updating Deployment openshift-image-registry/cluster-image-registry-operator due to diff
78 Updating Deployment openshift-kube-scheduler-operator/openshift-kube-scheduler-operator due to diff
78 Updating Deployment openshift-machine-api/cluster-autoscaler-operator due to diff
78 Updating Deployment openshift-machine-api/cluster-baremetal-operator due to diff
78 Updating Deployment openshift-machine-api/machine-api-operator due to diff
78 Updating Deployment openshift-machine-config-operator/machine-config-operator due to diff
78 Updating Deployment openshift-marketplace/marketplace-operator due to diff
78 Updating Deployment openshift-monitoring/cluster-monitoring-operator due to diff
78 Updating Deployment openshift-network-operator/network-operator due to diff
78 Updating Deployment openshift-operator-lifecycle-manager/catalog-operator due to diff
78 Updating Deployment openshift-operator-lifecycle-manager/olm-operator due to diff
156 Updating Deployment openshift-operator-lifecycle-manager/package-server-manager due to diff
78 Updating Deployment openshift-service-ca-operator/service-ca-operator due to diff
59 Updating Service openshift-monitoring/cluster-monitoring-operator due to diff
Verifying with 4.10.0-0.nightly-2022-01-09-195852
# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.0-0.nightly-2022-01-09-195852 True False 49m Cluster version is 4.10.0-0.nightly-2022-01-09-195852
# oc get po -n openshift-cluster-version
NAME READY STATUS RESTARTS AGE
cluster-version-operator-f94b75f9b-vxlbt 1/1 Running 0 67m
# oc logs cluster-version-operator-f94b75f9b-vxlbt -n openshift-cluster-version > cvo2.log
# grep -o 'Updating .*due to diff' cvo2.log | sort | uniq -c
18 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff
12 Updating Service openshift-monitoring/cluster-monitoring-operator due to diff
We don't see Deployment hot-looping. Moving it to verified state.
Hi Trevor, we can see hot-looping on Service openshift-monitoring/cluster-monitoring-operator and CronJob openshift-operator-lifecycle-manager/collect-profiles.
Looking at the Service hotloop.
# grep -A139 'Updating Service openshift-monitoring/cluster-monitoring-operator due to diff' cvo2.log | tail -n140
I0110 05:42:20.035301 1 core.go:78] Updating Service openshift-monitoring/cluster-monitoring-operator due to diff: &v1.Service{
TypeMeta: v1.TypeMeta{
- Kind: "",
+ Kind: "Service",
- APIVersion: "",
+ APIVersion: "v1",
},
ObjectMeta: v1.ObjectMeta{
... // 2 identical fields
Namespace: "openshift-monitoring",
SelfLink: "",
- UID: "bf847434-1d7b-403e-8799-3301762a9e4b",
+ UID: "",
- ResourceVersion: "43235",
+ ResourceVersion: "",
Generation: 0,
- CreationTimestamp: v1.Time{Time: s"2022-01-10 04:35:09 +0000 UTC"},
+ CreationTimestamp: v1.Time{},
DeletionTimestamp: nil,
DeletionGracePeriodSeconds: nil,
Labels: map[string]string{
"app": "cluster-monitoring-operator",
- "app.kubernetes.io/name": "cluster-monitoring-operator",
},
Annotations: map[string]string{
"include.release.openshift.io/ibm-cloud-managed": "true",
"include.release.openshift.io/self-managed-high-availability": "true",
"include.release.openshift.io/single-node-developer": "true",
- "service.alpha.openshift.io/serving-cert-signed-by": "openshift-service-serving-signer@1641789443",
"service.beta.openshift.io/serving-cert-secret-name": "cluster-monitoring-operator-tls",
- "service.beta.openshift.io/serving-cert-signed-by": "openshift-service-serving-signer@1641789443",
},
OwnerReferences: {{APIVersion: "config.openshift.io/v1", Kind: "ClusterVersion", Name: "version", UID: "334d6c04-126d-4271-96ec-d303e93b7d1c", ...}},
Finalizers: nil,
ClusterName: "",
- ManagedFields: []v1.ManagedFieldsEntry{
- {
- Manager: "cluster-version-operator",
- Operation: "Update",
- APIVersion: "v1",
- Time: s"2022-01-10 05:39:32 +0000 UTC",
- FieldsType: "FieldsV1",
- FieldsV1: s`{"f:metadata":{"f:annotations":{".":{},"f:include.release.opensh`...,
- },
- {
- Manager: "Go-http-client",
- Operation: "Update",
- APIVersion: "v1",
- Time: s"2022-01-10 05:39:35 +0000 UTC",
- FieldsType: "FieldsV1",
- FieldsV1: s`{"f:metadata":{"f:annotations":{"f:service.alpha.openshift.io/se`...,
- },
- },
+ ManagedFields: nil,
},
Spec: v1.ServiceSpec{
Ports: []v1.ServicePort{
{
Name: "https",
- Protocol: "TCP",
+ Protocol: "",
AppProtocol: nil,
Port: 8443,
... // 2 identical fields
},
},
Selector: {"app": "cluster-monitoring-operator"},
ClusterIP: "None",
- ClusterIPs: []string{"None"},
+ ClusterIPs: nil,
- Type: "ClusterIP",
+ Type: "",
ExternalIPs: nil,
- SessionAffinity: "None",
+ SessionAffinity: "",
LoadBalancerIP: "",
LoadBalancerSourceRanges: nil,
... // 3 identical fields
PublishNotReadyAddresses: false,
SessionAffinityConfig: nil,
- IPFamilies: []v1.IPFamily{"IPv4"},
+ IPFamilies: nil,
- IPFamilyPolicy: &"SingleStack",
+ IPFamilyPolicy: nil,
AllocateLoadBalancerNodePorts: nil,
LoadBalancerClass: nil,
- InternalTrafficPolicy: &"Cluster",
+ InternalTrafficPolicy: nil,
},
Status: {},
}
# cat 0000_50_cluster-monitoring-operator_03-service.yaml
1 apiVersion: v1
2 kind: Service
3 metadata:
4 annotations:
5 service.beta.openshift.io/serving-cert-secret-name: cluster-monitoring-operator-tls
6 include.release.openshift.io/ibm-cloud-managed: "true"
7 include.release.openshift.io/self-managed-high-availability: "true"
8 include.release.openshift.io/single-node-developer: "true"
9 labels:
10 app: cluster-monitoring-operator
11 name: cluster-monitoring-operator
12 namespace: openshift-monitoring
13 spec:
14 clusterIP: None
15 ports:
16 - name: https
17 port: 8443
18 targetPort: https
19 selector:
20 app: cluster-monitoring-operator
# oc get service/cluster-monitoring-operator -oyaml -n openshift-monitoring
apiVersion: v1
kind: Service
metadata:
annotations:
include.release.openshift.io/ibm-cloud-managed: "true"
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
service.alpha.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1641789443
service.beta.openshift.io/serving-cert-secret-name: cluster-monitoring-operator-tls
service.beta.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1641789443
creationTimestamp: "2022-01-10T04:35:09Z"
labels:
app: cluster-monitoring-operator
app.kubernetes.io/name: cluster-monitoring-operator
name: cluster-monitoring-operator
namespace: openshift-monitoring
ownerReferences:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
name: version
uid: 334d6c04-126d-4271-96ec-d303e93b7d1c
resourceVersion: "66862"
uid: bf847434-1d7b-403e-8799-3301762a9e4b
spec:
clusterIP: None
clusterIPs:
- None
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: https
port: 8443
protocol: TCP
targetPort: https
selector:
app: cluster-monitoring-operator
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
Is CVO complaining about "service.alpha.openshift.io/serving-cert-signed-by"?
> Is CVO complaining about "service.alpha.openshift.io/serving-cert-signed-by"?
CVO shouldn't mind folks adding additional annotations and labels beyond what the manifest carries. But yeah, we should grow separate bugs for the Service and the CronJob, because we want to fix those, but the root cause will be some aspect of reconciliation for those types, while this bug is just about reconciling Deployments.
From comment 0: > Therefore, if I as an OpenShift developer wish to change anything about the strategy after the resource is initially created, CVO will not apply the changes. However, we didn't end up getting any manifest strategy changes from 4.9 to 4.10: $ oc adm release extract --to 4.9 quay.io/openshift-release-dev/ocp-release:4.9.19-x86_64 Extracted release payload from digest sha256:79dc491573156e7413c9b39d06b5c10d01f7eee26779cb34aed3d1b93158508c created at 2022-02-01T15:31:25Z $ oc adm release extract --to 4.10 quay.io/openshift-release-dev/ocp-release:4.10.0-rc.0-x86_64 Extracted release payload from digest sha256:be7ff17230199b5c0ee9cd48a932ceb06145ced8fec1c7d31bfe2e1a36746830 created at 2022-02-02T11:21:41Z $ diff -ru 4.9 4.10 | grep -A1 strategy: strategy: @@ -35,19 +33,19 @@ -- strategy: type: RollingUpdate -- strategy: type: RollingUpdate -- strategy: type: RollingUpdate And even expanding to include Deployments not managed by the CVO, there's no difference between 4.9.0 [1] and 4.10.0-rc.0 [2] CI runs: $ REF_A=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1448662625234718720/artifacts/e2e-aws-upgrade/deployments.json $ REF_B=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1488836164038168576/artifacts/e2e-aws-upgrade/deployments.json $ JQ='[.items[] | {namespace: .metadata.namespace, name: .metadata.name, strategy: .spec.strategy}] | sort_by(.namespace + .name)[]' $ diff -u1 <(curl -s "${REF_A}" | jq -c "${JQ}") <(curl -s "${REF_B}" | jq -c "${JQ}") --- /dev/fd/63 2022-02-04 09:00:14.861964053 -0800 +++ /dev/fd/62 2022-02-04 09:00:14.863964053 -0800 @@ -1,2 +1,2 @@ -{"namespace":"e2e-k8s-sig-apps-deployment-upgrade-8373","name":"dp","strategy":{"rollingUpdate":{"maxSurge":"25%","maxUnavailable":"25%"},"type":"RollingUpdate"}} +{"namespace":"e2e-k8s-sig-apps-deployment-upgrade-7357","name":"dp","strategy":{"rollingUpdate":{"maxSurge":"25%","maxUnavailable":"25%"},"type":"RollingUpdate"}} {"namespace":"openshift-apiserver-operator","name":"openshift-apiserver-operator","strategy":{"type":"Recreate"}} @@ -8,2 +8,3 @@ {"namespace":"openshift-cloud-credential-operator","name":"pod-identity-webhook","strategy":{"rollingUpdate":{"maxSurge":"25%","maxUnavailable":"25%"},"type":"RollingUpdate"}} +{"namespace":"openshift-cloud-network-config-controller","name":"cloud-network-config-controller","strategy":{"type":"Recreate"}} {"namespace":"openshift-cluster-csi-drivers","name":"aws-ebs-csi-driver-controller","strategy":{"rollingUpdate":{"maxSurge":0,"maxUnavailable":1},"type":"RollingUpdate"}} So I expect this was more along the lines of "we might want this", not "we actually need this". Per the verification [3], this does help protect the cluster from admins patching CVO-managed deployments with custom update strategies (which perhaps the patched deployment is not prepared to support), but that seems narrow enough that I don't think it's worth release docs. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1448662625234718720 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1488836164038168576 [3]: https://github.com/openshift/cluster-version-operator/pull/650#issuecomment-924155597 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |