Attempting to merge the support operator has triggered some form of bug in the replica set controller - the first time the operator deployment is updated it goes into an infinite loop of collisions, creating and deleting the pod endlessly. This is 100% reproducible on update from the first to second deployment (so maybe deploy support operator and then tweak its image location to be to a identical mirror). https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_support-operator/9/pull-ci-openshift-support-operator-master-e2e-aws-upgrade/7 ... I0523 04:02:56.675984 1 sync.go:251] Found a hash collision for deployment "support-operator" - bumping collisionCount (6->7) to resolve it I0523 04:02:56.676020 1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: replicasets.apps "support-operator-84cbf58c9c" already exists ... later ... I0523 04:08:52.718873 1 sync.go:251] Found a hash collision for deployment "support-operator" - bumping collisionCount (81->82) to resolve it I0523 04:08:52.718912 1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: replicasets.apps "support-operator-599dc4958" already exists I0523 04:08:52.737010 1 replica_set.go:477] Too few replicas for ReplicaSet openshift-support/support-operator-6dd85c97cc, need 1, creating 1 I0523 04:08:52.737826 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-support", Name:"support-operator", UID:"7fca421a-7d0c-11e9-abe1-129a16ed0c20", APIVersion:"apps/v 1", ResourceVersion:"32661", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled up replica set support-operator-6dd85c97cc to 1 I0523 04:08:52.746483 1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-support", Name:"support-operator-6dd85c97cc", UID:"79c39631-7d10-11e9-b30f-0af47f16c66e", APIVers ion:"apps/v1", ResourceVersion:"32662", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: support-operator-6dd85c97cc-pzcwm I0523 04:08:52.758531 1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: Operation cannot be fulfilled on deployments.apps "support-operator": the object has been modified; please apply your changes to the latest version and try again I0523 04:08:52.775403 1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: Operation cannot be fulfilled on replicasets.apps "support-operator-6dd85c97cc": the object has been modified; please apply your changes to the latest version and try again I0523 04:08:52.791419 1 replica_set.go:516] Too many replicas for ReplicaSet openshift-support/support-operator-6dd85c97cc, need 0, deleting 1 I0523 04:08:52.791472 1 controller_utils.go:598] Controller support-operator-6dd85c97cc deleting pod openshift-support/support-operator-6dd85c97cc-pzcwm I0523 04:08:52.792028 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-support", Name:"support-operator", UID:"7fca421a-7d0c-11e9-abe1-129a16ed0c20", APIVersion:"apps/v 1", ResourceVersion:"32663", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set support-operator-6dd85c97cc to 0 I0523 04:08:52.804290 1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-support", Name:"support-operator-6dd85c97cc", UID:"79c39631-7d10-11e9-b30f-0af47f16c66e", APIVers ion:"apps/v1", ResourceVersion:"32671", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: support-operator-6dd85c97cc-pzcwm I0523 04:09:04.205503 1 sync.go:251] Found a hash collision for deployment "support-operator" - bumping collisionCount (82->83) to resolve it I0523 04:09:04.205548 1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: replicasets.apps "support-operator-6dd85c97cc" already exists I0523 04:09:04.227864 1 replica_set.go:477] Too few replicas for ReplicaSet openshift-support/support-operator-575c956f88, need 1, creating 1 I0523 04:09:04.228601 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-support", Name:"support-operator", UID:"7fca421a-7d0c-11e9-abe1-129a16ed0c20", APIVersion:"apps/v 1", ResourceVersion:"32751", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled up replica set support-operator-575c956f88 to 1 I0523 04:09:04.241293 1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-support", Name:"support-operator-575c956f88", UID:"809c4333-7d10-11e9-b30f-0af47f16c66e", APIVers ion:"apps/v1", ResourceVersion:"32752", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: support-operator-575c956f88-9cdvq I0523 04:09:04.242974 1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: Operation cannot be fulfilled on deployments.apps "support-operator": the object has been modified; please apply your changes to the latest version and try again I0523 04:09:04.259644 1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: Operation cannot be fulfilled on replicasets.apps "support-operator-575c956f88": the object has been modified; please apply your changes to the latest version and try again I0523 04:09:04.276331 1 replica_set.go:516] Too many replicas for ReplicaSet openshift-support/support-operator-575c956f88, need 0, deleting 1 ... One of the earlier chunks. I0523 04:02:56.174997 1 deployment_controller.go:484] Error syncing deployment openshift-image-registry/image-registry: Operation cannot be fulfilled on replicasets.apps "image-registry-5f788b4c79": the object has been modified; please apply your changes to the latest version and try again I0523 04:02:56.176247 1 sync.go:251] Found a hash collision for deployment "support-operator" - bumping collisionCount (5->6) to resolve it I0523 04:02:56.176270 1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: replicasets.apps "support-operator-578db4fdf8" already exists I0523 04:02:56.176681 1 replica_set.go:516] Too many replicas for ReplicaSet openshift-marketplace/certified-operators-6f96675f4, need 0, deleting 1 I0523 04:02:56.176729 1 controller_utils.go:598] Controller certified-operators-6f96675f4 deleting pod openshift-marketplace/certified-operators-6f96675f4-jxvdg I0523 04:02:56.182351 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-marketplace", Name:"certified-operators", UID:"4c0b1ed2-7d0d-11e9-8cad-0e4e0ac820e6", APIVersion:"apps/v1", ResourceVersion:"21199", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set certified-operators-6f96675f4 to 0 I0523 04:02:56.205528 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-support", Name:"support-operator", UID:"7fca421a-7d0c-11e9-abe1-129a16ed0c20", APIVersion:"apps/v1", ResourceVersion:"21204", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled up replica set support-operator-84cbf58c9c to 1 I0523 04:02:56.210491 1 replica_set.go:477] Too few replicas for ReplicaSet openshift-support/support-operator-84cbf58c9c, need 1, creating 1 I0523 04:02:56.215001 1 replica_set.go:477] Too few replicas for ReplicaSet openshift-image-registry/image-registry-5f788b4c79, need 1, creating 1 I0523 04:02:56.215658 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-image-registry", Name:"image-registry", UID:"4adaec14-7d0d-11e9-ae07-0af47f16c66e", APIVersion:"apps/v1", ResourceVersion:"21183", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled up replica set image-registry-5f788b4c79 to 1 I0523 04:02:56.229165 1 deployment_controller.go:484] Error syncing deployment openshift-marketplace/certified-operators: Operation cannot be fulfilled on deployments.apps "certified-operators": the object has been modified; please apply your changes to the latest version and try again I0523 04:02:56.236074 1 deployment_controller.go:484] Error syncing deployment openshift-image-registry/image-registry: Operation cannot be fulfilled on deployments.apps "image-registry": the object has been modified; please apply your changes to the latest version and try again I0523 04:02:56.236160 1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: Operation cannot be fulfilled on deployments.apps "support-operator": the object has been modified; please apply your changes to the latest version and try again I0523 04:02:56.242832 1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-support", Name:"support-operator-84cbf58c9c", UID:"a53fb46f-7d0f-11e9-b30f-0af47f16c66e", APIVersion:"apps/v1", ResourceVersion:"21213", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: support-operator-84cbf58c9c-fdpqd I0523 04:02:56.266646 1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-marketplace", Name:"certified-operators-6f96675f4", UID:"a50342ab-7d0f-11e9-b30f-0af47f16c66e", APIVersion:"apps/v1", ResourceVersion:"21205", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: certified-operators-6f96675f4-jxvdg I0523 04:02:56.266688 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-support", Name:"support-operator", UID:"7fca421a-7d0c-11e9-abe1-129a16ed0c20", APIVersion:"apps/v1", ResourceVersion:"21218", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set support-operator-84cbf58c9c to 0 I0523 04:02:56.316681 1 replica_set.go:516] Too many replicas for ReplicaSet openshift-support/support-operator-84cbf58c9c, need 0, deleting 1 I0523 04:02:56.316846 1 controller_utils.go:598] Controller support-operator-84cbf58c9c deleting pod openshift-support/support-operator-84cbf58c9c-fdpqd I0523 04:02:56.332971 1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: Operation cannot be fulfilled on deployments.apps "support-operator": the object has been modified; please apply your changes to the latest version and try again I0523 04:02:56.408169 1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-image-registry", Name:"image-registry-5f788b4c79", UID:"a05fffb6-7d0f-11e9-b30f-0af47f16c66e", APIVersion:"apps/v1", ResourceVersion:"21216", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: image-registry-5f788b4c79-xq6w8 I0523 04:02:56.574779 1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-support", Name:"support-operator-84cbf58c9c", UID:"a53fb46f-7d0f-11e9-b30f-0af47f16c66e", APIVersion:"apps/v1", ResourceVersion:"21225", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: support-operator-84cbf58c9c-fdpqd I0523 04:02:56.675984 1 sync.go:251] Found a hash collision for deployment "support-operator" - bumping collisionCount (6->7) to resolve it I0523 04:02:56.676020 1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: replicasets.apps "support-operator-84cbf58c9c" already exists I0523 04:02:56.738921 1 replica_set.go:477] Too few replicas for ReplicaSet openshift-support/support-operator-6954696768, need 1, creating 1 I0523 04:02:56.745524 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-support", Name:"support-operator", UID:"7fca421a-7d0c-11e9-abe1-129a16ed0c20", APIVersion:"apps/v1", ResourceVersion:"21312", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled up replica set support-operator-6954696768 to 1 I0523 04:02:56.759711 1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-support", Name:"support-operator-6954696768", UID:"a591891a-7d0f-11e9-b30f-0af47f16c66e", APIVersion:"apps/v1", ResourceVersion:"21318", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: support-operator-6954696768-ctvrw I0523 04:02:56.794452 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-marketplace", Name:"community-operators", UID:"4c45e496-7d0d-11e9-8cad-0e4e0ac820e6", APIVersion:"apps/v1", ResourceVersion:"21320", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set community-operators-5b69c4fbff to 0 Setting to urgent because this blocks rolling out support operator - a workaround would let us drop this to high. Does not appear to be a 4.1 issue, just 4.2 post-rebase.
I tracked this down: ReplicaSet spec {"containers":[{"args":["start","-v=4","--config=/etc/support-operator/server.yaml"],"env":[{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}},{"name":"RELEASE_VERSION","value":"0.0.1-2019-05-23-032421"}],"image":"registry.svc.ci.openshift.org/ci-op-pwcc6sq3/stable@sha256:44fe273f63edcec5f1e3bf999c4f08d34a5db02b426a03105657b1db3a5aeffb","imagePullPolicy":"IfNotPresent","name":"operator","ports":[{"containerPort":8443,"name":"https","protocol":"TCP"}],"resources":{"requests":{"cpu":"10m","memory":"30Mi"}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"FallbackToLogsOnError","volumeMounts":[{"mountPath":"/var/lib/support-operator","name":"snapshots"}]}],"dnsPolicy":"ClusterFirst","nodeSelector":{"beta.kubernetes.io/os":"linux","node-role.kubernetes.io/master":""},"priorityClassName":"system-cluster-critical","restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"operator","serviceAccountName":"operator","terminationGracePeriodSeconds":30,"tolerations":[{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":900},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":900}],"volumes":[{"emptyDir":{},"name":"snapshots"}]} Deployment spec {"containers":[{"args":["start","-v=4","--config=/etc/support-operator/server.yaml"],"env":[{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}},{"name":"RELEASE_VERSION","value":"0.0.1-2019-05-23-032421"}],"image":"registry.svc.ci.openshift.org/ci-op-pwcc6sq3/stable@sha256:44fe273f63edcec5f1e3bf999c4f08d34a5db02b426a03105657b1db3a5aeffb","imagePullPolicy":"IfNotPresent","name":"operator","ports":[{"containerPort":8443,"name":"https","protocol":"TCP"}],"resources":{"requests":{"cpu":"10m","memory":"30Mi"}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"FallbackToLogsOnError","volumeMounts":[{"mountPath":"/var/lib/support-operator","name":"snapshots"}]}],"dnsPolicy":"ClusterFirst","nodeSelector":{"beta.kubernetes.io/os":"linux","node-role.kubernetes.io/master":""},"priorityClassName":"system-cluster-critical","restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"operator","serviceAccountName":"operator","terminationGracePeriodSeconds":30,"tolerations":[{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":900},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":900}],"volumes":[{"emptyDir":{"sizeLimit":"1Gi"},"name":"snapshots"}]} Only difference is emptyDir: {sizeLimit: "1Gi"} Also, we create 487 replica sets (which are still around at the end of the run), which means that revision history (10) is not working for some reason. That is a separate bug.
Here's the pod spec: apiVersion: apps/v1 kind: Deployment metadata: name: support-operator namespace: openshift-support spec: strategy: type: Recreate selector: matchLabels: app: support-operator template: metadata: labels: app: support-operator spec: serviceAccountName: operator priorityClassName: system-cluster-critical nodeSelector: beta.kubernetes.io/os: linux node-role.kubernetes.io/master: "" tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 900 - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 900 volumes: - name: snapshots emptyDir: sizeLimit: 1Gi containers: - name: operator image: quay.io/openshift/origin-support-operator:latest terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - name: snapshots mountPath: /var/lib/support-operator ports: - containerPort: 8443 name: https resources: requests: cpu: 10m memory: 30Mi env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: RELEASE_VERSION value: "0.0.1-snapshot" args: - start - -v=4 - --config=/etc/support-operator/server.yaml This is going to stay urgent because I'm not positive this isn't a 1.13 bug too.
Ok, so the deployment had sizeLimit set (which means LocalStorageCapacityIsolation was on long enough for it to get set). Then LocalStorageCapacityIsolation got turned off, and replica set went into hot loop https://github.com/kubernetes/kubernetes/issues/57167
So the bugs that need to be tracked down: 1. Why does the hot loop not get revision cleanup? 2. Why LocalStorageCapacityIsolation is off in 4.1 (tested just now on a 4.1 cluster)? 3. Why, during an upgrade from 4.2 to 4.2 when the operator is installed, is sizeLimit allowed to be created, but then immediately turned off? 4. How can we fix the broken deployment / replica set hot loop - this means if a user toggles certain feature gates (those that control DisabledFields) you could cause a bunch of deployments to go haywire?
> 1. Why does the hot loop not get revision cleanup? It shares the same sync loop which is failing. Moreover it cleans up only after the latest deployments is complete https://github.com/openshift/origin/blob/efc7e25e7d1475b7c0c6caa74093cdad64d467e9/vendor/k8s.io/kubernetes/pkg/controller/deployment/recreate.go#L66-L70 > 2. 3. are tracked in it's own https://bugzilla.redhat.com/show_bug.cgi?id=1713479 > How can we fix the broken deployment / replica set hot loop - this means if a user toggles certain feature gates (those that control DisabledFields) you could cause a bunch of deployments to go haywire? We have long standing issues with workloads controllers and mutating admission. This is again the case where we compare podSpec which obviously breaks and can't recover because something else gets created then what the controller asks to create. To limit the fallout we could do backoff. Or/And we can invest time into solving this issue upstream. Actually we have reopened that issue few SIG-Apps meetings back so there is interest to fix it, but it is missing resources. We could either make workload controllers just revision/generation based and figure out the other issues caused by that or there was an option with dry-run but that would needed some smart heuristics to avoid performance hickups and there are also issues if the admission would not be idempotent.
I have started the discussion upstream and David came with another idea on server-side apply, but it won't be sooner then 4.3/4.4
This won't be fixed in 4.3 time frame, moving to 4.4.
closing in favor of https://issues.redhat.com/browse/WRKLDS-162