1713207 – ReplicaSet goes into an infinite loop of recreating pods due to hash collision

Bug 1713207 - ReplicaSet goes into an infinite loop of recreating pods due to hash collision

Summary: ReplicaSet goes into an infinite loop of recreating pods due to hash collision

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Tomáš Nožička
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1713479
TreeView+	depends on / blocked

Reported:	2019-05-23 07:15 UTC by Clayton Coleman
Modified:	2020-05-05 13:13 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1713479 1720308 (view as bug list)
Environment:
Last Closed:	2020-05-05 13:13:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Clayton Coleman 2019-05-23 07:15:51 UTC

Attempting to merge the support operator has triggered some form of bug in the replica set controller - the first time the operator deployment is updated it goes into an infinite loop of collisions, creating and deleting the pod endlessly.  This is 100% reproducible on update from the first to second deployment (so maybe deploy support operator and then tweak its image location to be to a identical mirror).  

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_support-operator/9/pull-ci-openshift-support-operator-master-e2e-aws-upgrade/7

...
I0523 04:02:56.675984       1 sync.go:251] Found a hash collision for deployment "support-operator" - bumping collisionCount (6->7) to resolve it
I0523 04:02:56.676020       1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: replicasets.apps "support-operator-84cbf58c9c" already exists
...

later

...
I0523 04:08:52.718873       1 sync.go:251] Found a hash collision for deployment "support-operator" - bumping collisionCount (81->82) to resolve it
I0523 04:08:52.718912       1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: replicasets.apps "support-operator-599dc4958" already exists
I0523 04:08:52.737010       1 replica_set.go:477] Too few replicas for ReplicaSet openshift-support/support-operator-6dd85c97cc, need 1, creating 1
I0523 04:08:52.737826       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-support", Name:"support-operator", UID:"7fca421a-7d0c-11e9-abe1-129a16ed0c20", APIVersion:"apps/v
1", ResourceVersion:"32661", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled up replica set support-operator-6dd85c97cc to 1
I0523 04:08:52.746483       1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-support", Name:"support-operator-6dd85c97cc", UID:"79c39631-7d10-11e9-b30f-0af47f16c66e", APIVers
ion:"apps/v1", ResourceVersion:"32662", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: support-operator-6dd85c97cc-pzcwm
I0523 04:08:52.758531       1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: Operation cannot be fulfilled on deployments.apps "support-operator": the object has
been modified; please apply your changes to the latest version and try again
I0523 04:08:52.775403       1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: Operation cannot be fulfilled on replicasets.apps "support-operator-6dd85c97cc": the
object has been modified; please apply your changes to the latest version and try again
I0523 04:08:52.791419       1 replica_set.go:516] Too many replicas for ReplicaSet openshift-support/support-operator-6dd85c97cc, need 0, deleting 1
I0523 04:08:52.791472       1 controller_utils.go:598] Controller support-operator-6dd85c97cc deleting pod openshift-support/support-operator-6dd85c97cc-pzcwm
I0523 04:08:52.792028       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-support", Name:"support-operator", UID:"7fca421a-7d0c-11e9-abe1-129a16ed0c20", APIVersion:"apps/v
1", ResourceVersion:"32663", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set support-operator-6dd85c97cc to 0
I0523 04:08:52.804290       1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-support", Name:"support-operator-6dd85c97cc", UID:"79c39631-7d10-11e9-b30f-0af47f16c66e", APIVers
ion:"apps/v1", ResourceVersion:"32671", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: support-operator-6dd85c97cc-pzcwm
I0523 04:09:04.205503       1 sync.go:251] Found a hash collision for deployment "support-operator" - bumping collisionCount (82->83) to resolve it
I0523 04:09:04.205548       1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: replicasets.apps "support-operator-6dd85c97cc" already exists
I0523 04:09:04.227864       1 replica_set.go:477] Too few replicas for ReplicaSet openshift-support/support-operator-575c956f88, need 1, creating 1
I0523 04:09:04.228601       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-support", Name:"support-operator", UID:"7fca421a-7d0c-11e9-abe1-129a16ed0c20", APIVersion:"apps/v
1", ResourceVersion:"32751", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled up replica set support-operator-575c956f88 to 1
I0523 04:09:04.241293       1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-support", Name:"support-operator-575c956f88", UID:"809c4333-7d10-11e9-b30f-0af47f16c66e", APIVers
ion:"apps/v1", ResourceVersion:"32752", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: support-operator-575c956f88-9cdvq
I0523 04:09:04.242974       1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: Operation cannot be fulfilled on deployments.apps "support-operator": the object has
been modified; please apply your changes to the latest version and try again
I0523 04:09:04.259644       1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: Operation cannot be fulfilled on replicasets.apps "support-operator-575c956f88": the
object has been modified; please apply your changes to the latest version and try again
I0523 04:09:04.276331       1 replica_set.go:516] Too many replicas for ReplicaSet openshift-support/support-operator-575c956f88, need 0, deleting 1
...

One of the earlier chunks.

I0523 04:02:56.174997       1 deployment_controller.go:484] Error syncing deployment openshift-image-registry/image-registry: Operation cannot be fulfilled on replicasets.apps "image-registry-5f788b4c79": the object has been modified; please apply your changes to the latest version and try again
I0523 04:02:56.176247       1 sync.go:251] Found a hash collision for deployment "support-operator" - bumping collisionCount (5->6) to resolve it
I0523 04:02:56.176270       1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: replicasets.apps "support-operator-578db4fdf8" already exists
I0523 04:02:56.176681       1 replica_set.go:516] Too many replicas for ReplicaSet openshift-marketplace/certified-operators-6f96675f4, need 0, deleting 1
I0523 04:02:56.176729       1 controller_utils.go:598] Controller certified-operators-6f96675f4 deleting pod openshift-marketplace/certified-operators-6f96675f4-jxvdg
I0523 04:02:56.182351       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-marketplace", Name:"certified-operators", UID:"4c0b1ed2-7d0d-11e9-8cad-0e4e0ac820e6", APIVersion:"apps/v1", ResourceVersion:"21199", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set certified-operators-6f96675f4 to 0
I0523 04:02:56.205528       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-support", Name:"support-operator", UID:"7fca421a-7d0c-11e9-abe1-129a16ed0c20", APIVersion:"apps/v1", ResourceVersion:"21204", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled up replica set support-operator-84cbf58c9c to 1
I0523 04:02:56.210491       1 replica_set.go:477] Too few replicas for ReplicaSet openshift-support/support-operator-84cbf58c9c, need 1, creating 1
I0523 04:02:56.215001       1 replica_set.go:477] Too few replicas for ReplicaSet openshift-image-registry/image-registry-5f788b4c79, need 1, creating 1
I0523 04:02:56.215658       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-image-registry", Name:"image-registry", UID:"4adaec14-7d0d-11e9-ae07-0af47f16c66e", APIVersion:"apps/v1", ResourceVersion:"21183", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled up replica set image-registry-5f788b4c79 to 1
I0523 04:02:56.229165       1 deployment_controller.go:484] Error syncing deployment openshift-marketplace/certified-operators: Operation cannot be fulfilled on deployments.apps "certified-operators": the object has been modified; please apply your changes to the latest version and try again
I0523 04:02:56.236074       1 deployment_controller.go:484] Error syncing deployment openshift-image-registry/image-registry: Operation cannot be fulfilled on deployments.apps "image-registry": the object has been modified; please apply your changes to the latest version and try again
I0523 04:02:56.236160       1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: Operation cannot be fulfilled on deployments.apps "support-operator": the object has been modified; please apply your changes to the latest version and try again
I0523 04:02:56.242832       1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-support", Name:"support-operator-84cbf58c9c", UID:"a53fb46f-7d0f-11e9-b30f-0af47f16c66e", APIVersion:"apps/v1", ResourceVersion:"21213", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: support-operator-84cbf58c9c-fdpqd
I0523 04:02:56.266646       1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-marketplace", Name:"certified-operators-6f96675f4", UID:"a50342ab-7d0f-11e9-b30f-0af47f16c66e", APIVersion:"apps/v1", ResourceVersion:"21205", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: certified-operators-6f96675f4-jxvdg
I0523 04:02:56.266688       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-support", Name:"support-operator", UID:"7fca421a-7d0c-11e9-abe1-129a16ed0c20", APIVersion:"apps/v1", ResourceVersion:"21218", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set support-operator-84cbf58c9c to 0
I0523 04:02:56.316681       1 replica_set.go:516] Too many replicas for ReplicaSet openshift-support/support-operator-84cbf58c9c, need 0, deleting 1
I0523 04:02:56.316846       1 controller_utils.go:598] Controller support-operator-84cbf58c9c deleting pod openshift-support/support-operator-84cbf58c9c-fdpqd
I0523 04:02:56.332971       1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: Operation cannot be fulfilled on deployments.apps "support-operator": the object has been modified; please apply your changes to the latest version and try again
I0523 04:02:56.408169       1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-image-registry", Name:"image-registry-5f788b4c79", UID:"a05fffb6-7d0f-11e9-b30f-0af47f16c66e", APIVersion:"apps/v1", ResourceVersion:"21216", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: image-registry-5f788b4c79-xq6w8
I0523 04:02:56.574779       1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-support", Name:"support-operator-84cbf58c9c", UID:"a53fb46f-7d0f-11e9-b30f-0af47f16c66e", APIVersion:"apps/v1", ResourceVersion:"21225", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: support-operator-84cbf58c9c-fdpqd
I0523 04:02:56.675984       1 sync.go:251] Found a hash collision for deployment "support-operator" - bumping collisionCount (6->7) to resolve it
I0523 04:02:56.676020       1 deployment_controller.go:484] Error syncing deployment openshift-support/support-operator: replicasets.apps "support-operator-84cbf58c9c" already exists
I0523 04:02:56.738921       1 replica_set.go:477] Too few replicas for ReplicaSet openshift-support/support-operator-6954696768, need 1, creating 1
I0523 04:02:56.745524       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-support", Name:"support-operator", UID:"7fca421a-7d0c-11e9-abe1-129a16ed0c20", APIVersion:"apps/v1", ResourceVersion:"21312", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled up replica set support-operator-6954696768 to 1
I0523 04:02:56.759711       1 event.go:209] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openshift-support", Name:"support-operator-6954696768", UID:"a591891a-7d0f-11e9-b30f-0af47f16c66e", APIVersion:"apps/v1", ResourceVersion:"21318", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: support-operator-6954696768-ctvrw
I0523 04:02:56.794452       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-marketplace", Name:"community-operators", UID:"4c45e496-7d0d-11e9-8cad-0e4e0ac820e6", APIVersion:"apps/v1", ResourceVersion:"21320", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set community-operators-5b69c4fbff to 0

Setting to urgent because this blocks rolling out support operator - a workaround would let us drop this to high.  Does not appear to be a 4.1 issue, just 4.2 post-rebase.

Comment 1 Clayton Coleman 2019-05-23 18:55:49 UTC

I tracked this down:

ReplicaSet spec

{"containers":[{"args":["start","-v=4","--config=/etc/support-operator/server.yaml"],"env":[{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}},{"name":"RELEASE_VERSION","value":"0.0.1-2019-05-23-032421"}],"image":"registry.svc.ci.openshift.org/ci-op-pwcc6sq3/stable@sha256:44fe273f63edcec5f1e3bf999c4f08d34a5db02b426a03105657b1db3a5aeffb","imagePullPolicy":"IfNotPresent","name":"operator","ports":[{"containerPort":8443,"name":"https","protocol":"TCP"}],"resources":{"requests":{"cpu":"10m","memory":"30Mi"}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"FallbackToLogsOnError","volumeMounts":[{"mountPath":"/var/lib/support-operator","name":"snapshots"}]}],"dnsPolicy":"ClusterFirst","nodeSelector":{"beta.kubernetes.io/os":"linux","node-role.kubernetes.io/master":""},"priorityClassName":"system-cluster-critical","restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"operator","serviceAccountName":"operator","terminationGracePeriodSeconds":30,"tolerations":[{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":900},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":900}],"volumes":[{"emptyDir":{},"name":"snapshots"}]}

Deployment spec

{"containers":[{"args":["start","-v=4","--config=/etc/support-operator/server.yaml"],"env":[{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}},{"name":"RELEASE_VERSION","value":"0.0.1-2019-05-23-032421"}],"image":"registry.svc.ci.openshift.org/ci-op-pwcc6sq3/stable@sha256:44fe273f63edcec5f1e3bf999c4f08d34a5db02b426a03105657b1db3a5aeffb","imagePullPolicy":"IfNotPresent","name":"operator","ports":[{"containerPort":8443,"name":"https","protocol":"TCP"}],"resources":{"requests":{"cpu":"10m","memory":"30Mi"}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"FallbackToLogsOnError","volumeMounts":[{"mountPath":"/var/lib/support-operator","name":"snapshots"}]}],"dnsPolicy":"ClusterFirst","nodeSelector":{"beta.kubernetes.io/os":"linux","node-role.kubernetes.io/master":""},"priorityClassName":"system-cluster-critical","restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"operator","serviceAccountName":"operator","terminationGracePeriodSeconds":30,"tolerations":[{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":900},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":900}],"volumes":[{"emptyDir":{"sizeLimit":"1Gi"},"name":"snapshots"}]}

Only difference is emptyDir: {sizeLimit: "1Gi"}

Also, we create 487 replica sets (which are still around at the end of the run), which means that revision history (10) is not working for some reason.  That is a separate bug.

Comment 2 Clayton Coleman 2019-05-23 18:57:34 UTC

Here's the pod spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: support-operator
  namespace: openshift-support
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: support-operator
  template:
    metadata:
      labels:
        app: support-operator
    spec:
      serviceAccountName: operator
      priorityClassName: system-cluster-critical
      nodeSelector:
        beta.kubernetes.io/os: linux
        node-role.kubernetes.io/master: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 900
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 900
      volumes:
      - name: snapshots
        emptyDir:
          sizeLimit: 1Gi
      containers:
      - name: operator
        image: quay.io/openshift/origin-support-operator:latest
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - name: snapshots
          mountPath: /var/lib/support-operator
        ports:
        - containerPort: 8443
          name: https
        resources:
          requests:
            cpu: 10m
            memory: 30Mi
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: RELEASE_VERSION
          value: "0.0.1-snapshot"
        args:
        - start
        - -v=4
        - --config=/etc/support-operator/server.yaml

This is going to stay urgent because I'm not positive this isn't a 1.13 bug too.

Comment 3 Clayton Coleman 2019-05-23 19:44:17 UTC

Ok, so the deployment had sizeLimit set (which means LocalStorageCapacityIsolation was on long enough for it to get set).  Then LocalStorageCapacityIsolation got turned off, and replica set went into hot loop

https://github.com/kubernetes/kubernetes/issues/57167

Comment 4 Clayton Coleman 2019-05-23 19:49:58 UTC

So the bugs that need to be tracked down:

1. Why does the hot loop not get revision cleanup?
2. Why LocalStorageCapacityIsolation is off in 4.1 (tested just now on a 4.1 cluster)?
3. Why, during an upgrade from 4.2 to 4.2 when the operator is installed, is sizeLimit allowed to be created, but then immediately turned off?
4. How can we fix the broken deployment / replica set hot loop - this means if a user toggles certain feature gates (those that control DisabledFields) you could cause a bunch of deployments to go haywire?

Comment 5 Tomáš Nožička 2019-05-24 14:55:40 UTC

> 1. Why does the hot loop not get revision cleanup?

It shares the same sync loop which is failing. Moreover it cleans up only after the latest deployments is complete
https://github.com/openshift/origin/blob/efc7e25e7d1475b7c0c6caa74093cdad64d467e9/vendor/k8s.io/kubernetes/pkg/controller/deployment/recreate.go#L66-L70

> 2. 3.
are tracked in it's own https://bugzilla.redhat.com/show_bug.cgi?id=1713479

> How can we fix the broken deployment / replica set hot loop - this means if a user toggles certain feature gates (those that control DisabledFields) you could cause a bunch of deployments to go haywire?

We have long standing issues with workloads controllers and mutating admission. This is again the case where we compare podSpec which obviously breaks and can't recover because something else gets created then what the controller asks to create.

To limit the fallout we could do backoff.

Or/And we can invest time into solving this issue upstream. Actually we have reopened that issue few SIG-Apps meetings back so there is interest to fix it, but it is missing resources. We could either make workload controllers just revision/generation based and figure out the other issues caused by that or there was an option with dry-run but that would needed some smart heuristics to avoid performance hickups and there are also issues if the admission would not be idempotent.

Comment 6 Tomáš Nožička 2019-07-25 11:19:12 UTC

I have started the discussion upstream and David came with another idea on server-side apply, but it won't be sooner then 4.3/4.4

Comment 7 Maciej Szulik 2019-11-05 14:49:46 UTC

This won't be fixed in 4.3 time frame, moving to 4.4.

Comment 10 Tomáš Nožička 2020-05-05 13:13:01 UTC

closing in favor of https://issues.redhat.com/browse/WRKLDS-162

Note You need to log in before you can comment on or make changes to this bug.