Bug 1713207
Summary: | ReplicaSet goes into an infinite loop of recreating pods due to hash collision | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | |
Component: | kube-controller-manager | Assignee: | Tomáš Nožička <tnozicka> | |
Status: | CLOSED DEFERRED | QA Contact: | zhou ying <yinzhou> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | 4.2.0 | CC: | aos-bugs, gblomqui, jokerman, maszulik, mfojtik, mmccomas, tnozicka | |
Target Milestone: | --- | |||
Target Release: | 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1713479 1720308 (view as bug list) | Environment: | ||
Last Closed: | 2020-05-05 13:13:01 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1713479 |
Description
Clayton Coleman
2019-05-23 07:15:51 UTC
I tracked this down: ReplicaSet spec {"containers":[{"args":["start","-v=4","--config=/etc/support-operator/server.yaml"],"env":[{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}},{"name":"RELEASE_VERSION","value":"0.0.1-2019-05-23-032421"}],"image":"registry.svc.ci.openshift.org/ci-op-pwcc6sq3/stable@sha256:44fe273f63edcec5f1e3bf999c4f08d34a5db02b426a03105657b1db3a5aeffb","imagePullPolicy":"IfNotPresent","name":"operator","ports":[{"containerPort":8443,"name":"https","protocol":"TCP"}],"resources":{"requests":{"cpu":"10m","memory":"30Mi"}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"FallbackToLogsOnError","volumeMounts":[{"mountPath":"/var/lib/support-operator","name":"snapshots"}]}],"dnsPolicy":"ClusterFirst","nodeSelector":{"beta.kubernetes.io/os":"linux","node-role.kubernetes.io/master":""},"priorityClassName":"system-cluster-critical","restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"operator","serviceAccountName":"operator","terminationGracePeriodSeconds":30,"tolerations":[{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":900},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":900}],"volumes":[{"emptyDir":{},"name":"snapshots"}]} Deployment spec {"containers":[{"args":["start","-v=4","--config=/etc/support-operator/server.yaml"],"env":[{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}},{"name":"RELEASE_VERSION","value":"0.0.1-2019-05-23-032421"}],"image":"registry.svc.ci.openshift.org/ci-op-pwcc6sq3/stable@sha256:44fe273f63edcec5f1e3bf999c4f08d34a5db02b426a03105657b1db3a5aeffb","imagePullPolicy":"IfNotPresent","name":"operator","ports":[{"containerPort":8443,"name":"https","protocol":"TCP"}],"resources":{"requests":{"cpu":"10m","memory":"30Mi"}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"FallbackToLogsOnError","volumeMounts":[{"mountPath":"/var/lib/support-operator","name":"snapshots"}]}],"dnsPolicy":"ClusterFirst","nodeSelector":{"beta.kubernetes.io/os":"linux","node-role.kubernetes.io/master":""},"priorityClassName":"system-cluster-critical","restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"operator","serviceAccountName":"operator","terminationGracePeriodSeconds":30,"tolerations":[{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":900},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":900}],"volumes":[{"emptyDir":{"sizeLimit":"1Gi"},"name":"snapshots"}]} Only difference is emptyDir: {sizeLimit: "1Gi"} Also, we create 487 replica sets (which are still around at the end of the run), which means that revision history (10) is not working for some reason. That is a separate bug. Here's the pod spec: apiVersion: apps/v1 kind: Deployment metadata: name: support-operator namespace: openshift-support spec: strategy: type: Recreate selector: matchLabels: app: support-operator template: metadata: labels: app: support-operator spec: serviceAccountName: operator priorityClassName: system-cluster-critical nodeSelector: beta.kubernetes.io/os: linux node-role.kubernetes.io/master: "" tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 900 - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 900 volumes: - name: snapshots emptyDir: sizeLimit: 1Gi containers: - name: operator image: quay.io/openshift/origin-support-operator:latest terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - name: snapshots mountPath: /var/lib/support-operator ports: - containerPort: 8443 name: https resources: requests: cpu: 10m memory: 30Mi env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: RELEASE_VERSION value: "0.0.1-snapshot" args: - start - -v=4 - --config=/etc/support-operator/server.yaml This is going to stay urgent because I'm not positive this isn't a 1.13 bug too. Ok, so the deployment had sizeLimit set (which means LocalStorageCapacityIsolation was on long enough for it to get set). Then LocalStorageCapacityIsolation got turned off, and replica set went into hot loop https://github.com/kubernetes/kubernetes/issues/57167 So the bugs that need to be tracked down: 1. Why does the hot loop not get revision cleanup? 2. Why LocalStorageCapacityIsolation is off in 4.1 (tested just now on a 4.1 cluster)? 3. Why, during an upgrade from 4.2 to 4.2 when the operator is installed, is sizeLimit allowed to be created, but then immediately turned off? 4. How can we fix the broken deployment / replica set hot loop - this means if a user toggles certain feature gates (those that control DisabledFields) you could cause a bunch of deployments to go haywire? > 1. Why does the hot loop not get revision cleanup? It shares the same sync loop which is failing. Moreover it cleans up only after the latest deployments is complete https://github.com/openshift/origin/blob/efc7e25e7d1475b7c0c6caa74093cdad64d467e9/vendor/k8s.io/kubernetes/pkg/controller/deployment/recreate.go#L66-L70 > 2. 3. are tracked in it's own https://bugzilla.redhat.com/show_bug.cgi?id=1713479 > How can we fix the broken deployment / replica set hot loop - this means if a user toggles certain feature gates (those that control DisabledFields) you could cause a bunch of deployments to go haywire? We have long standing issues with workloads controllers and mutating admission. This is again the case where we compare podSpec which obviously breaks and can't recover because something else gets created then what the controller asks to create. To limit the fallout we could do backoff. Or/And we can invest time into solving this issue upstream. Actually we have reopened that issue few SIG-Apps meetings back so there is interest to fix it, but it is missing resources. We could either make workload controllers just revision/generation based and figure out the other issues caused by that or there was an option with dry-run but that would needed some smart heuristics to avoid performance hickups and there are also issues if the admission would not be idempotent. I have started the discussion upstream and David came with another idea on server-side apply, but it won't be sooner then 4.3/4.4 This won't be fixed in 4.3 time frame, moving to 4.4. closing in favor of https://issues.redhat.com/browse/WRKLDS-162 |