Bug 1753120 - [osp-autoscaler] Race condition when new node comes, scheduler bind pending pods before static pods with requests resources sync to kube-apiserver
Summary: [osp-autoscaler] Race condition when new node comes, scheduler bind pending p...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.6.0
Assignee: Joel Smith
QA Contact: Cameron Meadors
Depends On:
TreeView+ depends on / blocked
Reported: 2019-09-18 07:10 UTC by sunzhaohua
Modified: 2020-10-27 15:54 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2020-10-27 15:54:19 UTC
Target Upstream Version:

Attachments (Terms of Use)
autoscaler log (104.68 KB, text/plain)
2019-09-19 11:00 UTC, sunzhaohua
no flags Details

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:54:52 UTC

Description sunzhaohua 2019-09-18 07:10:43 UTC
Description of problem:
This issue can be addressed on IPI on OSP or IPI on BM since they have static pods(coredns, keepalived and mdns-publisher) which have requests.resources
related bug: https://bugzilla.redhat.com/show_bug.cgi?id=1753067

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Create clusterautoscaler, machineautoscaler cr
2. oc adm new-project openshift-kni-infra
3. Create pod to scale up the cluster
apiVersion: extensions/v1beta1
kind: Deployment
  name: scale-up
    app: scale-up
  replicas: 15
      app: scale-up
        app: scale-up
      - name: busybox
        image: docker.io/library/busybox
            memory: 2Gi
        - /bin/sh 
        - "-c"
        - "echo 'this should be in the logs' && sleep 86400"
      terminationGracePeriodSeconds: 0
4. Check pod

Actual results:
After a while, some pod will OutOfmemory
$ oc get pod
NAME                                           READY   STATUS        RESTARTS   AGE
cluster-autoscaler-default-5dd4b8d85-dtrzz     1/1     Running       0          23h
cluster-autoscaler-operator-59b86c4d95-4r5wb   1/1     Running       0          24h
machine-api-controllers-776587cf7d-9ddqx       3/3     Running       0          24h
machine-api-operator-5bc8f8df49-pnf4c          1/1     Running       0          24h
scale-up-5f76786964-24tlg                      1/1     Running       0          9m35s
scale-up-5f76786964-252lw                      0/1     OutOfmemory   0          56s
scale-up-5f76786964-2cvmm                      0/1     OutOfmemory   0          38s
scale-up-5f76786964-2k7cc                      0/1     OutOfmemory   0          4m39s
scale-up-5f76786964-2kngx                      0/1     OutOfmemory   0          64s
scale-up-5f76786964-2wmk2                      0/1     OutOfmemory   0          60s
scale-up-5f76786964-2z5jc                      1/1     Running       0          9m35s
scale-up-5f76786964-4fbv4                      0/1     OutOfmemory   0          4m33s
scale-up-5f76786964-4fdlx                      0/1     OutOfmemory   0          49s
scale-up-5f76786964-4n5c4                      1/1     Running       0          9m35s
scale-up-5f76786964-4n8tr                      0/1     OutOfmemory   0          73s

Expected results:
Autoscaler could work well.

Additional info:

Comment 1 weiwei jiang 2019-09-18 08:30:44 UTC
This issue happens after applied the workaround which mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1753067#c3

Comment 3 Alberto 2019-09-19 07:24:43 UTC
can you share autoscaler logs?

Comment 4 sunzhaohua 2019-09-19 11:00:30 UTC
Created attachment 1616672 [details]
autoscaler log

Comment 6 Jan Chaloupka 2019-09-19 14:12:10 UTC
Does every node require all three (coredns, keepalived and mdns-publisher) static pods?

Comment 8 sunzhaohua 2019-09-20 01:48:37 UTC
@ Joel Smith, I think you are right, static pods' mirror pods are created when the scheduler may have already scheduled other workloads that result in the workload pod are in OutOfmemory status, autoscaler is working as expected. Just cluster autoscaler only handles the pod in the pending state, but the added workload is always outofmemory.

Comment 11 Ryan Phillips 2019-11-12 01:50:20 UTC
I would be curious if the following patch helps this issue [1]. This BZ was created at around the same time as [1] merged.

1. https://github.com/openshift/origin/pull/23812

Comment 13 Joel Smith 2020-06-01 04:29:50 UTC
I didn't manage to test the backport yet, but I'll try to do it next sprint.

Comment 15 Joel Smith 2020-07-06 13:04:55 UTC
This appears to be fixed in 4.5, based upon my testing.

Whether because of https://github.com/openshift/origin/pull/23812 or something else, the current behavior is that a static pod will preempt a pod that has been scheduled to a node if the node doesn't have enough resources for the static pod.

Comment 19 Cameron Meadors 2020-07-21 18:31:15 UTC
I have tried a few scenarios for autoscaling.  All include static pods on all worker nodes.  I have not seen a pod in status OutOfMemory.  They are correctly in Pending when they trigger a scaling event and eventually deploy to the new node.  Testing on latest released (4.5.2).

I would say this is verified, but I am seeing unexpected behavior with the autoscaler: multiple nodes get spun up when only one should be need to satisfy memory requests and removing the reproducer deployment doesn't scale back down completely.  I am double checking my math and what the expected behavior of autoscaler in latest code.

Comment 20 Cameron Meadors 2020-07-22 20:07:15 UTC
I am going to stand by my statement that this has been verified as fixed.  All other issues are unrelated and I can track them down separately.  I am a little concerned that it is not clear what actually fixed it, but it is fixed.

Comment 22 errata-xmlrpc 2020-10-27 15:54:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.