1485464 – Kubernetes favors certain nodes until they die

Bug 1485464 - Kubernetes favors certain nodes until they die

Summary: Kubernetes favors certain nodes until they die

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.6.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Avesh Agarwal
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-25 20:03 UTC by Sten Turpin
Modified:	2017-12-18 21:02 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-19 19:59:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Sten Turpin 2017-08-25 20:03:25 UTC

Description of problem: Kubernetes is not spreading pods out evenly across nodes, and is instead favoring putting pods on the same nodes, until those nodes become overloaded and unresponsive 


Version-Release number of selected component (if applicable): atomic-openshift-3.6.173.0.5-1.git.0.f30b99e.el7.x86_64 


How reproducible: Sometimes on a cluster with 265 nodes 


Steps to Reproduce:
1. Create a large cluster
2. Create a large, diverse load on the cluster

Actual results:

Pods are fed to the same nodes until those nodes become overloaded and die

Expected results:
Pods should be spread out across nodes unless affinity or other similar factors say otherwise 

Additional info:

Pods created from different clients but at roughly the same time all scheduled to the same node. That node became unresponsive as a result of the load. 
    [sturpin@starter-us-east-1-master-25064 ~]$ sudo oc get pods -n ops-health-monitoring -o wide
    [sudo] password for sturpin:
    NAME                         READY     STATUS              RESTARTS   AGE       IP        NODE
    pull-08251850z-34-1-deploy   0/1       ContainerCreating   0          13m       <none>    ip-172-31-55-238.ec2.internal
    pull-08251850z-tl-1-deploy   0/1       ContainerCreating   0          29m       <none>    ip-172-31-55-238.ec2.internal
    pull-08251850z-w3-1-deploy   0/1       ContainerCreating   0          13m       <none>    ip-172-31-55-238.ec2.internal
     
    [sturpin@starter-us-east-1-master-25064 ~]$ sudo oc get pods -n ops-health-monitoring  -o wide
    NAME                          READY     STATUS        RESTARTS   AGE       IP              NODE
    build-08251900z-1x-1-deploy   0/1       Terminating   0          18m       <none>          ip-172-31-59-26.ec2.internal
    build-08251901z-0d-1-deploy   0/1       Terminating   0          18m       <none>          ip-172-31-59-26.ec2.internal
    build-08251901z-xi-1-build    0/1       Terminating   0          20m       <none>          ip-172-31-59-26.ec2.internal
    pull-08251910z-oj-1-deploy    0/1       Terminating   0          11m       10.129.23.174   ip-172-31-59-26.ec2.internal
    pull-08251920z-26-1-deploy    1/1       Running       0          1m        <none>          ip-172-31-54-226.ec2.intern
     
Odd distribution across the top 50 nodes: 
     
    [sturpin@starter-us-east-1-master-25064 ~]$ sudo oc get pods -o wide --all-namespaces | awk '{print $8}' | sort | uniq -c  | sort -rn | head -50
        198 ip-172-31-50-178.ec2.internal
         87 ip-172-31-49-48.ec2.internal
         86 ip-172-31-57-154.ec2.internal
         75 ip-172-31-53-214.ec2.internal
         75 ip-172-31-51-213.ec2.internal
         73 ip-172-31-60-39.ec2.internal
         67 ip-172-31-56-61.ec2.internal
         65 ip-172-31-61-142.ec2.internal
         56 ip-172-31-59-1.ec2.internal
         55 ip-172-31-59-57.ec2.internal
         55 ip-172-31-58-211.ec2.internal
         55 ip-172-31-57-98.ec2.internal
         55 ip-172-31-57-139.ec2.internal
         53 ip-172-31-59-91.ec2.internal
         51 ip-172-31-60-66.ec2.internal
         49 ip-172-31-60-135.ec2.internal
         47 ip-172-31-61-150.ec2.internal
         47 ip-172-31-60-189.ec2.internal
         47 ip-172-31-52-92.ec2.internal
         46 ip-172-31-60-13.ec2.internal
         46 ip-172-31-57-24.ec2.internal
         46 ip-172-31-56-114.ec2.internal
         46 ip-172-31-50-89.ec2.internal
         46 ip-172-31-49-133.ec2.internal
         45 ip-172-31-58-246.ec2.internal
         45 ip-172-31-57-102.ec2.internal
         45 ip-172-31-55-29.ec2.internal
         45 ip-172-31-54-211.ec2.internal
         45 ip-172-31-52-181.ec2.internal
         44 ip-172-31-62-163.ec2.internal
         44 ip-172-31-56-206.ec2.internal
         44 ip-172-31-55-228.ec2.internal
         43 ip-172-31-53-87.ec2.internal
         43 ip-172-31-48-176.ec2.internal
         42 ip-172-31-59-69.ec2.internal
         42 ip-172-31-51-3.ec2.internal
         42 ip-172-31-49-165.ec2.internal
         41 ip-172-31-56-207.ec2.internal
         41 ip-172-31-55-219.ec2.internal
         41 ip-172-31-49-42.ec2.internal
         41 ip-172-31-49-220.ec2.internal
         40 ip-172-31-61-46.ec2.internal
         40 ip-172-31-57-220.ec2.internal
         40 ip-172-31-54-26.ec2.internal
         40 ip-172-31-53-187.ec2.internal
         39 ip-172-31-54-98.ec2.internal
         38 ip-172-31-61-209.ec2.internal
         38 ip-172-31-61-162.ec2.internal
         38 ip-172-31-59-113.ec2.internal
         38 ip-172-31-57-141.ec2.internal
[sturpin@starter-us-east-1-master-25064 ~]$ sudo oc get pods -o wide --all-namespaces | awk '{print $8}' | sort | uniq -c  | sort -rn | tail -10
     31 ip-172-31-60-103.ec2.internal
     31 ip-172-31-53-133.ec2.internal
     30 ip-172-31-54-226.ec2.internal
     14 ip-172-31-59-26.ec2.internal
      5 ip-172-31-60-14.ec2.internal
      4 ip-172-31-56-38.ec2.internal
      4 ip-172-31-50-116.ec2.internal
      4 ip-172-31-48-214.ec2.internal
      1 NODE
      1 ip-172-31-51-95.ec2.internal

Comment 1 Sten Turpin 2017-08-25 21:06:34 UTC

Another interesting point: we run end-to-end tests that create apps from images and do STI builds at the top and middle of the hour. Here's our top of the hour run. Note how many of the deploy pods are on the same node: 

[sturpin@starter-us-east-1-master-25064 ~]$ sudo oc get pods -n ops-health-monitoring -o wide --watch
[sudo] password for sturpin:
NAME                        READY     STATUS    RESTARTS   AGE       IP             NODE
pull-08252050z-u7-1-1tv4x   1/1       Running   0          2m        10.129.22.48   ip-172-31-59-26.ec2.internal
pull-08252050z-u7-1-1tv4x   1/1       Terminating   0         2m        10.129.22.48   ip-172-31-59-26.ec2.internal
pull-08252050z-u7-1-1tv4x   0/1       Terminating   0         3m        <none>    ip-172-31-59-26.ec2.internal
pull-08252050z-u7-1-1tv4x   0/1       Terminating   0         3m        <none>    ip-172-31-59-26.ec2.internal
pull-08252050z-u7-1-1tv4x   0/1       Terminating   0         3m        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-o7-1-deploy   0/1       Pending   0         0s        <none>
pull-08252100z-o7-1-deploy   0/1       Pending   0         0s        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-o7-1-deploy   0/1       ContainerCreating   0         0s        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-01-1-deploy   0/1       Pending   0         0s        <none>
pull-08252100z-01-1-deploy   0/1       Pending   0         1s        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-01-1-deploy   0/1       ContainerCreating   0         1s        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-sp-1-deploy   0/1       Pending   0         0s        <none>
pull-08252100z-sp-1-deploy   0/1       Pending   0         0s        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-sp-1-deploy   0/1       ContainerCreating   0         0s        <none>    ip-172-31-59-26.ec2.internal
build-08252100z-gr-1-build   0/1       Pending   0         0s        <none>
build-08252100z-gr-1-build   0/1       Pending   0         1s        <none>    ip-172-31-55-238.ec2.internal
build-08252100z-gr-1-build   0/1       ContainerCreating   0         1s        <none>    ip-172-31-55-238.ec2.internal
pull-08252100z-o7-1-k1jj1   0/1       Pending   0         0s        <none>
pull-08252100z-o7-1-k1jj1   0/1       Pending   0         0s        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-o7-1-k1jj1   0/1       ContainerCreating   0         0s        <none>    ip-172-31-59-26.ec2.internal
build-08252100z-gr-1-build   1/1       Running   0         24s       10.131.89.27   ip-172-31-55-238.ec2.internal
build-08252101z-7b-1-build   0/1       Pending   0         0s        <none>
build-08252101z-7b-1-build   0/1       Pending   0         0s        <none>    ip-172-31-59-69.ec2.internal
build-08252101z-7b-1-build   0/1       ContainerCreating   0         1s        <none>    ip-172-31-59-69.ec2.internal
pull-08252100z-01-1-0m7gx   0/1       Pending   0         0s        <none>
pull-08252100z-01-1-0m7gx   0/1       Pending   0         0s        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-01-1-0m7gx   0/1       ContainerCreating   0         0s        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-sp-1-lxktm   0/1       Pending   0         0s        <none>
pull-08252100z-sp-1-lxktm   0/1       Pending   0         0s        <none>    ip-172-31-57-103.ec2.internal
pull-08252100z-sp-1-lxktm   0/1       ContainerCreating   0         0s        <none>    ip-172-31-57-103.ec2.internal
pull-08252100z-sp-1-deploy   1/1       Running   0         1m        10.129.22.67   ip-172-31-59-26.ec2.internal
pull-08252100z-01-1-deploy   1/1       Running   0         1m        10.129.22.66   ip-172-31-59-26.ec2.internal
build-08252101z-6w-1-build   0/1       Pending   0         0s        <none>
build-08252101z-6w-1-build   0/1       Pending   0         0s        <none>    ip-172-31-51-213.ec2.internal
build-08252101z-6w-1-build   0/1       ContainerCreating   0         1s        <none>    ip-172-31-51-213.ec2.internal
pull-08252100z-sp-1-lxktm   1/1       Running   0         12s       10.128.67.200   ip-172-31-57-103.ec2.internal
build-08252100z-gr-1-deploy   0/1       Pending   0         0s        <none>
build-08252100z-gr-1-deploy   0/1       Pending   0         0s        <none>    ip-172-31-51-151.ec2.internal
build-08252100z-gr-1-deploy   0/1       ContainerCreating   0         0s        <none>    ip-172-31-51-151.ec2.internal
build-08252100z-gr-1-build   0/1       Completed   0         1m        10.131.89.27   ip-172-31-55-238.ec2.internal
pull-08252100z-sp-1-deploy   0/1       Completed   0         1m        10.129.22.67   ip-172-31-59-26.ec2.internal
pull-08252100z-sp-1-deploy   0/1       Terminating   0         1m        10.129.22.67   ip-172-31-59-26.ec2.internal
pull-08252100z-sp-1-deploy   0/1       Terminating   0         1m        10.129.22.67   ip-172-31-59-26.ec2.internal
pull-08252100z-o7-1-deploy   1/1       Running   0         2m        10.129.22.65   ip-172-31-59-26.ec2.internal
build-08252101z-6w-1-build   1/1       Running   0         49s       10.128.151.199   ip-172-31-51-213.ec2.internal
build-08252101z-7b-1-deploy   0/1       Pending   0         0s        <none>
build-08252101z-7b-1-deploy   0/1       Pending   0         0s        <none>    ip-172-31-55-238.ec2.internal
build-08252101z-7b-1-deploy   0/1       ContainerCreating   0         0s        <none>    ip-172-31-55-238.ec2.internal
build-08252101z-7b-1-deploy   1/1       Running   0         9s        10.131.89.30   ip-172-31-55-238.ec2.internal
build-08252101z-7b-1-l3l5c   0/1       Pending   0         0s        <none>
build-08252101z-7b-1-l3l5c   0/1       Pending   0         1s        <none>    ip-172-31-59-26.ec2.internal
build-08252101z-7b-1-l3l5c   0/1       ContainerCreating   0         1s        <none>    ip-172-31-59-26.ec2.internal
build-08252101z-6w-1-deploy   0/1       Pending   0         0s        <none>
build-08252101z-6w-1-deploy   0/1       Pending   0         0s        <none>    ip-172-31-57-24.ec2.internal
build-08252101z-6w-1-deploy   0/1       ContainerCreating   0         0s        <none>    ip-172-31-57-24.ec2.internal
build-08252101z-6w-1-build   0/1       Completed   0         1m        10.128.151.199   ip-172-31-51-213.ec2.internal
build-08252100z-gr-1-b1lhg   0/1       Pending   0         0s        <none>
build-08252100z-gr-1-deploy   1/1       Running   0         1m        10.130.104.192   ip-172-31-51-151.ec2.internal
build-08252101z-6w-1-deploy   1/1       Running   0         17s       10.131.133.144   ip-172-31-57-24.ec2.internal
build-08252100z-gr-1-b1lhg   0/1       Pending   0         0s        <none>    ip-172-31-57-139.ec2.internal
build-08252100z-gr-1-b1lhg   0/1       ContainerCreating   0         0s        <none>    ip-172-31-57-139.ec2.internal
build-08252101z-6w-1-7lxpb   0/1       Pending   0         0s        <none>
build-08252101z-6w-1-7lxpb   0/1       Pending   0         0s        <none>    ip-172-31-57-102.ec2.internal
build-08252101z-6w-1-7lxpb   0/1       ContainerCreating   0         0s        <none>    ip-172-31-57-102.ec2.internal
build-08252101z-7b-1-build   0/1       Completed   0         2m        10.128.60.6   ip-172-31-59-69.ec2.internal
pull-08252100z-o7-1-k1jj1   0/1       Terminating   0         2m        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-o7-1-k1jj1   0/1       Terminating   0         2m        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-o7-1-deploy   1/1       Terminating   0         3m        10.129.22.65   ip-172-31-59-26.ec2.internal
pull-08252100z-o7-1-k1jj1   0/1       Terminating   0         2m        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-o7-1-k1jj1   0/1       Terminating   0         2m        <none>    ip-172-31-59-26.ec2.internal
build-08252100z-gr-1-b1lhg   1/1       Running   0         16s       10.128.44.137   ip-172-31-57-139.ec2.internal
build-08252101z-6w-1-7lxpb   1/1       Running   0         15s       10.128.85.118   ip-172-31-57-102.ec2.internal
build-08252100z-gr-1-deploy   0/1       Completed   0         1m        10.130.104.192   ip-172-31-51-151.ec2.internal
build-08252100z-gr-1-deploy   0/1       Terminating   0         1m        10.130.104.192   ip-172-31-51-151.ec2.internal
build-08252100z-gr-1-deploy   0/1       Terminating   0         1m        10.130.104.192   ip-172-31-51-151.ec2.internal
build-08252101z-6w-1-deploy   0/1       Completed   0         37s       10.131.133.144   ip-172-31-57-24.ec2.internal
build-08252101z-6w-1-deploy   0/1       Terminating   0         37s       10.131.133.144   ip-172-31-57-24.ec2.internal
build-08252101z-6w-1-deploy   0/1       Terminating   0         37s       10.131.133.144   ip-172-31-57-24.ec2.internal
pull-08252100z-sp-1-lxktm   1/1       Terminating   0         2m        10.128.67.200   ip-172-31-57-103.ec2.internal
pull-08252100z-01-1-0m7gx   0/1       Terminating   0         2m        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-01-1-0m7gx   0/1       Terminating   0         2m        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-01-1-deploy   1/1       Terminating   0         3m        10.129.22.66   ip-172-31-59-26.ec2.internal
pull-08252100z-01-1-0m7gx   0/1       Terminating   0         2m        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-01-1-0m7gx   0/1       Terminating   0         2m        <none>    ip-172-31-59-26.ec2.internal
pull-08252100z-sp-1-lxktm   0/1       Terminating   0         2m        <none>    ip-172-31-57-103.ec2.internal
pull-08252100z-sp-1-lxktm   0/1       Terminating   0         2m        <none>    ip-172-31-57-103.ec2.internal
pull-08252100z-sp-1-lxktm   0/1       Terminating   0         2m        <none>    ip-172-31-57-103.ec2.internal

Comment 2 Derek Carr 2017-08-26 04:16:39 UTC

Do the pods you are scheduling specify resource requirements?

Can you provide a prototypical pod you are using in the test scenario?

Comment 3 Derek Carr 2017-08-26 04:18:04 UTC

reassigned to the Pod component as they handle scheduling issues.

Comment 4 Avesh Agarwal 2017-08-28 12:50:23 UTC

I am looking into it.

Comment 5 Avesh Agarwal 2017-08-28 13:30:10 UTC

Hi Sten,

First, I would like to understand why there are 198 pods on the node ip-172-31-50-178.ec2.internal. For that, could you provide me following information:

1. Information (oc describe) about all 198 pods on the ip-172-31-50-178.ec2.internal.
2. And oc describe about all nodes.

I will do data mining by myself once you provide above info to avoid going back and forth.

Regarding your next comment, https://bugzilla.redhat.com/show_bug.cgi?id=1485464#c1 , I'd say I am not surprised with this behavior. For the data in your comment https://bugzilla.redhat.com/show_bug.cgi?id=1485464#c1, I have:

33 ip-172-31-59-26.ec2.internal
15
7 ip-172-31-57-103.ec2.internal
7 ip-172-31-55-238.ec2.internal
6 ip-172-31-57-24.ec2.internal
6 ip-172-31-51-151.ec2.internal
4 ip-172-31-51-213.ec2.internal
3 ip-172-31-59-69.ec2.internal
3 ip-172-31-57-139.ec2.internal
3 ip-172-31-57-102.ec2.internal

Your original comment comment shows that ip-172-31-59-26 has only 14 pods, whereas other nodes are above 30, so having 33 pods on the node ip-172-31-59-26 is not a surprise.

Anyway, once you provide me the info, I asked, I will see what is going on, and it could be anything: 1) issue with the scheduler, 2) not enough build nodes in the cluster if controlled by node selectors 3) or some incorrect labels on the nodes 4) or something related to pods resource requirements.

But as I said above first I will start with investigating why there are 198 pods on the node ip-172-31-50-178.ec2.internal. Let me know if you have questions.

Comment 8 Sten Turpin 2017-09-14 15:55:27 UTC

(In reply to Derek Carr from comment #2)
> Do the pods you are scheduling specify resource requirements?
> 
> Can you provide a prototypical pod you are using in the test scenario?

We do not specify resource requirements during our e2e testing. The pod is a simple deploy of https://github.com/openshift/origin/tree/master/examples/hello-openshift

Comment 17 Seth Jennings 2017-09-29 21:59:07 UTC

Sten, are you still experiencing the node death due to overload originally reported?

If so, and you still suspect that the scheduler is not placing pods properly, we will probably need to get verbose scheduler logs to observe the assignment logic during a seemingly improper placement.

If you are not experiencing this anymore, please let me know so I can close.

Comment 18 Sten Turpin 2017-12-18 21:02:40 UTC

we're still seeing this on 3.6, but it looks like not on 3.7.

Note You need to log in before you can comment on or make changes to this bug.