Description of problem: Cluster with 2 nodes and 1 master, Estimate cluster-capacity with pod.yaml, get correct estimate result. but when remove a node from cluster and estimate cluster capacity with pod.yaml again, get incorrect estimate result. Version-Release number of selected component (if applicable): openshift v3.6.74 kubernetes v1.6.1+5115d708d7 etcd 3.1.0 How reproducible: Always Steps to Reproduce: 1.Cluster with 2 nodes and 1 master and estimate cluster capacity with pod.yaml # cat examples/pod.yaml apiVersion: v1 kind: Pod metadata: name: small-pod labels: app: guestbook tier: frontend spec: containers: - name: php-redis image: gcr.io/google-samples/gb-frontend:v4 imagePullPolicy: Always resources: limits: cpu: 150m memory: 100Mi requests: cpu: 150m memory: 100Mi restartPolicy: "OnFailure" dnsPolicy: "Default" #./cluster-capacity --kubeconfig ~/.kube/config --podspec examples/pod.yaml small-pod pod requirements: - CPU: 150m - Memory: 100Mi The cluster can schedule 24 instance(s) of the pod small-pod. Termination reason: Unschedulable: No nodes are available that match all of the following predicates:: Insufficient cpu (2). Pod distribution among nodes: small-pod - host-8-174-53.host.centralci.eng.rdu2.redhat.com: 12 instance(s) - host-8-172-102.host.centralci.eng.rdu2.redhat.com: 12 instance(s) 2. Remove node2 from cluster # oc delete node host-8-172-102.host.centralci.eng.rdu2.redhat.com node "host-8-172-102.host.centralci.eng.rdu2.redhat.com" deleted 3. estimate cluster capacity with pod.yaml again # ./cluster-capacity --kubeconfig ~/.kube/config --podspec examples/pod.yaml 4. try to create a pod again by manually. # oc create -f examples/pod.yaml Actual results: 3. # ./cluster-capacity --kubeconfig ~/.kube/config --podspec examples/pod.yaml --verbose small-pod pod requirements: - CPU: 150m - Memory: 100Mi The cluster can schedule 0 instance(s) of the pod small-pod. Termination reason: Unschedulable: no nodes available to schedule pods 4. # oc create -f examples/pod.yaml pod "small-pod" created # oc create -f examples/pod.yaml NAME READY STATUS RESTARTS AGE docker-registry-1-7sckt 1/1 Running 0 1h docker-registry-1-fgl6p 1/1 Running 0 56m registry-console-1-xgtfp 1/1 Running 0 56m router-1-h5fkm 1/1 Running 0 1h router-1-qc01g 0/1 Pending 0 15m small-pod 1/1 Running 0 59s Expected results: 3. The correct result should be such as following: small-pod pod requirements: - CPU: 150m - Memory: 100Mi The cluster can schedule 11 instance(s) of the pod small-pod. Termination reason: Unschedulable: No nodes are available that match all of the following predicates:: Insufficient cpu (1). Pod distribution among nodes: small-pod - host-8-174-53.host.centralci.eng.rdu2.redhat.com: 11 instance(s) 4. the pod create success and work running small-pod 1/1 Running 0 59s Additional info:
I tested this today on 1 master and 2 nodes setup I created from latest origin master branch. Here are steps I followed: 1. Started master: #openshift start master --config=./openshift.local.config/master/master-config.yaml 2. Started node 1: #openshift start node --config=./openshift.local.config/node-192.168.124.120/node-config.yaml 3. Stated node 2: #openshift start node --config=./openshift.local.config/node-192.168.124.214/node-config.yaml 4. Check node statuses: #oc get nodes --config=./openshift.local.config/master/admin.kubeconfig NAME STATUS AGE VERSION 192.168.124.120 Ready 2m v1.6.1+5115d708d7 192.168.124.214 Ready 46s v1.6.1+5115d708d7 5. Run cluster capacity analysis: # ./_output/local/bin/linux/amd64/cluster-capacity --kubeconfig ~/upstream-code/gocode/src/github.com/openshift/origin/openshift.local.config/master/admin.kubeconfig --podspec examples/pod.yaml --verbose small-pod pod requirements: - CPU: 150m - Memory: 100Mi The cluster can schedule 52 instance(s) of the pod small-pod. Termination reason: Unschedulable: No nodes are available that match all of the following predicates:: Insufficient cpu (2). Pod distribution among nodes: small-pod - 192.168.124.214: 26 instance(s) - 192.168.124.120: 26 instance(s) 6. Now delete a 2nd node: #oc delete nodes 192.168.124.214 --config=./openshift.local.config/master/admin.kubeconfig 7. Check node status: # oc get nodes --config=./openshift.local.config/master/admin.kubeconfig NAME STATUS AGE VERSION 192.168.124.120 Ready 4m v1.6.1+5115d708d7 8. Run cluster capacity analysis again: # ./_output/local/bin/linux/amd64/cluster-capacity --kubeconfig ~/upstream-code/gocode/src/github.com/openshift/origin/openshift.local.config/master/admin.kubeconfig --podspec examples/pod.yaml --verbose small-pod pod requirements: - CPU: 150m - Memory: 100Mi The cluster can schedule 26 instance(s) of the pod small-pod. Termination reason: Unschedulable: No nodes are available that match all of the following predicates:: Insufficient cpu (1). Pod distribution among nodes: small-pod - 192.168.124.120: 26 instance(s) You can see it clearly shows only one node was taken into analysis. 9. Again just for a further test, I create a pod: # oc --config=./openshift.local.config/master/admin.kubeconfig create -f ~/upstream-code/gocode/src/github.com/kubernetes-incubator/cluster-capacity/examples/pod.yaml pod "small-pod" created 10. Run cluster capacity analysis again: # ./_output/local/bin/linux/amd64/cluster-capacity --kubeconfig ~/upstream-code/gocode/src/github.com/openshift/origin/openshift.local.config/master/admin.kubeconfig --podspec examples/pod.yaml --verbose small-pod pod requirements: - CPU: 150m - Memory: 100Mi The cluster can schedule 25 instance(s) of the pod small-pod. Termination reason: Unschedulable: No nodes are available that match all of the following predicates:: Insufficient cpu (1). Pod distribution among nodes: small-pod - 192.168.124.120: 25 instance(s) It clearly shows one less pod (now 25 one less than the previous 26) as one pod was created in the cluster. Anyway please let me know if anything looks incorrect, otherwise I can not reproduce the issue as mentioned, and the cluster capacity analysis is working as expected.
(In reply to Avesh Agarwal from comment #1) > I tested this today on 1 master and 2 nodes setup I created from latest > origin master branch. > > Here are steps I followed: > > 1. Started master: > #openshift start master > --config=./openshift.local.config/master/master-config.yaml > > 2. Started node 1: > #openshift start node > --config=./openshift.local.config/node-192.168.124.120/node-config.yaml > > 3. Stated node 2: > #openshift start node > --config=./openshift.local.config/node-192.168.124.214/node-config.yaml > > 4. Check node statuses: > #oc get nodes --config=./openshift.local.config/master/admin.kubeconfig > NAME STATUS AGE VERSION > 192.168.124.120 Ready 2m v1.6.1+5115d708d7 > 192.168.124.214 Ready 46s v1.6.1+5115d708d7 > > 5. Run cluster capacity analysis: > # ./_output/local/bin/linux/amd64/cluster-capacity --kubeconfig > ~/upstream-code/gocode/src/github.com/openshift/origin/openshift.local. > config/master/admin.kubeconfig --podspec examples/pod.yaml --verbose > small-pod pod requirements: > - CPU: 150m > - Memory: 100Mi > > The cluster can schedule 52 instance(s) of the pod small-pod. > > Termination reason: Unschedulable: No nodes are available that match all of > the following predicates:: Insufficient cpu (2). > > Pod distribution among nodes: > small-pod > - 192.168.124.214: 26 instance(s) > - 192.168.124.120: 26 instance(s) > > 6. Now delete a 2nd node: > #oc delete nodes 192.168.124.214 > --config=./openshift.local.config/master/admin.kubeconfig > > 7. Check node status: > # oc get nodes --config=./openshift.local.config/master/admin.kubeconfig > NAME STATUS AGE VERSION > 192.168.124.120 Ready 4m v1.6.1+5115d708d7 > > 8. Run cluster capacity analysis again: > # ./_output/local/bin/linux/amd64/cluster-capacity --kubeconfig > ~/upstream-code/gocode/src/github.com/openshift/origin/openshift.local. > config/master/admin.kubeconfig --podspec examples/pod.yaml --verbose > small-pod pod requirements: > - CPU: 150m > - Memory: 100Mi > > The cluster can schedule 26 instance(s) of the pod small-pod. > > Termination reason: Unschedulable: No nodes are available that match all of > the following predicates:: Insufficient cpu (1). > > Pod distribution among nodes: > small-pod > - 192.168.124.120: 26 instance(s) > > You can see it clearly shows only one node was taken into analysis. > > 9. Again just for a further test, I create a pod: > # oc --config=./openshift.local.config/master/admin.kubeconfig create -f > ~/upstream-code/gocode/src/github.com/kubernetes-incubator/cluster-capacity/ > examples/pod.yaml > pod "small-pod" created > > 10. Run cluster capacity analysis again: > # ./_output/local/bin/linux/amd64/cluster-capacity --kubeconfig > ~/upstream-code/gocode/src/github.com/openshift/origin/openshift.local. > config/master/admin.kubeconfig --podspec examples/pod.yaml --verbose > small-pod pod requirements: > - CPU: 150m > - Memory: 100Mi > > The cluster can schedule 25 instance(s) of the pod small-pod. > > Termination reason: Unschedulable: No nodes are available that match all of > the following predicates:: Insufficient cpu (1). > > Pod distribution among nodes: > small-pod > - 192.168.124.120: 25 instance(s) > > It clearly shows one less pod (now 25 one less than the previous 26) as one > pod was created in the cluster. > > Anyway please let me know if anything looks incorrect, otherwise I can not > reproduce the issue as mentioned, and the cluster capacity analysis is > working as expected. Hi Avesh, there is no incorrect for your steps. yesterday I have tried with build which clone from origin master branch. the issue still reproduce. in fact, after i installed cluster env (1 master 2 nodes), I delete the node first time, the issue will not reproduce. but when I delete same node second time, the issue was reproduced. or maybe you can try to delete node more than once.
Could you try following steps and let me know if you are still seeing issues: 1. Setup your cluster (not if the cluster is already setup). 2. Remove a node 3. Run oc get nodes or oc describes nodes 4. Run cluster capacity analysis. Let me know if you follow above and can still reproduce the issue. If yes, please send me oc get/describe nodes output just before running the cluster capacity command.
tagging upcoming release as cluster-capacity will not vendor into origin until the next sprint.
This one has gone cold. I attempted to reproduce building everything from origin master and could not. Everything worked as expected. If reproduction is possible on a completely clean env built from upstream, please reopen.
I debugged it further and i think I have found a more reliable reproducer. I found that as soon as I delete router-1 rc, cluster-capacity works without any issues. So right now it seems to me that router-1 is causing issues. The exact steps are below what is really happening: 1. A node is removed 2. The route pod running on that removed node goes into unknown. 3. Another router pod is created the remain pending. As soon as the above happens, the cluster-capacity starts showing 0 nodes available. Another way to reproduce the above is just stopping the node: 1. Stop a node 2. After some time, the node goes into NotReady. 3. Pods running on the removed node stays running (waiting for default eviction timout). Please note upto this point, cluster-capacity keeps showing the correct results. 4. As soon as after the eviction timeout, the router pod on the stopped node goes unknown and one more pending router pod appears. As soon as this happens, cluster-capacity starts showing 0 available nodes.
Hi Cheng, So i know the root cause now. As far as there is one pending pod, cluster-capacity is showing 0 nodes available.
I think I know what to fix, I am testing and will send a PR soon.
Cheng, I have created PR https://github.com/openshift/origin/pull/14923 to address this issue. Would appreciate if you could verify it. Thanks Avesh
OKay, I will verify it once pr is merged.
Verified on openshift v3.6.126.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716