Description of problem: FailedScheduling after few deployments on azure. Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-07-22-160516 How reproducible: 100% Steps to Reproduce: 1. Using openshift-install script create cluster on azure with parameters: apiVersion: v1 baseDomain: qe.azure.devcluster.openshift.com compute: - hyperthreading: Enabled name: worker platform: azure: type: Standard_DS2_v2 replicas: 3 controlPlane: hyperthreading: Enabled name: master platform: azure: type: Standard_DS3_v2 replicas: 3 2. Get template: wget https://raw.githubusercontent.com/skordas/svt/starge_git_test_update/storage/git/files/oc/template_git.yaml 3. Deploy: oc new-project test-1 oc process -f template_git.yaml -p PVC_SIZE=1Gi -p STORAGE_CLASS_NAME=managed-premium | oc create --namespace test-1 -f - 4. Repeat point 3 for projects test-2 ... test-9 Actual results: In my case after 8 deployments rest are unsuccessful oc get pods -n test-9 NAME READY STATUS RESTARTS AGE git-1-deploy 0/1 Error 0 20h oc get events LAST SEEN TYPE REASON OBJECT MESSAGE 115m Warning FailedScheduling pod/git-2-deploy Binding rejected: Operation cannot be fulfilled on pods/binding "git-2-deploy": pod git-2-deploy is already assigned to node "skordas0723-5lbrd-worker-centralus3-s4qn2" 115m Warning FailedScheduling pod/git-2-deploy Binding rejected: Operation cannot be fulfilled on pods/binding "git-2-deploy": pod git-2-deploy is already assigned to node "skordas0723-5lbrd-worker-centralus3-s4qn2" 115m Normal Scheduled pod/git-2-deploy Successfully assigned test-9/git-2-deploy to skordas0723-5lbrd-worker-centralus3-s4qn2 115m Normal Pulled pod/git-2-deploy Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fcb3aa914fad34d05b60a957ff2a99c9d10b7c441916d0fe36d41b0f756440a3" already present on machine 115m Normal Created pod/git-2-deploy Created container deployment 115m Normal Started pod/git-2-deploy Started container deployment 105m Warning FailedScheduling pod/git-2-g8cg7 0/6 nodes are available: 1 node(s) exceed max volume count, 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate. 105m Warning FailedScheduling pod/git-2-g8cg7 0/6 nodes are available: 1 node(s) exceed max volume count, 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate. 105m Warning FailedScheduling pod/git-2-g8cg7 0/6 nodes are available: 1 node(s) exceed max volume count, 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate. 105m Warning FailedScheduling pod/git-2-g8cg7 skip schedule deleting pod: test-9/git-2-g8cg7 105m Warning FailedScheduling pod/git-2-g8cg7 skip schedule deleting pod: test-9/git-2-g8cg7 105m Warning FailedScheduling pod/git-2-g8cg7 skip schedule deleting pod: test-9/git-2-g8cg7 115m Normal SuccessfulCreate replicationcontroller/git-2 Created pod: git-2-g8cg7 105m Normal SuccessfulDelete replicationcontroller/git-2 Deleted pod: git-2-g8cg7 115m Normal DeploymentCreated deploymentconfig/git Created new replication controller "git-2" for version 2 105m Normal ReplicationControllerScaled deploymentconfig/git Scaled replication controller "git-2" from 1 to 0 Expected results: All deployments should be successful.
It's looks like is storage related: when there are multiple availability zones PV/PVC should be in the same zone. oc get storageclass managed-premium -o yaml | grep volumeBindingMode Actual: volumeBindingMode: Immediate Expected: volumeBindingMode: WaitForFirstConsumer With volumeBindingMode: Immediate PVC can created in different zone than the pod. WaitForFirstConsumer will assure PV, PVC pod will be in the same zone
> 105m Warning FailedScheduling pod/git-2-g8cg7 0/6 nodes are available: 1 node(s) exceed max volume count, 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate. It seems that one node is at its limit of attached Azure volumes. > volumeBindingMode: WaitForFirstConsumer This is already covered in bug #1731059. It should help in your case. But please check nr. of pods on the other nodes - are they reaching the volume attachment limit too? Kubernetes should distribute volumes across zones roughly evenly. Mabe it's a time to scale the cluster up. *** This bug has been marked as a duplicate of bug 1731059 ***