Description of problem: Pod with persistent volumes failed scheduling on Azure due to volume node affinity conflict Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-07-18-003042 How reproducible: Often Steps to Reproduce: 1.Login and create a project. 2.Create a PVC. 3.Create a pod with above PVC. Actual results: Pod failed scheduling, Warning FailedScheduling 37s (x2 over 37s) default-scheduler 0/5 nodes are available: 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate. Expected results: Pod is up and running. Additional info: $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mypvc Bound pvc-a0e2b137-a92d-11e9-b160-000d3a92c15a 1Gi RWO managed-premium 28s $ oc get pods NAME READY STATUS RESTARTS AGE mypod 0/1 Pending 0 31s $ oc describe pod Name: mypod Namespace: xg17v Priority: 0 PriorityClassName: <none> Node: <none> Labels: <none> Annotations: openshift.io/scc: restricted Status: Pending IP: Containers: mycontainer: Image: aosqe/hello-openshift Port: <none> Host Port: <none> Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-986qk (ro) Devices: /dev/myblock from myvolume Conditions: Type Status PodScheduled False Volumes: myvolume: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: mypvc ReadOnly: false default-token-986qk: Type: Secret (a volume populated by a Secret) SecretName: default-token-986qk Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 38s (x2 over 38s) default-scheduler 0/5 nodes are available: 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 37s default-scheduler 0/5 nodes are available: 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 37s (x2 over 37s) default-scheduler 0/5 nodes are available: 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate. $ oc get sc -o yaml apiVersion: v1 items: - apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.kubernetes.io/is-default-class: "true" creationTimestamp: "2019-07-18T01:06:12Z" name: managed-premium ownerReferences: - apiVersion: v1 kind: clusteroperator name: storage uid: 1b2163c5-a8f7-11e9-98a3-000d3a93cf81 resourceVersion: "8866" selfLink: /apis/storage.k8s.io/v1/storageclasses/managed-premium uid: 3c267dfa-a8f8-11e9-8236-000d3a93c3f4 parameters: kind: Managed storageaccounttype: Premium_LRS provisioner: kubernetes.io/azure-disk reclaimPolicy: Delete volumeBindingMode: Immediate kind: List metadata: resourceVersion: "" selfLink: "" $ oc get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS qe-lxia-0718-003042-bnwll-master-0 Ready master 7h v1.14.0+c47409b6f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=centralus,failure-domain.beta.kubernetes.io/zone=centralus-2,kubernetes.io/arch=amd64,kubernetes.io/hostname=qe-lxia-0718-003042-bnwll-master-0,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos qe-lxia-0718-003042-bnwll-master-1 Ready master 7h v1.14.0+c47409b6f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=centralus,failure-domain.beta.kubernetes.io/zone=centralus-1,kubernetes.io/arch=amd64,kubernetes.io/hostname=qe-lxia-0718-003042-bnwll-master-1,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos qe-lxia-0718-003042-bnwll-master-2 Ready master 7h1m v1.14.0+c47409b6f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=centralus,failure-domain.beta.kubernetes.io/zone=centralus-3,kubernetes.io/arch=amd64,kubernetes.io/hostname=qe-lxia-0718-003042-bnwll-master-2,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos qe-lxia-0718-003042-bnwll-worker-centralus1-qpjkr Ready worker 6h53m v1.14.0+c47409b6f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D2s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=centralus,failure-domain.beta.kubernetes.io/zone=centralus-1,kubernetes.io/arch=amd64,kubernetes.io/hostname=qe-lxia-0718-003042-bnwll-worker-centralus1-qpjkr,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos qe-lxia-0718-003042-bnwll-worker-centralus2-652vr Ready worker 6h53m v1.14.0+c47409b6f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D2s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=centralus,failure-domain.beta.kubernetes.io/zone=centralus-2,kubernetes.io/arch=amd64,kubernetes.io/hostname=qe-lxia-0718-003042-bnwll-worker-centralus2-652vr,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
How did you create the cluster? By default, our installer creates one master and one node in each zone, so any PVC can be scheduled. I can see that you have only 5 nodes (3 masters, 2 nodes) instead of 6. Can you please check what happened to the 6th node?
*** Bug 1732901 has been marked as a duplicate of this bug. ***
Verified the issue has been fixed. Tested on a cluster with 5 nodes ( 3 masters, 2 nodes), and create dynamic pvc/pod several times, pods are up and running with the volumes. $ oc get co storage NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE storage 4.2.0-0.nightly-2019-07-28-222114 True False False 35m $ oc get sc managed-premium -o yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.kubernetes.io/is-default-class: "true" creationTimestamp: "2019-07-29T02:00:13Z" name: managed-premium ownerReferences: - apiVersion: v1 kind: clusteroperator name: storage uid: 5f960bf0-b1a3-11e9-bb54-000d3a92e279 resourceVersion: "9674" selfLink: /apis/storage.k8s.io/v1/storageclasses/managed-premium uid: 9a64e4a3-b1a4-11e9-9ac3-000d3a92e440 parameters: kind: Managed storageaccounttype: Premium_LRS provisioner: kubernetes.io/azure-disk reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922
Facing the same problem on Openshift 4.3.28 (5 node cluster), what is the solution / the root cause of the problem?