Description of problem: 3 master and 6 works (m5.4xlarge) 6 workers: 2 in us-east-2a, 2 in us-east-2b, 2 in us-east-2c 75 Statefulsets behaves differently when they are in ONE project and when they are in 75 projects. How does a gp2 PV choose its availability zone when it gets created? What is the difference between one project and 75 projects? Version-Release number of selected component (if applicable): # oc get clusterversion version -o json | jq -r .status.desired { "image": "registry.svc.ci.openshift.org/ocp/release@sha256:ef5a60a10812f2fa1e4c93a5042c1520ca55675f9b4085b08579510d71031047", "version": "4.0.0-0.nightly-2019-01-25-205123" } How reproducible: 3/3 Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info: 75 project, each has one statefulset with 2 and each pod has a PVC gernated by `volumeClaimTemplates` # oc get pv --no-headers | wc -l 150 # oc get pod --all-namespaces | grep clusterb | grep Running | wc -l 150 # oc describe pv | grep zone | grep 2a | wc -l 50 # oc describe pv | grep zone | grep 2b | wc -l 50 # oc describe pv | grep zone | grep 2c | wc -l 50 However, if 1 project has all 75 statefulsets. # oc describe pv | grep zone | grep 2a | wc -l 50 # oc describe pv | grep zone | grep 2b | wc -l 0 # oc describe pv | grep zone | grep 2c | wc -l 75 # oc describe pod -n clusteraproject37 web0-0 Name: web0-0 Namespace: clusteraproject37 Priority: 0 PriorityClassName: <none> Node: <none> Labels: app=server0 controller-revision-hash=web0-7c994b8d69 statefulset.kubernetes.io/pod-name=web0-0 Annotations: openshift.io/scc=restricted Status: Pending IP: Controlled By: StatefulSet/web0 Containers: server: Image: openshift/hello-openshift Port: 8080/TCP Host Port: 0/TCP Limits: cpu: 1 memory: 256Mi Requests: cpu: 500m memory: 128Mi Environment: <none> Mounts: /mydata from www (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-z6hvb (ro) Conditions: Type Status PodScheduled False Volumes: www: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: www-web0-0 ReadOnly: false default-token-z6hvb: Type: Secret (a volume populated by a Secret) SecretName: default-token-z6hvb Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 14m (x25 over 14m) default-scheduler pod has unbound PersistentVolumeClaims (repeated 6 times) Warning FailedScheduling 4m (x356 over 12m) default-scheduler 0/9 nodes are available: 2 node(s) had no available volume zone, 3 node(s) had taints that the pod didn't tolerate, 4 node(s) exceed max volume count.
75 project, each has one statefulset with 2 REPLICAS and each pod has a PVC gernated by `volumeClaimTemplates`
Tried with the 1.12 of k8s. The problem is still there. # oc get clusterversion version -o json | jq .status.desired { "image": "registry.svc.ci.openshift.org/ocp/release@sha256:d03ce0ef85540a1fff8bfc1c408253404aaecb2b958d7c3f24896f3597c3715b", "version": "4.0.0-0.nightly-2019-01-30-145955" } # oc version oc v4.0.0-0.150.0 kubernetes v1.12.4+f39ab668d3 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://hongkliu28-api.qe.devcluster.openshift.com:6443 kubernetes v1.12.4+f39ab668d3
Placement of dynamically provisioned volumes is based *only* only PVC name. It's hashed, the hash is divided by nr. of zones and the remainder is used as index of the zone. It works well if each PVC has a different name - their hashes are different and PVs are provisioned roughly equally among zones. If the PVCs have the same names (in different namespaces), they have the same hash and PVs are provisioned in the same zones. There is bug #1663012 that tries to fix that, but change of the hashing algorithm on Kubernetes update looks like significant behavior change. Can you use different StatefulSet names in each namespace? It should help you with this issue.
Oh, and since this is 4.0, setting "volumeBindingMode: WaitForFirstConsumer" in storage class should fix it too, even with the same PVC names in all namespaces.
It works when `volumeBindingMode: WaitForFirstConsumer` # cat ~/gp2b.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: creationTimestamp: 2019-02-18T14:16:17Z labels: cluster.storage.openshift.io/owner-name: cluster-config-v1 cluster.storage.openshift.io/owner-namespace: kube-system name: gp2b resourceVersion: "9640" selfLink: /apis/storage.k8s.io/v1/storageclasses/gp2 uid: c1904b3f-3387-11e9-9c73-0ac06c3388a2 parameters: type: gp2 provisioner: kubernetes.io/aws-ebs reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer # oc get clusterversion version -o json | jq -r .status.desired { "image": "registry.svc.ci.openshift.org/ocp/release@sha256:9f37d93acf2e7442e5bf74f06ca253e37ba299e89bbb66fb30b2cafda6c3d217", "version": "4.0.0-0.ci-2019-02-18-105238" }