Description of problem: When using OCP 3.11 in AWS with cloudprovider enabled and a default storage class (gp2) enabled for dynamic provisioning, if a pvc with same name is created across N amount of different projects/namespaces, the PV is created in same AZ. The storageClass object definition looks like this: apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.beta.kubernetes.io/is-default-class: "true" creationTimestamp: null name: gp2 selfLink: /apis/storage.k8s.io/v1/storageclasses/gp2 parameters: encrypted: "false" kmsKeyId: "" type: gp2 provisioner: kubernetes.io/aws-ebs reclaimPolicy: Delete volumeBindingMode: Immediate As it can be seen from above definition, there isn't any "zone" parameter specified therefore it should comply with the following statement from [0]: "AWS zone. If no zone is specified, volumes are generally round-robined across all active zones where the OpenShift Container Platform cluster has a node. Zone and zones parameters must not be used at the same time." [0]: https://docs.openshift.com/container-platform/3.11/install_config/persistent_storage/dynamically_provisioning_pvs.html#aws-elasticblockstore-ebs Example: $ oc get node --show-labels NAME STATUS ROLES AGE VERSION LABELS ip-10-0-5-10.eu-west-3.compute.internal Ready infra,master 14d v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3a,kubernetes.io/hostname=ip-10-0-5-10.eu-west-3.compute.internal,logging-infra-fluentd=true,node-role.kubernetes.io/infra=true,node-role.kubernetes.io/master=true ip-10-0-5-11.eu-west-3.compute.internal Ready compute 14d v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3a,kubernetes.io/hostname=ip-10-0-5-11.eu-west-3.compute.internal,logging-infra-fluentd=true,node-role.kubernetes.io/compute=true ip-10-0-6-10.eu-west-3.compute.internal Ready infra,master 14d v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3b,kubernetes.io/hostname=ip-10-0-6-10.eu-west-3.compute.internal,logging-infra-fluentd=true,node-role.kubernetes.io/infra=true,node-role.kubernetes.io/master=true ip-10-0-6-11.eu-west-3.compute.internal Ready compute 14d v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3b,kubernetes.io/hostname=ip-10-0-6-11.eu-west-3.compute.internal,logging-infra-fluentd=true,node-role.kubernetes.io/compute=true ip-10-0-7-10.eu-west-3.compute.internal Ready infra,master 14d v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3c,kubernetes.io/hostname=ip-10-0-7-10.eu-west-3.compute.internal,logging-infra-fluentd=true,node-role.kubernetes.io/infra=true,node-role.kubernetes.io/master=true ip-10-0-7-11.eu-west-3.compute.internal Ready compute 14d v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3c,kubernetes.io/hostname=ip-10-0-7-11.eu-west-3.compute.internal,logging-infra-fluentd=true,node-role.kubernetes.io/compute=true $ oc new-project test1 $ oc -n test1 create -f test-pvc.yml persistentvolumeclaim "test-claim" created $ oc -n test1 get pvc test-claim NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE test-claim Bound pvc-970805f0-09e3-11e9-a22e-0af499353f00 1Gi RWO gp2 <invalid> $ oc -n test1 get pv pvc-970805f0-09e3-11e9-a22e-0af499353f00 --show-labels NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE LABELS pvc-970805f0-09e3-11e9-a22e-0af499353f00 1Gi RWO Delete Bound test1/test-claim gp2 15s failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3c $ oc new-project test2 $ oc -n test2 create -f test-pvc.yml persistentvolumeclaim "test-claim" created $ oc -n test2 get pvc test-claim NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE test-claim Bound pvc-01661c70-09e4-11e9-a4aa-0e1ec77e95a8 1Gi RWO gp2 9s $ oc -n test2 get pv pvc-01661c70-09e4-11e9-a4aa-0e1ec77e95a8 --show-labels NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE LABELS pvc-01661c70-09e4-11e9-a4aa-0e1ec77e95a8 1Gi RWO Delete Bound test2/test-claim gp2 22s failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3c $ oc new-project test3 $ oc -n test3 create -f test-pvc.yml persistentvolumeclaim "test-claim" created $ oc -n test3 get pvc test-claim NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE test-claim Bound pvc-2d4b8cc5-09e4-11e9-a22e-0af499353f00 1Gi RWO gp2 9s $ oc -n test3 get pv pvc-2d4b8cc5-09e4-11e9-a22e-0af499353f00 --show-labels NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE LABELS pvc-2d4b8cc5-09e4-11e9-a22e-0af499353f00 1Gi RWO Delete Bound test3/test-claim gp2 27s failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3c However if the PVC name is changed, then it gets created in a new AZ: $ oc new-project test5 $ oc -n test5 create -f test-pvc-t2.yml persistentvolumeclaim "test-claim-tx" created $ oc -n test5 get pvc test-claim-tx NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE test-claim-tx Bound pvc-2752cf53-09ea-11e9-b02e-068a8ed74732 1Gi RWO gp2 8s $ oc -n test5 get pv pvc-2752cf53-09ea-11e9-b02e-068a8ed74732 --show-labels NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE LABELS pvc-2752cf53-09ea-11e9-b02e-068a8ed74732 1Gi RWO Delete Bound test5/test-claim-tx gp2 41s failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3b even deleting these PVC's and creating them again with the same name, they get created under the same AZ. Version-Release number of selected component (if applicable): How reproducible: On the customer side, yes. I haven't tested in RH AWS account. Steps to Reproduce: 1. Install OCP 3.11 in AWS with cloudprovider enabled and having gp2 storage class set as 'default' 2. Create a new PVC with same name over multiple projects. Actual results: All PVs are created under same AZ even though there is no restriction of AZ's in the storageclass definition object itself. Expected results: PVs should be balanced across different AZs. Additional info:
This is unfortunate result of volume distribution logic. There is no cache of PVCs and their zones, Kubernetes simply hashes PVC name and determines the zone based on the hash like this: hash(pvc.name) mod number_of_zones So all PVCs with the same name in all namespaces get the same hash and thus the same zone. See https://github.com/kubernetes/kubernetes/blob/716b25396305b97034b019c13a937fcdfd364f9c/pkg/volume/util/util.go#L674 for the hashing function. In theory it could be extended to include also PVC namespace in the hash, so PVC with the same name in different namespaces get different hash. However, it will break StatefulSet scaling - we try to create PV for each PVC in a StatefulSet in different zone. Kubernetes provisions PVs in 3 different zones for a StatefulSet with say 3 replicas. If the StatefulSet is scaled to 4 replicas, new PV will be provisioned in fourth zone (if such zone exists). If we change the algorithm that calculates the zones (= the hash) in the middle of StatefulSet lifetime, this 4th replica may be created in a zone that already contains an existing replica, while there are still unused zones in the cluster. We need to carefully judge if breaking StatefulSet scaling is worth fixing this issue.