Bug 1663012 - AWS EBS PV volumes created from pvc with same name are always created under same AZ
Summary: AWS EBS PV volumes created from pvc with same name are always created under s...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.11.0
Hardware: Unspecified
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.11.z
Assignee: Jan Safranek
QA Contact: Chao Yang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-02 17:29 UTC by Joel Rosental R.
Modified: 2019-07-19 14:52 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-19 14:52:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Joel Rosental R. 2019-01-02 17:29:27 UTC
Description of problem:
When using OCP 3.11 in AWS with cloudprovider enabled and a default storage class (gp2) enabled for dynamic provisioning, if a pvc with same name is created across N amount of different projects/namespaces, the PV is created in same AZ.

The storageClass object definition looks like this:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: "true"
  creationTimestamp: null
  name: gp2
  selfLink: /apis/storage.k8s.io/v1/storageclasses/gp2
parameters:
  encrypted: "false"
  kmsKeyId: ""
  type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: Immediate

As it can be seen from above definition, there isn't any "zone" parameter specified therefore it should comply with the following statement from [0]:

"AWS zone. If no zone is specified, volumes are generally round-robined across all active zones where the OpenShift Container Platform cluster has a node. Zone and zones parameters must not be used at the same time." 

[0]: https://docs.openshift.com/container-platform/3.11/install_config/persistent_storage/dynamically_provisioning_pvs.html#aws-elasticblockstore-ebs

Example:

$ oc get node --show-labels 
NAME                                      STATUS    ROLES          AGE       VERSION           LABELS
ip-10-0-5-10.eu-west-3.compute.internal   Ready     infra,master   14d       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3a,kubernetes.io/hostname=ip-10-0-5-10.eu-west-3.compute.internal,logging-infra-fluentd=true,node-role.kubernetes.io/infra=true,node-role.kubernetes.io/master=true
ip-10-0-5-11.eu-west-3.compute.internal   Ready     compute        14d       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3a,kubernetes.io/hostname=ip-10-0-5-11.eu-west-3.compute.internal,logging-infra-fluentd=true,node-role.kubernetes.io/compute=true
ip-10-0-6-10.eu-west-3.compute.internal   Ready     infra,master   14d       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3b,kubernetes.io/hostname=ip-10-0-6-10.eu-west-3.compute.internal,logging-infra-fluentd=true,node-role.kubernetes.io/infra=true,node-role.kubernetes.io/master=true
ip-10-0-6-11.eu-west-3.compute.internal   Ready     compute        14d       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3b,kubernetes.io/hostname=ip-10-0-6-11.eu-west-3.compute.internal,logging-infra-fluentd=true,node-role.kubernetes.io/compute=true
ip-10-0-7-10.eu-west-3.compute.internal   Ready     infra,master   14d       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3c,kubernetes.io/hostname=ip-10-0-7-10.eu-west-3.compute.internal,logging-infra-fluentd=true,node-role.kubernetes.io/infra=true,node-role.kubernetes.io/master=true
ip-10-0-7-11.eu-west-3.compute.internal   Ready     compute        14d       v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3c,kubernetes.io/hostname=ip-10-0-7-11.eu-west-3.compute.internal,logging-infra-fluentd=true,node-role.kubernetes.io/compute=true

$ oc new-project test1
$ oc -n test1 create -f test-pvc.yml 
persistentvolumeclaim "test-claim" created
$ oc -n test1 get pvc test-claim 
NAME         STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-claim   Bound     pvc-970805f0-09e3-11e9-a22e-0af499353f00   1Gi        RWO            gp2            <invalid>

$ oc -n test1 get pv pvc-970805f0-09e3-11e9-a22e-0af499353f00 --show-labels 
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM              STORAGECLASS   REASON    AGE       LABELS
pvc-970805f0-09e3-11e9-a22e-0af499353f00   1Gi        RWO            Delete           Bound     test1/test-claim   gp2                      15s       failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3c

$ oc new-project test2
$ oc -n test2 create -f test-pvc.yml 
persistentvolumeclaim "test-claim" created
$ oc -n test2 get pvc test-claim 
NAME         STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-claim   Bound     pvc-01661c70-09e4-11e9-a4aa-0e1ec77e95a8   1Gi        RWO            gp2            9s

$ oc -n test2 get pv pvc-01661c70-09e4-11e9-a4aa-0e1ec77e95a8 --show-labels 
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM              STORAGECLASS   REASON    AGE       LABELS
pvc-01661c70-09e4-11e9-a4aa-0e1ec77e95a8   1Gi        RWO            Delete           Bound     test2/test-claim   gp2                      22s       failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3c


$ oc new-project test3
$ oc -n test3 create -f test-pvc.yml 
persistentvolumeclaim "test-claim" created
$ oc -n test3 get pvc test-claim 
NAME         STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-claim   Bound     pvc-2d4b8cc5-09e4-11e9-a22e-0af499353f00   1Gi        RWO            gp2            9s
$ oc -n test3 get pv pvc-2d4b8cc5-09e4-11e9-a22e-0af499353f00 --show-labels 
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM              STORAGECLASS   REASON    AGE       LABELS
pvc-2d4b8cc5-09e4-11e9-a22e-0af499353f00   1Gi        RWO            Delete           Bound     test3/test-claim   gp2                      27s       failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3c

However if the PVC name is changed, then it gets created in a new AZ:

$ oc new-project test5
$ oc -n test5 create -f test-pvc-t2.yml 
persistentvolumeclaim "test-claim-tx" created
$ oc -n test5 get pvc test-claim-tx
NAME            STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-claim-tx   Bound     pvc-2752cf53-09ea-11e9-b02e-068a8ed74732   1Gi        RWO            gp2            8s

$ oc -n test5 get pv pvc-2752cf53-09ea-11e9-b02e-068a8ed74732 --show-labels
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                 STORAGECLASS   REASON    AGE       LABELS
pvc-2752cf53-09ea-11e9-b02e-068a8ed74732   1Gi        RWO            Delete           Bound     test5/test-claim-tx   gp2                      41s       failure-domain.beta.kubernetes.io/region=eu-west-3,failure-domain.beta.kubernetes.io/zone=eu-west-3b

even deleting these PVC's and creating them again with the same name, they get created under the same AZ.

Version-Release number of selected component (if applicable):


How reproducible:
On the customer side, yes.
I haven't tested in RH AWS account.

Steps to Reproduce:
1. Install OCP 3.11 in AWS with cloudprovider enabled and having gp2 storage class set as 'default'
2. Create a new PVC with same name over multiple projects.


Actual results:
All PVs are created under same AZ even though there is no restriction of AZ's in the storageclass definition object itself.

Expected results:
PVs should be balanced across different AZs.

Additional info:

Comment 1 Jan Safranek 2019-01-03 14:26:10 UTC
This is unfortunate result of volume distribution logic. There is no cache of PVCs and their zones, Kubernetes simply hashes PVC name and determines the zone based on the hash like this:

    hash(pvc.name) mod number_of_zones

So all PVCs with the same name in all namespaces get the same hash and thus the same zone. See https://github.com/kubernetes/kubernetes/blob/716b25396305b97034b019c13a937fcdfd364f9c/pkg/volume/util/util.go#L674 for the hashing function.

In theory it could be extended to include also PVC namespace in the hash, so PVC with the same name in different namespaces get different hash. 

However, it will break StatefulSet scaling - we try to create PV for each PVC in a StatefulSet in different zone. Kubernetes provisions PVs in 3 different zones for a StatefulSet with say 3 replicas. If the StatefulSet is scaled to 4 replicas, new PV will be provisioned in fourth zone (if such zone exists). If we change the algorithm that calculates the zones (= the hash) in the middle of StatefulSet lifetime, this 4th replica may be created in a zone that already contains an existing replica, while there are still unused zones in the cluster.

We need to carefully judge if breaking StatefulSet scaling is worth fixing this issue.


Note You need to log in before you can comment on or make changes to this bug.