Bug 1670241

Summary:	How gp2 PVs chooses a zone?
Product:	OpenShift Container Platform	Reporter:	Hongkai Liu <hongkliu>
Component:	Storage	Assignee:	Hemant Kumar <hekumar>
Status:	CLOSED WONTFIX	QA Contact:	Liang Xia <lxia>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	aos-bugs, aos-storage-staff, hongkliu, jsafrane, mifiedle
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-02-18 20:35:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hongkai Liu 2019-01-29 01:57:28 UTC

Description of problem:
3 master and 6 works (m5.4xlarge)
6 workers: 2 in us-east-2a, 2 in us-east-2b, 2 in us-east-2c

75 Statefulsets behaves differently when they are in ONE project and when they are in 75 projects.

How does a gp2 PV choose its availability zone when it gets created?
What is the difference between one project and 75 projects?

Version-Release number of selected component (if applicable):
# oc get clusterversion version -o json | jq -r .status.desired
{
  "image": "registry.svc.ci.openshift.org/ocp/release@sha256:ef5a60a10812f2fa1e4c93a5042c1520ca55675f9b4085b08579510d71031047",
  "version": "4.0.0-0.nightly-2019-01-25-205123"
}

How reproducible: 3/3

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:



75 project, each has one statefulset with 2 and each pod has a PVC gernated by `volumeClaimTemplates`
# oc get pv --no-headers | wc -l
150
# oc get pod --all-namespaces | grep clusterb | grep Running | wc -l
150
# oc describe pv | grep zone  | grep 2a | wc -l
50
# oc describe pv | grep zone  | grep 2b | wc -l
50
# oc describe pv | grep zone  | grep 2c | wc -l
50


However, if 1 project has all 75 statefulsets.
# oc describe pv | grep zone  | grep 2a | wc -l
50
# oc describe pv | grep zone  | grep 2b | wc -l
0
# oc describe pv | grep zone  | grep 2c | wc -l
75

# oc describe pod -n clusteraproject37                            web0-0
Name:               web0-0
Namespace:          clusteraproject37
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             app=server0
                    controller-revision-hash=web0-7c994b8d69
                    statefulset.kubernetes.io/pod-name=web0-0
Annotations:        openshift.io/scc=restricted
Status:             Pending
IP:                 
Controlled By:      StatefulSet/web0
Containers:
  server:
    Image:      openshift/hello-openshift
    Port:       8080/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     1
      memory:  256Mi
    Requests:
      cpu:        500m
      memory:     128Mi
    Environment:  <none>
    Mounts:
      /mydata from www (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-z6hvb (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  www:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  www-web0-0
    ReadOnly:   false
  default-token-z6hvb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-z6hvb
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  14m (x25 over 14m)  default-scheduler  pod has unbound PersistentVolumeClaims (repeated 6 times)
  Warning  FailedScheduling  4m (x356 over 12m)  default-scheduler  0/9 nodes are available: 2 node(s) had no available volume zone, 3 node(s) had taints that the pod didn't tolerate, 4 node(s) exceed max volume count.

Comment 1 Hongkai Liu 2019-01-29 01:59:33 UTC

75 project, each has one statefulset with 2 REPLICAS and each pod has a PVC gernated by `volumeClaimTemplates`

Comment 2 Hongkai Liu 2019-02-01 21:07:21 UTC

Tried with the 1.12 of k8s. The problem is still there.

# oc get clusterversion version -o json | jq .status.desired
{
  "image": "registry.svc.ci.openshift.org/ocp/release@sha256:d03ce0ef85540a1fff8bfc1c408253404aaecb2b958d7c3f24896f3597c3715b",
  "version": "4.0.0-0.nightly-2019-01-30-145955"
}
# oc version
oc v4.0.0-0.150.0
kubernetes v1.12.4+f39ab668d3
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://hongkliu28-api.qe.devcluster.openshift.com:6443
kubernetes v1.12.4+f39ab668d3

Comment 3 Jan Safranek 2019-02-18 15:44:51 UTC

Placement of dynamically provisioned volumes is based *only* only PVC name. It's hashed, the hash is divided by nr. of zones and the remainder is used as index of the zone. It works well if each PVC has a different name - their hashes are different and PVs are provisioned roughly equally among zones.

If the PVCs have the same names (in different namespaces), they have the same hash and PVs are provisioned in the same zones. There is bug #1663012 that tries to fix that, but change of the hashing algorithm on Kubernetes update looks like significant behavior change.

Can you use different StatefulSet names in each namespace? It should help you with this issue.

Comment 4 Jan Safranek 2019-02-18 16:34:05 UTC

Oh, and since this is 4.0, setting "volumeBindingMode: WaitForFirstConsumer" in storage class should fix it too, even with the same PVC names in all namespaces.

Comment 5 Hongkai Liu 2019-02-18 20:35:05 UTC

It works when `volumeBindingMode: WaitForFirstConsumer`


# cat ~/gp2b.yaml 
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
  creationTimestamp: 2019-02-18T14:16:17Z
  labels:
    cluster.storage.openshift.io/owner-name: cluster-config-v1
    cluster.storage.openshift.io/owner-namespace: kube-system
  name: gp2b
  resourceVersion: "9640"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/gp2
  uid: c1904b3f-3387-11e9-9c73-0ac06c3388a2
parameters:
  type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

# oc get clusterversion version -o json | jq -r .status.desired
{
  "image": "registry.svc.ci.openshift.org/ocp/release@sha256:9f37d93acf2e7442e5bf74f06ca253e37ba299e89bbb66fb30b2cafda6c3d217",
  "version": "4.0.0-0.ci-2019-02-18-105238"
}