Bug 1698083

Summary: Storage classes in multi-AZ clusters don't work reliably
Product: OpenShift Container Platform Reporter: Eric Rich <erich>
Component: StorageAssignee: Bradley Childs <bchilds>
Status: CLOSED NOTABUG QA Contact: Liang Xia <lxia>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 1.0.0CC: aos-bugs, aos-storage-staff
Target Milestone: ---Keywords: NeedsTestCase
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-09 15:56:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1664187    

Description Eric Rich 2019-04-09 15:10:56 UTC
This bug was initially created as a copy of Bug #1694760

I am copying this bug because: 

Description of problem:

On a multi-AZ v3.11 cluster with a gp2 storage class that isn't restricted to a single AZ, it is possible to have a situation where two PVCs (belonging to the same pod) are provisioned in two separate AZs. In this scenario, the pod can't be scheduled anywhere because there isn't a node in the cluster that can mount both of the PVs. 

One alternative to this would be to create a storage class for each AZ and have the user specify which storage class to use each time a PVC is created. This works, but its inconvenient and it requires each user to know about this extra step. If users don't specify which SC to use, then all new PVs will be created in the AZ of the default SC, which could lead to a disproportionate amount of the cluster's workload running on nodes in a single AZ. 

Version-Release number of selected component (if applicable):

OpenShift v3.11

How reproducible:

Steps to Reproduce:
1. From a template, create a deployment that provisions two PVCs that are mounted to the same pod.
2.
3.

Actual results:
The PVCs will (sometimes) provision to two different AZs which makes the pod unschedulable. 

Expected results:
The PVCs will be created in the same AZ so that the pod can mount both of them. 

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 1 Eric Rich 2019-04-09 15:56:49 UTC
This seems to be mitigated by https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/ 

As the default storage class; has a `volumeBindingsMode` set:

$ cat must-gather/cluster-scoped-resources/storage.k8s.io/storageclasses/gp2.yaml 
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  creationTimestamp: 2019-04-09T15:30:53Z
  name: gp2
  ownerReferences:
  - apiVersion: v1
    kind: clusteroperator
    name: storage
    uid: 55c725bb-5adb-11e9-8aa8-02d4f8d6a68e
  resourceVersion: "7951"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/gp2
  uid: 7608d8ae-5adc-11e9-b839-02b822a6f0f6
parameters:
  type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer