Bug 1683750

Summary: OCP4 from Installer 0.13 creates StorageClass that creates PVs in availability zones that have no machines
Product: OpenShift Container Platform Reporter: Wolfgang Kulhanek <wkulhane>
Component: StorageAssignee: Matthew Wong <mawong>
Status: CLOSED DUPLICATE QA Contact: Liang Xia <lxia>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, aos-storage-staff, mawong, sponnaga
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-14 17:25:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Wolfgang Kulhanek 2019-02-27 17:38:45 UTC
Description of problem:
So I stood up a 0.13 cluster this morning. It appears that PVC provisioning is not working quite right.

I configured the cluster with 2 workers - so my machinesets us-east-1a and us-east-1b are scaled to 1 instance.

I then deploy a template with database (e.g dancer-mysql-persistent) and a PVC gets created in the project. After a while the PV is created and bound.

BUT what I am seeing is that it is creating the PV in us-east-1c - which has no active nodes... so the app stays in pending with  "2 node(s) had volume node affinity conflict"

Version-Release number of selected component (if applicable):
oc adm release info
Name:      4.0.0-0.5
Digest:    sha256:2de0de1c56c2b8de6b57c733db9397d206ff3b3328bd50d1bf1613cd5ba709c6
Created:   2019-02-27 00:08:21 +0000 UTC
OS/Arch:   linux/amd64
Manifests: 244

Release Metadata:
  Version:  4.0.0-0.5
  Upgrades: 4.0.0-0.4

Component Versions:
  Kubernetes 1.12.4

How reproducible:

Steps to Reproduce:
1. See above

Actual results:
oc describe pv
Name:              pvc-aaef2978-3a99-11e9-8307-0e3610fc2722
Labels:            failure-domain.beta.kubernetes.io/region=us-east-1
                   failure-domain.beta.kubernetes.io/zone=us-east-1c
Annotations:       kubernetes.io/createdby: aws-ebs-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      gp2
Status:            Bound
Claim:             wk-test/mysql
Reclaim Policy:    Delete
Access Modes:      RWO
Capacity:          1Gi
Node Affinity:
  Required Terms:
    Term 0:        failure-domain.beta.kubernetes.io/zone in [us-east-1c]
                   failure-domain.beta.kubernetes.io/region in [us-east-1]
Message:
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-east-1c/vol-028e209aad0cf2577
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>

Expected results:
Either 1a or 1b. 

StorageClass Dump (if StorageClass used by PV/PVC):
apiVersion: v1
items:
- apiVersion: storage.k8s.io/v1
  kind: StorageClass
  metadata:
    annotations:
      storageclass.kubernetes.io/is-default-class: "true"
    creationTimestamp: 2019-02-27T17:34:07Z
    labels:
      cluster.storage.openshift.io/owner-name: cluster-config-v1
      cluster.storage.openshift.io/owner-namespace: kube-system
    name: gp2
    resourceVersion: "156690"
    selfLink: /apis/storage.k8s.io/v1/storageclasses/gp2
    uid: e280d120-3ab5-11e9-b930-0e3610fc2722
  parameters:
    type: gp2
  provisioner: kubernetes.io/aws-ebs
  reclaimPolicy: Delete
  volumeBindingMode: Immediate
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Additional info:

Comment 1 Matthew Wong 2019-02-27 18:41:10 UTC
https://github.com/openshift/cluster-storage-operator/pull/12 to make the StorageClass volumeBindingMode WaitForFirstConsumer merged only yesterday, I guess we missed the boat on .13. We can wait for the next version. In the meantime, create your own StorageClass with volumeBindingMode WaitForFirstConsumer and mark it default instead by removing the annotation from the current one and adding it to yours

Comment 2 Wolfgang Kulhanek 2019-02-27 19:02:29 UTC
@matthew thanks I did that. Seems to work fine.

Comment 4 Matthew Wong 2019-03-14 17:25:48 UTC
Fixed by https://github.com/openshift/cluster-storage-operator/commit/b850242280b7ef2cf7631952229c0a438ec39e64 and installer 0.14. Marking as dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1664145, tracking it there

*** This bug has been marked as a duplicate of bug 1664145 ***