Bug 1509028

Summary: GCE dynamic provisioning going to wrong zone
Product: OpenShift Container Platform Reporter: Eric Jones <erjones>
Component: StorageAssignee: Tomas Smetana <tsmetana>
Status: CLOSED ERRATA QA Contact: Liang Xia <lxia>
Severity: high Docs Contact:
Priority: high    
Version: 3.6.0CC: aos-bugs, aos-storage-staff, bchilds, erich, gpei, jkaur, pschiffe, tsmetana, wmeng
Target Milestone: ---Keywords: NeedsTestCase
Target Release: 3.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: When using OpenShift cluster in GCE multiple zone setup the dynamic persistent volumes might get provisioned in zones with no running nodes. Consequence: Pods using the volume provisioned in zone with no nodes could not be run. Fix: The GCE cloud provider has been fixed to provision persistent volumes only in the zones with running nodes. Result: The pods in multi-zone clusters should not fail to start because of not fitting to any node.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-28 14:09:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1531444    
Bug Blocks:    

Description Eric Jones 2017-11-02 17:53:48 UTC
Description of problem:
When using GCE for hosting the OCP cluster across multiple zones, and using the dynamically provisioned storage, the storage periodically 

Version-Release number of selected component (if applicable):
3.6

Additional info:
Found a known kubernetes issue [0] about this issue that still seems to be open.

[0] https://github.com/kubernetes/kubernetes/issues/50115

Comment 2 Pavel Pospisil 2017-11-03 19:53:18 UTC
*** Bug 1490477 has been marked as a duplicate of this bug. ***

Comment 3 Tomas Smetana 2017-11-06 08:48:06 UTC
There is a workaround described in the bug #1490477.

Comment 4 Eric Jones 2017-11-07 17:50:21 UTC
Hi,

Just to be clear, the workaround described in the other bug is:


As a workaround, you should always have at least one node in every zone that has a master.


And to clarify, this means that should you have 3 masters and each is in a different zone, the workaround for this issue is to simply have at least one node in each of those zones so that the pod can be scheduled to whatever zone has access to the gce storage?

Or am I missing an understanding of how kubernetes interacts with the gce storage?

Comment 5 Eric Jones 2017-11-07 18:06:56 UTC
Having spoken with Bradley Childs in irc, I am removing the needinfo flag as he explained the workaround.

Rather than necessarily having one node per zone with the masters, you can understand the workaround to be "have at least one master per zone with nodes". This is because the provisioner for GCE runs on the master and therefore only knows about the zone that that master is configured for. This means if you have nodes in zones without a master, there will be nothing in that zone to tell GCE that it needs storage in that zone.

Comment 14 Tomas Smetana 2018-01-12 09:04:59 UTC
https://github.com/kubernetes/kubernetes/pull/55039

Comment 16 Liang Xia 2018-01-18 08:56:56 UTC
Tested on OCP 3.9 with version v3.9.0-0.20.0, and the issue has been fixed.

# openshift version
openshift v3.9.0-0.20.0
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.8


Verification steps:
ENV of OCP 3.9:
   total 1 master, 1 nodes. 
   1 master in zone us-central1-a.
   1 node   in zone us-central1-a.

1. Create a storage class without setting parameters.zones.
2. Create 100 pvc using above storage class, and check dynamic provisioned volumes.
Result: ALL 100 volumes were provisioned in zone us-central1-a.

1. Create a storage class with parameters.zones set to "us-central1-a,us-central1-b".
2. Create 100 pvc using above storage class, and check dynamic provisioned volumes.
Result: 51 volumes were provisioned in zone us-central1-a.
51 pvc were in bound status, 49 in pending status.

Check the pending pvc,
# oc describe pvc pvcname095
Name:          pvcname095
Namespace:     default
StorageClass:  zones
Status:        Pending
Volume:        
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/gce-pd
Finalizers:    []
Capacity:      
Access Modes:  
Events:
  Type     Reason              Age              From                         Message
  ----     ------              ----             ----                         -------
  Warning  ProvisioningFailed  3s (x7 over 1m)  persistentvolume-controller  Failed to provision volume with StorageClass "zones": kubernetes does not have a node in zone "us-central1-b"

The error message is clear to show why it failed to provision the volume.


So in total, the result is acceptable from QE's perspective.

Feel free to move back if you do not agree.

Comment 19 errata-xmlrpc 2018-03-28 14:09:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489