Description of problem: After creating dynamic pvc, pvc keep in pending status, describe the pvc show: Failed to provision volume with StorageClass "standard": Node informer is not synced when trying to GetAllCurrentZones Version-Release number of selected component (if applicable): openshift v3.9.0-0.15.0 kubernetes v1.9.0-beta1 etcd 3.2.8 How reproducible: Always Steps to Reproduce: 1. Set up a OCP 3.9 cluster on GCP. 2. Check there is a default storage class. 3. Create a pvc. 4. Check the pvc status. Actual results: pvc keeps pending, describe the pvc show: Failed to provision volume with StorageClass "standard": Node informer is not synced when trying to GetAllCurrentZones Expected results: pvc bind with an dynamic provisioned pv. Additional info: $ cat pvc.json { "kind": "PersistentVolumeClaim", "apiVersion": "v1", "metadata": { "name": "pvc" }, "spec": { "accessModes": [ "ReadWriteOnce" ], "resources": { "requests": { "storage": "1Gi" } } } } # oc get nodes NAME STATUS ROLES AGE VERSION qe-lxia-master-etcd-1 Ready,SchedulingDisabled <none> 3h v1.9.0-beta1 qe-lxia-node-registry-router-1 Ready <none> 3h v1.9.0-beta1 # oc get sc -o json { "apiVersion": "v1", "items": [ { "apiVersion": "storage.k8s.io/v1", "kind": "StorageClass", "metadata": { "annotations": { "storageclass.beta.kubernetes.io/is-default-class": "true" }, "creationTimestamp": "2018-01-05T03:14:21Z", "name": "standard", "namespace": "", "resourceVersion": "1581", "selfLink": "/apis/storage.k8s.io/v1/storageclasses/standard", "uid": "867710a2-f1c6-11e7-89e3-42010af00002" }, "parameters": { "type": "pd-standard" }, "provisioner": "kubernetes.io/gce-pd", "reclaimPolicy": "Delete" } ], "kind": "List", "metadata": { "resourceVersion": "", "selfLink": "" } } # oc describe pvc pvc Name: pvc Namespace: default StorageClass: standard Status: Pending Volume: Labels: name=dynamic-pvc Annotations: volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/gce-pd Finalizers: [] Capacity: Access Modes: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ProvisioningFailed 1h (x478 over 3h) persistentvolume-controller Failed to provision volume with StorageClass "standard": Node informer is not synced when trying to GetAllCurrentZones
All features related to dynamic provision are blocked from testing.
Reproduced on GCE, no extra config required, just cloud-config and cloud-provider in master-config.yaml.
GCE cloud provider uses a shared informer to get events from nodes, but OpenShift controller manager does not start it. See https://github.com/openshift/ose/blob/e0f1109f8d3b42b9f4fcfa29296f616e4c0449ab/pkg/cmd/server/start/start_kube_controller_manager.go#L32 OpenShift calls Kubernetes controllerapp.CreateControllerContext: ret, err := oldContextFunc(s, rootClientBuilder, clientBuilder, stop) if err != nil { return controllerapp.ControllerContext{}, err } This creates a new shared informer factory which is passed to GCE cloud provider at https://github.com/openshift/ose/blob/e0f1109f8d3b42b9f4fcfa29296f616e4c0449ab/vendor/k8s.io/kubernetes/cmd/kube-controller-manager/app/controllermanager.go#L449 GCE cloud provider creates its own node informer from the factory. But when the code returns back to newKubeControllerContext(), this informer factory is replaced with a new one: // Overwrite the informers. Since nothing accessed the existing informers that we're overwriting, they are inert. // TODO Remove this. It keeps in-process memory utilization down, but we shouldn't do it. ret.InformerFactory = newGenericInformers(informers) -> the informer that GCE cloud provider has never starts and never syncs. I removed newGenericInformers(informers) just to check that this theory is correct and GCE started provisioning. I did not check quota though.
The GCE cloud provider uses the node informer to make sure the PVs do not get provisioned in the zones with no nodes. I was actually backporting the fix (the backport is not going to be used after all), and it seemed the easiest is to re-initialize the cloud provider in pkg/cmd/server/start/start_kube_controller_manager.go newKubeControllerContext(). This is however not really pretty since it might cause troubles in the future. Ideally we would want not rewrite the informers factory...
It's not a good idea to make use of the context being created during its initialization. See https://github.com/openshift/origin/pull/18097 for an option to address this.
Checked on v3.9.0-0.20.0, issue describe in #comment 0 has been fixed. # openshift version openshift v3.9.0-0.20.0 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.8 ============================================================================ # oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc Bound pvc-61d73cca-fbf8-11e7-85aa-42010af00007 1Gi RWO standard 3m ============================================================================ # oc describe pvc pvc Name: pvc Namespace: default StorageClass: standard Status: Bound Volume: pvc-61d73cca-fbf8-11e7-85aa-42010af00007 Labels: <none> Annotations: pv.kubernetes.io/bind-completed=yes pv.kubernetes.io/bound-by-controller=yes volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/gce-pd Finalizers: [] Capacity: 1Gi Access Modes: RWO Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ProvisioningSucceeded 3m persistentvolume-controller Successfully provisioned volume pvc-61d73cca-fbf8-11e7-85aa-42010af00007 using kubernetes.io/gce-pd