Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1531444

Summary:	Dynamic pvc provision keeps pending with error "Node informer is not synced when trying to GetAllCurrentZones"
Product:	OpenShift Container Platform	Reporter:	Liang Xia <lxia>
Component:	Master	Assignee:	David Eads <deads>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Wang Haoran <haowang>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.9.0	CC:	aos-bugs, aos-storage-staff, gpei, haowang, jliggitt, jokerman, jsafrane, mfojtik, mmccomas, smunilla, tsmetana, wmeng
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openshift v3.9.0-0.20.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-06-18 17:36:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1509028

Description Liang Xia 2018-01-05 06:43:12 UTC

Description of problem:
After creating dynamic pvc, pvc keep in pending status,
describe the pvc show:
Failed to provision volume with StorageClass "standard": Node informer is not synced when trying to GetAllCurrentZones

Version-Release number of selected component (if applicable):
openshift v3.9.0-0.15.0
kubernetes v1.9.0-beta1
etcd 3.2.8

How reproducible:
Always

Steps to Reproduce:
1. Set up a OCP 3.9 cluster on GCP.
2. Check there is a default storage class.
3. Create a pvc.
4. Check the pvc status.

Actual results:
pvc keeps pending, describe the pvc show:
Failed to provision volume with StorageClass "standard": Node informer is not synced when trying to GetAllCurrentZones


Expected results:
pvc bind with an dynamic provisioned pv.


Additional info:
$ cat pvc.json
{
  "kind": "PersistentVolumeClaim",
  "apiVersion": "v1",
  "metadata": {
    "name": "pvc"
  },
  "spec": {
    "accessModes": [
      "ReadWriteOnce"
    ],
    "resources": {
      "requests": {
        "storage": "1Gi"
      }
    }
  }
}

# oc get nodes
NAME                             STATUS                     ROLES     AGE       VERSION
qe-lxia-master-etcd-1            Ready,SchedulingDisabled   <none>    3h        v1.9.0-beta1
qe-lxia-node-registry-router-1   Ready                      <none>    3h        v1.9.0-beta1

# oc get sc -o json
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "storage.k8s.io/v1",
            "kind": "StorageClass",
            "metadata": {
                "annotations": {
                    "storageclass.beta.kubernetes.io/is-default-class": "true"
                },
                "creationTimestamp": "2018-01-05T03:14:21Z",
                "name": "standard",
                "namespace": "",
                "resourceVersion": "1581",
                "selfLink": "/apis/storage.k8s.io/v1/storageclasses/standard",
                "uid": "867710a2-f1c6-11e7-89e3-42010af00002"
            },
            "parameters": {
                "type": "pd-standard"
            },
            "provisioner": "kubernetes.io/gce-pd",
            "reclaimPolicy": "Delete"
        }
    ],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}

# oc describe pvc pvc
Name:          pvc
Namespace:     default
StorageClass:  standard
Status:        Pending
Volume:        
Labels:        name=dynamic-pvc
Annotations:   volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/gce-pd
Finalizers:    []
Capacity:      
Access Modes:  
Events:
  Type     Reason              Age                From                         Message
  ----     ------              ----               ----                         -------
  Warning  ProvisioningFailed  1h (x478 over 3h)  persistentvolume-controller  Failed to provision volume with StorageClass "standard": Node informer is not synced when trying to GetAllCurrentZones

Comment 1 Liang Xia 2018-01-05 09:46:50 UTC

All features related to dynamic provision are blocked from testing.

Comment 2 Jan Safranek 2018-01-08 14:51:43 UTC

Reproduced on GCE, no extra config required, just cloud-config and cloud-provider in master-config.yaml.

Comment 3 Jan Safranek 2018-01-08 16:16:44 UTC

GCE cloud provider uses a shared informer to get events from nodes, but OpenShift controller manager does not start it.

See https://github.com/openshift/ose/blob/e0f1109f8d3b42b9f4fcfa29296f616e4c0449ab/pkg/cmd/server/start/start_kube_controller_manager.go#L32

OpenShift calls Kubernetes controllerapp.CreateControllerContext: 

	ret, err := oldContextFunc(s, rootClientBuilder, clientBuilder, stop)
	if err != nil {
		return controllerapp.ControllerContext{}, err
	}

This creates a new shared informer factory which is passed to GCE cloud provider at https://github.com/openshift/ose/blob/e0f1109f8d3b42b9f4fcfa29296f616e4c0449ab/vendor/k8s.io/kubernetes/cmd/kube-controller-manager/app/controllermanager.go#L449

GCE cloud provider creates its own node informer from the factory.

But when the code returns back to newKubeControllerContext(), this informer factory is replaced with a new one:

	// Overwrite the informers.  Since nothing accessed the existing informers that we're overwriting, they are inert.
	// TODO Remove this.  It keeps in-process memory utilization down, but we shouldn't do it.
	ret.InformerFactory = newGenericInformers(informers)

-> the informer that GCE cloud provider has never starts and never syncs.

I removed newGenericInformers(informers) just to check that this theory is correct and GCE started provisioning. I did not check quota though.

Comment 4 Tomas Smetana 2018-01-09 11:29:37 UTC

The GCE cloud provider uses the node informer to make sure the PVs do not get provisioned in the zones with no nodes. I was actually backporting the fix (the backport is not going to be used after all), and it seemed the easiest is to re-initialize the cloud provider in pkg/cmd/server/start/start_kube_controller_manager.go newKubeControllerContext(). This is however not really pretty since it might cause troubles in the future. Ideally we would want not rewrite the informers factory...

Comment 5 David Eads 2018-01-12 16:22:07 UTC

It's not a good idea to make use of the context being created during its initialization.

See https://github.com/openshift/origin/pull/18097 for an option to address this.

Comment 6 Liang Xia 2018-01-18 02:41:52 UTC

Checked on v3.9.0-0.20.0, issue describe in #comment 0 has been fixed.

# openshift version
openshift v3.9.0-0.20.0
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.8
============================================================================
# oc get pvc
NAME      STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pvc       Bound     pvc-61d73cca-fbf8-11e7-85aa-42010af00007   1Gi        RWO            standard       3m
============================================================================
# oc describe pvc pvc
Name:          pvc
Namespace:     default
StorageClass:  standard
Status:        Bound
Volume:        pvc-61d73cca-fbf8-11e7-85aa-42010af00007
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed=yes
               pv.kubernetes.io/bound-by-controller=yes
               volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/gce-pd
Finalizers:    []
Capacity:      1Gi
Access Modes:  RWO
Events:
  Type    Reason                 Age   From                         Message
  ----    ------                 ----  ----                         -------
  Normal  ProvisioningSucceeded  3m    persistentvolume-controller  Successfully provisioned volume pvc-61d73cca-fbf8-11e7-85aa-42010af00007 using kubernetes.io/gce-pd