Description of problem: On GCE environment, create PD and pods with GCE persistent volume. Pod can't run because it could not attach GCE PD. Node log shows "Error attaching PD "mypd-1": googleapi: Error 400: Invalid value 'lxia-ose32.c.openshift-gce-devel.internal'. Values must match the following regular expression: '[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?', invalidParameter" . It seems instanceID gets wrong return value thus google API can't parse it. Version-Release number of selected component (if applicable): openshift v3.2.0.3 kubernetes v1.2.0-origin-41-g91d3e75 etcd 2.2.5 How reproducible: Always Steps to Reproduce: 1. Setup GCE environment with ansible and configure cloud provider on master and node, restart service. 2. Create pd, pod # gcloud compute disks create --size=500GB --zone=us-central1-a my-data-disk # vi test-pd.yaml apiVersion: v1 kind: Pod metadata: name: test-pd spec: containers: - image: gcr.io/google_containers/test-webserver name: test-container volumeMounts: - mountPath: /test-pd name: test-volume volumes: - name: test-volume # This GCE PD must already exist. gcePersistentDisk: pdName: my-data-disk fsType: ext4 # oc create -f test-pd.yaml 3. Check pod Actual results: 3. [root@ose-32-dma-master us]# oc describe pod test-pd Name: test-pd Namespace: qwang1 Image(s): gcr.io/google_containers/test-webserver Node: ose-32-dma-node-1.c.openshift-gce-devel.internal/10.240.0.11 Start Time: Wed, 16 Mar 2016 05:39:04 -0400 Labels: <none> Status: Pending Reason: Message: IP: Controllers: <none> Containers: test-container: Container ID: Image: gcr.io/google_containers/test-webserver Image ID: Port: QoS Tier: cpu: BestEffort memory: BestEffort State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment Variables: Conditions: Type Status Ready False Volumes: test-volume: Type: GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine) PDName: my-data-disk FSType: ext4 Partition: 0 ReadOnly: false default-token-v216z: Type: Secret (a secret that should populate this volume) SecretName: default-token-v216z Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 4m 4m 1 {default-scheduler } Normal Scheduled Successfully assigned test-pd to ose-32-dma-node-1.c.openshift-gce-devel.internal 3m 21s 4 {kubelet ose-32-dma-node-1.c.openshift-gce-devel.internal} Warning FailedMount Unable to mount volumes for pod "test-pd_qwang1(ec2af27a-eb5a-11e5-8d6e-42010af00009)": Could not attach GCE PD "my-data-disk". Timeout waiting for mount paths to be created. 3m 21s 4 {kubelet ose-32-dma-node-1.c.openshift-gce-devel.internal} Warning FailedSync Error syncing pod, skipping: Could not attach GCE PD "my-data-disk". Timeout waiting for mount paths to be created. [root@ose-32-dma-node-1 ~]# journalctl -f -u atomic-openshift-node Mar 16 05:39:46 ose-32-dma-node-1.c.openshift-gce-devel.internal atomic-openshift-node[19926]: W0316 05:39:46.210981 19926 gce_util.go:176] Retrying attach for GCE PD "my-data-disk" (retry count=8). Mar 16 05:39:46 ose-32-dma-node-1.c.openshift-gce-devel.internal atomic-openshift-node[19926]: E0316 05:39:46.366276 19926 gce_util.go:180] Error attaching PD "my-data-disk": googleapi: Error 400: Invalid value 'ose-32-dma-node-1.c.openshift-gce-devel.internal'. Values must match the following regular expression: '[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?', invalidParameter Expected results: GCE PD should be acttached Additional info:
Created attachment 1137724 [details] node configuration
From the node config.yaml: > nodeName: ose-32-dma-node-1.c.openshift-gce-devel.internal This tells GCE cloud provider to use "ose-32-dma-node-1.c.openshift-gce-devel.internal" as GCE instance name. GCE does not allow dots in instance names -> error "Values must match the following regular expression ..." IMO we should fix ansible playbook to put GCE instance name here. It may be quite different to hostname.
Umm, reading OpenShift sources: // NodeName is the value used to identify this particular node in the cluster. If possible, this should be your fully qualified hostname. // If you're describing a set of static nodes to the master, this value must match one of the values in the list NodeName string And in node_config.go: cfg.NodeName = options.NodeName That's KubeletConfig.NodeName. It's not only "value used to identify this particular node in the cluster". It's also used to identify instance in external cloud and thus must be equal to GCE or OpenStack instance name.
Jan, Does this behave right when the NodeName is not defined in the config file? If so, we can probably update the config to not set NodeName if the user has not overridden the value of openshift_hostname for the node.
If nodeName isn't defined in the node config file, node service won't work. Comment "nodeName: ose-32-dma-node-1.c.openshift-gce-devel.internal" or set "nodeName: null", restart atomic-openshift-node service: Mar 22 01:28:45 ose-32-dma-node-1.c.openshift-gce-devel.internal systemd[1]: Starting Atomic OpenShift Node... Mar 22 01:28:45 ose-32-dma-node-1.c.openshift-gce-devel.internal atomic-openshift-node[5699]: Invalid NodeConfig /etc/origin/node/node-config.yaml Mar 22 01:28:45 ose-32-dma-node-1.c.openshift-gce-devel.internal atomic-openshift-node[5699]: nodeName: Required value Did I get your point?
Hmm, I don't think so. Perhaps I need to change node_config.go. Please ignore the above comments.
Jason, when NodeName is not defined in kubelet config, kcfg.Hostname is used instead in most cloud operations: https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/server.go#L584 To get things even more complicated, GCE PD volume plugin ignores any NodeName and uses machine hostname (or hostname-override) as instance name: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/gce_pd/gce_util.go#L187 All this is very confusing and probably buggy, every volume plugin and cloud provider works slightly differently. Conclusion: with current code, GCE instance name must be the same as hostname. Either ansbible scripts or GCE volume plugin (or both) must be fixed.
Jan, I think ideally, we would want all of the cloud providers to ignore the value of NodeName all together. It is possible to get the data that is needed to query the api directly from the host metadata, rather than relying on a user provided setting matching up with the value queried in the cloud provider api. That said, we can definitely work around it in ansible until we can get the upstream code decoupled from the NodeName.
(In reply to Jason DeTiberus from comment #8) > I think ideally, we would want all of the cloud providers to ignore the > value of NodeName all together. Agreed, cloud providers need some refactoring in Kubernetes 1.3. What's the right BZ component & team who should take care about this?
Good question on component and team. I suspect it might be best to use the Kubernetes component. Andy, would your team be the proper team to take a look at fixes to the cloud providers to decouple the NodeName setting from the api lookup? Jan, We'll probably want to clone this bug for the upstream changes and keep this current bug around for implementing a workaround for 3.2.
Yes, Origin product, Kubernetes component for cloud provider code in Kube for post 3.2 work.
Could someone look into this since it is blocking storage testing.
I can confirm that using the instance name rather than the FQDN of the instance for the nodeName works around this issue. Had a similar situation for Openstack cloudprovider: https://bugzilla.redhat.com/show_bug.cgi?id=1321964 Opened a Trello card to get the cloudprovider stuff detached from the nodeName in the Openstack case: https://trello.com/c/dyHpMQw9/335-as-a-user-i-want-to-the-installer-to-configure-for-the-openstack-cloudprovider-without-having-to-manually-edit-the-node-configs I believe that setting openshift_hostname=<instance name> in the installer should prevent needing to manually edit the node-config.yaml post installation. But, yes, we need to get all the cloudprovider code away from using the nodeName.
Changing the semantics of cloud provider and nodename is a very involved task -- not one that can be done for 3.2. Seth's workaround is the solution for now.
We can get the workaround working, so removed keyword testblocker and lower the Severity/Priority. Assign the bug back since we still need to get the issue fixed, either in code or ansible playbook.
*** Bug 1339086 has been marked as a duplicate of this bug. ***
This won't make 3.3 and needs deeper redesign work with the Kube community. We'll cover that work in Trello.
The current feature is working as designed that node name and instance name must match. If we want to remove this constraint, and we have a compelling reason for removing said constraint, please open a new RFE that captures that detail. Until then, closing this bug as designed.
Just for a reference, bug 1367201 is the RFE to decouple node and machine name.