Bug 1311840

Summary: Node becomes NotReady after adding gce as cloud provider to kubeletArguments in node-config.yaml
Product: OpenShift Container Platform Reporter: Liang Xia <lxia>
Component: InstallerAssignee: Andrew Butcher <abutcher>
Status: CLOSED CURRENTRELEASE QA Contact: Johnny Liu <jialiu>
Severity: low Docs Contact:
Priority: high    
Version: 3.1.0CC: abutcher, agoldste, akostadi, aos-bugs, bleanhar, dma, jhou, jialiu, jokerman, lxia, mmccomas
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-03 06:23:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
atomic-openshift-node.log none

Description Liang Xia 2016-02-25 07:32:42 UTC
Description of problem:
Follow doc https://docs.openshift.com/enterprise/3.1/install_config/configuring_gce.html#gce-configuring-nodes,
The log show unable to construct api.Node object, and node becomes NotReady.

Version-Release number of selected component (if applicable):
openshift v3.1.1.906
kubernetes v1.2.0-alpha.7-703-gbc4550d
etcd 2.2.5

How reproducible:
Always

Steps to Reproduce:
1.Launch OSE 3.2 environment on GCE.
2.Follow doc https://docs.openshift.com/enterprise/3.1/install_config/configuring_gce.html
3.Restart master and nodes service.
4.Check node service
systemctl status atomic-openshift-node

Actual results:
Feb 25 02:04:45 qe-lxia-ose32-node-1.c.openshift-gce-devel.internal atomic-openshift-node[53650]: E0225 02:04:45.255470   53650 gce.go:2228] Failed to retrieve TargetInstance resource for instance: qe-lxia-ose32-node-1
Feb 25 02:04:45 qe-lxia-ose32-node-1.c.openshift-gce-devel.internal atomic-openshift-node[53650]: E0225 02:04:45.255503   53650 kubelet.go:1085] Unable to construct api.Node object for kubelet: failed to get instance ID from cloud provider: Get https://www.googleapis.com/compute/v1/projects/openshift-gce-devel/zones/us-central1-a/instances/qe-lxia-ose32-node-1?alt=json: metadata: GCE metadata "instance/service-accounts/default/token" not defined

And the node becomes NotReady.

Expected results:
Service restarted successfully, and node becomes Ready.

Additional info:

Comment 1 Paul Morie 2016-03-01 22:57:22 UTC
Would you please provide the kubelet config being used?

Comment 2 Jianwei Hou 2016-03-10 03:27:39 UTC
@pmorie, I think @lxia intended to report a bug for the ansible installer. 

There is an option 'Allow API access to all Google Cloud services in the same project. ' from the GCE console when launching an instance. However, this option is not available in our ansible installer, therefore the OSE cluster setup by ansible always have this problem. The work around we have now is to start the instances ourselves making sure this option is always checked.

I think the aim of this bug is to track this issue.

Comment 3 Liang Xia 2016-03-15 06:59:57 UTC
The issue mentioned in #comment 2 is tracked in bug 1311878

This bug is used to tracked "adding gce as cloud provider to kubeletArgument in node-config.yaml making node become NotReady"

And there is no workaround till now.

Comment 4 Liang Xia 2016-03-15 07:02:17 UTC
# cat /etc/origin/node/node-config.yaml 
allowDisabledDocker: false
apiVersion: v1
dnsDomain: cluster.local
dockerConfig:
  execHandlerName: ""
iptablesSyncPeriod: "5s"
imageConfig:
  format: registry.qe.openshift.com/openshift3/ose-${component}:${version}
  latest: false
kind: NodeConfig
kubeletArguments: 
  cloud-provider:
    - "gce"
masterKubeConfig: system:node:10.240.0.9.kubeconfig
networkPluginName: redhat/openshift-ovs-subnet
# networkConfig struct introduced in origin 1.0.6 and OSE 3.0.2 which
# deprecates networkPluginName above. The two should match.
networkConfig:
   mtu: 1410
   networkPluginName: redhat/openshift-ovs-subnet
nodeName: 10.240.0.9
podManifestConfig:
servingInfo:
  bindAddress: 0.0.0.0:10250
  certFile: server.crt
  clientCA: ca.crt
  keyFile: server.key
volumeDirectory: /var/lib/origin/openshift.local.volumes
proxyArguments:
  proxy-mode:
     - iptables

Comment 5 Liang Xia 2016-03-15 07:04:04 UTC
Raising the priority/serverity since it is blocking some user story from testing.

Comment 6 DeShuai Ma 2016-03-15 09:42:46 UTC
1.When kubelet get token from computeMetadata server it failed.
1)atomic-openshift-node logs:
Mar 15 05:29:04 ose-32-dma-node-1.c.openshift-gce-devel.internal atomic-openshift-node[2322]: E0315 05:29:04.727464    2322 gce.go:2228] Failed to retrieve TargetInstance resource for instance: ose-32-dma-node-1
Mar 15 05:29:04 ose-32-dma-node-1.c.openshift-gce-devel.internal atomic-openshift-node[2322]: E0315 05:29:04.727494    2322 kubelet.go:1085] Unable to construct api.Node object for kubelet: failed to get instance ID from cloud provider:  Get https://www.googleapis.com/compute/v1/projects/openshift-gce-devel/zones/us-central1-a/instances/ose-32-dma-node-1?alt=json: status code 403 trying to fetch http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token

2. On gce instance when use curl directly get the token still error

https://cloud.google.com/compute/docs/authentication?hl=en_US#applications

[root@ose-32-dma-node-1 ~]# curl "http://metadata/computeMetadata/v1/instance/service-accounts/default/token" \
>   -H "Metadata-Flavor: Google"
<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 403 (Forbidden)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>403.</b> <ins>That’s an error.</ins>
  <p>Your client does not have permission to get URL <code>/computeMetadata/v1/instance/service-accounts/default/token</code> from this server.  <ins>That’s all we know.</ins>

Additional info:
In this curl example: "the instance requires the https://www.googleapis.com/auth/compute.readonly scope or the roles/compute.instanceAdmin IAM role."
It seem that our gce instance don't have this scope or role.

Comment 11 Liang Xia 2016-03-16 05:13:14 UTC
Created attachment 1136845 [details]
atomic-openshift-node.log

Comment 12 Liang Xia 2016-03-16 05:17:34 UTC
Tried again with following steps,
1. Launch instance via web console with project access enabled.
2. Prepare gce-hosts file manually.
3. Setup environment using playbooks/byo/config.yml in openshift-ansible repo
$ ansible-playbook -i ~/gce-hosts playbooks/byo/config.yml
4. Follow doc https://docs.openshift.com/enterprise/3.1/install_config/configuring_gce.html
5. Restart master and nodes service.
6. Check node service
$ journalctl -u atomic-openshift-node

The gce-hosts file in step 2,
$ cat ~/gce-hosts
[OSEv3:children]
masters
nodes
etcd
[OSEv3:vars]
ansible_ssh_user=root
ansible_sudo=True
deployment_type=openshift-enterprise
oreg_url=registry.qe.openshift.com/openshift3/ose-${component}:${version}
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/htpasswd'}]
openshift_use_openshift_sdn=true
os_sdn_network_plugin_name=redhat/openshift-ovs-subnet
osm_default_subdomain=aep-appxhhv.com.cn
use_cluster_metrics=true
# host group for masters
[masters]
PUBLIC_IP openshift_public_hostname=PUBLIC_IP
[etcd]
[nodes]
PUBLIC_IP openshift_public_hostname=PUBLIC_IP openshift_node_labels="{'region': u'us-central1', 'type': u'compute'}"

The output in step 6,
As attached in atomic-openshift-node.log

Comment 13 Liang Xia 2016-03-16 05:21:56 UTC
Forgot to mention that, after step 6, check node status, it's NotReady.
# oc get nodes
NAME                                        STATUS     AGE
lxia-ose32.c.openshift-gce-devel.internal   NotReady   2h

Comment 14 DeShuai Ma 2016-03-16 08:50:12 UTC
This is two issue for node NotReady:
1) not ready reason one: Unable to construct api.Node object for kubelet: failed to get instance ID from cloud provider
2) not ready reason two: Unable to delete old node: User "system:node:lxia-ose32.c.openshift-gce-devel.internal" cannot delete nodes at the cluster scope

For 1) we met it when use our jenkins job to install the env on gce, it can't get token from metadata server, it has been fixed by Aleksandar in # Comment 7 
(if we create instance on gce by manually enable `Allow API access to all Google Cloud services in the same project.` then install by ansible will don't has this issue.)

For 2) we need delete the old node then restart the node service, the node can be ready. This should be a bug and still not fixed. detail log see #Comment 11

Comment 15 Andy Goldstein 2016-03-16 10:51:51 UTC
Issue 2 is https://github.com/kubernetes/kubernetes/issues/17731

Comment 16 Andy Goldstein 2016-03-16 16:26:39 UTC
Here is what you'll need to do to work around this for now:

1) Create your VMs
2) Place the appropriate cloud-specific configuration file on every node in the same location, if needed (e.g. /etc/aws/aws.conf)
3) Set openshift_node_kubelet_args, osm_api_server_args, and osm_controller_args in your inventory to contain the appropriate JSON to set up the cloud-provider and cloud-config (if needed).
4) Run ansible

There will be work in the future to add GCE support to openshift-ansible out of the box. Marking UpcomingRelease as this won't get in to 3.2.

Comment 17 Brenton Leanhardt 2016-05-18 12:17:57 UTC
I don't think we know exactly when the work will land in openshift-ansible.  Since there is a workaround I'm lowering the severity.

Comment 19 Andrew Butcher 2016-09-22 17:42:29 UTC
https://github.com/openshift/openshift-ansible/pull/2484 adds openshift_cloudprovider_kind=gce to shortcut adding gce parameters to openshift_node_kubelet_args, osm_api_server_args, and osm_controller_args.

Comment 20 Scott Dodson 2017-01-27 15:51:19 UTC
This should be fixed in all builds newer than openshift-ansible-3.3.30-1. Moving ON_QA, if it's verified please move it to CLOSED CURRENTRELEASE

Comment 21 Johnny Liu 2017-02-03 06:21:54 UTC
Verified this bug with openshift-ansible-3.3.61-1.git.0.27743e6.el7.noarch, and PASS.


1. Adding openshift_cloudprovider_kind=gce to inventory host file
2. Trigger installation.
3. After installation, check master-config.yaml
<--snip-->
kubernetesMasterConfig:
  admissionConfig:
    pluginConfig:
      {}
  apiServerArguments:
    cloud-provider:
    - gce
  controllerArguments:
    cloud-provider:
    - gce
<--snip-->
4. After installation, check node-config.yaml
<--snip-->
kubeletArguments:
  cloud-provider:
  - gce
<--snip-->

Comment 22 Johnny Liu 2017-02-03 06:23:45 UTC
According to comment 20, move it to CLOSED CURRENTRELEASE.