Bug 1311840 - Node becomes NotReady after adding gce as cloud provider to kubeletArguments in node-config.yaml
Node becomes NotReady after adding gce as cloud provider to kubeletArguments ...
Status: CLOSED CURRENTRELEASE
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer (Show other bugs)
3.1.0
Unspecified Unspecified
high Severity low
: ---
: ---
Assigned To: Andrew Butcher
Johnny Liu
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-25 02:32 EST by Liang Xia
Modified: 2017-03-08 13 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-02-03 01:23:45 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
atomic-openshift-node.log (49.72 KB, text/x-vhdl)
2016-03-16 01:13 EDT, Liang Xia
no flags Details

  None (edit)
Description Liang Xia 2016-02-25 02:32:42 EST
Description of problem:
Follow doc https://docs.openshift.com/enterprise/3.1/install_config/configuring_gce.html#gce-configuring-nodes,
The log show unable to construct api.Node object, and node becomes NotReady.

Version-Release number of selected component (if applicable):
openshift v3.1.1.906
kubernetes v1.2.0-alpha.7-703-gbc4550d
etcd 2.2.5

How reproducible:
Always

Steps to Reproduce:
1.Launch OSE 3.2 environment on GCE.
2.Follow doc https://docs.openshift.com/enterprise/3.1/install_config/configuring_gce.html
3.Restart master and nodes service.
4.Check node service
systemctl status atomic-openshift-node

Actual results:
Feb 25 02:04:45 qe-lxia-ose32-node-1.c.openshift-gce-devel.internal atomic-openshift-node[53650]: E0225 02:04:45.255470   53650 gce.go:2228] Failed to retrieve TargetInstance resource for instance: qe-lxia-ose32-node-1
Feb 25 02:04:45 qe-lxia-ose32-node-1.c.openshift-gce-devel.internal atomic-openshift-node[53650]: E0225 02:04:45.255503   53650 kubelet.go:1085] Unable to construct api.Node object for kubelet: failed to get instance ID from cloud provider: Get https://www.googleapis.com/compute/v1/projects/openshift-gce-devel/zones/us-central1-a/instances/qe-lxia-ose32-node-1?alt=json: metadata: GCE metadata "instance/service-accounts/default/token" not defined

And the node becomes NotReady.

Expected results:
Service restarted successfully, and node becomes Ready.

Additional info:
Comment 1 Paul Morie 2016-03-01 17:57:22 EST
Would you please provide the kubelet config being used?
Comment 2 Jianwei Hou 2016-03-09 22:27:39 EST
@pmorie, I think @lxia intended to report a bug for the ansible installer. 

There is an option 'Allow API access to all Google Cloud services in the same project. ' from the GCE console when launching an instance. However, this option is not available in our ansible installer, therefore the OSE cluster setup by ansible always have this problem. The work around we have now is to start the instances ourselves making sure this option is always checked.

I think the aim of this bug is to track this issue.
Comment 3 Liang Xia 2016-03-15 02:59:57 EDT
The issue mentioned in #comment 2 is tracked in bug 1311878

This bug is used to tracked "adding gce as cloud provider to kubeletArgument in node-config.yaml making node become NotReady"

And there is no workaround till now.
Comment 4 Liang Xia 2016-03-15 03:02:17 EDT
# cat /etc/origin/node/node-config.yaml 
allowDisabledDocker: false
apiVersion: v1
dnsDomain: cluster.local
dockerConfig:
  execHandlerName: ""
iptablesSyncPeriod: "5s"
imageConfig:
  format: registry.qe.openshift.com/openshift3/ose-${component}:${version}
  latest: false
kind: NodeConfig
kubeletArguments: 
  cloud-provider:
    - "gce"
masterKubeConfig: system:node:10.240.0.9.kubeconfig
networkPluginName: redhat/openshift-ovs-subnet
# networkConfig struct introduced in origin 1.0.6 and OSE 3.0.2 which
# deprecates networkPluginName above. The two should match.
networkConfig:
   mtu: 1410
   networkPluginName: redhat/openshift-ovs-subnet
nodeName: 10.240.0.9
podManifestConfig:
servingInfo:
  bindAddress: 0.0.0.0:10250
  certFile: server.crt
  clientCA: ca.crt
  keyFile: server.key
volumeDirectory: /var/lib/origin/openshift.local.volumes
proxyArguments:
  proxy-mode:
     - iptables
Comment 5 Liang Xia 2016-03-15 03:04:04 EDT
Raising the priority/serverity since it is blocking some user story from testing.
Comment 6 DeShuai Ma 2016-03-15 05:42:46 EDT
1.When kubelet get token from computeMetadata server it failed.
1)atomic-openshift-node logs:
Mar 15 05:29:04 ose-32-dma-node-1.c.openshift-gce-devel.internal atomic-openshift-node[2322]: E0315 05:29:04.727464    2322 gce.go:2228] Failed to retrieve TargetInstance resource for instance: ose-32-dma-node-1
Mar 15 05:29:04 ose-32-dma-node-1.c.openshift-gce-devel.internal atomic-openshift-node[2322]: E0315 05:29:04.727494    2322 kubelet.go:1085] Unable to construct api.Node object for kubelet: failed to get instance ID from cloud provider:  Get https://www.googleapis.com/compute/v1/projects/openshift-gce-devel/zones/us-central1-a/instances/ose-32-dma-node-1?alt=json: status code 403 trying to fetch http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token

2. On gce instance when use curl directly get the token still error

https://cloud.google.com/compute/docs/authentication?hl=en_US#applications

[root@ose-32-dma-node-1 ~]# curl "http://metadata/computeMetadata/v1/instance/service-accounts/default/token" \
>   -H "Metadata-Flavor: Google"
<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 403 (Forbidden)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>403.</b> <ins>That’s an error.</ins>
  <p>Your client does not have permission to get URL <code>/computeMetadata/v1/instance/service-accounts/default/token</code> from this server.  <ins>That’s all we know.</ins>

Additional info:
In this curl example: "the instance requires the https://www.googleapis.com/auth/compute.readonly scope or the roles/compute.instanceAdmin IAM role."
It seem that our gce instance don't have this scope or role.
Comment 11 Liang Xia 2016-03-16 01:13 EDT
Created attachment 1136845 [details]
atomic-openshift-node.log
Comment 12 Liang Xia 2016-03-16 01:17:34 EDT
Tried again with following steps,
1. Launch instance via web console with project access enabled.
2. Prepare gce-hosts file manually.
3. Setup environment using playbooks/byo/config.yml in openshift-ansible repo
$ ansible-playbook -i ~/gce-hosts playbooks/byo/config.yml
4. Follow doc https://docs.openshift.com/enterprise/3.1/install_config/configuring_gce.html
5. Restart master and nodes service.
6. Check node service
$ journalctl -u atomic-openshift-node

The gce-hosts file in step 2,
$ cat ~/gce-hosts
[OSEv3:children]
masters
nodes
etcd
[OSEv3:vars]
ansible_ssh_user=root
ansible_sudo=True
deployment_type=openshift-enterprise
oreg_url=registry.qe.openshift.com/openshift3/ose-${component}:${version}
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/htpasswd'}]
openshift_use_openshift_sdn=true
os_sdn_network_plugin_name=redhat/openshift-ovs-subnet
osm_default_subdomain=aep-appxhhv.com.cn
use_cluster_metrics=true
# host group for masters
[masters]
PUBLIC_IP openshift_public_hostname=PUBLIC_IP
[etcd]
[nodes]
PUBLIC_IP openshift_public_hostname=PUBLIC_IP openshift_node_labels="{'region': u'us-central1', 'type': u'compute'}"

The output in step 6,
As attached in atomic-openshift-node.log
Comment 13 Liang Xia 2016-03-16 01:21:56 EDT
Forgot to mention that, after step 6, check node status, it's NotReady.
# oc get nodes
NAME                                        STATUS     AGE
lxia-ose32.c.openshift-gce-devel.internal   NotReady   2h
Comment 14 DeShuai Ma 2016-03-16 04:50:12 EDT
This is two issue for node NotReady:
1) not ready reason one: Unable to construct api.Node object for kubelet: failed to get instance ID from cloud provider
2) not ready reason two: Unable to delete old node: User "system:node:lxia-ose32.c.openshift-gce-devel.internal" cannot delete nodes at the cluster scope

For 1) we met it when use our jenkins job to install the env on gce, it can't get token from metadata server, it has been fixed by Aleksandar in # Comment 7 
(if we create instance on gce by manually enable `Allow API access to all Google Cloud services in the same project.` then install by ansible will don't has this issue.)

For 2) we need delete the old node then restart the node service, the node can be ready. This should be a bug and still not fixed. detail log see #Comment 11
Comment 15 Andy Goldstein 2016-03-16 06:51:51 EDT
Issue 2 is https://github.com/kubernetes/kubernetes/issues/17731
Comment 16 Andy Goldstein 2016-03-16 12:26:39 EDT
Here is what you'll need to do to work around this for now:

1) Create your VMs
2) Place the appropriate cloud-specific configuration file on every node in the same location, if needed (e.g. /etc/aws/aws.conf)
3) Set openshift_node_kubelet_args, osm_api_server_args, and osm_controller_args in your inventory to contain the appropriate JSON to set up the cloud-provider and cloud-config (if needed).
4) Run ansible

There will be work in the future to add GCE support to openshift-ansible out of the box. Marking UpcomingRelease as this won't get in to 3.2.
Comment 17 Brenton Leanhardt 2016-05-18 08:17:57 EDT
I don't think we know exactly when the work will land in openshift-ansible.  Since there is a workaround I'm lowering the severity.
Comment 19 Andrew Butcher 2016-09-22 13:42:29 EDT
https://github.com/openshift/openshift-ansible/pull/2484 adds openshift_cloudprovider_kind=gce to shortcut adding gce parameters to openshift_node_kubelet_args, osm_api_server_args, and osm_controller_args.
Comment 20 Scott Dodson 2017-01-27 10:51:19 EST
This should be fixed in all builds newer than openshift-ansible-3.3.30-1. Moving ON_QA, if it's verified please move it to CLOSED CURRENTRELEASE
Comment 21 Johnny Liu 2017-02-03 01:21:54 EST
Verified this bug with openshift-ansible-3.3.61-1.git.0.27743e6.el7.noarch, and PASS.


1. Adding openshift_cloudprovider_kind=gce to inventory host file
2. Trigger installation.
3. After installation, check master-config.yaml
<--snip-->
kubernetesMasterConfig:
  admissionConfig:
    pluginConfig:
      {}
  apiServerArguments:
    cloud-provider:
    - gce
  controllerArguments:
    cloud-provider:
    - gce
<--snip-->
4. After installation, check node-config.yaml
<--snip-->
kubeletArguments:
  cloud-provider:
  - gce
<--snip-->
Comment 22 Johnny Liu 2017-02-03 01:23:45 EST
According to comment 20, move it to CLOSED CURRENTRELEASE.

Note You need to log in before you can comment on or make changes to this bug.