Bug 1311840
Summary: | Node becomes NotReady after adding gce as cloud provider to kubeletArguments in node-config.yaml | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Liang Xia <lxia> | ||||
Component: | Installer | Assignee: | Andrew Butcher <abutcher> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Johnny Liu <jialiu> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 3.1.0 | CC: | abutcher, agoldste, akostadi, aos-bugs, bleanhar, dma, jhou, jialiu, jokerman, lxia, mmccomas | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-02-03 06:23:45 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Liang Xia
2016-02-25 07:32:42 UTC
Would you please provide the kubelet config being used? @pmorie, I think @lxia intended to report a bug for the ansible installer. There is an option 'Allow API access to all Google Cloud services in the same project. ' from the GCE console when launching an instance. However, this option is not available in our ansible installer, therefore the OSE cluster setup by ansible always have this problem. The work around we have now is to start the instances ourselves making sure this option is always checked. I think the aim of this bug is to track this issue. The issue mentioned in #comment 2 is tracked in bug 1311878 This bug is used to tracked "adding gce as cloud provider to kubeletArgument in node-config.yaml making node become NotReady" And there is no workaround till now. # cat /etc/origin/node/node-config.yaml allowDisabledDocker: false apiVersion: v1 dnsDomain: cluster.local dockerConfig: execHandlerName: "" iptablesSyncPeriod: "5s" imageConfig: format: registry.qe.openshift.com/openshift3/ose-${component}:${version} latest: false kind: NodeConfig kubeletArguments: cloud-provider: - "gce" masterKubeConfig: system:node:10.240.0.9.kubeconfig networkPluginName: redhat/openshift-ovs-subnet # networkConfig struct introduced in origin 1.0.6 and OSE 3.0.2 which # deprecates networkPluginName above. The two should match. networkConfig: mtu: 1410 networkPluginName: redhat/openshift-ovs-subnet nodeName: 10.240.0.9 podManifestConfig: servingInfo: bindAddress: 0.0.0.0:10250 certFile: server.crt clientCA: ca.crt keyFile: server.key volumeDirectory: /var/lib/origin/openshift.local.volumes proxyArguments: proxy-mode: - iptables Raising the priority/serverity since it is blocking some user story from testing. 1.When kubelet get token from computeMetadata server it failed. 1)atomic-openshift-node logs: Mar 15 05:29:04 ose-32-dma-node-1.c.openshift-gce-devel.internal atomic-openshift-node[2322]: E0315 05:29:04.727464 2322 gce.go:2228] Failed to retrieve TargetInstance resource for instance: ose-32-dma-node-1 Mar 15 05:29:04 ose-32-dma-node-1.c.openshift-gce-devel.internal atomic-openshift-node[2322]: E0315 05:29:04.727494 2322 kubelet.go:1085] Unable to construct api.Node object for kubelet: failed to get instance ID from cloud provider: Get https://www.googleapis.com/compute/v1/projects/openshift-gce-devel/zones/us-central1-a/instances/ose-32-dma-node-1?alt=json: status code 403 trying to fetch http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token 2. On gce instance when use curl directly get the token still error https://cloud.google.com/compute/docs/authentication?hl=en_US#applications [root@ose-32-dma-node-1 ~]# curl "http://metadata/computeMetadata/v1/instance/service-accounts/default/token" \ > -H "Metadata-Flavor: Google" <!DOCTYPE html> <html lang=en> <meta charset=utf-8> <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"> <title>Error 403 (Forbidden)!!1</title> <style> *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px} </style> <a href=//www.google.com/><span id=logo aria-label=Google></span></a> <p><b>403.</b> <ins>That’s an error.</ins> <p>Your client does not have permission to get URL <code>/computeMetadata/v1/instance/service-accounts/default/token</code> from this server. <ins>That’s all we know.</ins> Additional info: In this curl example: "the instance requires the https://www.googleapis.com/auth/compute.readonly scope or the roles/compute.instanceAdmin IAM role." It seem that our gce instance don't have this scope or role. Created attachment 1136845 [details]
atomic-openshift-node.log
Tried again with following steps, 1. Launch instance via web console with project access enabled. 2. Prepare gce-hosts file manually. 3. Setup environment using playbooks/byo/config.yml in openshift-ansible repo $ ansible-playbook -i ~/gce-hosts playbooks/byo/config.yml 4. Follow doc https://docs.openshift.com/enterprise/3.1/install_config/configuring_gce.html 5. Restart master and nodes service. 6. Check node service $ journalctl -u atomic-openshift-node The gce-hosts file in step 2, $ cat ~/gce-hosts [OSEv3:children] masters nodes etcd [OSEv3:vars] ansible_ssh_user=root ansible_sudo=True deployment_type=openshift-enterprise oreg_url=registry.qe.openshift.com/openshift3/ose-${component}:${version} openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/htpasswd'}] openshift_use_openshift_sdn=true os_sdn_network_plugin_name=redhat/openshift-ovs-subnet osm_default_subdomain=aep-appxhhv.com.cn use_cluster_metrics=true # host group for masters [masters] PUBLIC_IP openshift_public_hostname=PUBLIC_IP [etcd] [nodes] PUBLIC_IP openshift_public_hostname=PUBLIC_IP openshift_node_labels="{'region': u'us-central1', 'type': u'compute'}" The output in step 6, As attached in atomic-openshift-node.log Forgot to mention that, after step 6, check node status, it's NotReady. # oc get nodes NAME STATUS AGE lxia-ose32.c.openshift-gce-devel.internal NotReady 2h This is two issue for node NotReady: 1) not ready reason one: Unable to construct api.Node object for kubelet: failed to get instance ID from cloud provider 2) not ready reason two: Unable to delete old node: User "system:node:lxia-ose32.c.openshift-gce-devel.internal" cannot delete nodes at the cluster scope For 1) we met it when use our jenkins job to install the env on gce, it can't get token from metadata server, it has been fixed by Aleksandar in # Comment 7 (if we create instance on gce by manually enable `Allow API access to all Google Cloud services in the same project.` then install by ansible will don't has this issue.) For 2) we need delete the old node then restart the node service, the node can be ready. This should be a bug and still not fixed. detail log see #Comment 11 Here is what you'll need to do to work around this for now: 1) Create your VMs 2) Place the appropriate cloud-specific configuration file on every node in the same location, if needed (e.g. /etc/aws/aws.conf) 3) Set openshift_node_kubelet_args, osm_api_server_args, and osm_controller_args in your inventory to contain the appropriate JSON to set up the cloud-provider and cloud-config (if needed). 4) Run ansible There will be work in the future to add GCE support to openshift-ansible out of the box. Marking UpcomingRelease as this won't get in to 3.2. I don't think we know exactly when the work will land in openshift-ansible. Since there is a workaround I'm lowering the severity. https://github.com/openshift/openshift-ansible/pull/2484 adds openshift_cloudprovider_kind=gce to shortcut adding gce parameters to openshift_node_kubelet_args, osm_api_server_args, and osm_controller_args. This should be fixed in all builds newer than openshift-ansible-3.3.30-1. Moving ON_QA, if it's verified please move it to CLOSED CURRENTRELEASE Verified this bug with openshift-ansible-3.3.61-1.git.0.27743e6.el7.noarch, and PASS. 1. Adding openshift_cloudprovider_kind=gce to inventory host file 2. Trigger installation. 3. After installation, check master-config.yaml <--snip--> kubernetesMasterConfig: admissionConfig: pluginConfig: {} apiServerArguments: cloud-provider: - gce controllerArguments: cloud-provider: - gce <--snip--> 4. After installation, check node-config.yaml <--snip--> kubeletArguments: cloud-provider: - gce <--snip--> According to comment 20, move it to CLOSED CURRENTRELEASE. |