1390160 – Couldn't deploy multizone OCP-3.4 env on GCE

Bug 1390160 - Couldn't deploy multizone OCP-3.4 env on GCE

Summary: Couldn't deploy multizone OCP-3.4 env on GCE

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Scott Dodson
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-31 11:21 UTC by Gaoyun Pei
Modified:	2017-03-08 18:43 UTC (History)
CC List:	10 users (show)
Fixed In Version:	openshift-ansible-3.4.20-1
Doc Type:	Bug Fix
Doc Text:	Previously openshift-ansible did not configure environments using GCE as multizone clusters. This prevented nodes from different zones registering against masters. Now GCE based clusters are multizone enabled allowing nodes from other zones to register themselves.
Clone Of:
Environment:
Last Closed:	2017-01-18 12:48:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0066	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.4 RPM Release Advisory	2017-01-18 17:23:26 UTC

Comment 2 Derek Carr 2016-10-31 14:39:04 UTC

The log on the attached node appears different than what is shown in the bug report.

Oct 31 06:29:44 gpei-34-gce-private-777-node-zone1-primary-1 atomic-openshift-node[54232]: I1031 06:29:44.480564   54232 kubelet_node_status.go:67] Successfully registered node gpei-34-gce-private-777-node-zone1-primary-1

It appears that the node actually registered by that log message.  The logs also do not appear to go beyond 06:29.  Is the proper log attached?

I am moving this to the install component to see if they can reproduce in the interim until updated logs are supplied.

Comment 6 Scott Dodson 2016-11-01 15:06:06 UTC

Additional logs have been attached, looks like it's removing nodes that aren't in the same region. Re-assigning to Kubernetes.

Comment 8 Seth Jennings 2016-11-04 15:16:04 UTC

Ok I have a lead.

By default, the GCE cloudprovider in kubernetes uses a single-zone mode.

https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/gce/gce.go#L2906

https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/gce/gce.go#L307

using the getInstanceByName/single-zone endpoint with the zone of the requesting node.

In order for the request to be make against all zones (multizone), one must:

/etc/origin/master/master-config.yaml

  apiServerArguments:
    cloud-provider:
      - "gce"
    cloud-config:
      - "/etc/gce.conf"
  controllerArguments:
    cloud-provider:
      - "gce"
    cloud-config:
      - "/etc/gce.conf"

and in /etc/gce.conf

mutlizone = true

You can do this on the nodes as well but they don't make requests for any instance information other than their own AFAICT

I'm verifying that this works now.

Comment 9 Seth Jennings 2016-11-04 16:05:45 UTC

Confirmed.  Typo in my last comment.  s/mutlizone/multizone/ in the gce.conf.  Looks like this

/etc/gce.conf

[Global]
multizone = true

It doesn't look like this is supported by the ansible installer.

Two ways I'm thinking for the installer to support this:

1) a new openshift_cloudprovider_gce_multizone=true variable
2) always set up the gce.conf with multizone=true if the openshift_cloudprovider_kind=gce

Here is a rough PR for option 2

https://github.com/openshift/openshift-ansible/pull/2728

Comment 11 Gaoyun Pei 2016-11-09 07:36:34 UTC

Test with openshift-ansible-3.4.18-1.git.0.ed7dac0.el7.noarch.rpm.

Once enable gce cloudprovider by setting openshift_cloudprovider_kind=gce in ansible inventory, it will fail when "Set cloud provider facts"


TASK [openshift_cloud_provider : Set cloud provider facts] *********************
Wednesday 09 November 2016  06:16:11 +0000 (0:00:02.131)       0:04:01.616 **** 
fatal: [146.148.52.78]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Shared connection to 146.148.52.78 closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n  File \"/tmp/ansible_xGtwL_/ansible_module_openshift_facts.py\", line 2293, in <module>\r\n    main()\r\n  File \"/tmp/ansible_xGtwL_/ansible_module_openshift_facts.py\", line 2274, in main\r\n    protected_facts_to_overwrite)\r\n  File \"/tmp/ansible_xGtwL_/ansible_module_openshift_facts.py\", line 1730, in __init__\r\n    protected_facts_to_overwrite)\r\n  File \"/tmp/ansible_xGtwL_/ansible_module_openshift_facts.py\", line 1788, in generate_facts\r\n    facts = build_controller_args(facts)\r\n  File \"/tmp/ansible_xGtwL_/ansible_module_openshift_facts.py\", line 1113, in build_controller_args\r\n    kubelet_args['cloud-config'] = [cloud_cfg_path + '/gce.conf']\r\nNameError: global name 'kubelet_args' is not defined\r\n", "msg": "MODULE FAILURE"}

Comment 12 Johnny Liu 2016-11-09 11:04:57 UTC

Seem like comment 11 is a bug relevant to openshift-ansible installer, need open a new bug to track that issue.

Move this bug back to ON_QA status, will re-test it later using a workaround for comment 11.

Comment 13 Johnny Liu 2016-11-09 11:10:19 UTC

Go trough comment 9, seem like the fix is landed in opoenshift-ansible installer, so ignore comment 12.

Comment 14 openshift-github-bot 2016-11-09 22:29:51 UTC

Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/5a752437e6e852c514d0698c9241be759c03c6c7
Fix typos in openshift_facts gce cloud provider

Fixes Bug 1390160
Fixes BZ1390160

Comment 15 Gaoyun Pei 2016-11-10 06:03:01 UTC

Setup an ocp-3.4 cluster with 2 masters+ 3 etcd + 6 node with openshift-ansible-3.4.20-1.git.0.2031d1e.el7.noarch.rpm, master/etcd/node groups are all located across zones. 

Set openshift_cloudprovider_kind=gce in ansible inventory, after installation, all nodes are available on the two masters, and also works well when atomic-openshift-master-controllers switch over between masters. 

Cloud provider configuration are correct in master-config.yaml, node-config.yaml and /etc/origin/cloudprovider/gce.conf

/etc/origin/master/master-config.yaml
...
  apiServerArguments:
    cloud-config:
    - /etc/origin/cloudprovider/gce.conf
    cloud-provider:
    - gce
  controllerArguments:
    cloud-config:
    - /etc/origin/cloudprovider/gce.conf
    cloud-provider:
    - gce

/etc/origin/node/node-config.yaml
...
kubeletArguments:
  cloud-config:
  - /etc/origin/cloudprovider/gce.conf
  cloud-provider:
  - gce

/etc/origin/cloudprovider/gce.conf
[Global]
multizone = true

Comment 17 errata-xmlrpc 2017-01-18 12:48:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.