1381335 – Scale up playbook does not rerun master-facts.

Bug 1381335 - Scale up playbook does not rerun master-facts.

Summary: Scale up playbook does not rerun master-facts.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.3.1
Assignee:	Andrew Butcher
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-03 18:12 UTC by Ryan Howe
Modified:	2017-07-10 13:52 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, the node scale up playbook did not re-generate master facts before adding new nodes which meant that any master configuration changes made to the advanced installation hosts file were not used when configuring the additional nodes. Now master facts are regenerated ensuring configuration changes are applied when adding additional nodes.
Clone Of:
Environment:
Last Closed:	2016-10-27 16:13:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:2122	0	normal	SHIPPED_LIVE	OpenShift Container Platform atomic-openshift-utils bug fix update	2016-10-27 20:11:30 UTC

Description Ryan Howe 2016-10-03 18:12:50 UTC

Description of problem:

 - Cluster host name as been changed and the installer has not been rerun. When scaling up the the node-scale up playbook is using old cached values. 

Version-Release number of selected component (if applicable):
3.2+

How reproducible:
100%

Steps to Reproduce:
1. Install
2. Change master_url 
3. Scale up node

Actual results:
- Health test is done using old master_url 

Expected results:
- Use new values set in hosts file. 

Additional info:

Scale up fails on this task. 
https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/tasks/main.yml#L128-L146

It fails because the wrong value is being used for "openshift_node_master_api_url", this variable is set by the following: 

 "openshift_node_master_api_url": "{{ hostvars[groups.oo_first_master.0].openshift.master.api_url }}"

The issue is that setting the follow does not update the cached master facts. 
 openshift_master_api_url="openshift.api.url.com" 


This is because the task is not rerun. 

https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_master_facts/tasks/main.yml

Comment 1 openshift-github-bot 2016-10-06 12:13:26 UTC

Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/c7d9c63088f58a3aa338981083a9fb21a8c5c7f5
Merge pull request #2555 from abutcher/node-scaleup-facts

Bug 1381335 - Scale up playbook does not rerun master-facts.

Comment 3 liujia 2016-10-18 10:06:39 UTC

still blocked by bug 1382887

Comment 6 liujia 2016-10-21 10:00:46 UTC

@Ryan
Could u help to checked my reproduced steps?I am not sure whether my reproduce step about "Change master_url" is right because original cluster can not work correctly for changed lb's hostname. 

I still have a confuse about your description in step 2-"Change master_url" and how to "Change master_url" in your env?

Comment 7 liujia 2016-10-21 10:17:24 UTC

@Andrew, Ryan

As last comment mentioned, i am not sure my reproduce steps, so i just attached my verified steps on latest 3.3 puddle in the comment. Could u help to check about my verification?

Version:
atomic-openshift-utils-3.3.38-1.git.0.2637ed5.el7.noarch
openshift-ansible-3.3.38-1.git.0.2637ed5.el7.noarch

Steps:
1.Install OCP in HA env
cat /etc/ansible/facts.d/openshift.fact in one master host
"cluster_hostname": "openshift-139.lab.eng.nay.redhat.com"

2.Change lb hostname to openshift-149.lab.eng.nay.redhat.com in my env.

3.Edit original hosts:
1)change cluster_hostname:
openshift_master_cluster_hostname=openshift-149.lab.eng.nay.redhat.com
2)add new node
[OSEv3:children]
nodes
nfs
masters
lb
etcd
new_nodes
...
[new_nodes]
openshift-180.lab.eng.nay.redhat.com  openshift_public_ip=10.66.147.180 openshift_ip=192.168.2.4 openshift_public_hostname=10.66.147.180 openshift_hostname=192.168.2.4

4. run scaleup playbook with new hosts file.
# ansible-playbook -i .config/openshift/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-node/scaleup.yml

Result:
It still failed with a new error about certificates.

Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com systemd[1]: atomic-openshift-node.service holdoff time over, scheduling restart.
Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com systemd[1]: Starting Atomic OpenShift Node...
-- Subject: Unit atomic-openshift-node.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit atomic-openshift-node.service has begun starting up.
Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com atomic-openshift-node[26301]: F1021 06:05:38.365403   26301 start_node.go:126] cannot fetch "default" cluster network: Get https://openshift-149.lab.eng.nay.redhat.com:8443/oapi/v1/clusternetworks/default: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, openshift, openshift-139.lab.eng.nay.redhat.com, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local, 10.66.147.128, 172.30.0.1, 192.168.2.183, not openshift-149.lab.eng.nay.redhat.com
Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com systemd[1]: Failed to start Atomic OpenShift Node.
-- Subject: Unit atomic-openshift-node.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit atomic-openshift-node.service has failed.
--
-- The result is failed.
Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com systemd[1]: Unit atomic-openshift-node.service entered failed state.
Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com systemd[1]: atomic-openshift-node.service failed.

I checked that the old cached file in first master has been updated to new hostname-openshift-149.lab.eng.nay.redhat.com. 
<--snip-->
"cluster_hostname": "openshift-149.lab.eng.nay.redhat.com"
<--snip-->

Comment 11 errata-xmlrpc 2016-10-27 16:13:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:2122

Note You need to log in before you can comment on or make changes to this bug.