Description of problem: - Cluster host name as been changed and the installer has not been rerun. When scaling up the the node-scale up playbook is using old cached values. Version-Release number of selected component (if applicable): 3.2+ How reproducible: 100% Steps to Reproduce: 1. Install 2. Change master_url 3. Scale up node Actual results: - Health test is done using old master_url Expected results: - Use new values set in hosts file. Additional info: Scale up fails on this task. https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/tasks/main.yml#L128-L146 It fails because the wrong value is being used for "openshift_node_master_api_url", this variable is set by the following: "openshift_node_master_api_url": "{{ hostvars[groups.oo_first_master.0].openshift.master.api_url }}" The issue is that setting the follow does not update the cached master facts. openshift_master_api_url="openshift.api.url.com" This is because the task is not rerun. https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_master_facts/tasks/main.yml
Commit pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/c7d9c63088f58a3aa338981083a9fb21a8c5c7f5 Merge pull request #2555 from abutcher/node-scaleup-facts Bug 1381335 - Scale up playbook does not rerun master-facts.
still blocked by bug 1382887
@Ryan Could u help to checked my reproduced steps?I am not sure whether my reproduce step about "Change master_url" is right because original cluster can not work correctly for changed lb's hostname. I still have a confuse about your description in step 2-"Change master_url" and how to "Change master_url" in your env?
@Andrew, Ryan As last comment mentioned, i am not sure my reproduce steps, so i just attached my verified steps on latest 3.3 puddle in the comment. Could u help to check about my verification? Version: atomic-openshift-utils-3.3.38-1.git.0.2637ed5.el7.noarch openshift-ansible-3.3.38-1.git.0.2637ed5.el7.noarch Steps: 1.Install OCP in HA env cat /etc/ansible/facts.d/openshift.fact in one master host "cluster_hostname": "openshift-139.lab.eng.nay.redhat.com" 2.Change lb hostname to openshift-149.lab.eng.nay.redhat.com in my env. 3.Edit original hosts: 1)change cluster_hostname: openshift_master_cluster_hostname=openshift-149.lab.eng.nay.redhat.com 2)add new node [OSEv3:children] nodes nfs masters lb etcd new_nodes ... [new_nodes] openshift-180.lab.eng.nay.redhat.com openshift_public_ip=10.66.147.180 openshift_ip=192.168.2.4 openshift_public_hostname=10.66.147.180 openshift_hostname=192.168.2.4 4. run scaleup playbook with new hosts file. # ansible-playbook -i .config/openshift/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-node/scaleup.yml Result: It still failed with a new error about certificates. Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com systemd[1]: atomic-openshift-node.service holdoff time over, scheduling restart. Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com systemd[1]: Starting Atomic OpenShift Node... -- Subject: Unit atomic-openshift-node.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit atomic-openshift-node.service has begun starting up. Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com atomic-openshift-node[26301]: F1021 06:05:38.365403 26301 start_node.go:126] cannot fetch "default" cluster network: Get https://openshift-149.lab.eng.nay.redhat.com:8443/oapi/v1/clusternetworks/default: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, openshift, openshift-139.lab.eng.nay.redhat.com, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local, 10.66.147.128, 172.30.0.1, 192.168.2.183, not openshift-149.lab.eng.nay.redhat.com Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com systemd[1]: Failed to start Atomic OpenShift Node. -- Subject: Unit atomic-openshift-node.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit atomic-openshift-node.service has failed. -- -- The result is failed. Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com systemd[1]: Unit atomic-openshift-node.service entered failed state. Oct 21 06:05:38 openshift-180.lab.eng.nay.redhat.com systemd[1]: atomic-openshift-node.service failed. I checked that the old cached file in first master has been updated to new hostname-openshift-149.lab.eng.nay.redhat.com. <--snip--> "cluster_hostname": "openshift-149.lab.eng.nay.redhat.com" <--snip-->
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2122