Description of problem: The overcloud nodes /etc/hosts file contains entry pointing to 127.0.0.1 for the nodes hostname: 127.0.0.1 overcloud-controller-0.localdomain overcloud-controller-0 192.0.2.22 overcloud-controller-0.localdomain overcloud-controller-0 Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-2.0.0-5.el7ost.noarch How reproducible: 100% Expected results: There's no entry pointing to 127.0.0.1. This makes corosync unable to join the node to cluster: systemd[1]: Starting Corosync Cluster Engine... corosync[14022]: [TOTEM ] Initializing transport (UDP/IP Unicast). corosync[14022]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none corosync[14022]: [TOTEM ] The network interface [127.0.0.1] is now up. corosync[14022]: [SERV ] Service engine loaded: corosync configuration map access [0] corosync[14022]: [QB ] server name: cmap corosync[14022]: [SERV ] Service engine loaded: corosync configuration service [1] corosync[14022]: [QB ] server name: cfg corosync[14022]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] corosync[14022]: [QB ] server name: cpg corosync[14022]: [SERV ] Service engine loaded: corosync profile loading service [4] corosync[14022]: [QUORUM] Using quorum provider corosync_votequorum corosync[14022]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] corosync[14022]: [QB ] server name: votequorum corosync[14022]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] corosync[14022]: [QB ] server name: quorum corosync[14022]: [TOTEM ] adding new UDPU member {127.0.0.1} corosync[14022]: [TOTEM ] adding new UDPU member {192.0.2.19} corosync[14022]: [TOTEM ] adding new UDPU member {192.0.2.23} corosync[14022]: [TOTEM ] A new membership (127.0.0.1:8) was formed. Members joined: 1 corosync[14022]: [QUORUM] Members[1]: 1 corosync[14022]: [MAIN ] Completed service synchronization, ready to provide service.
Deployment failed with RHEL-OSP director 9.0 puddle - 2016-05-17.1 After applying the workaround in : https://bugzilla.redhat.com/show_bug.cgi?id=1337537 This issue #1337465 is blocking the deployment.
A dirty workaround: during early stage of the deployment(right after nodes boot) ssh to each controller nodes and comment out in /etc/hosts the entry pointing to the loopback address.
Checking further this issue I think it's caused by cloud-init: [root@overcloud-controller-0 ~]# grep hosts /var/lib/cloud/instances/9c06be1a-ebd7-4280-970f-907085714f01/obj.pkl aS'update_etc_hosts' asS'manage_etc_hosts' comparing it to a node on a rhos8 deployment where: [root@overcloud-controller-0 heat-admin]# grep hosts /var/lib/cloud/instances/85ba7bc7-04ab-4e83-b3e5-2dcdff7974f7/obj.pkl aS'update_etc_hosts' Checking the hosts template, it actually matches the format of the /etc/hosts file: [root@overcloud-controller-0 ~]# cat /etc/cloud/templates/hosts.redhat.tmpl ## template:jinja {# This file /etc/cloud/templates/hosts.redhat.tmpl is only utilized if enabled in cloud-config. Specifically, in order to enable it you need to add the following to config: manage_etc_hosts: True -#} # Your system has configured 'manage_etc_hosts' as True. # As a result, if you wish for changes to this file to persist # then you will need to either # a.) make changes to the master file in /etc/cloud/templates/hosts.redhat.tmpl # b.) change or remove the value of 'manage_etc_hosts' in # /etc/cloud/cloud.cfg or cloud-config from user-data # # The following lines are desirable for IPv4 capable hosts 127.0.0.1 {{fqdn}} {{hostname}} 127.0.0.1 localhost.localdomain localhost 127.0.0.1 localhost4.localdomain4 localhost4 # The following lines are desirable for IPv6 capable hosts ::1 {{fqdn}} {{hostname}} ::1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 # HEAT_HOSTS_START - Do not edit manually within this section! 10.0.0.12 overcloud-novacompute-0.localdomain overcloud-novacompute-0 192.168.0.17 overcloud-novacompute-0-external 10.0.0.12 overcloud-novacompute-0-internalapi 10.0.0.140 overcloud-novacompute-0-storage 192.168.0.17 overcloud-novacompute-0-storagemgmt 10.0.1.138 overcloud-novacompute-0-tenant 172.16.17.161 overcloud-novacompute-0-management 10.0.0.15 overcloud-controller-0.localdomain overcloud-controller-0 172.16.18.28 overcloud-controller-0-external 10.0.0.15 overcloud-controller-0-internalapi 10.0.0.143 overcloud-controller-0-storage 10.0.1.14 overcloud-controller-0-storagemgmt 10.0.1.141 overcloud-controller-0-tenant 172.16.17.164 overcloud-controller-0-management 10.0.0.13 overcloud-controller-1.localdomain overcloud-controller-1 172.16.18.26 overcloud-controller-1-external 10.0.0.13 overcloud-controller-1-internalapi 10.0.0.142 overcloud-controller-1-storage 10.0.1.12 overcloud-controller-1-storagemgmt 10.0.1.139 overcloud-controller-1-tenant 172.16.17.163 overcloud-controller-1-management 10.0.0.14 overcloud-controller-2.localdomain overcloud-controller-2 172.16.18.27 overcloud-controller-2-external 10.0.0.14 overcloud-controller-2-internalapi 10.0.0.141 overcloud-controller-2-storage 10.0.1.13 overcloud-controller-2-storagemgmt 10.0.1.140 overcloud-controller-2-tenant 172.16.17.162 overcloud-controller-2-management 10.0.0.139 overcloud-cephstorage-0.localdomain overcloud-cephstorage-0 192.168.0.16 overcloud-cephstorage-0-external 192.168.0.16 overcloud-cephstorage-0-internalapi 10.0.0.139 overcloud-cephstorage-0-storage 10.0.1.11 overcloud-cephstorage-0-storagemgmt 192.168.0.16 overcloud-cephstorage-0-tenant 172.16.17.160 overcloud-cephstorage-0-management # HEAT_HOSTS_END
I checked further with Giulio and this is caused by: /etc/cloud/cloud.cfg.d/10_etc_hosts.cfg manage_etc_hosts: localhost which comes prepackages inside the image. This can be controller by DIB_CLOUD_INIT_ETC_HOSTS="" or DIB_CLOUD_INIT_ETC_HOSTS=false before building the images. This was also added to the tripleoclient in https://review.openstack.org/#/c/222539/ We need to figure out if the variable was set during the image build process.
DIB_CLOUD_INIT_ETC_HOSTS is not set as part of the image building process (KS, ...)
A test puddle (2016-05-23.1) with an image built with DIB_CLOUD_INIT_ETC_HOSTS=false has been provided to Omri Hochman yesterday.
Omri, did you get any success with the images I provided you?
I've managed to pass deployment with the latest images, and the problem didn't reproduce, we need pm_ack, so we can switch this BZ to ON_QA and then to Verified.
unable to reproduce with latest images : New RHEL-OSP director 9.0 puddle - 2016-05-26.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1598.html