Description of problem: During overcloud deployment the ceph monitor gets started only on controller-0 which delays the deployment as the ceph client is trying to reach the rest of 2 monitors(set in the conf file) which are not reachable: During the following step: overcloud-ControllerNodesPostDeployment-r42rlskbzdeb-ControllerServicesBaseDeployment_Step2-sezqtmy5qs4r on controller-0 os-collect-config shows that it's stuck with: Running /var/lib/heat-config/hooks/puppet < /var/lib/heat-config/deployed/101ffc97-51c5-4ceb-bb3e-00a537697ad3.json Running the puppet command manually with debug enabled: FACTER_heat_outputs_path="/var/run/heat-config/heat-config-puppet/101ffc97-51c5-4ceb-bb3e-00a537697ad3" FACTER_fqdn="overcloud-controller-0.localdomain" FACTER_deploy_config_name="ControllerServicesBaseDeployment_Step2" puppet apply --detailed-exitcodes /var/lib/heat-config/heat-config-puppet/101ffc97-51c5-4ceb-bb3e-00a537697ad3.pp -d Shows that it's stuck at: Debug: Executing 'service ceph status mon.overcloud-controller-0' Debug: Exec[rm-keyring-overcloud-controller-0](provider=posix): Executing check '/bin/true # comment to satisfy puppet syntax requirements set -ex test ! -e /tmp/ceph-mon-keyring-overcloud-controller-0 ' Debug: Executing '/bin/true # comment to satisfy puppet syntax requirements set -ex test ! -e /tmp/ceph-mon-keyring-overcloud-controller-0 ' Debug: /Stage[main]/Ceph::Profile::Mon/Ceph::Mon[overcloud-controller-0]/Exec[rm-keyring-overcloud-controller-0]/unless: + test '!' -e /tmp/ceph-mon-keyring-overcloud-controller-0 Debug: Exec[ceph-injectkey-client.admin](provider=posix): Executing check '/bin/true # comment to satisfy puppet syntax requirements set -ex ceph --name 'mon.' --keyring '/var/lib/ceph/mon/ceph-overcloud-controller-0/keyring' auth get client.admin | grep AQDqCGlXAAAAABAAAk3LxlGiCOI4D0xpBpHxIg==' Debug: Executing '/bin/true # comment to satisfy puppet syntax requirements set -ex ceph --name 'mon.' --keyring '/var/lib/ceph/mon/ceph-overcloud-controller-0/keyring' auth get client.admin | grep AQDqCGlXAAAAABAAAk3LxlGiCOI4D0xpBpHxIg==' If I run the ceph command manually I get the following output: [root@overcloud-controller-0 heat-admin]# ceph --name 'mon.' --keyring '/var/lib/ceph/mon/ceph-overcloud-controller-0/keyring' auth get client.admin | grep AQDqCGlXAAAAABAAAk3LxlGiCOI4D0xpBpHxIg== 2016-06-21 09:53:33.873039 7f1bce6fb700 0 -- 10.0.0.139:0/1025709 >> 10.0.0.143:6789/0 pipe(0x7f1bc0000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc0004ef0).fault 2016-06-21 09:53:36.873332 7f1bce7fc700 0 -- 10.0.0.139:0/1025709 >> 10.0.0.140:6789/0 pipe(0x7f1bc0008280 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc000c520).fault 2016-06-21 09:53:42.874206 7f1bd4184700 0 -- 10.0.0.139:0/1025709 >> 10.0.0.140:6789/0 pipe(0x7f1bc0008280 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc0007790).fault 2016-06-21 09:53:45.874477 7f1bce7fc700 0 -- 10.0.0.139:0/1025709 >> 10.0.0.143:6789/0 pipe(0x7f1bc0000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc0005160).fault 2016-06-21 09:53:51.875374 7f1bce6fb700 0 -- 10.0.0.139:0/1025709 >> 10.0.0.143:6789/0 pipe(0x7f1bc0000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc0005ab0).fault 2016-06-21 09:53:57.875759 7f1bd4184700 0 -- 10.0.0.139:0/1025709 >> 10.0.0.140:6789/0 pipe(0x7f1bc0000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc000e270).fault 2016-06-21 09:54:00.876063 7f1bce6fb700 0 -- 10.0.0.139:0/1025709 >> 10.0.0.143:6789/0 pipe(0x7f1bc0008280 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc000e770).fault 2016-06-21 09:54:03.876455 7f1bd4184700 0 -- 10.0.0.139:0/1025709 >> 10.0.0.140:6789/0 pipe(0x7f1bc0000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc0013090).fault 10.0.0.143 and 10.0.0.140 are the IP addresses of overcloud-controller-2 and overcloud-controller-1 respectively where the ceph service is not listening as there is no ceph.conf file: [root@overcloud-controller-0 heat-admin]# facter ipaddress_vlan300 10.0.0.139 [heat-admin@overcloud-controller-1 ~]$ facter ipaddress_vlan300 10.0.0.140 [root@overcloud-controller-2 heat-admin]# facter ipaddress_vlan300 10.0.0.143 [root@overcloud-controller-2 ~]# ls -l /etc/ceph/ total 4 -rwxr-xr-x. 1 root root 92 May 23 21:30 rbdmap systemctl status ceph reports that there is no /etc/ceph/ceph.conf file Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud with HA controllers 2. Wait for ControllerServicesBaseDeployment_Step2 3. SSH to controller-0 4. Check the os-collect-config journal 5. Run puppet manually with debug mode Actual results: The ceph client is trying to reach the other controllers where the ceph service is not running: ceph --name 'mon.' --keyring '/var/lib/ceph/mon/ceph-overcloud-controller-0/keyring' auth get client.admin | grep AQDqCGlXAAAAABAAAk3LxlGiCOI4D0xpBpHxIg== 2016-06-21 09:53:33.873039 7f1bce6fb700 0 -- 10.0.0.139:0/1025709 >> 10.0.0.143:6789/0 pipe(0x7f1bc0000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc0004ef0).fault 2016-06-21 09:53:36.873332 7f1bce7fc700 0 -- 10.0.0.139:0/1025709 >> 10.0.0.140:6789/0 pipe(0x7f1bc0008280 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc000c520).fault Expected results: The ceph service is configured on the other controllers as well and the client can reach them. Additional info:
Just an update with current findings, puppet seems to take much longer to complete step 2 on non-bootstrap nodes, up to 980seconds. This blocks ceph-mon on the bootstrap node too, waiting for the other two nodes to make up the cluster. Still investigating what causes the delay/timeout on the non-bootstrap nodes.
The deployment eventually completes with ceph-mon running on all nodes when puppet completes step 2. During the puppet run, the corosync logs show a 15mins gap with no lines logged occurring during the time in which pacemaker is bringing up the VIPs. This *could* be due to a timeout or a networking problem, continuing investigation.
I tested the patch and step 2 completed in approx 2 minutes: 2016-07-20 08:52:15 [overcloud-ControllerNodesPostDeployment-ovygbbvoegp7-ControllerLoadBalancerDeployment_Step1-qzmfbd6kirmq]: CREATE_COMPLETE Stack CREATE completed successfully 2016-07-20 08:52:16 [overcloud-ControllerNodesPostDeployment-ovygbbvoegp7-ControllerServicesBaseDeployment_Step2-y7mugjbxvuvl]: CREATE_IN_PROGRESS Stack CREATE started 2016-07-20 08:52:16 [1]: CREATE_IN_PROGRESS state changed 2016-07-20 08:52:17 [0]: CREATE_IN_PROGRESS state changed 2016-07-20 08:52:18 [2]: CREATE_IN_PROGRESS state changed 2016-07-20 08:53:05 [1]: SIGNAL_IN_PROGRESS Signal: deployment succeeded 2016-07-20 08:53:06 [1]: CREATE_COMPLETE state changed 2016-07-20 08:53:06 [2]: SIGNAL_IN_PROGRESS Signal: deployment succeeded 2016-07-20 08:53:08 [2]: CREATE_COMPLETE state changed 2016-07-20 08:54:03 [0]: SIGNAL_IN_PROGRESS Signal: deployment succeeded 2016-07-20 08:54:04 [ControllerServicesBaseDeployment_Step2]: CREATE_COMPLETE state changed 2016-07-20 08:54:04 [ControllerRingbuilderDeployment_Step3]: CREATE_IN_PROGRESS state changed 2016-07-20 08:54:04 [0]: CREATE_COMPLETE state changed 2016-07-20 08:54:04 [overcloud-ControllerNodesPostDeployment-ovygbbvoegp7-ControllerServicesBaseDeployment_Step2-y7mugjbxvuvl]: CREATE_COMPLETE Stack CREATE completed successfully 2016-07-20 08:54:05 [overcloud-ControllerNodesPostDeployment-ovygbbvoegp7-ControllerRingbuilderDeployment_Step3-gjazamyx7ktj]: CREATE_IN_PROGRESS Stack CREATE started
openstack-tripleo-heat-templates-2.0.0-24.el7ost.noarch Step2 completed in approx. 2 minutes: 2016-07-29 07:50:57 [overcloud-ControllerNodesPostDeployment-ys5skqitpsta-ControllerServicesBaseDeployment_Step2-pvvopijeq3ry]: CREATE_IN_PROGRESS Stack CREATE started 2016-07-29 07:50:57 [1]: CREATE_IN_PROGRESS state changed 2016-07-29 07:50:57 [0]: CREATE_IN_PROGRESS state changed 2016-07-29 07:50:58 [2]: CREATE_IN_PROGRESS state changed 2016-07-29 07:51:51 [2]: SIGNAL_IN_PROGRESS Signal: deployment succeeded 2016-07-29 07:51:51 [1]: SIGNAL_IN_PROGRESS Signal: deployment succeeded 2016-07-29 07:51:52 [1]: CREATE_COMPLETE state changed 2016-07-29 07:51:52 [2]: CREATE_COMPLETE state changed 2016-07-29 07:52:43 [0]: SIGNAL_IN_PROGRESS Signal: deployment succeeded 2016-07-29 07:52:44 [0]: CREATE_COMPLETE state changed 2016-07-29 07:52:44 [overcloud-ControllerNodesPostDeployment-ys5skqitpsta-ControllerServicesBaseDeployment_Step2-pvvopijeq3ry]: CREATE_COMPLETE Stack CREATE completed successfully
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1599.html