Bug 1349456

Summary: During overcloud deployment the ceph monitor gets started only on one of the controllers which causes the deployment to get stuck for some time
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Giulio Fidente <gfidente>
Status: CLOSED ERRATA QA Contact: Yogev Rabl <yrabl>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 9.0 (Mitaka)CC: athomas, dbecker, jjoyce, mburns, morazi, ohochman, rhel-osp-director-maint, tvignaud, yrabl
Target Milestone: gaKeywords: Regression
Target Release: 9.0 (Mitaka)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-2.0.0-22.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1348489 Environment:
Last Closed: 2016-08-11 11:33:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1348489    
Bug Blocks:    

Description Marius Cornea 2016-06-23 13:31:46 UTC
Description of problem:
During overcloud deployment the ceph monitor gets started only on controller-0 which delays the deployment as the ceph client is trying to reach the rest of 2 monitors(set in the conf file) which are not reachable:

During the following step:
overcloud-ControllerNodesPostDeployment-r42rlskbzdeb-ControllerServicesBaseDeployment_Step2-sezqtmy5qs4r

on controller-0 os-collect-config shows that it's stuck with:
Running /var/lib/heat-config/hooks/puppet < /var/lib/heat-config/deployed/101ffc97-51c5-4ceb-bb3e-00a537697ad3.json

Running the puppet command manually with debug enabled:
FACTER_heat_outputs_path="/var/run/heat-config/heat-config-puppet/101ffc97-51c5-4ceb-bb3e-00a537697ad3"  FACTER_fqdn="overcloud-controller-0.localdomain"  FACTER_deploy_config_name="ControllerServicesBaseDeployment_Step2"  puppet apply --detailed-exitcodes /var/lib/heat-config/heat-config-puppet/101ffc97-51c5-4ceb-bb3e-00a537697ad3.pp -d


Shows that it's stuck at:

Debug: Executing 'service ceph status mon.overcloud-controller-0'
Debug: Exec[rm-keyring-overcloud-controller-0](provider=posix): Executing check '/bin/true # comment to satisfy puppet syntax requirements
set -ex
test ! -e /tmp/ceph-mon-keyring-overcloud-controller-0
'
Debug: Executing '/bin/true # comment to satisfy puppet syntax requirements
set -ex
test ! -e /tmp/ceph-mon-keyring-overcloud-controller-0
'
Debug: /Stage[main]/Ceph::Profile::Mon/Ceph::Mon[overcloud-controller-0]/Exec[rm-keyring-overcloud-controller-0]/unless: + test '!' -e /tmp/ceph-mon-keyring-overcloud-controller-0
Debug: Exec[ceph-injectkey-client.admin](provider=posix): Executing check '/bin/true # comment to satisfy puppet syntax requirements
set -ex
ceph   --name 'mon.'   --keyring '/var/lib/ceph/mon/ceph-overcloud-controller-0/keyring'  auth get client.admin | grep AQDqCGlXAAAAABAAAk3LxlGiCOI4D0xpBpHxIg=='
Debug: Executing '/bin/true # comment to satisfy puppet syntax requirements
set -ex
ceph   --name 'mon.'   --keyring '/var/lib/ceph/mon/ceph-overcloud-controller-0/keyring'  auth get client.admin | grep AQDqCGlXAAAAABAAAk3LxlGiCOI4D0xpBpHxIg=='


If I run the ceph command manually I get the following output:
[root@overcloud-controller-0 heat-admin]# ceph   --name 'mon.'   --keyring '/var/lib/ceph/mon/ceph-overcloud-controller-0/keyring'  auth get client.admin | grep AQDqCGlXAAAAABAAAk3LxlGiCOI4D0xpBpHxIg==
2016-06-21 09:53:33.873039 7f1bce6fb700  0 -- 10.0.0.139:0/1025709 >> 10.0.0.143:6789/0 pipe(0x7f1bc0000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc0004ef0).fault
2016-06-21 09:53:36.873332 7f1bce7fc700  0 -- 10.0.0.139:0/1025709 >> 10.0.0.140:6789/0 pipe(0x7f1bc0008280 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc000c520).fault
2016-06-21 09:53:42.874206 7f1bd4184700  0 -- 10.0.0.139:0/1025709 >> 10.0.0.140:6789/0 pipe(0x7f1bc0008280 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc0007790).fault
2016-06-21 09:53:45.874477 7f1bce7fc700  0 -- 10.0.0.139:0/1025709 >> 10.0.0.143:6789/0 pipe(0x7f1bc0000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc0005160).fault
2016-06-21 09:53:51.875374 7f1bce6fb700  0 -- 10.0.0.139:0/1025709 >> 10.0.0.143:6789/0 pipe(0x7f1bc0000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc0005ab0).fault
2016-06-21 09:53:57.875759 7f1bd4184700  0 -- 10.0.0.139:0/1025709 >> 10.0.0.140:6789/0 pipe(0x7f1bc0000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc000e270).fault
2016-06-21 09:54:00.876063 7f1bce6fb700  0 -- 10.0.0.139:0/1025709 >> 10.0.0.143:6789/0 pipe(0x7f1bc0008280 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc000e770).fault
2016-06-21 09:54:03.876455 7f1bd4184700  0 -- 10.0.0.139:0/1025709 >> 10.0.0.140:6789/0 pipe(0x7f1bc0000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc0013090).fault

10.0.0.143 and 10.0.0.140 are the IP addresses of overcloud-controller-2 and overcloud-controller-1 respectively where the ceph service is not listening as there is no ceph.conf file:

[root@overcloud-controller-0 heat-admin]# facter ipaddress_vlan300
10.0.0.139

[heat-admin@overcloud-controller-1 ~]$ facter ipaddress_vlan300
10.0.0.140

[root@overcloud-controller-2 heat-admin]# facter ipaddress_vlan300
10.0.0.143


[root@overcloud-controller-2 ~]# ls -l /etc/ceph/
total 4
-rwxr-xr-x. 1 root root 92 May 23 21:30 rbdmap

systemctl status ceph reports that there is no /etc/ceph/ceph.conf file

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with HA controllers
2. Wait for ControllerServicesBaseDeployment_Step2 
3. SSH to controller-0 
4. Check the os-collect-config journal
5. Run puppet manually with debug mode

Actual results:
The ceph client is trying to reach the other controllers where the ceph service is not running:
ceph   --name 'mon.'   --keyring '/var/lib/ceph/mon/ceph-overcloud-controller-0/keyring'  auth get client.admin | grep AQDqCGlXAAAAABAAAk3LxlGiCOI4D0xpBpHxIg==
2016-06-21 09:53:33.873039 7f1bce6fb700  0 -- 10.0.0.139:0/1025709 >> 10.0.0.143:6789/0 pipe(0x7f1bc0000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc0004ef0).fault
2016-06-21 09:53:36.873332 7f1bce7fc700  0 -- 10.0.0.139:0/1025709 >> 10.0.0.140:6789/0 pipe(0x7f1bc0008280 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1bc000c520).fault


Expected results:
The ceph service is configured on the other controllers as well and the client can reach them.

Additional info:

Comment 1 Giulio Fidente 2016-06-30 14:41:59 UTC
Just an update with current findings, puppet seems to take much longer to complete step 2 on non-bootstrap nodes, up to 980seconds.

This blocks ceph-mon on the bootstrap node too, waiting for the other two nodes to make up the cluster.

Still investigating what causes the delay/timeout on the non-bootstrap nodes.

Comment 2 Giulio Fidente 2016-06-30 17:11:11 UTC
The deployment eventually completes with ceph-mon running on all nodes when puppet completes step 2.

During the puppet run, the corosync logs show a 15mins gap with no lines logged occurring during the time in which pacemaker is bringing up the VIPs. This *could* be due to a timeout or a networking problem, continuing investigation.

Comment 4 Marius Cornea 2016-07-20 09:10:34 UTC
I tested the patch and step 2 completed in approx 2 minutes:

2016-07-20 08:52:15 [overcloud-ControllerNodesPostDeployment-ovygbbvoegp7-ControllerLoadBalancerDeployment_Step1-qzmfbd6kirmq]: CREATE_COMPLETE Stack CREATE completed successfully
2016-07-20 08:52:16 [overcloud-ControllerNodesPostDeployment-ovygbbvoegp7-ControllerServicesBaseDeployment_Step2-y7mugjbxvuvl]: CREATE_IN_PROGRESS Stack CREATE started
2016-07-20 08:52:16 [1]: CREATE_IN_PROGRESS state changed
2016-07-20 08:52:17 [0]: CREATE_IN_PROGRESS state changed
2016-07-20 08:52:18 [2]: CREATE_IN_PROGRESS state changed
2016-07-20 08:53:05 [1]: SIGNAL_IN_PROGRESS Signal: deployment succeeded
2016-07-20 08:53:06 [1]: CREATE_COMPLETE state changed
2016-07-20 08:53:06 [2]: SIGNAL_IN_PROGRESS Signal: deployment succeeded
2016-07-20 08:53:08 [2]: CREATE_COMPLETE state changed
2016-07-20 08:54:03 [0]: SIGNAL_IN_PROGRESS Signal: deployment succeeded
2016-07-20 08:54:04 [ControllerServicesBaseDeployment_Step2]: CREATE_COMPLETE state changed
2016-07-20 08:54:04 [ControllerRingbuilderDeployment_Step3]: CREATE_IN_PROGRESS state changed
2016-07-20 08:54:04 [0]: CREATE_COMPLETE state changed
2016-07-20 08:54:04 [overcloud-ControllerNodesPostDeployment-ovygbbvoegp7-ControllerServicesBaseDeployment_Step2-y7mugjbxvuvl]: CREATE_COMPLETE Stack CREATE completed successfully
2016-07-20 08:54:05 [overcloud-ControllerNodesPostDeployment-ovygbbvoegp7-ControllerRingbuilderDeployment_Step3-gjazamyx7ktj]: CREATE_IN_PROGRESS Stack CREATE started

Comment 7 Marius Cornea 2016-07-29 08:06:31 UTC
openstack-tripleo-heat-templates-2.0.0-24.el7ost.noarch

Step2 completed in approx. 2 minutes:

2016-07-29 07:50:57 [overcloud-ControllerNodesPostDeployment-ys5skqitpsta-ControllerServicesBaseDeployment_Step2-pvvopijeq3ry]: CREATE_IN_PROGRESS Stack CREATE started
2016-07-29 07:50:57 [1]: CREATE_IN_PROGRESS state changed
2016-07-29 07:50:57 [0]: CREATE_IN_PROGRESS state changed
2016-07-29 07:50:58 [2]: CREATE_IN_PROGRESS state changed
2016-07-29 07:51:51 [2]: SIGNAL_IN_PROGRESS Signal: deployment succeeded
2016-07-29 07:51:51 [1]: SIGNAL_IN_PROGRESS Signal: deployment succeeded
2016-07-29 07:51:52 [1]: CREATE_COMPLETE state changed
2016-07-29 07:51:52 [2]: CREATE_COMPLETE state changed
2016-07-29 07:52:43 [0]: SIGNAL_IN_PROGRESS Signal: deployment succeeded
2016-07-29 07:52:44 [0]: CREATE_COMPLETE state changed
2016-07-29 07:52:44 [overcloud-ControllerNodesPostDeployment-ys5skqitpsta-ControllerServicesBaseDeployment_Step2-pvvopijeq3ry]: CREATE_COMPLETE Stack CREATE completed successfully

Comment 10 errata-xmlrpc 2016-08-11 11:33:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1599.html