Bug 1551397

Summary:	FFU: deploy_steps_playbook.yaml playbook fails because rabbitmq_init_bundle container is unable to successfully run Executing: 'rabbitmqctl status \| grep -F "{rabbit,"'
Product:	Red Hat OpenStack	Reporter:	Marius Cornea <mcornea>
Component:	openstack-tripleo-heat-templates	Assignee:	Emilien Macchi <emacchi>
Status:	CLOSED WORKSFORME	QA Contact:	Gurenko Alex <agurenko>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	13.0 (Queens)	CC:	dbecker, jeckersb, jschluet, jstransk, lbezdick, mbracho, mbultel, mburns, mcornea, michele, morazi, rhel-osp-director-maint, sathlang, sclewis
Target Milestone:	beta	Keywords:	Triaged
Target Release:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-02 15:15:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marius Cornea 2018-03-05 03:25:22 UTC

Description of problem:
FFU: deploy_steps_playbook.yaml playbook fails because rabbitmq_init_bundle container is unable to successfully run Executing: 'rabbitmqctl status | grep -F "{rabbit,"'


[root@controller-0 ~]# docker logs --tail 10 rabbitmq_init_bundle
Debug: Executing: 'rabbitmqctl status | grep -F "{rabbit,"'
Debug: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]/returns: Sleeping for 10 seconds between tries
Debug: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]/returns: Exec try 47/180
Debug: Exec[rabbitmq-ready](provider=posix): Executing 'rabbitmqctl status | grep -F "{rabbit,"'
Debug: Executing: 'rabbitmqctl status | grep -F "{rabbit,"'
Debug: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]/returns: Sleeping for 10 seconds between tries
Debug: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]/returns: Exec try 48/180
Debug: Exec[rabbitmq-ready](provider=posix): Executing 'rabbitmqctl status | grep -F "{rabbit,"'
Debug: Executing: 'rabbitmqctl status | grep -F "{rabbit,"'
Debug: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]/returns: Sleeping for 10 seconds between tries

This step eventually times out. When trying to run the rabbitmqctl status command manually I get:

[root@controller-0 ~]# docker exec -it rabbitmq_init_bundle rabbitmqctl status
Error: Failed to initialize erlang distribution: {{shutdown,
                                                   {failed_to_start_child,
                                                    net_kernel,
                                                    {'EXIT',nodistribution}}},
                                                  {child,undefined,
                                                   net_sup_dynamic,
                                                   {erl_distribution,
                                                    start_link,
                                                    [['rabbitmq-cli-85',
                                                      shortnames]]},
                                                   permanent,1000,supervisor,
                                                   [erl_distribution]}}.


Version-Release number of selected component (if applicable):
rhosp13/openstack-rabbitmq:2018-03-02.2
pacemaker-1.1.18-11.el7.x86_64
resource-agents-3.9.5-124.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 1 controller + 1 compute
2. Run through the FFU steps
3. Run deploy_steps_playbook.yaml

Actual results:
Playbook eventually times out because rabbitmq_init_bundle cannot exit successfully.

Expected results:
deploy_steps_playbook.yaml finishes successfully.

Additional info:

Comment 2 John Eckersberg 2018-03-05 21:06:32 UTC

This is (somewhat) intentional; we explicitly bind mount /bin/true overtop of /bin/epmd to prevent epmd from running.  See https://review.openstack.org/#/c/527404/5.

However, if epmd is not running, it means that rabbitmqctl can't start erlang distribution, so the error message above is produced.

Sadly I'm not sure how this ever worked, I'll keep digging.

Comment 3 Michele Baldessari 2018-03-06 15:22:16 UTC

Aren't we missing https://review.openstack.org/#/c/531448/ ?

Comment 4 Michele Baldessari 2018-03-06 15:25:23 UTC

(In reply to John Eckersberg from comment #2)
> This is (somewhat) intentional; we explicitly bind mount /bin/true overtop
> of /bin/epmd to prevent epmd from running.  See
> https://review.openstack.org/#/c/527404/5.
> 
> However, if epmd is not running, it means that rabbitmqctl can't start
> erlang distribution, so the error message above is produced.

Right that is because epmd must never be spawned by the init_bundle. It
has to be started by pacemaker in the proper docker-rabbitmq-bundle.

What we might be missing here are some caveats that FFU imposes on us and break the rabbitmq_ready tag assumption

Comment 5 Michele Baldessari 2018-03-06 15:32:30 UTC

(In reply to Michele Baldessari from comment #3)
> Aren't we missing https://review.openstack.org/#/c/531448/ ?

Ah NM this is a deplyment with 1 controller, so the patch is surely there (it's just that we're hitting the first if clause). So the question becomes:
Why is rabbitmq not coming up? puppet-pacemaker must have created the rabbitmq bundle resource. Are there any errors around pacemaker logs? Could you drop a sosreport for the ctrl node somewhere?

Comment 6 Michele Baldessari 2018-03-06 15:49:06 UTC

Ok so reason is that when the init_bundle runs the cluster is in maintenance mode:
4 nodes configured
16 resources configured

              *** Resource management is DISABLED ***
  The cluster will not attempt to start, stop or recover services

Online: [ controller-0 ]

Full list of resources:

 ip-172.17.3.17 (ocf::heartbeat:IPaddr2):       Started controller-0 (unmanaged)
 ip-172.17.4.11 (ocf::heartbeat:IPaddr2):       Started controller-0 (unmanaged)
 ip-172.17.1.15 (ocf::heartbeat:IPaddr2):       Started controller-0 (unmanaged)
 ip-10.0.0.110  (ocf::heartbeat:IPaddr2):       Started controller-0 (unmanaged)
 ip-192.168.24.9        (ocf::heartbeat:IPaddr2):       Started controller-0 (unmanaged)
 ip-172.17.1.12 (ocf::heartbeat:IPaddr2):       Started controller-0 (unmanaged)
 Docker container: rabbitmq-bundle [rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp13/openstack-rabbitmq:pcmklatest] (unmanaged)
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Stopped (unmanaged)
 Docker container: galera-bundle [rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp13/openstack-mariadb:pcmklatest] (unmanaged)
   galera-bundle-0      (ocf::heartbeat:galera):        Stopped (unmanaged)
 Docker container: redis-bundle [rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp13/openstack-redis:pcmklatest] (unmanaged)
   redis-bundle-0       (ocf::heartbeat:redis): Stopped (unmanaged)
 Docker container: haproxy-bundle [rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp13/openstack-haproxy:pcmklatest] (unmanaged)
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Stopped (unmanaged)


So we're actually waiting for rabbitmq bundle to come up but it never will because cluster is in maintenance mode.

Comment 7 Michele Baldessari 2018-03-06 16:11:43 UTC

Ok so here is what is setting the maintenance-mode=true property:
cluster/corosync.log:Mar 06 15:50:35 [550161] controller-0        cib:     info: cib_perform_op:        +  /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']/nvpair[@id='cib-bootstrap-options-maintenance-mode']:  @value=false
cluster/corosync.log:Mar 06 15:50:35 [550166] controller-0       crmd:     info: abort_transition_graph:        Transition aborted by cib-bootstrap-options-maintenance-mode doing modify maintenance-mode=false: Configuration change | cib=0.86.0 source=te_update_diff:456 path=/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']/nvpair[@id='cib-bootstrap-options-maintenance-mode'] complete=true
messages:Mar  5 10:55:15 controller-0 os-collect-config: [2018-03-05 10:55:14,997] (heat-config) [INFO] deploy_stack_id=overcloud-AllNodesDeploySteps-47qa75eiwwox-ControllerPrePuppet-cm2fkki6x4sl-ControllerPrePuppetMaintenanceModeDeployment-mxfczy4ibsvo/68ec7495-3218-4d98-8212-0c1c892666c0 

Jiri is this stuff still needed in this context (FFU 10-13)?
I think that especially during upgrades (and FFU) we probably do not want this? (Although we might want to reevaluate if we even want this at all during deployment anymore)

Comment 8 Jiri Stransky 2018-03-07 10:56:23 UTC

The main reason for that was to prevent Puppet vs. Pacemaker fighting over the same services when Puppet ran during a stack update, and restarted some services which were normally under Pacemaker's control. So the intent was to have it for stack updates that delivered config changes.

It should not be enabled when we're trying to stop->update/upgrade->start the services, and AFAIK we never had maintenance mode kick in during minor/major updates/upgrades previously. This FFU case looks similar to me, so +1 on your conclusion that Pacemaker shouldn't be under maint mode at that time.

Regarding phasing out the maint mode hooks completely: it's actually not present in containerized deployments [1]. TripleO has it in the deprecated upstream-only non-containerized deployments currently [2]. So as we drop the non-containerized service variants from tripleo-heat-templates, we'll be able to drop the maintenance mode hooks too.

[1] https://github.com/openstack/tripleo-heat-templates/blob/3004c31d72e2f5963bda6821f7bc3da47940ea75/environments/docker-ha.yaml#L8-L9
[2] https://github.com/openstack/tripleo-heat-templates/blob/3004c31d72e2f5963bda6821f7bc3da47940ea75/environments/puppet-pacemaker.yaml#L4-L5

Comment 13 Sofer Athlan-Guyot 2018-03-15 17:21:32 UTC

So the current workaround is to remove those lines[1]

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/pacemaker_maintenance_mode.sh#L8..L11

Comment 16 Lukas Bezdicka 2018-05-02 15:15:12 UTC

Recently we didn't hit this issue.