Bug 1551397
Summary: | FFU: deploy_steps_playbook.yaml playbook fails because rabbitmq_init_bundle container is unable to successfully run Executing: 'rabbitmqctl status | grep -F "{rabbit,"' | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> |
Component: | openstack-tripleo-heat-templates | Assignee: | Emilien Macchi <emacchi> |
Status: | CLOSED WORKSFORME | QA Contact: | Gurenko Alex <agurenko> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 13.0 (Queens) | CC: | dbecker, jeckersb, jschluet, jstransk, lbezdick, mbracho, mbultel, mburns, mcornea, michele, morazi, rhel-osp-director-maint, sathlang, sclewis |
Target Milestone: | beta | Keywords: | Triaged |
Target Release: | 13.0 (Queens) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-05-02 15:15:12 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Marius Cornea
2018-03-05 03:25:22 UTC
This is (somewhat) intentional; we explicitly bind mount /bin/true overtop of /bin/epmd to prevent epmd from running. See https://review.openstack.org/#/c/527404/5. However, if epmd is not running, it means that rabbitmqctl can't start erlang distribution, so the error message above is produced. Sadly I'm not sure how this ever worked, I'll keep digging. Aren't we missing https://review.openstack.org/#/c/531448/ ? (In reply to John Eckersberg from comment #2) > This is (somewhat) intentional; we explicitly bind mount /bin/true overtop > of /bin/epmd to prevent epmd from running. See > https://review.openstack.org/#/c/527404/5. > > However, if epmd is not running, it means that rabbitmqctl can't start > erlang distribution, so the error message above is produced. Right that is because epmd must never be spawned by the init_bundle. It has to be started by pacemaker in the proper docker-rabbitmq-bundle. What we might be missing here are some caveats that FFU imposes on us and break the rabbitmq_ready tag assumption (In reply to Michele Baldessari from comment #3) > Aren't we missing https://review.openstack.org/#/c/531448/ ? Ah NM this is a deplyment with 1 controller, so the patch is surely there (it's just that we're hitting the first if clause). So the question becomes: Why is rabbitmq not coming up? puppet-pacemaker must have created the rabbitmq bundle resource. Are there any errors around pacemaker logs? Could you drop a sosreport for the ctrl node somewhere? Ok so reason is that when the init_bundle runs the cluster is in maintenance mode: 4 nodes configured 16 resources configured *** Resource management is DISABLED *** The cluster will not attempt to start, stop or recover services Online: [ controller-0 ] Full list of resources: ip-172.17.3.17 (ocf::heartbeat:IPaddr2): Started controller-0 (unmanaged) ip-172.17.4.11 (ocf::heartbeat:IPaddr2): Started controller-0 (unmanaged) ip-172.17.1.15 (ocf::heartbeat:IPaddr2): Started controller-0 (unmanaged) ip-10.0.0.110 (ocf::heartbeat:IPaddr2): Started controller-0 (unmanaged) ip-192.168.24.9 (ocf::heartbeat:IPaddr2): Started controller-0 (unmanaged) ip-172.17.1.12 (ocf::heartbeat:IPaddr2): Started controller-0 (unmanaged) Docker container: rabbitmq-bundle [rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp13/openstack-rabbitmq:pcmklatest] (unmanaged) rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped (unmanaged) Docker container: galera-bundle [rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp13/openstack-mariadb:pcmklatest] (unmanaged) galera-bundle-0 (ocf::heartbeat:galera): Stopped (unmanaged) Docker container: redis-bundle [rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp13/openstack-redis:pcmklatest] (unmanaged) redis-bundle-0 (ocf::heartbeat:redis): Stopped (unmanaged) Docker container: haproxy-bundle [rhos-qe-mirror-brq.usersys.redhat.com:5000/rhosp13/openstack-haproxy:pcmklatest] (unmanaged) haproxy-bundle-docker-0 (ocf::heartbeat:docker): Stopped (unmanaged) So we're actually waiting for rabbitmq bundle to come up but it never will because cluster is in maintenance mode. Ok so here is what is setting the maintenance-mode=true property: cluster/corosync.log:Mar 06 15:50:35 [550161] controller-0 cib: info: cib_perform_op: + /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']/nvpair[@id='cib-bootstrap-options-maintenance-mode']: @value=false cluster/corosync.log:Mar 06 15:50:35 [550166] controller-0 crmd: info: abort_transition_graph: Transition aborted by cib-bootstrap-options-maintenance-mode doing modify maintenance-mode=false: Configuration change | cib=0.86.0 source=te_update_diff:456 path=/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']/nvpair[@id='cib-bootstrap-options-maintenance-mode'] complete=true messages:Mar 5 10:55:15 controller-0 os-collect-config: [2018-03-05 10:55:14,997] (heat-config) [INFO] deploy_stack_id=overcloud-AllNodesDeploySteps-47qa75eiwwox-ControllerPrePuppet-cm2fkki6x4sl-ControllerPrePuppetMaintenanceModeDeployment-mxfczy4ibsvo/68ec7495-3218-4d98-8212-0c1c892666c0 Jiri is this stuff still needed in this context (FFU 10-13)? I think that especially during upgrades (and FFU) we probably do not want this? (Although we might want to reevaluate if we even want this at all during deployment anymore) The main reason for that was to prevent Puppet vs. Pacemaker fighting over the same services when Puppet ran during a stack update, and restarted some services which were normally under Pacemaker's control. So the intent was to have it for stack updates that delivered config changes. It should not be enabled when we're trying to stop->update/upgrade->start the services, and AFAIK we never had maintenance mode kick in during minor/major updates/upgrades previously. This FFU case looks similar to me, so +1 on your conclusion that Pacemaker shouldn't be under maint mode at that time. Regarding phasing out the maint mode hooks completely: it's actually not present in containerized deployments [1]. TripleO has it in the deprecated upstream-only non-containerized deployments currently [2]. So as we drop the non-containerized service variants from tripleo-heat-templates, we'll be able to drop the maintenance mode hooks too. [1] https://github.com/openstack/tripleo-heat-templates/blob/3004c31d72e2f5963bda6821f7bc3da47940ea75/environments/docker-ha.yaml#L8-L9 [2] https://github.com/openstack/tripleo-heat-templates/blob/3004c31d72e2f5963bda6821f7bc3da47940ea75/environments/puppet-pacemaker.yaml#L4-L5 So the current workaround is to remove those lines[1] [1] https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/pacemaker_maintenance_mode.sh#L8..L11 Recently we didn't hit this issue. |