When ControllerOvercloudServicesDeployment_Step6 is triggered on controllers, it re-executes all of the steps 1-5 in overcloud_controller_pacemaker.pp. The issue is that there is an Exec under the if step >= 4 part that does: exec { 'neutron-server-start-wait-stop' : command => "systemctl start neutron-server && \ sleep 5s && \ systemctl stop neutron-server", path => ["/usr/bin", "/usr/sbin"], } -> that means puppet is going to attempt to neutron-server if step >= 5, which means pacemaker is also trying to manage neutron-server. You get into a race with puppet trying to stop neutron-server, and pacemaker trying to keep it running. Here's what the errors looked like: From os-collect-config: Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Error: systemctl start neutron-server && sleep 5s && systemctl stop neutron-server returned 1 instead of one Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Error: /Stage[main]/Main/Exec[neutron-server-start-wait-stop]/returns: change from notrun to 0 failed: systemctl start neutron-server && Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Warning: /Stage[main]/Main/Pacemaker::Resource::Service[neutron-server]/Pacemaker::Resource::Systemd[neutron-server]/Pcmk_resource[neutron-server]: S Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Warning: /Stage[main]/Main/Pacemaker::Constraint::Base[keystone-to-neutron-server-constraint]/Exec[Creating order constraint keystone-to-neutron-serv Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Warning: /Stage[main]/Main/Pacemaker::Constraint::Base[neutron-server-to-openvswitch-agent-constraint]/Exec[Creating order constraint neutron-server- Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: [2015-12-10 15:05:31,940] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-puppet/4b90f354-65ff-4a5f-9ef3-ebb440d90233.pp. [6] Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: [2015-12-10 15:05:31,943] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/puppet Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: [2015-12-10 15:05:31,944] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/4b90f354-65ff-4a5f-9ef3-ebb440d90233.json < pacemaker trying to start it: Dec 10 15:05:04 overcloud-controller-0.localdomain crmd[25535]: notice: Operation neutron-server_start_0: ok (node=overcloud-controller-0, call=227, rc=0, cib-update=94, confirmed=true) [root@overcloud-controller-0 ~]# journalctl -u neutron-server -- Logs begin at Thu 2015-12-10 14:49:17 EST, end at Thu 2015-12-10 16:07:03 EST. -- Dec 10 15:02:39 overcloud-controller-0.localdomain systemd[1]: Starting OpenStack Neutron Server... Dec 10 15:02:41 overcloud-controller-0.localdomain systemd[1]: Started OpenStack Neutron Server. Dec 10 15:02:46 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Neutron Server... Dec 10 15:02:48 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Neutron Server. Dec 10 15:04:53 overcloud-controller-0.localdomain systemd[1]: Starting OpenStack Neutron Server... Dec 10 15:04:54 overcloud-controller-0.localdomain systemd[1]: Started OpenStack Neutron Server. Dec 10 15:04:59 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Neutron Server... Dec 10 15:05:00 overcloud-controller-0.localdomain systemd[1]: Starting OpenStack Neutron Server... Dec 10 15:05:03 overcloud-controller-0.localdomain systemd[1]: Started OpenStack Neutron Server.
notice that neutron-server never stopped at puppets request: Dec 10 15:04:59 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Neutron Server... There's never a "Stopped..." after that. Because pacemaker immediately restarted it. So then puppet reports an error.
i'm requesting blocker on this one just so that we can get it discussed. we've hit this several times in CI, although this is the first time we've seen the error this way. usually neutron-server is just down on the overcloud, and then the tempest run fails against the overcloud. in this scenario i suspect it's because puppet has won the race to stop neutron-server before pacemaker can restart it (that's my theory any way). We have seen this with some regularity in CI (maybe 20%), so that's why i'm asking for blocker. when pacemaker wins to restart before puppet can stop it, then you get the error this way as shown in this bz.
No specific steps are needed to verify this one. Basically, if various deployments have been done (HA, non-HA, net-iso, non-net-iso, virt, baremetal), and we're not seeing errors with neutron-server failing to stop/start on the overcloud, then this bug fix is likely doing the right thing.
Successfully updated from 7.1 to 7.2 and saw no issue. Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2650