Bug 1290582 - puppet / pacemaker race stopping and starting neutron-server on Step6 of puppet apply
Summary: puppet / pacemaker race stopping and starting neutron-server on Step6 of pupp...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: y2
: 7.0 (Kilo)
Assignee: James Slagle
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-10 21:07 UTC by James Slagle
Modified: 2023-02-22 23:02 UTC (History)
7 users (show)

Fixed In Version: openstack-tripleo-heat-templates-0.8.6-92.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, during the initial Overcloud deployment, there existed a race condition between the puppet trying to stop the neutron-server and the Pacemaker trying to start the neutron-server. The neutron-server would often be left stopped on the Overcloud controllers, even though the deployment indicated it was successful. This was because the request to stop neutron-server eventually succeeded, although it would be not reported to Orchestration. With this update, the puppet manifest is fixed to only conditionally stop the neutron-server if the Pacemaker is not already managing the neutron-server resource. As a result, the initial deployments succeed and the neutron-server is running in the Overcloud.
Clone Of:
Environment:
Last Closed: 2015-12-21 16:53:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:2650 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux OpenStack Platform 7 director update 2015-12-21 21:44:54 UTC

Description James Slagle 2015-12-10 21:07:28 UTC
When ControllerOvercloudServicesDeployment_Step6 is triggered on controllers, it re-executes all of the steps 1-5 in overcloud_controller_pacemaker.pp. The issue is that there is an Exec under the if step >= 4 part that does:

    exec { 'neutron-server-start-wait-stop' :
      command   => "systemctl start neutron-server && \
                    sleep 5s && \
                    systemctl stop neutron-server",
      path      => ["/usr/bin", "/usr/sbin"],
    } ->

that means puppet is going to attempt to neutron-server if step >= 5, which means pacemaker is also trying to manage neutron-server.

You get into a race with puppet trying to stop neutron-server, and pacemaker trying to keep it running.

Here's what the errors looked like:

From os-collect-config:

Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Error: systemctl start neutron-server &&                     sleep 5s &&                     systemctl stop neutron-server returned 1 instead of one 
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Error: /Stage[main]/Main/Exec[neutron-server-start-wait-stop]/returns: change from notrun to 0 failed: systemctl start neutron-server &&             
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Warning: /Stage[main]/Main/Pacemaker::Resource::Service[neutron-server]/Pacemaker::Resource::Systemd[neutron-server]/Pcmk_resource[neutron-server]: S
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Warning: /Stage[main]/Main/Pacemaker::Constraint::Base[keystone-to-neutron-server-constraint]/Exec[Creating order constraint keystone-to-neutron-serv
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Warning: /Stage[main]/Main/Pacemaker::Constraint::Base[neutron-server-to-openvswitch-agent-constraint]/Exec[Creating order constraint neutron-server-
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: [2015-12-10 15:05:31,940] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-puppet/4b90f354-65ff-4a5f-9ef3-ebb440d90233.pp. [6]
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: [2015-12-10 15:05:31,943] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/puppet
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: [2015-12-10 15:05:31,944] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/4b90f354-65ff-4a5f-9ef3-ebb440d90233.json < 

pacemaker trying to start it:

Dec 10 15:05:04 overcloud-controller-0.localdomain crmd[25535]:   notice: Operation neutron-server_start_0: ok (node=overcloud-controller-0, call=227, rc=0, cib-update=94, confirmed=true)


[root@overcloud-controller-0 ~]# journalctl -u neutron-server
-- Logs begin at Thu 2015-12-10 14:49:17 EST, end at Thu 2015-12-10 16:07:03 EST. --
Dec 10 15:02:39 overcloud-controller-0.localdomain systemd[1]: Starting OpenStack Neutron Server...
Dec 10 15:02:41 overcloud-controller-0.localdomain systemd[1]: Started OpenStack Neutron Server.
Dec 10 15:02:46 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Neutron Server...
Dec 10 15:02:48 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Neutron Server.
Dec 10 15:04:53 overcloud-controller-0.localdomain systemd[1]: Starting OpenStack Neutron Server...
Dec 10 15:04:54 overcloud-controller-0.localdomain systemd[1]: Started OpenStack Neutron Server.
Dec 10 15:04:59 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Neutron Server...
Dec 10 15:05:00 overcloud-controller-0.localdomain systemd[1]: Starting OpenStack Neutron Server...
Dec 10 15:05:03 overcloud-controller-0.localdomain systemd[1]: Started OpenStack Neutron Server.

Comment 1 James Slagle 2015-12-10 21:09:28 UTC
notice that neutron-server never stopped at puppets request:
Dec 10 15:04:59 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Neutron Server...

There's never a "Stopped..." after that. Because pacemaker immediately restarted it. So then puppet reports an error.

Comment 2 James Slagle 2015-12-10 21:29:18 UTC
i'm requesting blocker on this one just so that we can get it discussed. we've hit this several times in CI, although this is the first time we've seen the error this way.

usually neutron-server is just down on the overcloud, and then the tempest run fails against the overcloud. in this scenario i suspect it's because puppet has won the race to stop neutron-server before pacemaker can restart it (that's my theory any way). We have seen this with some regularity in CI (maybe 20%), so that's why i'm asking for blocker.

when pacemaker wins to restart before puppet can stop it, then you get the error this way as shown in this bz.

Comment 3 James Slagle 2015-12-11 00:55:48 UTC
No specific steps are needed to verify this one. Basically, if various deployments have been done (HA, non-HA, net-iso, non-net-iso, virt, baremetal), and we're not seeing errors with neutron-server failing to stop/start on the overcloud, then this bug fix is likely doing the right thing.

Comment 5 Udi Kalifon 2015-12-16 15:13:20 UTC
Successfully updated from 7.1 to 7.2 and saw no issue. Verified.

Comment 7 errata-xmlrpc 2015-12-21 16:53:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2650


Note You need to log in before you can comment on or make changes to this bug.