Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1290582

Summary:	puppet / pacemaker race stopping and starting neutron-server on Step6 of puppet apply
Product:	Red Hat OpenStack	Reporter:	James Slagle <jslagle>
Component:	openstack-tripleo-heat-templates	Assignee:	James Slagle <jslagle>
Status:	CLOSED ERRATA	QA Contact:	Amit Ugol <augol>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.0 (Kilo)	CC:	dnavale, emacchi, jcoufal, mburns, rhel-osp-director-maint, sasha, ukalifon
Target Milestone:	y2
Target Release:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-0.8.6-92.el7ost	Doc Type:	Bug Fix
Doc Text:	Previously, during the initial Overcloud deployment, there existed a race condition between the puppet trying to stop the neutron-server and the Pacemaker trying to start the neutron-server. The neutron-server would often be left stopped on the Overcloud controllers, even though the deployment indicated it was successful. This was because the request to stop neutron-server eventually succeeded, although it would be not reported to Orchestration. With this update, the puppet manifest is fixed to only conditionally stop the neutron-server if the Pacemaker is not already managing the neutron-server resource. As a result, the initial deployments succeed and the neutron-server is running in the Overcloud.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-12-21 16:53:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description James Slagle 2015-12-10 21:07:28 UTC

When ControllerOvercloudServicesDeployment_Step6 is triggered on controllers, it re-executes all of the steps 1-5 in overcloud_controller_pacemaker.pp. The issue is that there is an Exec under the if step >= 4 part that does:

    exec { 'neutron-server-start-wait-stop' :
      command   => "systemctl start neutron-server && \
                    sleep 5s && \
                    systemctl stop neutron-server",
      path      => ["/usr/bin", "/usr/sbin"],
    } ->

that means puppet is going to attempt to neutron-server if step >= 5, which means pacemaker is also trying to manage neutron-server.

You get into a race with puppet trying to stop neutron-server, and pacemaker trying to keep it running.

Here's what the errors looked like:

From os-collect-config:

Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Error: systemctl start neutron-server &&                     sleep 5s &&                     systemctl stop neutron-server returned 1 instead of one 
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Error: /Stage[main]/Main/Exec[neutron-server-start-wait-stop]/returns: change from notrun to 0 failed: systemctl start neutron-server &&             
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Warning: /Stage[main]/Main/Pacemaker::Resource::Service[neutron-server]/Pacemaker::Resource::Systemd[neutron-server]/Pcmk_resource[neutron-server]: S
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Warning: /Stage[main]/Main/Pacemaker::Constraint::Base[keystone-to-neutron-server-constraint]/Exec[Creating order constraint keystone-to-neutron-serv
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: Warning: /Stage[main]/Main/Pacemaker::Constraint::Base[neutron-server-to-openvswitch-agent-constraint]/Exec[Creating order constraint neutron-server-
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: [2015-12-10 15:05:31,940] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-puppet/4b90f354-65ff-4a5f-9ef3-ebb440d90233.pp. [6]
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: [2015-12-10 15:05:31,943] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/puppet
Dec 10 15:05:31 overcloud-controller-0.localdomain os-collect-config[4580]: [2015-12-10 15:05:31,944] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/4b90f354-65ff-4a5f-9ef3-ebb440d90233.json < 

pacemaker trying to start it:

Dec 10 15:05:04 overcloud-controller-0.localdomain crmd[25535]:   notice: Operation neutron-server_start_0: ok (node=overcloud-controller-0, call=227, rc=0, cib-update=94, confirmed=true)


[root@overcloud-controller-0 ~]# journalctl -u neutron-server
-- Logs begin at Thu 2015-12-10 14:49:17 EST, end at Thu 2015-12-10 16:07:03 EST. --
Dec 10 15:02:39 overcloud-controller-0.localdomain systemd[1]: Starting OpenStack Neutron Server...
Dec 10 15:02:41 overcloud-controller-0.localdomain systemd[1]: Started OpenStack Neutron Server.
Dec 10 15:02:46 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Neutron Server...
Dec 10 15:02:48 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Neutron Server.
Dec 10 15:04:53 overcloud-controller-0.localdomain systemd[1]: Starting OpenStack Neutron Server...
Dec 10 15:04:54 overcloud-controller-0.localdomain systemd[1]: Started OpenStack Neutron Server.
Dec 10 15:04:59 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Neutron Server...
Dec 10 15:05:00 overcloud-controller-0.localdomain systemd[1]: Starting OpenStack Neutron Server...
Dec 10 15:05:03 overcloud-controller-0.localdomain systemd[1]: Started OpenStack Neutron Server.

Comment 1 James Slagle 2015-12-10 21:09:28 UTC

notice that neutron-server never stopped at puppets request:
Dec 10 15:04:59 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Neutron Server...

There's never a "Stopped..." after that. Because pacemaker immediately restarted it. So then puppet reports an error.

Comment 2 James Slagle 2015-12-10 21:29:18 UTC

i'm requesting blocker on this one just so that we can get it discussed. we've hit this several times in CI, although this is the first time we've seen the error this way.

usually neutron-server is just down on the overcloud, and then the tempest run fails against the overcloud. in this scenario i suspect it's because puppet has won the race to stop neutron-server before pacemaker can restart it (that's my theory any way). We have seen this with some regularity in CI (maybe 20%), so that's why i'm asking for blocker.

when pacemaker wins to restart before puppet can stop it, then you get the error this way as shown in this bz.

Comment 3 James Slagle 2015-12-11 00:55:48 UTC

No specific steps are needed to verify this one. Basically, if various deployments have been done (HA, non-HA, net-iso, non-net-iso, virt, baremetal), and we're not seeing errors with neutron-server failing to stop/start on the overcloud, then this bug fix is likely doing the right thing.

Comment 5 Udi Kalifon 2015-12-16 15:13:20 UTC

Successfully updated from 7.1 to 7.2 and saw no issue. Verified.

Comment 7 errata-xmlrpc 2015-12-21 16:53:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2650