1292555 – Quickstack pacemaker puppet modules for nova causing deployment failures with rhel-osp-installer

Bug 1292555 - Quickstack pacemaker puppet modules for nova causing deployment failures with rhel-osp-installer

Summary: Quickstack pacemaker puppet modules for nova causing deployment failures with...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-foreman-installer
Sub Component:
Version:	6.0 (Juno)
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	Installer
Assignee:	Jason Guiditta
QA Contact:	yeylon@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1290684
Blocks:
TreeView+	depends on / blocked

Reported:	2015-12-17 18:57 UTC by Dave Cain
Modified:	2016-04-18 07:14 UTC (History)
CC List:	13 users (show)
Fixed In Version:	openstack-foreman-installer-3.0.27-1.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-02-22 12:32:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:0284	0	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform Installer update	2016-02-22 17:32:05 UTC

Description Dave Cain 2015-12-17 18:57:33 UTC

Description: Hi folks.  When using the rhel-osp-installer on RHEL-OSP6 (RHEL7.2) to build an OpenStack deployment, it seems that the Quickstack Puppet modules for nova are causing deployment failures whenever creating the cloned Pacemaker resources via "pcs resource create".  This used to work when we were doing the FlexPod OpenStack CVD validation earlier this summer.

Getting the following errors on the Controllers:

Notice: /Stage[main]/Quickstack::Pacemaker::Nova/Quickstack::Pacemaker::Resource::Generic[openstack-nova-novncproxy]/Exec[create openstack-nova-novncproxy resource]/returns: Error: When using 'op' you must specify an operation name and at least one option
Error: /usr/sbin/pcs resource create openstack-nova-novncproxy     systemd:openstack-nova-novncproxy clone interleave=true  op monitor start-delay=10s returned 1 instead of one of [0]
Error: /Stage[main]/Quickstack::Pacemaker::Nova/Quickstack::Pacemaker::Resource::Generic[openstack-nova-novncproxy]/Exec[create openstack-nova-novncproxy resource]/returns: change from notrun to 0 failed: /usr/sbin/pcs resource create openstack-nova-novncproxy     systemd:openstack-nova-novncproxy clone interleave=true  op monitor start-delay=10s returned 1 instead of one of [0]

I believe the following resources are affected by this:
openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
openstack-nova-conductor-clone [openstack-nova-conductor]
openstack-nova-api-clone [openstack-nova-api]
openstack-nova-scheduler-clone [openstack-nova-scheduler]

If one patches the following file on the Installer (Puppet master):
/etc/puppet/environments/production/modules/quickstack/manifests/pacemaker/resource/generic.pp 

Line 26 to this, removing the "op": #$_operation_opts = "${operation_opts}"

The installation progresses further.  Whether or not this is the correct action, I cannot say.  We need someone to take a look at this.

Installer Node package versions:
openstack-puppet-modules-2014.2.15-4.el7ost.noarch
openstack-foreman-installer-3.0.26-1.el7ost.noarch
foreman-discovery-image-7.0-20150227.0.el7ost.noarch
foreman-proxy-1.6.0.30-6.el7ost.noarch
rubygem-foreman_api-0.1.11-6.el7sat.noarch
foreman-postgresql-1.6.0.49-6.el7ost.noarch
foreman-installer-1.6.0-0.4.RC1.el7ost.noarch
foreman-selinux-1.6.0.14-1.el7sat.noarch
foreman-1.6.0.49-6.el7ost.noarch

Controller Nodes package versions:
openstack-nova-common-2014.2.3-35.el7ost.noarch
openstack-nova-api-2014.2.3-35.el7ost.noarch
openstack-nova-scheduler-2014.2.3-35.el7ost.noarch
openstack-nova-console-2014.2.3-35.el7ost.noarch
openstack-nova-novncproxy-2014.2.3-35.el7ost.noarch
openstack-nova-cert-2014.2.3-35.el7ost.noarch
openstack-nova-conductor-2014.2.3-35.el7ost.noarch

External links:

Severity (U/H/M/L): H

Business Priority: Urgent

Comment 2 Jason Guiditta 2015-12-17 22:03:12 UTC

The suggested change seems likely harmless enough with the one caveat potentially being neutron.  This installer implemented the reference architecture found here [1], and I woudl have some concern that since this neutron timeout setting would get dropped, there could be other issues if it is slow to start

[1] https://github.com/beekhof/osp-ha-deploy/blob/Juno-RDO6/pcmk/neutron-server.scenario#L176

Comment 3 Dave Cain 2015-12-21 14:18:48 UTC

For what it's worth, we did get a successful deployment using the rhel-osp-installer after implementing the above change.  However, that deployment was contingent on Bugzilla 1290684 being resolved as well.

@Jason -- if there is a different change you can recommend that would not affect Neutron, please let me know and I will try it.  In our environment at least, Neutron came up fine.

Comment 4 Mike Orazi 2015-12-22 15:44:48 UTC

David,

Any chance you could submit a patch to the kilo branch for astapor?

Comment 5 Mark Davidson 2016-01-07 12:21:21 UTC

I just hit this on my system on Centos7

The latest pacemaker has a new error message "When using 'op' you must specify an operation name and at least one option" which was not reported before.

I checked the command that was running:
/usr/sbin/pcs resource create httpd systemd:httpd clone interleave=true op monitor start-delay=10s

which looked ok, so I then checked how pcs was parsing this. As soon as pcs sees the clone argument, it thinks everything is a clone arg. So previously the above command was not setting up the operation correctly (start-delay was added as a clone param).

Changing the command to
/usr/sbin/pcs resource create httpd systemd:httpd op monitor start-delay=10s clone interleave=true

Works correctly. (or fix pcs argument parsing)

Comment 6 Dave Cain 2016-01-11 16:19:57 UTC

Pull request to fix this bug here:
https://github.com/redhat-openstack/astapor/pull/567

Comment 8 Mike Orazi 2016-01-12 14:41:44 UTC

David, 

If we are able to produce an updated rpm would you be able to help verify?

Comment 9 Dave Cain 2016-01-12 14:50:19 UTC

Hi Mike,

Sure can, and would be happy to kick off a new build to verify, assuming 1290684 is patched too. That bug and this one prevent deployments from occurring.

Comment 10 Jason Guiditta 2016-01-12 16:42:51 UTC

Merged

Comment 12 Mark Davidson 2016-01-16 01:04:28 UTC

The patch above that removes the 'op' word means that any options passed that should be 'op' actions and parameters are now clone parameters.

It will allow the install, but the pacemaker resource setup will be wrong

You need to move the 'op' setup before the 'clone' setup (because pcs has some issues parsing its input)

The version of astapor I am using also sometimes sends the clone parameter via the resource_params, so I stopped using operation_opts and added 'op xxx xxx=yy' into the resource_params where every I was calling generic.pp

Comment 13 Dave Cain 2016-01-28 17:54:29 UTC

I was able to test this with the assistance of a colleague of mine and can say that builds are successful now with the rhel-osp-installer.  Builds no longer fail.  We even did a clean install of the database, as I seem to remember that it was required when updating the openstack-foreman-installer package.  

Not sure about Comment #12 from Mark though, which has merit.  Will leave that analysis to others though.

Marking bug VERIFIED.  Hope that's the right state.

Comment 15 errata-xmlrpc 2016-02-22 12:32:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0284.html

Note You need to log in before you can comment on or make changes to this bug.