Bug 1257414
Summary: | [HA] critical resource constraints missing from pacemaker config make things go kaboom | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Fabio Massimo Di Nitto <fdinitto> |
Component: | openstack-tripleo-heat-templates | Assignee: | Jiri Stransky <jstransk> |
Status: | CLOSED ERRATA | QA Contact: | Udi Shkalim <ushkalim> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 7.0 (Kilo) | CC: | calfonso, cfeist, clasohm, dh3, dmacpher, fdinitto, jcoufal, jdonohue, jliberma, jraju, jstransk, kgaillot, mburns, mcornea, michele, mori, oblaut, ohochman, rhel-osp-director-maint, rscarazz, sjeuk |
Target Milestone: | y1 | Keywords: | Triaged |
Target Release: | 7.0 (Kilo) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-0.8.6-50.el7ost | Doc Type: | Bug Fix |
Doc Text: |
Missing constraints between Pacemaker resources caused issues when starting or stopping the Controller cluster. This fix adds these constraints. Pacemaker resources now have the necessary relationships to function.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2015-10-08 12:17:25 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1259232 | ||
Bug Blocks: | 1185030, 1191185, 1243520, 1257753, 1261487, 1262263 |
Description
Fabio Massimo Di Nitto
2015-08-27 04:50:28 UTC
(In reply to Fabio Massimo Di Nitto from comment #0) > After a clean HA installation at least the following constraints are missing > from the pacemaker configuration. > > I am not entirely sure which package is in charge here and how to get tis > version. This is using OSP7 GA. > > pcs constraint order start openstack-nova-novncproxy-clone then > openstack-nova-api-clone > > pcs constraint order start rabbitmq-clone then openstack-keystone-clone > pcs constraint order promote galera-master then openstack-keystone-clone > pcs constraint order start haproxy-clone then openstack-keystone-clone > pcs constraint order start memcached-clone then keystone-clone > > pcs constraint order promote redis-master then start > openstack-ceilometer-central-clone One minor correction: pcs constraint order promote redis-master then start openstack-ceilometer-central-clone should be: pcs constraint order promote redis-master then start openstack-ceilometer-central-clone require-all=false One more that is missing: pcs resource defaults resource-stickiness=INFINITY Thanks Fabio. One of the commands fails though: pcs constraint order promote redis-master then start openstack-ceilometer-central-clone require-all=false Adding redis-master openstack-ceilometer-central-clone (kind: Mandatory) (Options: require-all=false first-action=promote then-action=start) Error: Unable to update cib Call cib_replace failed (-203): Update does not conform to the configured schema After some googling I was able to do this: pcs constraint order promote redis-master then start openstack-ceilometer-central-clone pcs constraint order set redis-master openstack-ceilometer-central-clone require-all=false Is it the same thing? Can the original command be somehow altered to make it work just with `pcs constraint order` (not `pcs constraint order set`)? I'm asking because puppet-pacemaker currently doesn't support `pcs constraint order set`. (In reply to Jiri Stransky from comment #7) > Thanks Fabio. One of the commands fails though: > > pcs constraint order promote redis-master then start > openstack-ceilometer-central-clone require-all=false > > Adding redis-master openstack-ceilometer-central-clone (kind: Mandatory) > (Options: require-all=false first-action=promote then-action=start) > Error: Unable to update cib > Call cib_replace failed (-203): Update does not conform to the configured > schema > > > After some googling I was able to do this: > > pcs constraint order promote redis-master then start > openstack-ceilometer-central-clone > pcs constraint order set redis-master openstack-ceilometer-central-clone > require-all=false > > Is it the same thing? Can the original command be somehow altered to make it > work just with `pcs constraint order` (not `pcs constraint order set`)? I'm > asking because puppet-pacemaker currently doesn't support `pcs constraint > order set`. I am not sure they are the same thing. Chris can you confirm? Jiri: can you please verify pcs version deployed on those nodes? The commands are not identical: # pcs constraint order promote redis-master then start openstack-ceilometer-central-clone This says, start openstack-ceilometer-central-clone after (and only after) promoting redis-master. require-all is an option for resource sets (which is why you get an error when attempting to use it with a normal order constraint). pcs constraint order set redis-master openstack-ceilometer-central-clone require-all=false Operates with resource sets, and using require-all=false really only makes sense if you have multiple sets. So for this command to be useful you'd have several resources in a few different sets. (my guess is that this isn't what you're looking for). There's a description of how ordered sets with require-all works here: http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_resource_set_or_logic --- What do you want to happen with that command? And what effect were you hoping that require-all would have when adding that to the resource constraint? This was upstream environment with Fedora 21, pcs version 0.9.137. If i understand correctly what Chris wrote, it looks like we don't need the resource set, because this would be the only resource set defined anyway. So i guess i should add the constraint like this: pcs constraint order promote redis-master then start openstack-ceilometer-central-clone (without require-all=false). (In reply to Jiri Stransky from comment #15) > This was upstream environment with Fedora 21, pcs version 0.9.137. > > If i understand correctly what Chris wrote, it looks like we don't need the > resource set, because this would be the only resource set defined anyway. So > i guess i should add the constraint like this: > > pcs constraint order promote redis-master then start > openstack-ceilometer-central-clone > > (without require-all=false). No, we need the require-all=false and it has to be tested in RHEL7. a failure in fedora is not valuable input here. Without require-all=false, all ceilometer and heat services will run A/P that is not what we want. Submitted a patch to puppet-pacemaker to add support for pcs resource defaults: https://github.com/redhat-openstack/puppet-pacemaker/pull/61 After this gets plus'ed on review, i'll file a BZ for backport to OPM. Another 2 patches submitted to t-h-t (the 4 keystone constraints, and the resource-stickiness default). Fourth patch (the vncproxy constraint) will be just uncommenting a bit which is commented upstream due to missing dependencies for vncproxy. Along with ordering it will also add colocation with nova-api. This might be a downstream-only patch for the time being (i'll submit it upstream too but it might not get merged very soon). Regarding fifth patch (redis/ceilometer), the constraint only worked on RHEL 7 as Fabio suggested, so the best solution i can think of is to make a condition in Puppet based on the OS. RHEL/CentOS would create the constraint with "require-all=false", and Fedora without. This would allow us to keep upstream/downstream from diverging on the code level at least. (In reply to Jiri Stransky from comment #17) > Submitted a patch to puppet-pacemaker to add support for pcs resource > defaults: > > https://github.com/redhat-openstack/puppet-pacemaker/pull/61 > > After this gets plus'ed on review, i'll file a BZ for backport to OPM. > > > Another 2 patches submitted to t-h-t (the 4 keystone constraints, and the > resource-stickiness default). > > Fourth patch (the vncproxy constraint) will be just uncommenting a bit which > is commented upstream due to missing dependencies for vncproxy. Along with > ordering it will also add colocation with nova-api. This might be a > downstream-only patch for the time being (i'll submit it upstream too but it > might not get merged very soon). > > Regarding fifth patch (redis/ceilometer), the constraint only worked on RHEL > 7 as Fabio suggested, so the best solution i can think of is to make a > condition in Puppet based on the OS. RHEL/CentOS would create the constraint > with "require-all=false", and Fedora without. This would allow us to keep > upstream/downstream from diverging on the code level at least. Chris and I identified that the version of pacemaker in Fedora might be old and that´s why it failed. We will work with Andrew to have a new pacemaker build, but it should work. Does this need to be tested? (In reply to jliberma from comment #19) > Does this need to be tested? what exactly? I am not sure what you are referring to. require-all=false in RHEL 7 https://bugzilla.redhat.com/show_bug.cgi?id=1257414#c16 (In reply to jliberma from comment #21) > require-all=false in RHEL 7 > > https://bugzilla.redhat.com/show_bug.cgi?id=1257414#c16 Oh I see.. my comment was not clear. We already use require-all=false for other resources in this deployment. My "it has to be tested in RHEL7" was referring to the code/fix that has been developed. A failure in Fedora is less relevant to what we deliver to customers and we can investigate why the original fix failed at a later stage. All needed patches are submitted upstream (5 of them, the OPM one is merged already). I'm hoping to have bandwidth tomorrow to give all of these a test on downstream environment. Fabio, is there a way to verify that the constraint indeed has the require-all=false option set? `pcs constraint order show --full` only shows kind and ID, not options. I know it's printed when creating the constraint, but i don't have that output avaliable when the constraint is created via Puppet. Maybe it would be visible in `crm_<something>`? Thanks for all the help. (In reply to Jiri Stransky from comment #24) > Fabio, is there a way to verify that the constraint indeed has the > require-all=false option set? `pcs constraint order show --full` only shows > kind and ID, not options. I know it's printed when creating the constraint, > but i don't have that output avaliable when the constraint is created via > Puppet. Maybe it would be visible in `crm_<something>`? > > Thanks for all the help. [root@overcloud-controller-0 ~]# pcs constraint order promote redis-master then start openstack-ceilometer-central-clone require-all=false Adding redis-master openstack-ceilometer-central-clone (kind: Mandatory) (Options: require-all=false first-action=promote then-action=start) [root@overcloud-controller-0 ~]# echo $? 0 that´s all you need. pcs would return != 0 on error. For 7.1 pcs does not display the options such as require-all, but it´s already fixed in 7.2 7.1 output: promote redis-master then start openstack-ceilometer-central-clone (kind:Mandatory) (id:order-redis-master-openstack-ceilometer-central-clone-mandatory) and you can verify (tho it´s redundant if you check return codes): [root@overcloud-controller-0 ~]# pcs cluster cib |grep require <rsc_order first="redis-master" first-action="promote" id="order-redis-master-openstack-ceilometer-central-clone-mandatory" require-all="false" then="openstack-ceilometer-central-clone" then-action="start"/> For 7.2 you will something similar to: [root@rhel7-ha-node4 ~]# pcs config |grep require start clusterfs-clone then start webfarm-clone (kind:Mandatory) (Options: require-all=false) (id:order-clusterfs-clone-webfarm-clone-mandatory) Applied the patches on downstream environment and deployed, all seems well: [root@overcloud-controller-0 ~]# pcs constraint show | grep 'then start openstack-keystone' start memcached-clone then start openstack-keystone-clone (kind:Mandatory) start rabbitmq-clone then start openstack-keystone-clone (kind:Mandatory) promote galera-master then start openstack-keystone-clone (kind:Mandatory) start haproxy-clone then start openstack-keystone-clone (kind:Mandatory) [root@overcloud-controller-0 ~]# pcs constraint show | grep 'then start openstack-ceilometer-central' start mongod-clone then start openstack-ceilometer-central-clone (kind:Mandatory) start openstack-keystone-clone then start openstack-ceilometer-central-clone (kind:Mandatory) promote redis-master then start openstack-ceilometer-central-clone (kind:Mandatory) [root@overcloud-controller-0 ~]# pcs constraint show | grep 'then start openstack-nova-api' start openstack-nova-novncproxy-clone then start openstack-nova-api-clone (kind:Mandatory) [root@overcloud-controller-0 ~]# pcs resource defaults resource-stickiness: INFINITY [root@overcloud-controller-0 ~]# pcs cluster cib | grep require-all <rsc_order first="redis-master" first-action="promote" id="order-redis-master-openstack-ceilometer-central-clone-mandatory" require-all="false" then="openstack-ceilometer-central-clone" then-action="start"/> [root@overcloud-controller-0 ~]# pcs status | grep -i stopped New constraints are marked with "here >>" -------------------------------------------- [root@overcloud-controller-0 ~]# pcs constraint order show --full Ordering Constraints: start ip-10.19.104.11 then start haproxy-clone (kind:Optional) (id:order-ip-10.19.104.11-haproxy-clone-Optional) start ip-192.168.0.6 then start haproxy-clone (kind:Optional) (id:order-ip-192.168.0.6-haproxy-clone-Optional) start ip-192.168.201.10 then start haproxy-clone (kind:Optional) (id:order-ip-192.168.201.10-haproxy-clone-Optional) start ip-10.19.184.201 then start haproxy-clone (kind:Optional) (id:order-ip-10.19.184.201-haproxy-clone-Optional) start ip-10.19.104.10 then start haproxy-clone (kind:Optional) (id:order-ip-10.19.104.10-haproxy-clone-Optional) start ip-10.19.105.10 then start haproxy-clone (kind:Optional) (id:order-ip-10.19.105.10-haproxy-clone-Optional) here >> start memcached-clone then start openstack-keystone-clone (kind:Mandatory) (id:order-memcached-clone-openstack-keystone-clone-mandatory) start mongod-clone then start openstack-ceilometer-central-clone (kind:Mandatory) (id:order-mongod-clone-openstack-ceilometer-central-clone-mandatory) start openstack-glance-registry-clone then start openstack-glance-api-clone (kind:Mandatory) (id:order-openstack-glance-registry-clone-openstack-glance-api-clone-mandatory) here >> start rabbitmq-clone then start openstack-keystone-clone (kind:Mandatory) (id:order-rabbitmq-clone-openstack-keystone-clone-mandatory) start openstack-heat-api-clone then start openstack-heat-api-cfn-clone (kind:Mandatory) (id:order-openstack-heat-api-clone-openstack-heat-api-cfn-clone-mandatory) start delay-clone then start openstack-ceilometer-alarm-evaluator-clone (kind:Mandatory) (id:order-delay-clone-openstack-ceilometer-alarm-evaluator-clone-mandatory) start openstack-keystone-clone then start openstack-ceilometer-central-clone (kind:Mandatory) (id:order-openstack-keystone-clone-openstack-ceilometer-central-clone-mandatory) start openstack-keystone-clone then start openstack-glance-registry-clone (kind:Mandatory) (id:order-openstack-keystone-clone-openstack-glance-registry-clone-mandatory) start openstack-keystone-clone then start openstack-cinder-api-clone (kind:Mandatory) (id:order-openstack-keystone-clone-openstack-cinder-api-clone-mandatory) start openstack-cinder-scheduler-clone then start openstack-cinder-volume (kind:Mandatory) (id:order-openstack-cinder-scheduler-clone-openstack-cinder-volume-mandatory) here >> promote redis-master then start openstack-ceilometer-central-clone (kind:Mandatory) (id:order-redis-master-openstack-ceilometer-central-clone-mandatory) start openstack-nova-scheduler-clone then start openstack-nova-conductor-clone (kind:Mandatory) (id:order-openstack-nova-scheduler-clone-openstack-nova-conductor-clone-mandatory) start openstack-nova-consoleauth-clone then start openstack-nova-novncproxy-clone (kind:Mandatory) (id:order-openstack-nova-consoleauth-clone-openstack-nova-novncproxy-clone-mandatory) start neutron-l3-agent-clone then start neutron-metadata-agent-clone (kind:Mandatory) (id:order-neutron-l3-agent-clone-neutron-metadata-agent-clone-mandatory) here >> start openstack-nova-novncproxy-clone then start openstack-nova-api-clone (kind:Mandatory) (id:order-openstack-nova-novncproxy-clone-openstack-nova-api-clone-mandatory) start openstack-heat-api-cloudwatch-clone then start openstack-heat-engine-clone (kind:Mandatory) (id:order-openstack-heat-api-cloudwatch-clone-openstack-heat-engine-clone-mandatory) start openstack-ceilometer-notification-clone then start openstack-heat-api-clone (kind:Mandatory) (id:order-openstack-ceilometer-notification-clone-openstack-heat-api-clone-mandatory) start openstack-keystone-clone then start neutron-server-clone (kind:Mandatory) (id:order-openstack-keystone-clone-neutron-server-clone-mandatory) start neutron-dhcp-agent-clone then start neutron-l3-agent-clone (kind:Mandatory) (id:order-neutron-dhcp-agent-clone-neutron-l3-agent-clone-mandatory) start openstack-ceilometer-alarm-notifier-clone then start openstack-ceilometer-notification-clone (kind:Mandatory) (id:order-openstack-ceilometer-alarm-notifier-clone-openstack-ceilometer-notification-clone-mandatory) start openstack-keystone-clone then start openstack-nova-consoleauth-clone (kind:Mandatory) (id:order-openstack-keystone-clone-openstack-nova-consoleauth-clone-mandatory) start openstack-nova-api-clone then start openstack-nova-scheduler-clone (kind:Mandatory) (id:order-openstack-nova-api-clone-openstack-nova-scheduler-clone-mandatory) start openstack-heat-api-cfn-clone then start openstack-heat-api-cloudwatch-clone (kind:Mandatory) (id:order-openstack-heat-api-cfn-clone-openstack-heat-api-cloudwatch-clone-mandatory) start neutron-server-clone then start neutron-ovs-cleanup-clone (kind:Mandatory) (id:order-neutron-server-clone-neutron-ovs-cleanup-clone-mandatory) start neutron-openvswitch-agent-clone then start neutron-dhcp-agent-clone (kind:Mandatory) (id:order-neutron-openvswitch-agent-clone-neutron-dhcp-agent-clone-mandatory) start openstack-ceilometer-api-clone then start delay-clone (kind:Mandatory) (id:order-openstack-ceilometer-api-clone-delay-clone-mandatory) here >> promote galera-master then start openstack-keystone-clone (kind:Mandatory) (id:order-galera-master-openstack-keystone-clone-mandatory) start openstack-cinder-api-clone then start openstack-cinder-scheduler-clone (kind:Mandatory) (id:order-openstack-cinder-api-clone-openstack-cinder-scheduler-clone-mandatory) start neutron-netns-cleanup-clone then start neutron-openvswitch-agent-clone (kind:Mandatory) (id:order-neutron-netns-cleanup-clone-neutron-openvswitch-agent-clone-mandatory) start openstack-ceilometer-central-clone then start openstack-ceilometer-collector-clone (kind:Mandatory) (id:order-openstack-ceilometer-central-clone-openstack-ceilometer-collector-clone-mandatory) here >> start haproxy-clone then start openstack-keystone-clone (kind:Mandatory) (id:order-haproxy-clone-openstack-keystone-clone-mandatory) start neutron-ovs-cleanup-clone then start neutron-netns-cleanup-clone (kind:Mandatory) (id:order-neutron-ovs-cleanup-clone-neutron-netns-cleanup-clone-mandatory) start openstack-ceilometer-collector-clone then start openstack-ceilometer-api-clone (kind:Mandatory) (id:order-openstack-ceilometer-collector-clone-openstack-ceilometer-api-clone-mandatory) start openstack-keystone-clone then start openstack-heat-api-clone (kind:Mandatory) (id:order-openstack-keystone-clone-openstack-heat-api-clone-mandatory) start openstack-ceilometer-alarm-evaluator-clone then start openstack-ceilometer-alarm-notifier-clone (kind:Mandatory) (id:order-openstack-ceilometer-alarm-evaluator-clone-openstack-ceilometer-alarm-notifier-clone-mandatory) [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# pcs resource defaults resource-stickiness: INFINITY Thanks for the help Jiri Stransky ---------------------------------- Verified with :[root@undercloud ~]# rpm -qa | grep openstack-tripleo-heat-templates openstack-tripleo-heat-templates-0.8.6-62.el7ost.noarch Since this bug is really bad and we want to be on the safe side, Fabio asked for another per of eyes to make sure the verification as described in comment #29 is valid - if more work is needed please specify the steps and reopen the bug or return it to - ON_QA. Omri, thanks for the ping. So Raoul and I did verify that starting with openstack-tripleo-heat-templates-0.8.6-50.el7ost the constraints are all correct and that we're on par with the refarch here https://github.com/beekhof/osp-ha-deploy/ (modulo two different bugs: https://bugzilla.redhat.com/show_bug.cgi?id=1262425 and https://bugzilla.redhat.com/show_bug.cgi?id=1262409) We have not done any functional tests, and are currently relying on the work of the refarch. We will start the functional part in the next weeks though, so we will be better equipped in the future. So, I'll leave this on VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:1862 To clarify the issue with "require-all=false" raised in comments: require-all is only documented with respect to resource sets, but it is an undocumented option for ordering constraints involving clones. It was superseded by the clone-min=1 option, which is equivalent. RHEL 7.1 supports require-all since a z-stream released on 2015-03-05; RHEL 7.2 was released with require-all support. Both support clone-min since a z-stream released after 2015-07-22. No earlier versions support either option, although 6.8 will support both at release. I believe support for both options was added to Fedora 21+ sometime around October 2015. |