Description of problem: Openstack director's templates were set to deploy the Ceph monitor as the single service on block-storage nodes. An update of the Overcloud from 3 nodes of block-storage nodes passed and the director created 4 monitors in quorum in the cluster. Version-Release number of selected component (if applicable): openstack-tripleo-validations-5.3.1-0.20170125194508.6b928f1.el7ost.noarch openstack-tripleo-common-5.7.1-0.20170126235054.c75d3c6.el7ost.noarch puppet-tripleo-6.1.0-0.20170127040716.d427c2a.el7ost.noarch openstack-tripleo-puppet-elements-6.0.0-0.20170126053436.688584c.el7ost.noarch openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch openstack-tripleo-heat-templates-6.0.0-0.20170127041112.ce54697.el7ost.1.noarch openstack-tripleo-ui-2.0.1-0.20170126144317.f3bd97e.el7ost.noarch python-tripleoclient-6.0.1-0.20170127055753.8ea289c.el7ost.noarch openstack-tripleo-image-elements-6.0.0-0.20170126135810.00b9869.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. deploy an overcloud with dedicated Ceph monitor node (as described in Bug 1232958 ) 2. add another node to the deployment command and update the Overcloud 3. Actual results: The update creates another Ceph monitor, adds it to quorum. Expected results: OSP director wont run the update, show a message: can't deploy an even number of Ceph monitor service. Additional info:
We don't support this kind of validation for roles or node counts so this would need to be an RFE
Yogev, from the 'ceph status' output you sent me, we appear to have completed successfully the update producing a 4 nodes cluster in healthy state ... so I agree it would be better to use an uneven number of monitors but ceph itself doesn't prevent you from using an even number so I am not sure director should.
In addition to the description: A fresh deployment of two dedicated Ceph monitor nodes ended successfully with both of them in quorum. The templates were set to use the block-storage node as a dedicated Ceph monitor node. The Overcloud topology was: - 3 Control nodes - 2 Block storage nodes - 3 Ceph storage nodes (each with 10 OSDs) - 2 Compute nodes The deployment started without any warning or a sign that there will be an even number of Ceph monitors in the cluster.
I am adding a warning message in the post-deployment validations printed if the cluster is in HEALTH_WARN state. If Ceph returns HEALTH_OK with two and/or any other even number of ceph-mon instances, I don't think we should stop deployment of an even number of nodes in tripleo.
The problem isn't so much even, or odd, it's that three monitors are required for HA. A transitional state of 4 monitors is not problematic. The problem arises during leader election (paxos). There are situations where you would want to have an even number: * Scaling from 3 monitors to an eventual 5 * Provisioning a 4th monitor with the intention of retiring an old monitor When an operator goes to deploy a HA OSP control plane, is this something that is enforced programmatically? For example, if HA OSP requires 3 controller nodes, and the templates contain 2, do we block installation? If we do block installation, then we should have a way of enforcing similar requirements for other components (eg. Ceph).
hi Kyle, thanks for commenting. Currently OSPd does not enforce (nor block) the deployment of a specific number of MONs, OSDs or even MDSs. Instead this BZ added a post-install validation task which prints a warning message if at the end of the deployment the Ceph cluster is not in HEALTH_OK. My goal is to make OSPd more verbose if Ceph is in warning state; given that deploying an odd number of nodes isn't even worth a warning in Ceph, I don't think OSPd should prevent that.
verified on openstack-tripleo-validations-5.4.0-7.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1245