Description of problem: customer reported a race condition related to pacemaker during deployment of Overcloud controllers. Unfortunately, logs were not were retained at that time and they can't afford to have the Overcloud sitting idle for investigation as already delayed for deadlines What it was observed the overcloud deployment process failing because in one of the three controllers, Puppet was unable to create the following pacemaker::resource::filesystem used by Glance (NFS driver). As part of the deployment process, the following Puppet fragment is run: if $glance_backend == 'file' and hiera('glance_file_pcmk_manage', false) { pacemaker::resource::filesystem { "glance-fs": device => hiera('glance_file_pcmk_device'), directory => hiera('glance_file_pcmk_directory'), fstype => hiera('glance_file_pcmk_fstype'), fsoptions => hiera('glance_file_pcmk_options', ''), clone_params => '', } } This Puppet fragment failed in just one of our three controllers. Looking at logs of all controllers In controller #0, there os-collect-config logs shows that a 'pcs resource create' command for the glance-fs resource was attempted and succeded. In controller #1, there is no trace of 'pcs resource create'. Note that in controller #2, 'pcs resource create' was called and failed with an error that claims that the glance-fs resource already exists. Customer impression is that there is a race condition in the pacemaker::resource::filesystem code. Let me explain how I believe it all happened, first by showing a fragment of pacemaker::resource::filesystem: Puppet::Type.type(:pcmk_resource).provide(:default) do desc 'A base resource definition for a pacemaker resource' ### overloaded methods def create ... # Build the 'pcs resource create' command. Check out the pcs man page :-) cmd = 'resource create ' + @resource[:name]+' ' +@resource[:resource_type] if not_empty_string(resource_params) cmd += ' ' + resource_params end ... # do pcs create pcs('create', cmd) end ... def exists? cmd = 'resource show ' + @resource[:name] + ' > /dev/null 2>&1' pcs('show', cmd) end From heir perspective, the underlying resource creation logic involves calling the "exists?" method and if it claims False, the the "create" method is called. However, there is a race condition here. It could happen that by the time "exists?" returns False, the local "corosync" daemon replicates the resource clone) (which was created almost at the same time in another controller), and by the time "pcs resource create" is called, the glance-fs resource already exists and then it fails because it's duplicated. A potential proper fix is to parse the "pcs resource create" output to deal with this potential race condition. At the moment, customer think the output from "pcs resource create" is never parsed to deal with this race condition. Another approach would be to run the Puppet code that sets Glance up just in one controller node. In any case, it's looks a very annoying failure mode. so this is for reducing those kind of race condition as is probably other like ones exist. How reproducible: Rarely. This kind of Race conditions are hard to reproduce. The only currently effective workaround is redeploying the Overcloud to see if the timing is correct. Customer don't have enough time and resources for investing a full testing so Expecting Red Hat QE for this
Looking at how TripleO works [1], the Pacemaker::Resource::Filesystem['glance-fs'] is created on all controller nodes. Indeed, it can leads to race conditions during the deployment if Puppet is run on the same time, because each node will try to create its own filesystem. I see 2 different options that would help to avoid this issue (maybe there is more): * set verify_on_create to True on Pacemaker::Resource::Filesystem['glance-fs'] resource (in tripleo-heat-templates). * manage Pacemaker::Resource::Filesystem['glance-fs'] only in the "if $pacemaker_master" block. Both or either solutions could work, we need some testing though, I was not able to reproduce the bug yet. [1] https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/manifests/overcloud_controller_pacemaker.pp#L637-L646
I think doing both is the right way to go. For some context, when the verify_on_create code was written, it was under the assumption that only one node would attempt to create a given pcs resource (other nodes could try to create other pcs resources or properties at the same time, that would be OK). One would expect that if only node was trying to create the resource, and if the call to pcs create succeeded, there would be no need to verify and maybe retry. But in our experience, this turned out not to be true rare cases. I forget the exact mechanism, but it was clearly seen in the logs where the cluster agreed that the latest cib.xml should not include the resource that pcs said it created. This is probably more likely to occur when you have other nodes that also editing the cluster definition through pcs calls, e.g. updating pcs properties. But, I think there is a possibility of it it occurring anyway. So, verify_on_create is a good idea when creating a pcs resource. However, creating the same resource with (or without) verify_on_create on two nodes could lead to one of the nodes having a puppet error, i.e. what Paul wrote is correct: "From their perspective, the underlying resource creation logic involves calling the "exists?" method and if it claims False, the the "create" method is called. However, there is a race condition here. It could happen that by the time "exists?" returns False, the local "corosync" daemon replicates the resource clone) (which was created almost at the same time in another controller), and by the time "pcs resource create" is called, the glance-fs resource already exists and then it fails because it's duplicated. A potential proper fix is to parse the "pcs resource create" output to deal with this potential race condition. At the moment, customer think the output from "pcs resource create" is never parsed to deal with this race condition." Specifically, in the way verify_on_create is currently written: https://github.com/openstack/puppet-pacemaker/blob/master/lib/puppet/provider/pcmk_resource/default.rb#L152 if the pcs command fails because the resource already exists (or any other reason), it won't even attempt to try to verify with "pcs show".
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10.
May I ask you to reconsider fixing this in OSP7 or OSP8? I personally don't think it's that hard to defer it for more than 4 months.
moving needinfo to HA PM
Morning, could you give us an update about this RFE? Thanks very much.
This seems like a pacemaker config issue. Moving to HA team.
I've started testing this one but noticed that we need NFS backend so I moved it to the storage team.
I am going to verify on openstack-tripleo-heat-templates-2.0.0-44.el7ost. I was able to reproduce problem where fs-varlibglanceimages resource creation was triggered multiple times, It happened 3 of 10 overcloud deploys in my env. I was not able to reproduce with fixed package during 10 tries of overcloud deploy. I did not observe situation when resource could be created but not yet present. The error when reproduced on older package I got: Error: pcs | create failed: Error: unable to create resource/fence device 'fs-varlibglanceimages\', \'fs-varlibglanceimages\' already exists on this system\x1b[0m\n\x1b[1;31mError: /Stage[main]/Main/Pacemaker::Resource::Filesystem[glance-fs]/Pcmk_resource[fs-varlibglanceimages]/ensure: change from absent to present failed: pcs create failed: Error: unable to create resource/fence device \'fs-varlibglanceimages\', \'fs-varlibglanceimages\' already exists on this system\x1b[0m\n', u'deploy_status_code': 6}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0470.html