Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1482116

Summary: Ansible-pacemaker lacks retry logics to deal with CIB update concurrency
Product: Red Hat OpenStack Reporter: Damien Ciabrini <dciabrin>
Component: ansible-pacemakerAssignee: mathieu bultel <mbultel>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: high Docs Contact:
Priority: urgent    
Version: 12.0 (Pike)CC: aherr, jschluet, mcornea
Target Milestone: betaKeywords: Triaged
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ansible-pacemaker-1.0.3-0.20170929170820.1279294.el7ost openstack-tripleo-heat-templates-7.0.1-0.20170927205938.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-13 21:52:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1481987    

Description Damien Ciabrini 2017-08-16 13:41:32 UTC
Description of problem:
As detailed in https://bugzilla.redhat.com/show_bug.cgi?id=1481987, when we made a Ocata -> Pike upgrade, some of the ansible steps implemented for the rabbitmq service upgrade involved deleting the existing rabbitmq-clone [1] resource and creating a new containerized resource rabbitmq-bundle.

While the resource disabling task succeeded, the resource deletion task [2] did not, and the journal logged the following error:

Aug 15 22:48:24 messaging-0 ansible-pacemaker_resource[175956]: Invoked with check_mode=False state=delete resource=rabbitmq timeout=300 wait_for_resource=True
Aug 15 22:48:25 messaging-0 cib[16463]:    error: IDREF attribute rsc references an unknown ID "rabbitmq-clone"
Aug 15 22:48:25 messaging-0 cib[16463]:    error: IDREF attribute rsc references an unknown ID "rabbitmq-clone"
Aug 15 22:48:25 messaging-0 cib[16463]:  warning: Updated CIB does not validate against pacemaker-2.8 schema/dtd
Aug 15 22:48:25 messaging-0 cib[16463]:  warning: Local-only Change (client:cibadmin, call: 2): 0.78.0 (Update does not conform to the configured schema)
Aug 15 22:48:25 messaging-0 cib[16463]:  warning: Completed cib_delete operation for section //clone/primitive[@id="rabbitmq"]/..: Update does not conform to the configured schema (rc=-203, 

Such logs are most probably the symptoms that another pcs command was run on the cluster and updated the CIB in the middle of the "resource delete" pcs command implemented in the ansible task.

In such condition, ansible-pacemaker should retry the requested command (e.g. delete) to make sure that it succeeds. Also, it should report the error appropriately to ansible if the retry logics yield a failure.   

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/pacemaker/rabbitmq.yaml#L175
[2] https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/pacemaker/rabbitmq.yaml#L182


Version-Release number of selected component (if applicable):


How reproducible:
Random (when pcs commands are competing to update the CIB)

Steps to Reproduce:
1. Install Ocata
2. Upgrade to Pike

Actual results:
Some upgrade task may fail to execute properly if other pcs command updated the CIB.


Expected results:
The ansible pacemaker module should retry the requested action if it detected some concurrent update to the CIB prevented the action to finish.

Additional info:

Comment 1 Marius Cornea 2017-09-14 12:29:18 UTC
Cherry pick on stable/pike: https://review.openstack.org/#/c/504044/

Comment 4 Chris Jones 2017-10-25 09:43:41 UTC
*** Bug 1481987 has been marked as a duplicate of this bug. ***

Comment 7 errata-xmlrpc 2017-12-13 21:52:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462