Bug 1482116 - Ansible-pacemaker lacks retry logics to deal with CIB update concurrency
Summary: Ansible-pacemaker lacks retry logics to deal with CIB update concurrency
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: ansible-pacemaker
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: beta
: 12.0 (Pike)
Assignee: mathieu bultel
QA Contact: Marius Cornea
URL:
Whiteboard:
: 1481987 (view as bug list)
Depends On:
Blocks: 1481987
TreeView+ depends on / blocked
 
Reported: 2017-08-16 13:41 UTC by Damien Ciabrini
Modified: 2018-02-05 19:12 UTC (History)
3 users (show)

Fixed In Version: ansible-pacemaker-1.0.3-0.20170929170820.1279294.el7ost openstack-tripleo-heat-templates-7.0.1-0.20170927205938.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-13 21:52:15 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gerrithub.io 375982 0 None None None 2017-08-28 15:18:06 UTC
OpenStack gerrit 498499 0 None None None 2017-08-28 15:28:18 UTC
OpenStack gerrit 504044 0 None None None 2017-10-10 14:16:10 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Damien Ciabrini 2017-08-16 13:41:32 UTC
Description of problem:
As detailed in https://bugzilla.redhat.com/show_bug.cgi?id=1481987, when we made a Ocata -> Pike upgrade, some of the ansible steps implemented for the rabbitmq service upgrade involved deleting the existing rabbitmq-clone [1] resource and creating a new containerized resource rabbitmq-bundle.

While the resource disabling task succeeded, the resource deletion task [2] did not, and the journal logged the following error:

Aug 15 22:48:24 messaging-0 ansible-pacemaker_resource[175956]: Invoked with check_mode=False state=delete resource=rabbitmq timeout=300 wait_for_resource=True
Aug 15 22:48:25 messaging-0 cib[16463]:    error: IDREF attribute rsc references an unknown ID "rabbitmq-clone"
Aug 15 22:48:25 messaging-0 cib[16463]:    error: IDREF attribute rsc references an unknown ID "rabbitmq-clone"
Aug 15 22:48:25 messaging-0 cib[16463]:  warning: Updated CIB does not validate against pacemaker-2.8 schema/dtd
Aug 15 22:48:25 messaging-0 cib[16463]:  warning: Local-only Change (client:cibadmin, call: 2): 0.78.0 (Update does not conform to the configured schema)
Aug 15 22:48:25 messaging-0 cib[16463]:  warning: Completed cib_delete operation for section //clone/primitive[@id="rabbitmq"]/..: Update does not conform to the configured schema (rc=-203, 

Such logs are most probably the symptoms that another pcs command was run on the cluster and updated the CIB in the middle of the "resource delete" pcs command implemented in the ansible task.

In such condition, ansible-pacemaker should retry the requested command (e.g. delete) to make sure that it succeeds. Also, it should report the error appropriately to ansible if the retry logics yield a failure.   

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/pacemaker/rabbitmq.yaml#L175
[2] https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/pacemaker/rabbitmq.yaml#L182


Version-Release number of selected component (if applicable):


How reproducible:
Random (when pcs commands are competing to update the CIB)

Steps to Reproduce:
1. Install Ocata
2. Upgrade to Pike

Actual results:
Some upgrade task may fail to execute properly if other pcs command updated the CIB.


Expected results:
The ansible pacemaker module should retry the requested action if it detected some concurrent update to the CIB prevented the action to finish.

Additional info:

Comment 1 Marius Cornea 2017-09-14 12:29:18 UTC
Cherry pick on stable/pike: https://review.openstack.org/#/c/504044/

Comment 4 Chris Jones 2017-10-25 09:43:41 UTC
*** Bug 1481987 has been marked as a duplicate of this bug. ***

Comment 7 errata-xmlrpc 2017-12-13 21:52:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.