Bug 1463033 - attrd_updater returns before the update is applied anywhere
attrd_updater returns before the update is applied anywhere
Status: NEW
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker (Show other bugs)
7.3
Unspecified Unspecified
unspecified Severity high
: rc
: ---
Assigned To: Ken Gaillot
cluster-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-19 20:52 EDT by Andrew Beekhof
Modified: 2017-08-01 12:23 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Andrew Beekhof 2017-06-19 20:52:54 EDT
Description of problem:

A timing window exists whereby attrd_updater -QA may fail to report values that were previously set by attrd_updater -U.

Version-Release number of selected component (if applicable):

all

How reproducible:

unclear, timing and hardware dependant.
reported by several customers

Steps to Reproduce:
1. Set a value with attrd_updater -U 
2. Immediately call attrd_updater -QA  for the same attribute
3.

Actual results:

Value not found

Expected results:

Value is always visible

Additional info:


The current sequence is:

t1. receive message over local IPC
t2. send local ACK
t3. forward message over CPG (to peers and itself)
t4. all nodes get the CPG message and apply the update

CPG will make sure that all active peers (including ourselves) receive the message (anyone that can't is evicted) and in the same order, however the timing would depend on your (corosync) token timeouts.

The minimum bar is to send the ack only after the update was applied, however while this would make the window much shorter, the window still exists since other nodes may not yet have applied it.   Specifically, querying a slow or comparatively more overloaded node could result in the same (bad) behaviour.

Which may push us towards adding some kind of internal ack phase.
Comment 2 Ken Gaillot 2017-08-01 12:23:31 EDT
Due to capacity constraints, this is unlikely to be addressed in the 7.5 timeframe

Note You need to log in before you can comment on or make changes to this bug.