Bug 837289

Summary:

Sending durable messages via a federation link to a destination cluster causes cluster-desync in long term

Product:

Red Hat Enterprise MRG

Reporter:

Pavel Moravec <pmoravec>

Component:

qpid-cpp

Assignee:

messaging-bugs <messaging-bugs>

Status:

NEW ---

QA Contact:

MRG Quality Engineering <mrgqe-bugs>

Severity:

high

Docs Contact:

Priority:

high

Version:

2.1

CC:

jross, jthomas

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
reproducer	none

Description Pavel Moravec 2012-07-03 11:28:00 UTC

Description of problem:
Long usage of a federation link between two clusters where _durable_ messages are frequently sent leads to cluster-desync (of the dst.cluster). See Aditional info for reasoning.


Version-Release number of selected component (if applicable):
any (checked qpid 0.14)


How reproducible:
depending on time, but 10% for default scenario


Steps to Reproduce:
(ideally, if possible, rebuild qpid with changing cpp/src/qpid/SessionState.cpp: static const uint32_t SPONTANEOUS_REQUEST_INTERVAL = 64 to really, really significantly speedup the reproducer)
(also note, you could get false bug-like results due to bz836141)

1) Have a source broker (or cluster, this does not matter) and dest.cluster with queue state replication of just one queue from source do dest.cluster.
2) On the federation route, setup --ack to some low number (to speedup replication, I used --ack 5).
3) Randomly produce and consume _durable_ messages to the src.broker to the queue to be replicated - ideally, do the enqueues and dequeues as much alternating as possible. 
4) Either wait very long time (usually in millions of enqueues and dequeues) to get cluster desync, or monitor on whole dst.cluster these trace logs:

trace @QPID.9593a43d-0450-4bea-b510-a315e726bb63: sent cmd 131015: {MessageAcceptBody: transfers={ [5077482,5077486] }; }

(where the trace is generated by adding "log-enable=trace+:senderRecord" to /etc/qpidd.conf and the uuid is ID of the session on fed.link)

  
Actual results:
- After some random time (i.e. "sent cmd 98000:" or so), the session points in "transfers" will differ among cluster members (note that due to bz836141, they will differ by 3 since beginning, so observe for a _different_ difference that will randomly appear).
- After that, there is a +- 10% probability that after each SPONTANEOUS_REQUEST_INTERVAL-th message (by default after 65536, 65536*2, ..) sent, the cluster will de-sync logging something like:

critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: admin: confirmed < (65535+0) but only sent < (1625+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)

(where the difference between "confirmed" and "sent" corresponds to the difference from session points)


Expected results:
no either session point difference or cluster de-sync


Additional info:
- Replication steps use queue state replication, though the bug is relevant to more general usage of federation link. Anyway once queue replication is set up, the reproducer is easier and faster leading to the bug.

- The root cause of the bug is in async completion of received messages. In particular, the session difference started to appear when SessionState::IncompleteIngressMsgXfer::completed(bool sync) was called with sync=false for the node keeping the fed.link connection and with sync=true for the other nodes (that had the fed.link as shadow connection only).
This was caused by different asynchronous processing / storing of durable messages into the store. The asynchronous storing eventually causes that the same group of messages is stored (resp. the storing is completed) in different ordering. That completion, however, triggers async completion (see ./qpid/broker/AsyncCompletion.h file) what causes different brokers accepts (or reply to accept for shadow connections) different sets / ranges of messages.

This hidden asynchronicity could lead to the state that one broker sends a bunch of ranges in one message.accept while another broker sends (or replays for shadow connections) multiple message.accept AMQP messages. This then directly leads to de-sync in session points, what can be - with some "luck" - revealed when the dt.broker sends spontaneous sync. request (that the other brokers don't replay, having session point set differently).

Comment 1 Pavel Moravec 2013-02-13 16:41:49 UTC

Created attachment 696913 [details]
reproducer

1) compile attached cpp
2) run attached script and monitor if "sent cmd 117328" (for whatever number there) trace log is the same on all qpidd.500*.log - that directly indicates if the bug can occur or not (hitting the bug directly is little bit more tricky)