Bug 557138

Summary: Management updates not predictable for cluster.
Product: Red Hat Enterprise MRG Reporter: Alan Conway <aconway>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED ERRATA QA Contact: Jan Sarenik <jsarenik>
Severity: urgent Docs Contact:
Priority: urgent    
Version: DevelopmentCC: freznice, jsarenik
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-20 11:29:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 501015    

Description Alan Conway 2010-01-20 14:48:29 UTC
Description of problem:

Management updates are triggered by a timer. They are not predictable for the cluster and so can cause cluster shut-downs and inconsistent message delivery.

We have a hack in place that suppresses exceptions when the session receives
completions for transfers not yet sent (which is the usual manifestation of the
unpredictability). I.e. we have in essence disabled consistency checking for
management sessions. This solved immediate problems but would quickly stop
working if sessions/connections could be used for management and other things
(as will be more likely with QMFv2 where using management becomes quite
straightforward).

Comment 1 Alan Conway 2010-01-27 22:32:18 UTC
This is fixed by the following revisions:

903826 Fix cluster elder calculation to ensure unique elder.
903869 QPID_2634 Management updates in timer create inconsistencies in a cluster.
903868 Test for management + cluster: run management tools in parallel with regular clients.
903867 Cluster implementation of PeriodicTimer.
903866 Added PeriodicTimer interface for periodic tasks that need cluster synchronization.
903864 In clustered broker: move construction of broker::Connections to the cluster dispatch thread.

It's been pointed out that the current solution is not well integrated with the existing Timer class.
A follow up rename/refactor will be done to:
 - define a single abstract Timer interface with named tasks.
 - implementations Local\Timer and ClusterTimer
 - two broker accessors returning Timer: \
  - getLocalTimer always returns a LocalTimer instance
  - getClusterTimer returns a ClusterTImer in a cluster, else the same as getLocalTimer
  - rework management timer initialization to happen after plugin init, drop DelegatedTimer

Comment 2 Frantisek Reznicek 2010-02-10 10:06:55 UTC
Hello Alan, could you possibly specify the recommended way we should test it, please? It seems difficult.

putting NEEDINFO.

Comment 3 Alan Conway 2010-02-22 16:35:34 UTC
Can't easily reproduce this on 1.2 because the hack mentioned above suppresses the error.

You can reproduce on revision 903717 (before the fixes):
 - start a cluster broker (one is enough)
 - run qpid-queue-stats -a host:port
 - wait for management interval to pass.

The broker will exit with:
 - critical Modified cluster state outside of cluster context

Comment 4 Jan Sarenik 2010-03-30 10:18:21 UTC
Reproduced on manually-built qpid-903717

Comment 5 Jan Sarenik 2010-03-30 10:52:31 UTC
Verified on qpid-cpp-server-cluster-0.7.916826-2.el5, RHEL5 i386

Comment 6 Jan Sarenik 2010-03-30 14:00:14 UTC
Verified on qpid-cpp-server-cluster-0.7.916826-2.el5, RHEL5 x86_64