Description of problem: While running extended failover_soak test (BZ 583526, 601895, 601841) I noticed that clustered broker reports multiple thousands of errors like this: soak-6: 2010-08-06 10:04:33 error Couldn't setup next timer firing: -270804103000ns[600000000000ns] The whole failover soak test pass the exit code checks, but a lot of errors are found in the broker log files. The issue is seen on RHEL 5.5 both i386 / x86_64. This problem might be related to the bug 619765 fix as noted by Gordon. Version-Release number of selected component (if applicable): python-qmf-0.7.946106-8.el5 python-qpid-0.7.946106-11.el5 qmf-0.7.946106-11.el5 qmf-devel-0.7.946106-11.el5 qpid-cpp-client*-0.7.946106-11.el5 qpid-cpp-mrg-debuginfo-0.7.946106-11.el5 qpid-cpp-server*-0.7.946106-11.el5 qpid-java-client-0.7.946106-7.el5 qpid-java-common-0.7.946106-7.el5 qpid-tools-0.7.946106-8.el5 rh-qpid-cpp-tests-0.7.946106-11.el5 ruby-qmf-0.7.946106-11.el5 ruby-qpid-0.7.946106-2.el5 How reproducible: 80% Steps to Reproduce: 1. set-up qpidd and clustering 2. run failover_soak test alternatively run qpid_ptest_cluster_failover_soak with TEST_PARAM_RHTS_TEST_MRG_QPID_PTEST_CLUSTER_FAILOVER_SOAK_LOOPS=100 make rb Actual results: Broker reports very huge number of errors. Expected results: Broker should not report above errors. Additional info (broker log from failover_soak test): soak-11: 2010-08-06 05:53:48 notice Initializing CPG soak-11: 2010-08-06 05:53:48 notice cluster(10.34.45.2:24111 PRE_INIT) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 1 0.34.45.2:24090 10.34.45.2:24111 soak-11: 2010-08-06 05:53:48 notice cluster(10.34.45.2:24111 PRE_INIT) Members joined: 10.34.45.2:24111 soak-11: 2010-08-06 05:53:48 debug cluster(10.34.45.2:24111 PRE_INIT) received initial status from 10.34.45.2:24111 soak-9: 2010-08-06 05:54:04 error Couldn't setup next timer firing: -312675313000ns[600000000000ns] soak-9: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24069 READY) closed connection 127.0.0.1:37252(10.34.45.2:24090-6 shadow) soak-9: 2010-08-06 05:54:10 notice cluster(10.34.45.2:24069 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.34 .45.2:24090 soak-9: 2010-08-06 05:54:10 notice cluster(10.34.45.2:24069 READY) Members left: 10.34.45.2:24008 soak-9: 2010-08-06 05:54:10 info cluster(10.34.45.2:24069 READY) member update: 10.34.45.2:24030(member) 10.34.45.2:24051(member) 10.34.45.2:2406 9(member) 10.34.45.2:24090(member) frameSeq=5013371 soak-9: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24069 READY) closed connection 10.34.33.55:57205(10.34.45.2:24008-7 shadow) soak-9: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24069 READY) closed connection 10.34.33.55:57213(10.34.45.2:24008-9 shadow) soak-9: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24069 READY) closed connection 10.34.33.55:57214(10.34.45.2:24008-10 shadow) soak-9: 2010-08-06 05:54:10 notice cluster(10.34.45.2:24069 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.34 .45.2:24090 10.34.45.2:24111 soak-9: 2010-08-06 05:54:10 notice cluster(10.34.45.2:24069 READY) Members joined: 10.34.45.2:24111 soak-11: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24111 PRE_INIT) received initial status from 10.34.45.2:24069 soak-9: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24069 READY) replicated connection 10.34.33.55:46736(10.34.45.2:24030-8 shadow) soak-9: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24069 READY) replicated connection 10.34.33.55:46735(10.34.45.2:24030-7 shadow) soak-10: 2010-08-06 05:54:29 error Couldn't setup next timer firing: -348309454000ns[600000000000ns] soak-8: 2010-08-06 05:54:30 error Couldn't setup next timer firing: -153713914000ns[600000000000ns] soak-10: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24090 READY) closed connection 127.0.0.1:37252(10.34.45.2:24090-6 local) soak-10: 2010-08-06 05:54:32 notice cluster(10.34.45.2:24090 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.3 4.45.2:24090 soak-10: 2010-08-06 05:54:32 notice cluster(10.34.45.2:24090 READY) Members left: 10.34.45.2:24008 soak-10: 2010-08-06 05:54:32 info cluster(10.34.45.2:24090 READY) member update: 10.34.45.2:24030(member) 10.34.45.2:24051(member) 10.34.45.2:240 69(member) 10.34.45.2:24090(member) frameSeq=5013371 soak-10: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24090 READY) closed connection 10.34.33.55:57205(10.34.45.2:24008-7 shadow) soak-10: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24090 READY) closed connection 10.34.33.55:57213(10.34.45.2:24008-9 shadow) soak-10: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24090 READY) closed connection 10.34.33.55:57214(10.34.45.2:24008-10 shadow) soak-10: 2010-08-06 05:54:32 notice cluster(10.34.45.2:24090 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.3 4.45.2:24090 10.34.45.2:24111 soak-10: 2010-08-06 05:54:32 notice cluster(10.34.45.2:24090 READY) Members joined: 10.34.45.2:24111 soak-11: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24111 PRE_INIT) received initial status from 10.34.45.2:24090 soak-10: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24090 READY) replicated connection 10.34.33.55:46736(10.34.45.2:24030-8 shadow) soak-10: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24090 READY) replicated connection 10.34.33.55:46735(10.34.45.2:24030-7 shadow) soak-8: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24051 READY) closed connection 127.0.0.1:37252(10.34.45.2:24090-6 shadow) soak-8: 2010-08-06 05:54:33 notice cluster(10.34.45.2:24051 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.34.45.2:24090 soak-8: 2010-08-06 05:54:33 notice cluster(10.34.45.2:24051 READY) Members left: 10.34.45.2:24008 soak-8: 2010-08-06 05:54:33 info cluster(10.34.45.2:24051 READY) member update: 10.34.45.2:24030(member) 10.34.45.2:24051(member) 10.34.45.2:24069(member) 10.34.45.2:24090(member) frameSeq=5013371 soak-8: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24051 READY) closed connection 10.34.33.55:57205(10.34.45.2:24008-7 shadow) soak-8: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24051 READY) closed connection 10.34.33.55:57213(10.34.45.2:24008-9 shadow) soak-8: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24051 READY) closed connection 10.34.33.55:57214(10.34.45.2:24008-10 shadow) soak-8: 2010-08-06 05:54:33 notice cluster(10.34.45.2:24051 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.34.45.2:24090 10.34.45.2:24111 soak-8: 2010-08-06 05:54:33 notice cluster(10.34.45.2:24051 READY) Members joined: 10.34.45.2:24111 soak-11: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24111 PRE_INIT) received initial status from 10.34.45.2:24051 soak-8: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24051 READY) replicated connection 10.34.33.55:46736(10.34.45.2:24030-8 shadow) soak-8: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24051 READY) replicated connection 10.34.33.55:46735(10.34.45.2:24030-7 shadow) soak-7: 2010-08-06 05:54:34 error Couldn't setup next timer firing: -34363811000ns[600000000000ns] soak-7: 2010-08-06 05:54:37 debug cluster(10.34.45.2:24030 READY) closed connection 127.0.0.1:37252(10.34.45.2:24090-6 shadow) soak-7: 2010-08-06 05:54:37 notice cluster(10.34.45.2:24030 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.34.45.2:24090 ... Listener has received 440000 messages on queue failover_soak_23897_1 Killing oldest broker... killFrontBroker: waiting for newbie sync... 2010-08-06 05:54:52 info Connection connected to tcp:127.0.0.1:56993 soak-11: 2010-08-06 05:54:52 debug cluster(10.34.45.2:24111 READY) local connection 127.0.0.1:45306(10.34.45.2:24111-4 local) Listener has received 440000 messages on queue failover_soak_23897_0 soak-9: 2010-08-06 05:54:53 error Couldn't setup next timer firing: -263725065000ns[600000000000ns] soak-10: 2010-08-06 05:54:54 error Couldn't setup next timer firing: -323226689000ns[600000000000ns] soak-8: 2010-08-06 05:54:55 error Couldn't setup next timer firing: -128658816000ns[600000000000ns] soak-11: 2010-08-06 05:54:55 error Couldn't setup next timer firing: -581869566000ns[600000000000ns] soak-7: 2010-08-06 05:54:56 error Couldn't setup next timer firing: -13240677000ns[600000000000ns] soak-9: 2010-08-06 05:54:57 debug cluster(10.34.45.2:24069 READY) replicated connection 127.0.0.1:45306(10.34.45.2:24111-4 shadow) soak-10: 2010-08-06 05:54:59 debug cluster(10.34.45.2:24090 READY) replicated connection 127.0.0.1:45306(10.34.45.2:24111-4 shadow) soak-8: 2010-08-06 05:55:00 debug cluster(10.34.45.2:24051 READY) replicated connection 127.0.0.1:45306(10.34.45.2:24111-4 shadow) soak-11: 2010-08-06 05:55:00 debug cluster(10.34.45.2:24111 READY) replicated connection 127.0.0.1:45306(10.34.45.2:24111-4 local) killFrontBroker: newbie synced.
I detected very few above errors messages also in usual multihost cluster environment, all brokers ran as service (running for 3 hours). The issue is not linked to the cluster width change, because it happened around couple of minutes after cluster start: 1st in cluster: no such error started at 2010-08-09 11:23:26 2nd in cluster: one occurence: 2010-08-09 11:30:04 error Couldn't setup next timer firing: -814894000ns[600000000000ns] 3rd in cluster: one occurence: 2010-08-09 11:32:38 error Couldn't setup next timer firing: -3876683747ns[600000000000ns] 4th in cluster: one occurence: 2010-08-09 11:33:29 error Couldn't setup next timer firing: -4902666378ns[600000000000ns]
I think it is caused by a periodic timer event firing during an update (specifically the queue cleaner task). I can reproduce it on almost every node addition by setting the queue purge interval to 10 secs.
Fixed on trunk () and in release repo (http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=95f384d368d1442bf483d2ab302019b584fa323c) by reverting the original fix for bug 619765 (where the queue cleaner was moved to the clusters timer) and instead avoid hitting the assertion by disabling periodic purging of LVQs. This was preferred over the removal of the assertion as it is in fact the case that there is a small risk of cluster inconsistencies in an LVQ if messages are purged on a distinct thread.
The issue has been fixed, tested on RHEL 5.5 i386 / x86_64 on packages python-qmf-0.7.946106-8.el5 python-qpid-0.7.946106-12.el5 qmf-0.7.946106-12.el5 qmf-devel-0.7.946106-12.el5 qpid-cpp-client*-0.7.946106-12.el5 qpid-cpp-server*-0.7.946106-12.el5 qpid-java-client-0.7.946106-7.el5 qpid-java-common-0.7.946106-7.el5 qpid-tools-0.7.946106-8.el5 ruby-qmf-0.7.946106-12.el5 ruby-qpid-0.7.946106-2.el5 -> VERIFIED