Bug 622422 - clustered qpidd during failover_soak test reports tons of 'error Couldn't setup next timer firing: -[0-9]+ns\[[0-9]+ns\]' errors
Summary: clustered qpidd during failover_soak test reports tons of 'error Couldn't set...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: beta
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3
: ---
Assignee: Gordon Sim
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-08-09 10:14 UTC by Frantisek Reznicek
Modified: 2015-11-16 01:12 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-10-20 11:29:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Frantisek Reznicek 2010-08-09 10:14:45 UTC
Description of problem:

While running extended failover_soak test (BZ 583526, 601895, 601841) I noticed that clustered broker reports multiple thousands of errors like this:
  soak-6: 2010-08-06 10:04:33 error Couldn't setup next timer firing: -270804103000ns[600000000000ns]

The whole failover soak test pass the exit code checks, but a lot of errors are found in the broker log files.

The issue is seen on RHEL 5.5 both i386 / x86_64.

This problem might be related to the bug 619765 fix as noted by Gordon.

Version-Release number of selected component (if applicable):
python-qmf-0.7.946106-8.el5
python-qpid-0.7.946106-11.el5
qmf-0.7.946106-11.el5
qmf-devel-0.7.946106-11.el5
qpid-cpp-client*-0.7.946106-11.el5
qpid-cpp-mrg-debuginfo-0.7.946106-11.el5
qpid-cpp-server*-0.7.946106-11.el5
qpid-java-client-0.7.946106-7.el5
qpid-java-common-0.7.946106-7.el5
qpid-tools-0.7.946106-8.el5
rh-qpid-cpp-tests-0.7.946106-11.el5
ruby-qmf-0.7.946106-11.el5
ruby-qpid-0.7.946106-2.el5

How reproducible:
80%

Steps to Reproduce:
1. set-up qpidd and clustering
2. run failover_soak test

alternatively run qpid_ptest_cluster_failover_soak with 
TEST_PARAM_RHTS_TEST_MRG_QPID_PTEST_CLUSTER_FAILOVER_SOAK_LOOPS=100 make rb
  
Actual results:
Broker reports very huge number of errors.

Expected results:
Broker should not report above errors.

Additional info (broker log from failover_soak test):

soak-11: 2010-08-06 05:53:48 notice Initializing CPG
soak-11: 2010-08-06 05:53:48 notice cluster(10.34.45.2:24111 PRE_INIT) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 1
0.34.45.2:24090 10.34.45.2:24111
soak-11: 2010-08-06 05:53:48 notice cluster(10.34.45.2:24111 PRE_INIT) Members joined: 10.34.45.2:24111
soak-11: 2010-08-06 05:53:48 debug cluster(10.34.45.2:24111 PRE_INIT) received initial status from 10.34.45.2:24111
soak-9: 2010-08-06 05:54:04 error Couldn't setup next timer firing: -312675313000ns[600000000000ns]
soak-9: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24069 READY) closed connection 127.0.0.1:37252(10.34.45.2:24090-6 shadow)
soak-9: 2010-08-06 05:54:10 notice cluster(10.34.45.2:24069 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.34
.45.2:24090
soak-9: 2010-08-06 05:54:10 notice cluster(10.34.45.2:24069 READY) Members left: 10.34.45.2:24008
soak-9: 2010-08-06 05:54:10 info cluster(10.34.45.2:24069 READY) member update: 10.34.45.2:24030(member) 10.34.45.2:24051(member) 10.34.45.2:2406
9(member) 10.34.45.2:24090(member) frameSeq=5013371
soak-9: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24069 READY) closed connection 10.34.33.55:57205(10.34.45.2:24008-7 shadow)
soak-9: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24069 READY) closed connection 10.34.33.55:57213(10.34.45.2:24008-9 shadow)
soak-9: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24069 READY) closed connection 10.34.33.55:57214(10.34.45.2:24008-10 shadow)
soak-9: 2010-08-06 05:54:10 notice cluster(10.34.45.2:24069 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.34
.45.2:24090 10.34.45.2:24111
soak-9: 2010-08-06 05:54:10 notice cluster(10.34.45.2:24069 READY) Members joined: 10.34.45.2:24111
soak-11: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24111 PRE_INIT) received initial status from 10.34.45.2:24069
soak-9: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24069 READY) replicated connection 10.34.33.55:46736(10.34.45.2:24030-8 shadow)
soak-9: 2010-08-06 05:54:10 debug cluster(10.34.45.2:24069 READY) replicated connection 10.34.33.55:46735(10.34.45.2:24030-7 shadow)
soak-10: 2010-08-06 05:54:29 error Couldn't setup next timer firing: -348309454000ns[600000000000ns]
soak-8: 2010-08-06 05:54:30 error Couldn't setup next timer firing: -153713914000ns[600000000000ns]
soak-10: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24090 READY) closed connection 127.0.0.1:37252(10.34.45.2:24090-6 local)
soak-10: 2010-08-06 05:54:32 notice cluster(10.34.45.2:24090 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.3
4.45.2:24090
soak-10: 2010-08-06 05:54:32 notice cluster(10.34.45.2:24090 READY) Members left: 10.34.45.2:24008
soak-10: 2010-08-06 05:54:32 info cluster(10.34.45.2:24090 READY) member update: 10.34.45.2:24030(member) 10.34.45.2:24051(member) 10.34.45.2:240
69(member) 10.34.45.2:24090(member) frameSeq=5013371
soak-10: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24090 READY) closed connection 10.34.33.55:57205(10.34.45.2:24008-7 shadow)
soak-10: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24090 READY) closed connection 10.34.33.55:57213(10.34.45.2:24008-9 shadow)
soak-10: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24090 READY) closed connection 10.34.33.55:57214(10.34.45.2:24008-10 shadow)
soak-10: 2010-08-06 05:54:32 notice cluster(10.34.45.2:24090 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.3
4.45.2:24090 10.34.45.2:24111
soak-10: 2010-08-06 05:54:32 notice cluster(10.34.45.2:24090 READY) Members joined: 10.34.45.2:24111
soak-11: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24111 PRE_INIT) received initial status from 10.34.45.2:24090
soak-10: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24090 READY) replicated connection 10.34.33.55:46736(10.34.45.2:24030-8 shadow)
soak-10: 2010-08-06 05:54:32 debug cluster(10.34.45.2:24090 READY) replicated connection 10.34.33.55:46735(10.34.45.2:24030-7 shadow)
soak-8: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24051 READY) closed connection 127.0.0.1:37252(10.34.45.2:24090-6 shadow)
soak-8: 2010-08-06 05:54:33 notice cluster(10.34.45.2:24051 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.34.45.2:24090
soak-8: 2010-08-06 05:54:33 notice cluster(10.34.45.2:24051 READY) Members left: 10.34.45.2:24008
soak-8: 2010-08-06 05:54:33 info cluster(10.34.45.2:24051 READY) member update: 10.34.45.2:24030(member) 10.34.45.2:24051(member) 10.34.45.2:24069(member) 10.34.45.2:24090(member) frameSeq=5013371
soak-8: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24051 READY) closed connection 10.34.33.55:57205(10.34.45.2:24008-7 shadow)
soak-8: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24051 READY) closed connection 10.34.33.55:57213(10.34.45.2:24008-9 shadow)
soak-8: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24051 READY) closed connection 10.34.33.55:57214(10.34.45.2:24008-10 shadow)
soak-8: 2010-08-06 05:54:33 notice cluster(10.34.45.2:24051 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.34.45.2:24090 10.34.45.2:24111
soak-8: 2010-08-06 05:54:33 notice cluster(10.34.45.2:24051 READY) Members joined: 10.34.45.2:24111
soak-11: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24111 PRE_INIT) received initial status from 10.34.45.2:24051
soak-8: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24051 READY) replicated connection 10.34.33.55:46736(10.34.45.2:24030-8 shadow)
soak-8: 2010-08-06 05:54:33 debug cluster(10.34.45.2:24051 READY) replicated connection 10.34.33.55:46735(10.34.45.2:24030-7 shadow)
soak-7: 2010-08-06 05:54:34 error Couldn't setup next timer firing: -34363811000ns[600000000000ns]
soak-7: 2010-08-06 05:54:37 debug cluster(10.34.45.2:24030 READY) closed connection 127.0.0.1:37252(10.34.45.2:24090-6 shadow)
soak-7: 2010-08-06 05:54:37 notice cluster(10.34.45.2:24030 READY) configuration change: 10.34.45.2:24030 10.34.45.2:24051 10.34.45.2:24069 10.34.45.2:24090
...
Listener has received 440000 messages on queue failover_soak_23897_1
Killing oldest broker...
killFrontBroker: waiting for newbie sync...
2010-08-06 05:54:52 info Connection  connected to tcp:127.0.0.1:56993
soak-11: 2010-08-06 05:54:52 debug cluster(10.34.45.2:24111 READY) local connection 127.0.0.1:45306(10.34.45.2:24111-4 local)
Listener has received 440000 messages on queue failover_soak_23897_0
soak-9: 2010-08-06 05:54:53 error Couldn't setup next timer firing: -263725065000ns[600000000000ns]
soak-10: 2010-08-06 05:54:54 error Couldn't setup next timer firing: -323226689000ns[600000000000ns]
soak-8: 2010-08-06 05:54:55 error Couldn't setup next timer firing: -128658816000ns[600000000000ns]
soak-11: 2010-08-06 05:54:55 error Couldn't setup next timer firing: -581869566000ns[600000000000ns]
soak-7: 2010-08-06 05:54:56 error Couldn't setup next timer firing: -13240677000ns[600000000000ns]
soak-9: 2010-08-06 05:54:57 debug cluster(10.34.45.2:24069 READY) replicated connection 127.0.0.1:45306(10.34.45.2:24111-4 shadow)
soak-10: 2010-08-06 05:54:59 debug cluster(10.34.45.2:24090 READY) replicated connection 127.0.0.1:45306(10.34.45.2:24111-4 shadow)
soak-8: 2010-08-06 05:55:00 debug cluster(10.34.45.2:24051 READY) replicated connection 127.0.0.1:45306(10.34.45.2:24111-4 shadow)
soak-11: 2010-08-06 05:55:00 debug cluster(10.34.45.2:24111 READY) replicated connection 127.0.0.1:45306(10.34.45.2:24111-4 local)
killFrontBroker: newbie synced.

Comment 1 Frantisek Reznicek 2010-08-09 12:04:35 UTC
I detected very few above errors messages also in usual multihost cluster environment, all brokers ran as service (running for 3 hours).
The issue is not linked to the cluster width change, because it happened around couple of minutes after cluster start:

1st in cluster: no such error started at 2010-08-09 11:23:26
2nd in cluster: one occurence: 2010-08-09 11:30:04 error Couldn't setup next timer firing: -814894000ns[600000000000ns]
3rd in cluster: one occurence: 2010-08-09 11:32:38 error Couldn't setup next timer firing: -3876683747ns[600000000000ns]
4th in cluster: one occurence: 2010-08-09 11:33:29 error Couldn't setup next timer firing: -4902666378ns[600000000000ns]

Comment 2 Gordon Sim 2010-08-09 12:29:18 UTC
I think it is caused by a periodic timer event firing during an update (specifically the queue cleaner task). I can reproduce it on almost every node addition by setting the queue purge interval to 10 secs.

Comment 3 Gordon Sim 2010-08-11 13:28:40 UTC
Fixed on trunk () and in release repo (http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=95f384d368d1442bf483d2ab302019b584fa323c) by reverting the original fix for bug 619765 (where the queue cleaner was moved to the clusters timer) and instead avoid hitting the assertion by disabling periodic purging of LVQs.

This was preferred over the removal of the assertion as it is in fact the case that there is a small risk of cluster inconsistencies in an LVQ if messages are purged on a distinct thread.

Comment 4 Frantisek Reznicek 2010-08-17 08:21:34 UTC
The issue has been fixed, tested on RHEL 5.5 i386 / x86_64 on packages 
python-qmf-0.7.946106-8.el5
python-qpid-0.7.946106-12.el5
qmf-0.7.946106-12.el5
qmf-devel-0.7.946106-12.el5
qpid-cpp-client*-0.7.946106-12.el5
qpid-cpp-server*-0.7.946106-12.el5
qpid-java-client-0.7.946106-7.el5
qpid-java-common-0.7.946106-7.el5
qpid-tools-0.7.946106-8.el5
ruby-qmf-0.7.946106-12.el5
ruby-qpid-0.7.946106-2.el5

-> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.