Description of problem: Running a durable perftest results in a never ending stream of late/overrun journal inactive timer messages, with delays up to 6-7 seconds. This was observed on 2 hosts: mrg-qe-09.lab.eng.brq.redhat.com: x86_64, 8053Gb, 2 cpus, 8 cores, 2294.252MHz Quad-Core AMD Opteron(tm) Processor 2376 Red Hat Enterprise Linux Server release 5.5 (Tikanga) mrg-qe-12.lab.eng.brq.redhat.com: i386, 8117Gb, 2 cpus, 8 cores, 2294.411MHz Quad-Core AMD Opteron(tm) Processor 2376 Red Hat Enterprise Linux Server release 5.5 (Tikanga) The same test ran correctly on the following hosts: rolf: x86_64, 3891Gb, 1 cpus, 2 cores, 2200.000MHz Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz Fedora release 13 (Goddard) mrg33.lab.bos.redhat.com: 16051Mb 8 cores, 2 cpus, 2493.751MHz Intel(R) Xeon(R) CPU E5420 @ 2.50GHz Red Hat Enterprise Linux Server release 5.5 (Tikanga) So far it appears that the problem is AMD specific but there could be other factors. Version-Release number of selected component (if applicable): Observed on: qpid-cpp-server-0.7.946106-12.el5 How reproducible: 100% Steps to Reproduce: 1. qpid-perftest --durable yes --count 10000 --qt 40 -s 2. watch qpidd log for late/overrun warnings Note: on trunk, timer warnings now include the timer name and, in the case of journal timers, the queue name. I suspect the issue still exists on trunk but haven't tested yet. Actual results: Lots of late/overrun journal timer warnings. Note the warnings continue even after the perftest has stopped. Expected results: moderate number of warnings, delays of <10ms, maybe one or two in 100s of ms. Certainly nothing over 1000ms. Additional info:
This was discovered while investigating bug 623660, made a separate bug as this issue does not require a cluster.
According to later comments in bug 623660, late/overrun journal timer warnings can also be produced on intel boxes, it's just more difficult.
You can also provoke the storm of warnings in a non-durable cluster with: qpid-cpp-benchmark -q10 --send-arg --ttl=1 aggregating the timer warnings gives reasonable behaviour, see https://issues.apache.org/jira/browse/QPID-2938 however we may still want to review whether the store and cluster are using timers excessively.
Timer warning aggregation comitted as r1036169 seems to resolve the problem. There are still overruns reported but without increasing multi-second delays. Fixed on r1036169, needs to be back-ported to release branch.
I've tested the same clustering procedure/scenario on latest packages (see below) and I can see different warnings (as code introduced another class with another message). Sadly I'm able to see very long overruns, see details below. I believe I should see those warnings significantly reduced and lateness should not be greather than a second. My observation shows that there are less warnings, but some of them (as can be seen below) are greather than 10 seconds (machine mrg-qe-10 where is highest qmf traffic). Let me raise NEEDINFO for crosschecking the bugfix criteria. The current observation: [root@mrg-qe-09 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | head -5 51 [2011-01-11 03:52:14 warning task late 1 times by 51ms on average.] 51 [2011-01-11 21:48:30 warning task late 1 times by 51ms on average.] 52 [2011-01-08 10:55:38 warning task late 1 times by 52ms on average.] 53 [2011-01-08 05:14:06 warning task late 1 times by 53ms on average.] 53 [2011-01-08 10:26:44 warning task late 1 times by 53ms on average.] [root@mrg-qe-09 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | tail -5 55 [2011-01-09 14:30:05 warning task late 1 times by 55ms on average.] 55 [2011-01-10 03:26:33 warning task late 1 times by 55ms on average.] 56 [2011-01-09 14:01:16 warning task late 1 times by 56ms on average.] 56 [2011-01-12 04:17:15 warning task late 1 times by 56ms on average.] 56 [2011-01-12 12:07:54 warning task late 1 times by 56ms on average.] [root@mrg-qe-09 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | wc -l 13 [root@mrg-qe-10 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | head -5 10 [2011-01-12 04:18:27 warning task late 1 times by 101ms on average.] 10 [2011-01-12 06:28:41 warning task late 1 times by 101ms on average.] 10 [2011-01-12 07:54:58 warning task late 1 times by 101ms on average.] 50 [2011-01-09 06:20:16 warning task late 1 times by 50ms on average.] 50 [2011-01-09 13:28:08 warning task late 1 times by 50ms on average.] [root@mrg-qe-10 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | tail -5 558 [2011-01-11 12:40:47 warning task late 3 times by 558ms on average.] 3813 [2011-01-12 04:54:28 warning task late 5 times by 3813ms on average.] 11948 [2011-01-12 04:54:47 warning task late 1 times by 11948ms on average.] 11977 [2011-01-11 12:41:53 warning task late 1 times by 11977ms on average.] 14443 [2011-01-11 12:41:34 warning task late 2 times by 14443ms on average.] [root@mrg-qe-10 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | wc -l 200 [root@mrg-qe-11 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | head -5 10 [2011-01-11 17:28:11 warning task late 1 times by 101ms on average.] 10 [2011-01-12 04:43:16 warning task late 1 times by 101ms on average.] 10 [2011-01-12 05:09:04 warning task late 1 times by 101ms on average.] 10 [2011-01-12 05:12:01 warning task late 1 times by 101ms on average.] 10 [2011-01-12 12:27:36 warning task late 1 times by 101ms on average.] [root@mrg-qe-11 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | tail -5 258 [2011-01-12 03:59:35 warning task late 1 times by 258ms on average.] 260 [2011-01-11 18:34:24 warning task late 1 times by 260ms on average.] 260 [2011-01-12 03:44:58 warning task late 1 times by 260ms on average.] 280 [2011-01-12 03:55:31 warning task late 1 times by 280ms on average.] 826 [2011-01-11 15:29:07 warning task late 3 times by 826ms on average.] [root@mrg-qe-11 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | wc -l 238 [root@mrg-qe-12 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | head -5 10 [2011-01-11 20:55:09 warning task late 1 times by 101ms on average.] 10 [2011-01-12 10:18:00 warning task late 1 times by 100ms on average.] 10 [2011-01-12 12:05:38 warning task late 1 times by 100ms on average.] 50 [2011-01-11 13:06:05 warning task late 1 times by 50ms on average.] 50 [2011-01-11 13:19:54 warning task late 1 times by 50ms on average.] [root@mrg-qe-12 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | tail -5 330 [2011-01-12 04:25:02 warning task late 1 times by 330ms on average.] 340 [2011-01-12 17:43:45 warning task late 1 times by 340ms on average.] 422 [2011-01-12 12:40:26 warning task late 1 times by 422ms on average.] 646 [2011-01-11 12:46:39 warning task late 2 times by 646ms on average.] 1828 [2011-01-11 12:46:59 warning task late 1 times by 1828ms on average.] [root@mrg-qe-12 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | wc -l 371 The packages: [root@mrg-qe-12 ~]# rpm -qa | grep qpid | sort python-qpid-0.7.946106-14.el5 qpid-cpp-client-0.7.946106-26.el5 qpid-cpp-client-devel-0.7.946106-26.el5 qpid-cpp-client-devel-docs-0.7.946106-26.el5 qpid-cpp-client-rdma-0.7.946106-26.el5 qpid-cpp-client-ssl-0.7.946106-26.el5 qpid-cpp-mrg-debuginfo-0.7.946106-26.el5 qpid-cpp-server-0.7.946106-26.el5 qpid-cpp-server-cluster-0.7.946106-26.el5 qpid-cpp-server-devel-0.7.946106-26.el5 qpid-cpp-server-rdma-0.7.946106-26.el5 qpid-cpp-server-ssl-0.7.946106-26.el5 qpid-cpp-server-store-0.7.946106-26.el5 qpid-cpp-server-xml-0.7.946106-26.el5 qpid-java-client-0.7.946106-12.el5 qpid-java-common-0.7.946106-12.el5 qpid-java-example-0.7.946106-12.el5 qpid-tools-0.7.946106-11.el5 rh-qpid-cpp-tests-0.7.946106-26.el5
How do you reproduce the result in comment 6? The cases in the Description and comment 4 seem to be OK.
Hi Alan, I run the original scenario: 4-node cluster, every node loops connection clients (checking whether c++/python/qmf c++/qmf python is able to connect and disconnect to /from broker) on top of that tsxtest/perftest/qpid-stat|qpid-cluster are looped on specific node to create transient / durable traffic (approx. 50 / 50 %) in fact exactly the same scenario as described here: https://bugzilla.redhat.com/show_bug.cgi?id=623660#c13 I have not tried comment's 4 qpid-cpp-benchmark yet. I'd like to know: - Is the change of log level for such messages a strict requirement? (see comment 3) - It is proved that broker now produces much less messages (factor >10^4) -26 build produced about N * 800 warnings (N ~ cluster width) -12 build produced about N * 12M warnings Is this improvement enough? - How to treat the task late by more than second? I think there is link/dependency to missing flow control: https://bugzilla.redhat.com/show_bug.cgi?id=623660#c27 (which is not implemented yet right?)
Responses to comment 9 - Is the change of log level for such messages a strict requirement? (see comment 3) The fix reduces the warnings to at most 1 every 5 seconds, which should avoid excessive flooding of the logs but at the same time leaves the warning visible. - It is proved that broker now produces much less messages (factor >10^4) -26 build produced about N * 800 warnings (N ~ cluster width) -12 build produced about N * 12M warnings Is this improvement enough? I think so - as above its now once every 5 seconds. We could make that interval longer. - How to treat the task late by more than second? I think there is link/dependency to missing flow control: https://bugzilla.redhat.com/show_bug.cgi?id=623660#c27 (which is not implemented yet right?) Right. That probably is the source of the overrun. I haven't had time to get back to that yet.
Hi Alan, thanks for your comments. I agree that logging is improved enough and log message flooding is avoided. There are, however, two other linked facts which should be done for this BZ: - there is suggestion to change logging level for such messages (at very least which I believe is not strict requirement as the main goal is reached and flooding avoided) - there are few values of delays mentioned in the description which are currently violated in some cases: Expected results: moderate number of warnings, delays of <10ms, maybe one or two in 100s of ms. Certainly nothing over 1000ms. Would you agree to let solve such problems (overrun timing) in bug 623660 or in possibly new BZ apart of this defect, please? Once we agree on the summarized two points, I'm ready to finish this defect.
Yes, let's leave the overrun timing issue to bug 623660 and close this one.
The issue has been resolved. The flood of overrun messages is now very significantly reduced. Tested on RHEL 4.8 / 5.6 i386 / x86_64 on packages: [root@mrg-qe-02 bz656488]# rpm -qa | grep qpid | sort python-qpid-0.7.946106-14.el5.noarch qpid-cpp-client-0.7.946106-26.el5.x86_64 qpid-cpp-client-devel-0.7.946106-26.el5.x86_64 qpid-cpp-client-devel-docs-0.7.946106-26.el5.x86_64 qpid-cpp-client-rdma-0.7.946106-26.el5.x86_64 qpid-cpp-client-ssl-0.7.946106-26.el5.x86_64 qpid-cpp-mrg-debuginfo-0.7.946106-26.el5.x86_64 qpid-cpp-server-0.7.946106-26.el5.x86_64 qpid-cpp-server-cluster-0.7.946106-26.el5.x86_64 qpid-cpp-server-devel-0.7.946106-26.el5.x86_64 qpid-cpp-server-rdma-0.7.946106-26.el5.x86_64 qpid-cpp-server-ssl-0.7.946106-26.el5.x86_64 qpid-cpp-server-store-0.7.946106-26.el5.x86_64 qpid-cpp-server-xml-0.7.946106-26.el5.x86_64 qpid-dotnet-0.4.738274-2.el5.x86_64 qpid-java-client-0.7.946106-14.el5.noarch qpid-java-common-0.7.946106-14.el5.noarch qpid-java-example-0.7.946106-14.el5.noarch qpid-tests-0.7.946106-1.el5.noarch qpid-tools-0.7.946106-11.el5.noarch rh-qpid-cpp-tests-0.7.946106-26.el5.x86_64 -> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: in some circumstances the broker generated a large number of timer late/overrun warnings. Consequence: The warnings themselves are usually inoffensive but the time to log them individually caused long (up to several seconds) delays and bloated log files. Fix: Timer warnings are batched and logged periodically Result: Large numbers of timer late/overrun events only create a controlled number of warnings in the log.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0217.html