Bug 629756
| Summary: | Lots of Journal inactive late/overrun timer warnings | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Alan Conway <aconway> |
| Component: | qpid-cpp | Assignee: | Alan Conway <aconway> |
| Status: | CLOSED ERRATA | QA Contact: | Frantisek Reznicek <freznice> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | beta | CC: | esammons, freznice, gsim, iboverma, jneedle, rrajaram, tross |
| Target Milestone: | 1.3.2-RC1 | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | qpid-cpp-mrg-0.7.946106-26 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: in some circumstances the broker generated a large number of timer late/overrun warnings.
Consequence: The warnings themselves are usually inoffensive but the time to log them individually caused long (up to several seconds) delays and bloated log files.
Fix: Timer warnings are batched and logged periodically
Result: Large numbers of timer late/overrun events only create a controlled number of warnings in the log.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-02-15 12:11:18 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Alan Conway
2010-09-02 20:57:54 UTC
This was discovered while investigating bug 623660, made a separate bug as this issue does not require a cluster. According to later comments in bug 623660, late/overrun journal timer warnings can also be produced on intel boxes, it's just more difficult. You can also provoke the storm of warnings in a non-durable cluster with: qpid-cpp-benchmark -q10 --send-arg --ttl=1 aggregating the timer warnings gives reasonable behaviour, see https://issues.apache.org/jira/browse/QPID-2938 however we may still want to review whether the store and cluster are using timers excessively. Timer warning aggregation comitted as r1036169 seems to resolve the problem. There are still overruns reported but without increasing multi-second delays. Fixed on r1036169, needs to be back-ported to release branch. I've tested the same clustering procedure/scenario on latest packages (see below) and I can see different warnings (as code introduced another class with another message).
Sadly I'm able to see very long overruns, see details below.
I believe I should see those warnings significantly reduced and lateness should not be greather than a second.
My observation shows that there are less warnings, but some of them (as can be seen below) are greather than 10 seconds (machine mrg-qe-10 where is highest qmf traffic).
Let me raise NEEDINFO for crosschecking the bugfix criteria.
The current observation:
[root@mrg-qe-09 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | head -5
51 [2011-01-11 03:52:14 warning task late 1 times by 51ms on average.]
51 [2011-01-11 21:48:30 warning task late 1 times by 51ms on average.]
52 [2011-01-08 10:55:38 warning task late 1 times by 52ms on average.]
53 [2011-01-08 05:14:06 warning task late 1 times by 53ms on average.]
53 [2011-01-08 10:26:44 warning task late 1 times by 53ms on average.]
[root@mrg-qe-09 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | tail -5
55 [2011-01-09 14:30:05 warning task late 1 times by 55ms on average.]
55 [2011-01-10 03:26:33 warning task late 1 times by 55ms on average.]
56 [2011-01-09 14:01:16 warning task late 1 times by 56ms on average.]
56 [2011-01-12 04:17:15 warning task late 1 times by 56ms on average.]
56 [2011-01-12 12:07:54 warning task late 1 times by 56ms on average.]
[root@mrg-qe-09 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | wc -l
13
[root@mrg-qe-10 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | head -5
10 [2011-01-12 04:18:27 warning task late 1 times by 101ms on average.]
10 [2011-01-12 06:28:41 warning task late 1 times by 101ms on average.]
10 [2011-01-12 07:54:58 warning task late 1 times by 101ms on average.]
50 [2011-01-09 06:20:16 warning task late 1 times by 50ms on average.]
50 [2011-01-09 13:28:08 warning task late 1 times by 50ms on average.]
[root@mrg-qe-10 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | tail -5
558 [2011-01-11 12:40:47 warning task late 3 times by 558ms on average.]
3813 [2011-01-12 04:54:28 warning task late 5 times by 3813ms on average.]
11948 [2011-01-12 04:54:47 warning task late 1 times by 11948ms on average.]
11977 [2011-01-11 12:41:53 warning task late 1 times by 11977ms on average.]
14443 [2011-01-11 12:41:34 warning task late 2 times by 14443ms on average.]
[root@mrg-qe-10 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | wc -l
200
[root@mrg-qe-11 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | head -5
10 [2011-01-11 17:28:11 warning task late 1 times by 101ms on average.]
10 [2011-01-12 04:43:16 warning task late 1 times by 101ms on average.]
10 [2011-01-12 05:09:04 warning task late 1 times by 101ms on average.]
10 [2011-01-12 05:12:01 warning task late 1 times by 101ms on average.]
10 [2011-01-12 12:27:36 warning task late 1 times by 101ms on average.]
[root@mrg-qe-11 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | tail -5
258 [2011-01-12 03:59:35 warning task late 1 times by 258ms on average.]
260 [2011-01-11 18:34:24 warning task late 1 times by 260ms on average.]
260 [2011-01-12 03:44:58 warning task late 1 times by 260ms on average.]
280 [2011-01-12 03:55:31 warning task late 1 times by 280ms on average.]
826 [2011-01-11 15:29:07 warning task late 3 times by 826ms on average.]
[root@mrg-qe-11 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | wc -l
238
[root@mrg-qe-12 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | head -5
10 [2011-01-11 20:55:09 warning task late 1 times by 101ms on average.]
10 [2011-01-12 10:18:00 warning task late 1 times by 100ms on average.]
10 [2011-01-12 12:05:38 warning task late 1 times by 100ms on average.]
50 [2011-01-11 13:06:05 warning task late 1 times by 50ms on average.]
50 [2011-01-11 13:19:54 warning task late 1 times by 50ms on average.]
[root@mrg-qe-12 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | tail -5
330 [2011-01-12 04:25:02 warning task late 1 times by 330ms on average.]
340 [2011-01-12 17:43:45 warning task late 1 times by 340ms on average.]
422 [2011-01-12 12:40:26 warning task late 1 times by 422ms on average.]
646 [2011-01-11 12:46:39 warning task late 2 times by 646ms on average.]
1828 [2011-01-11 12:46:59 warning task late 1 times by 1828ms on average.]
[root@mrg-qe-12 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | wc -l
371
The packages:
[root@mrg-qe-12 ~]# rpm -qa | grep qpid | sort
python-qpid-0.7.946106-14.el5
qpid-cpp-client-0.7.946106-26.el5
qpid-cpp-client-devel-0.7.946106-26.el5
qpid-cpp-client-devel-docs-0.7.946106-26.el5
qpid-cpp-client-rdma-0.7.946106-26.el5
qpid-cpp-client-ssl-0.7.946106-26.el5
qpid-cpp-mrg-debuginfo-0.7.946106-26.el5
qpid-cpp-server-0.7.946106-26.el5
qpid-cpp-server-cluster-0.7.946106-26.el5
qpid-cpp-server-devel-0.7.946106-26.el5
qpid-cpp-server-rdma-0.7.946106-26.el5
qpid-cpp-server-ssl-0.7.946106-26.el5
qpid-cpp-server-store-0.7.946106-26.el5
qpid-cpp-server-xml-0.7.946106-26.el5
qpid-java-client-0.7.946106-12.el5
qpid-java-common-0.7.946106-12.el5
qpid-java-example-0.7.946106-12.el5
qpid-tools-0.7.946106-11.el5
rh-qpid-cpp-tests-0.7.946106-26.el5
How do you reproduce the result in comment 6? The cases in the Description and comment 4 seem to be OK. Hi Alan, I run the original scenario: 4-node cluster, every node loops connection clients (checking whether c++/python/qmf c++/qmf python is able to connect and disconnect to /from broker) on top of that tsxtest/perftest/qpid-stat|qpid-cluster are looped on specific node to create transient / durable traffic (approx. 50 / 50 %) in fact exactly the same scenario as described here: https://bugzilla.redhat.com/show_bug.cgi?id=623660#c13 I have not tried comment's 4 qpid-cpp-benchmark yet. I'd like to know: - Is the change of log level for such messages a strict requirement? (see comment 3) - It is proved that broker now produces much less messages (factor >10^4) -26 build produced about N * 800 warnings (N ~ cluster width) -12 build produced about N * 12M warnings Is this improvement enough? - How to treat the task late by more than second? I think there is link/dependency to missing flow control: https://bugzilla.redhat.com/show_bug.cgi?id=623660#c27 (which is not implemented yet right?) Responses to comment 9 - Is the change of log level for such messages a strict requirement? (see comment 3) The fix reduces the warnings to at most 1 every 5 seconds, which should avoid excessive flooding of the logs but at the same time leaves the warning visible. - It is proved that broker now produces much less messages (factor >10^4) -26 build produced about N * 800 warnings (N ~ cluster width) -12 build produced about N * 12M warnings Is this improvement enough? I think so - as above its now once every 5 seconds. We could make that interval longer. - How to treat the task late by more than second? I think there is link/dependency to missing flow control: https://bugzilla.redhat.com/show_bug.cgi?id=623660#c27 (which is not implemented yet right?) Right. That probably is the source of the overrun. I haven't had time to get back to that yet. Hi Alan,
thanks for your comments.
I agree that logging is improved enough and log message flooding is avoided.
There are, however, two other linked facts which should be done for this BZ:
- there is suggestion to change logging level for such messages (at very least which I believe is not strict requirement as the main goal is reached and flooding avoided)
- there are few values of delays mentioned in the description which are currently violated in some cases:
Expected results:
moderate number of warnings, delays of <10ms, maybe one or two in
100s of ms. Certainly nothing over 1000ms.
Would you agree to let solve such problems (overrun timing) in bug 623660 or in possibly new BZ apart of this defect, please?
Once we agree on the summarized two points, I'm ready to finish this defect.
Yes, let's leave the overrun timing issue to bug 623660 and close this one. The issue has been resolved. The flood of overrun messages is now very significantly reduced. Tested on RHEL 4.8 / 5.6 i386 / x86_64 on packages: [root@mrg-qe-02 bz656488]# rpm -qa | grep qpid | sort python-qpid-0.7.946106-14.el5.noarch qpid-cpp-client-0.7.946106-26.el5.x86_64 qpid-cpp-client-devel-0.7.946106-26.el5.x86_64 qpid-cpp-client-devel-docs-0.7.946106-26.el5.x86_64 qpid-cpp-client-rdma-0.7.946106-26.el5.x86_64 qpid-cpp-client-ssl-0.7.946106-26.el5.x86_64 qpid-cpp-mrg-debuginfo-0.7.946106-26.el5.x86_64 qpid-cpp-server-0.7.946106-26.el5.x86_64 qpid-cpp-server-cluster-0.7.946106-26.el5.x86_64 qpid-cpp-server-devel-0.7.946106-26.el5.x86_64 qpid-cpp-server-rdma-0.7.946106-26.el5.x86_64 qpid-cpp-server-ssl-0.7.946106-26.el5.x86_64 qpid-cpp-server-store-0.7.946106-26.el5.x86_64 qpid-cpp-server-xml-0.7.946106-26.el5.x86_64 qpid-dotnet-0.4.738274-2.el5.x86_64 qpid-java-client-0.7.946106-14.el5.noarch qpid-java-common-0.7.946106-14.el5.noarch qpid-java-example-0.7.946106-14.el5.noarch qpid-tests-0.7.946106-1.el5.noarch qpid-tools-0.7.946106-11.el5.noarch rh-qpid-cpp-tests-0.7.946106-26.el5.x86_64 -> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Cause: in some circumstances the broker generated a large number of timer late/overrun warnings.
Consequence: The warnings themselves are usually inoffensive but the time to log them individually caused long (up to several seconds) delays and bloated log files.
Fix: Timer warnings are batched and logged periodically
Result: Large numbers of timer late/overrun events only create a controlled number of warnings in the log.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0217.html |