Bug 629756 - Lots of Journal inactive late/overrun timer warnings
Summary: Lots of Journal inactive late/overrun timer warnings
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: beta
Hardware: All
OS: Linux
high
medium
Target Milestone: 1.3.2-RC1
: ---
Assignee: Alan Conway
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-09-02 20:57 UTC by Alan Conway
Modified: 2015-11-16 01:12 UTC (History)
7 users (show)

Fixed In Version: qpid-cpp-mrg-0.7.946106-26
Doc Type: Bug Fix
Doc Text:
Cause: in some circumstances the broker generated a large number of timer late/overrun warnings. Consequence: The warnings themselves are usually inoffensive but the time to log them individually caused long (up to several seconds) delays and bloated log files. Fix: Timer warnings are batched and logged periodically Result: Large numbers of timer late/overrun events only create a controlled number of warnings in the log.
Clone Of:
Environment:
Last Closed: 2011-02-15 12:11:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0217 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid bug fix and enhancement update 2011-02-15 12:10:15 UTC

Description Alan Conway 2010-09-02 20:57:54 UTC
Description of problem:

Running a durable perftest results in a never ending stream of late/overrun journal inactive timer messages, with delays up to 6-7 seconds. This was observed on 2 hosts: 

mrg-qe-09.lab.eng.brq.redhat.com: x86_64, 8053Gb, 2 cpus, 8 cores,  2294.252MHz
  Quad-Core AMD Opteron(tm) Processor 2376
  Red Hat Enterprise Linux Server release 5.5 (Tikanga)

mrg-qe-12.lab.eng.brq.redhat.com: i386, 8117Gb, 2 cpus, 8 cores,  2294.411MHz
  Quad-Core AMD Opteron(tm) Processor 2376
  Red Hat Enterprise Linux Server release 5.5 (Tikanga)

The same test ran correctly on the following hosts:

rolf: x86_64, 3891Gb, 1 cpus, 2 cores,  2200.000MHz
  Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz
  Fedora release 13 (Goddard)

mrg33.lab.bos.redhat.com: 16051Mb 8 cores, 2 cpus,  2493.751MHz
  Intel(R) Xeon(R) CPU           E5420  @ 2.50GHz
  Red Hat Enterprise Linux Server release 5.5 (Tikanga)

So far it appears that the problem is AMD specific but there could be other factors.

Version-Release number of selected component (if applicable):

Observed on: qpid-cpp-server-0.7.946106-12.el5

How reproducible: 100%


Steps to Reproduce:
1. qpid-perftest --durable yes --count 10000 --qt 40 -s
2. watch qpidd log for late/overrun warnings

Note: on trunk, timer warnings now include the timer name and, in the case of journal timers, the queue name. I suspect the issue still exists on trunk but haven't tested yet.
  
Actual results:

Lots of late/overrun journal timer warnings. Note the warnings continue even after the perftest has stopped.

Expected results: 

moderate number of warnings, delays of <10ms, maybe one or two in 100s of ms. Certainly nothing over 1000ms.


Additional info:

Comment 1 Alan Conway 2010-09-02 20:59:05 UTC
This was discovered while investigating bug 623660, made a separate bug as this issue does not require a cluster.

Comment 2 Alan Conway 2010-09-03 12:53:44 UTC
According to later comments in bug 623660, late/overrun journal timer warnings can also be produced on intel boxes, it's just more difficult.

Comment 4 Alan Conway 2010-11-10 18:55:51 UTC
You can also provoke the storm of warnings in a non-durable cluster with:

qpid-cpp-benchmark -q10 --send-arg --ttl=1 

aggregating the timer warnings gives reasonable behaviour, see

https://issues.apache.org/jira/browse/QPID-2938

however we may still want to review whether the store and cluster are using timers excessively.

Comment 5 Alan Conway 2010-11-17 19:31:57 UTC
Timer warning aggregation comitted as r1036169 seems to resolve the problem. There are still overruns reported but without increasing multi-second delays.

Fixed on r1036169, needs to be back-ported to release branch.

Comment 6 Frantisek Reznicek 2011-01-12 17:00:27 UTC
I've tested the same clustering procedure/scenario on latest packages (see below) and I can see different warnings (as code introduced another class with another message).

Sadly I'm able to see very long overruns, see details below.
I believe I should see those warnings significantly reduced and lateness should not be greather than a second.


My observation shows that there are less warnings, but some of them (as can be seen below) are greather than 10 seconds (machine mrg-qe-10 where is highest qmf traffic).


Let me raise NEEDINFO for crosschecking the bugfix criteria.

The current observation:

[root@mrg-qe-09 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | head -5
51 [2011-01-11 03:52:14 warning  task late 1 times by 51ms on average.]
51 [2011-01-11 21:48:30 warning  task late 1 times by 51ms on average.]
52 [2011-01-08 10:55:38 warning  task late 1 times by 52ms on average.]
53 [2011-01-08 05:14:06 warning  task late 1 times by 53ms on average.]
53 [2011-01-08 10:26:44 warning  task late 1 times by 53ms on average.]
[root@mrg-qe-09 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | tail -5
55 [2011-01-09 14:30:05 warning  task late 1 times by 55ms on average.]
55 [2011-01-10 03:26:33 warning  task late 1 times by 55ms on average.]
56 [2011-01-09 14:01:16 warning  task late 1 times by 56ms on average.]
56 [2011-01-12 04:17:15 warning  task late 1 times by 56ms on average.]
56 [2011-01-12 12:07:54 warning  task late 1 times by 56ms on average.]
[root@mrg-qe-09 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | wc -l
13


[root@mrg-qe-10 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | head -5
10 [2011-01-12 04:18:27 warning  task late 1 times by 101ms on average.]
10 [2011-01-12 06:28:41 warning  task late 1 times by 101ms on average.]
10 [2011-01-12 07:54:58 warning  task late 1 times by 101ms on average.]
50 [2011-01-09 06:20:16 warning  task late 1 times by 50ms on average.]
50 [2011-01-09 13:28:08 warning  task late 1 times by 50ms on average.]
[root@mrg-qe-10 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | tail -5
558 [2011-01-11 12:40:47 warning  task late 3 times by 558ms on average.]
3813 [2011-01-12 04:54:28 warning  task late 5 times by 3813ms on average.]
11948 [2011-01-12 04:54:47 warning  task late 1 times by 11948ms on average.]
11977 [2011-01-11 12:41:53 warning  task late 1 times by 11977ms on average.]
14443 [2011-01-11 12:41:34 warning  task late 2 times by 14443ms on average.]
[root@mrg-qe-10 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | wc -l
200


[root@mrg-qe-11 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | head -5
10 [2011-01-11 17:28:11 warning  task late 1 times by 101ms on average.]
10 [2011-01-12 04:43:16 warning  task late 1 times by 101ms on average.]
10 [2011-01-12 05:09:04 warning  task late 1 times by 101ms on average.]
10 [2011-01-12 05:12:01 warning  task late 1 times by 101ms on average.]
10 [2011-01-12 12:27:36 warning  task late 1 times by 101ms on average.]
[root@mrg-qe-11 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | tail -5
258 [2011-01-12 03:59:35 warning  task late 1 times by 258ms on average.]
260 [2011-01-11 18:34:24 warning  task late 1 times by 260ms on average.]
260 [2011-01-12 03:44:58 warning  task late 1 times by 260ms on average.]
280 [2011-01-12 03:55:31 warning  task late 1 times by 280ms on average.]
826 [2011-01-11 15:29:07 warning  task late 3 times by 826ms on average.]
[root@mrg-qe-11 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | wc -l
238

[root@mrg-qe-12 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | head -5
10 [2011-01-11 20:55:09 warning  task late 1 times by 101ms on average.]
10 [2011-01-12 10:18:00 warning  task late 1 times by 100ms on average.]
10 [2011-01-12 12:05:38 warning  task late 1 times by 100ms on average.]
50 [2011-01-11 13:06:05 warning  task late 1 times by 50ms on average.]
50 [2011-01-11 13:19:54 warning  task late 1 times by 50ms on average.]
[root@mrg-qe-12 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | tail -5
330 [2011-01-12 04:25:02 warning  task late 1 times by 330ms on average.]
340 [2011-01-12 17:43:45 warning  task late 1 times by 340ms on average.]
422 [2011-01-12 12:40:26 warning  task late 1 times by 422ms on average.]
646 [2011-01-11 12:46:39 warning  task late 2 times by 646ms on average.]
1828 [2011-01-11 12:46:59 warning  task late 1 times by 1828ms on average.]
[root@mrg-qe-12 ~]# grep "late" /tmp/qpidd.log | awk '{print substr($(NF-2), 1,length($(NF-2)-2)) " [" $0 "]" }' | sort -n | wc -l
371


The packages:
[root@mrg-qe-12 ~]# rpm -qa | grep qpid | sort
python-qpid-0.7.946106-14.el5
qpid-cpp-client-0.7.946106-26.el5
qpid-cpp-client-devel-0.7.946106-26.el5
qpid-cpp-client-devel-docs-0.7.946106-26.el5
qpid-cpp-client-rdma-0.7.946106-26.el5
qpid-cpp-client-ssl-0.7.946106-26.el5
qpid-cpp-mrg-debuginfo-0.7.946106-26.el5
qpid-cpp-server-0.7.946106-26.el5
qpid-cpp-server-cluster-0.7.946106-26.el5
qpid-cpp-server-devel-0.7.946106-26.el5
qpid-cpp-server-rdma-0.7.946106-26.el5
qpid-cpp-server-ssl-0.7.946106-26.el5
qpid-cpp-server-store-0.7.946106-26.el5
qpid-cpp-server-xml-0.7.946106-26.el5
qpid-java-client-0.7.946106-12.el5
qpid-java-common-0.7.946106-12.el5
qpid-java-example-0.7.946106-12.el5
qpid-tools-0.7.946106-11.el5
rh-qpid-cpp-tests-0.7.946106-26.el5

Comment 8 Alan Conway 2011-01-14 15:17:36 UTC
How do you reproduce the result in comment 6? The cases in the Description and comment 4 seem to be OK.

Comment 9 Frantisek Reznicek 2011-01-17 12:00:44 UTC
Hi Alan,
I run the original scenario:
4-node cluster,
every node loops connection clients (checking whether c++/python/qmf c++/qmf python is able to connect and disconnect to /from broker) on top of that tsxtest/perftest/qpid-stat|qpid-cluster are looped on specific node to create transient / durable traffic (approx. 50 / 50 %)

in fact exactly the same scenario as described here:
https://bugzilla.redhat.com/show_bug.cgi?id=623660#c13


I have not tried comment's 4 qpid-cpp-benchmark yet.

I'd like to know:
- Is the change of log level for such messages a strict requirement? (see comment 3)
- It is proved that broker now produces much less messages (factor >10^4)
  -26 build produced about N * 800 warnings (N ~ cluster width)
  -12 build produced about N * 12M warnings
  Is this improvement enough?
- How to treat the task late by more than second?
  I think there is link/dependency to missing flow control:
  https://bugzilla.redhat.com/show_bug.cgi?id=623660#c27
  (which is not implemented yet right?)

Comment 10 Alan Conway 2011-01-17 14:06:30 UTC
Responses to comment 9

- Is the change of log level for such messages a strict requirement? (see
comment 3)

The fix reduces the warnings to at most 1 every 5 seconds, which should avoid excessive flooding of the logs but at the same time leaves the warning visible.

- It is proved that broker now produces much less messages (factor >10^4)
  -26 build produced about N * 800 warnings (N ~ cluster width)
  -12 build produced about N * 12M warnings
  Is this improvement enough?

I think so - as above its now once every 5 seconds. We could make that interval longer.

- How to treat the task late by more than second?
  I think there is link/dependency to missing flow control:
  https://bugzilla.redhat.com/show_bug.cgi?id=623660#c27
  (which is not implemented yet right?)

Right. That probably is the source of the overrun. I haven't had time to get back to that yet.

Comment 11 Frantisek Reznicek 2011-01-19 09:08:59 UTC
Hi Alan,
thanks for your comments.

I agree that logging is improved enough and log message flooding is avoided.

There are, however, two other linked facts which should be done for this BZ:
- there is suggestion to change logging level for such messages (at very least which I believe is not strict requirement as the main goal is reached and flooding avoided)
- there are few values of delays mentioned in the description which are currently violated in some cases:

  Expected results: 

    moderate number of warnings, delays of <10ms, maybe one or two in 
    100s of ms. Certainly nothing over 1000ms.

  
  Would you agree to let solve such problems (overrun timing) in bug 623660 or in possibly new BZ apart of this defect, please?

Once we agree on the summarized two points, I'm ready to finish this defect.

Comment 12 Alan Conway 2011-01-19 14:23:47 UTC
Yes, let's leave the overrun timing issue to bug 623660 and close this one.

Comment 13 Frantisek Reznicek 2011-01-20 09:38:22 UTC
The issue has been resolved.
The flood of overrun messages is now very significantly reduced.

Tested on RHEL 4.8 / 5.6 i386 / x86_64 on packages:
[root@mrg-qe-02 bz656488]# rpm -qa | grep qpid | sort
python-qpid-0.7.946106-14.el5.noarch
qpid-cpp-client-0.7.946106-26.el5.x86_64
qpid-cpp-client-devel-0.7.946106-26.el5.x86_64
qpid-cpp-client-devel-docs-0.7.946106-26.el5.x86_64
qpid-cpp-client-rdma-0.7.946106-26.el5.x86_64
qpid-cpp-client-ssl-0.7.946106-26.el5.x86_64
qpid-cpp-mrg-debuginfo-0.7.946106-26.el5.x86_64
qpid-cpp-server-0.7.946106-26.el5.x86_64
qpid-cpp-server-cluster-0.7.946106-26.el5.x86_64
qpid-cpp-server-devel-0.7.946106-26.el5.x86_64
qpid-cpp-server-rdma-0.7.946106-26.el5.x86_64
qpid-cpp-server-ssl-0.7.946106-26.el5.x86_64
qpid-cpp-server-store-0.7.946106-26.el5.x86_64
qpid-cpp-server-xml-0.7.946106-26.el5.x86_64
qpid-dotnet-0.4.738274-2.el5.x86_64
qpid-java-client-0.7.946106-14.el5.noarch
qpid-java-common-0.7.946106-14.el5.noarch
qpid-java-example-0.7.946106-14.el5.noarch
qpid-tests-0.7.946106-1.el5.noarch
qpid-tools-0.7.946106-11.el5.noarch
rh-qpid-cpp-tests-0.7.946106-26.el5.x86_64

-> VERIFIED

Comment 14 Alan Conway 2011-01-26 19:48:44 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: in some circumstances the broker generated a large number of timer late/overrun warnings. 
Consequence: The warnings themselves are usually inoffensive but the time to log them individually caused long (up to several seconds) delays and bloated log files.
Fix: Timer warnings are batched and logged periodically
Result: Large numbers of timer late/overrun events only create a controlled number of warnings in the log.

Comment 15 errata-xmlrpc 2011-02-15 12:11:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html


Note You need to log in before you can comment on or make changes to this bug.