Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 623660

Summary: clustered qpidd resolves 'Channel exception: not-attached: Channel 1 is not attached' extremely slowly
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: messaging-bugs <messaging-bugs>
Status: CLOSED UPSTREAM QA Contact: Messaging QE <messaging-qe-bugs>
Severity: high Docs Contact:
Priority: medium    
Version: betaCC: gsim, rrajaram
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Nodes in a cluster are unbalanced such that one node runs significantly faster than another. This could be because nodes use different hardware or because of other CPU or IO intensive processes running continuously on some nodes but not others. Consequence: Under continuous heavy load, the slower nodes may lag behind the cluster and become slow to respond to their clients. Workaround: Use similar hardware for cluster nodes, avoid other long running CPU/IO intensive tasks on cluster nodes. Occasional spikes of activity are not a problem. The problem only occurs qpidd is heavily loaded for a long period (several hours) on a continuously unbalanced cluster.
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-02-10 03:13:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 579127    
Bug Blocks:    
Attachments:
Description Flags
The tool for reporting duration of channel exception resolution none

Description Frantisek Reznicek 2010-08-12 13:11:47 UTC
Description of problem:

Clustered broker often need to resolve 
  2010-08-09 16:58:26 error Channel exception: not-attached: Channel 1 is not attached

errors which might happen due to bad function/crash of the client[s] attached to broker nodes.

This situation is much more real than we realize due to fact that QMF clients are very slow in connection and/or exchanging data with broker. (linked to bug 581006)

Let's discuss the situation happened in the four-node cluster under load, every cluster node runs two different clients in parallel and lod the cluster that way, clients are perftest, tsxtest, qpid-cluster, qpid-stat -b, qpid-config, simple connection clients (qc_client) in c++ / python.

The runners (programs who launch the clients against the broker) are configured the way that if particular client stops responding or hang then after approximately after ten times of the average client run duration script kills the client and continues with client looping.

This (kill/cleanup) effect creates cluster environment which contains above 'Channel not attached exceptions'.

The above exception need to be negotiated/resolved with all cluster members to be in synch with the fact that specific client disappeared. The mechanism should happen automatically within few seconds depending on network speed, CPU load etc...


The detected issue shows that 'error Channel exception: not-attached:' exception is not resolved quickly, moreover those exceptions can be resolved after few hours only which indicates an problem.

The performance is the first logical result of above described problem.


Version-Release number of selected component (if applicable):
python-qmf-0.7.946106-8.el5
python-qpid-0.7.946106-11.el5
qmf-0.7.946106-11.el5
qmf-devel-0.7.946106-11.el5
qpid-cpp-*-0.7.946106-11.el5
qpid-java-*-0.7.946106-7.el5
qpid-tools-0.7.946106-8.el5
ruby-qmf-0.7.946106-11.el5


How reproducible:
quite rapidly

Steps to Reproduce:
see bug 612458 steps and monitor the broker logs to see how are the exceptions resolved soon
The key is to run the client under timeout of for example 3-5 minutes

Actual results:
Channel exceptions are getting resolved incredibly slowly.

Expected results:
Channel exceptions should get resolved much more quickly.

Additional info:

The part of the broker log with one process of exception resolving (watch the timestamps):
  2010-08-09 16:58:16 warning Timer callback overran by 2ms [taking 3848ns]
  2010-08-09 16:58:24 debug cluster(10.34.45.12:29768 READY) closed connection 10.34.33.62:52775(10.34.45.10:1629-58 shadow)
  2010-08-09 16:58:26 error Channel exception: not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
  2010-08-09 16:58:26 debug cluster(10.34.45.12:29768 READY/error) channel error 1084918 on 10.34.33.63:35407(10.34.45.11:22967-67 shadow) must be resolved with: 10.34.45.9:29034 10.34.45.10:1629 10.34.45.11:22967 10.34.45.12:29768 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
  2010-08-09 16:58:26 debug cluster(10.34.45.12:29768 READY/error) error 1084918 resolved with 10.34.45.9:29034
  2010-08-09 16:58:26 debug cluster(10.34.45.12:29768 READY/error) error 1084918 must be resolved with 10.34.45.10:1629 10.34.45.11:22967 10.34.45.12:29768
  2010-08-09 16:58:26 debug cluster(10.34.45.12:29768 READY/error) error 1084918 resolved with 10.34.45.12:29768
  2010-08-09 16:58:26 debug cluster(10.34.45.12:29768 READY/error) error 1084918 must be resolved with 10.34.45.10:1629 10.34.45.11:22967
  2010-08-09 16:58:40 warning Timer callback overran by 2ms [taking 3941ns]
  ...
  2010-08-09 19:51:08 debug cluster(10.34.45.12:29768 READY/error) error 1084918 resolved with 10.34.45.11:22967
  2010-08-09 19:51:08 debug cluster(10.34.45.12:29768 READY/error) error 1084918 must be resolved with 10.34.45.10:1629
  2010-08-09 19:51:14 warning Timer callback overran by 2ms [taking 4385ns]
  2010-08-09 19:51:14 warning Timer callback overran by 2ms [taking 3916ns]
  2010-08-09 19:51:26 debug cluster(10.34.45.12:29768 READY/error) local connection 10.34.33.65:33361(10.34.45.12:29768-383 local)
  2010-08-09 19:51:33 debug cluster(10.34.45.12:29768 READY/error) local connection 127.0.0.1:50500(10.34.45.12:29768-384 local)
  2010-08-09 19:51:38 warning Timer callback overran by 2ms [taking 5947ns]
  2010-08-09 19:51:38 warning Timer callback overran by 2ms [taking 5543ns]
  2010-08-09 19:51:50 warning Timer callback overran by 2ms [taking 4313ns]
  2010-08-09 19:52:14 warning Timer callback overran by 2ms [taking 3802ns]
  2010-08-09 19:52:14 warning Timer callback overran by 2ms [taking 4533ns]
  2010-08-09 19:52:33 debug cluster(10.34.45.12:29768 READY/error) local connection 127.0.0.1:50501(10.34.45.12:29768-385 local)
  2010-08-09 19:52:38 warning Timer callback overran by 2ms [taking 4637ns]
  2010-08-09 19:52:38 warning Timer callback overran by 2ms [taking 3974ns]
  2010-08-09 19:53:02 warning Timer callback overran by 2ms [taking 4000ns]
  2010-08-09 19:53:36 debug cluster(10.34.45.12:29768 READY/error) local connection 127.0.0.1:50502(10.34.45.12:29768-386 local)
  2010-08-09 19:53:39 warning Timer callback overran by 2ms [taking 4071ns]
  2010-08-09 19:53:39 warning Timer callback overran by 2ms [taking 3928ns]
  2010-08-09 19:53:42 debug cluster(10.34.45.12:29768 READY/error) error 1084918 resolved with 10.34.45.10:1629
  2010-08-09 19:53:42 debug cluster(10.34.45.12:29768 READY) error 1084918 resolved.

Broker config:
[root@mrg-qe-10 fcluster]# tail /etc/qpidd.conf
#
# (Note: no spaces on either side of '='). Using default settings:
# "qpidd --help" or "man qpidd" for more details.
cluster-mechanism=ANONYMOUS
auth=no
log-to-file=/tmp/qpidd.log
log-enable=info+
log-enable=debug+:cluster
cluster-name=fcluster

Comment 2 Frantisek Reznicek 2010-08-13 07:48:10 UTC
Created attachment 438615 [details]
The tool for reporting duration of channel exception resolution

The attachement allows to measure the channel exception resolution.

Usage:

gawk -f qpidd_channel_error_duration_reporter.awk /tmp/qpidd.log | \
  sort -nr -k 2 > qpidd_channel_error_duration_reporter.log
[root@mrg-qe-12 fcluster]# head -3 qpidd_channel_error_duration_reporter.log
197589822 10862 (17:58:41->20:59:43)
197591234 53 (20:59:43->21:00:36)
198951004 51 (21:07:52->21:08:43)

# ^ ID    ^ duration[secs]
#             ^ timestamps, to double check

Note:
The attached tool is very simple and will not deal with change of the time from one day to another one that's why you need to check timestamps

Comment 3 Frantisek Reznicek 2010-08-13 08:08:33 UTC
The issue was retested with results detailed below.

There was seen another one long case when exception was resolved within 3 hours only (one occurrence), the other cases were resolved much more quickly, some within few seconds, some withing few tens of seconds.

The simple statistics is also summarized below:


gawk -f cherror_dur.awk /tmp/qpidd.log | sort -nr -k 2 > cherror_dur.log
head cherror_dur.log
wc -l cherror_dur.log
awk -v min=10 -v max=20 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
awk -v min=20 -v max=30 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
awk -v min=30 -v max=40 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
awk -v min=40 -v max=999999 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log


[09-first in cluster]
  [root@mrg-qe-09 fcluster]# head cherror_dur.log
  197589822 10865 (17:58:38->20:59:43)
  225824437 56 (07:36:39->07:37:35)
  200130231 53 (21:21:10->21:22:03)
  229187039 50 (08:53:47->08:54:37)
  224992378 50 (07:10:03->07:10:53)
  219963899 49 (04:55:22->04:56:11)
  213121023 48 (01:44:49->01:45:37)
  207347485 45 (23:08:37->23:09:22)
  205867402 45 (22:31:41->22:32:26)
  204581138 45 (22:00:29->22:01:14)
  [root@mrg-qe-09 fcluster]# wc -l cherror_dur.log
  1334 cherror_dur.log
  [root@mrg-qe-09 fcluster]# awk -v min=10 -v max=20 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  44
  [root@mrg-qe-09 fcluster]# awk -v min=20 -v max=30 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  84
  [root@mrg-qe-09 fcluster]# awk -v min=30 -v max=40 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  148
  [root@mrg-qe-09 fcluster]# awk -v min=40 -v max=999999 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  33

[10-second in cluster]
  [root@mrg-qe-10 fcluster]# head cherror_dur.log
  197589822 10863 (17:58:40->20:59:43)
  197591234 53 (20:59:43->21:00:36)
  13459025 29 (16:59:58->17:00:27)
  140739016 26 (17:39:05->17:39:31)
  93301270 25 (17:24:37->17:25:02)
  28830231 25 (17:05:11->17:05:36)
  155411800 25 (17:43:46->17:44:11)
  115423132 24 (17:31:19->17:31:43)
  196914882 23 (17:56:23->17:56:46)
  54510456 21 (17:12:52->17:13:13)
  [root@mrg-qe-10 fcluster]# wc -l cherror_dur.log
  1334 cherror_dur.log
  [root@mrg-qe-10 fcluster]# awk -v min=10 -v max=20 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  76
  [root@mrg-qe-10 fcluster]# awk -v min=20 -v max=30 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  11
  [root@mrg-qe-10 fcluster]# awk -v min=30 -v max=40 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  0
  [root@mrg-qe-10 fcluster]# awk -v min=40 -v max=999999 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  2

[11-third in cluster]
  [root@mrg-qe-11 fcluster]# head cherror_dur.log
  225824437 38 (07:36:02->07:36:40)
  219963899 27 (04:54:50->04:55:17)
  200130231 24 (21:20:40->21:21:04)
  224992378 23 (07:09:35->07:09:58)
  213121023 23 (01:44:19->01:44:42)
  228312913 22 (08:34:51->08:35:13)
  204581138 22 (21:59:58->22:00:20)
  226224279 21 (07:50:13->07:50:34)
  214836440 21 (02:41:16->02:41:37)
  213955315 21 (02:15:25->02:15:46)
  [root@mrg-qe-11 fcluster]# wc -l cherror_dur.log
  1334 cherror_dur.log
  [root@mrg-qe-11 fcluster]# awk -v min=10 -v max=20 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  178
  [root@mrg-qe-11 fcluster]# awk -v min=20 -v max=30 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  22
  [root@mrg-qe-11 fcluster]# awk -v min=30 -v max=40 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  1
  [root@mrg-qe-11 fcluster]# awk -v min=40 -v max=999999 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  0
  [root@mrg-qe-11 fcluster]# grep 197589822 cherror_dur.log
  197589822 0 (20:58:49->20:58:49)


[12-fourth in cluster]
  [root@mrg-qe-12 fcluster]# head cherror_dur.log
  197589822 10862 (17:58:41->20:59:43)
  197591234 53 (20:59:43->21:00:36)
  198951004 51 (21:07:52->21:08:43)
  197602148 47 (21:03:59->21:04:46)
  197609774 40 (21:05:20->21:06:00)
  13459025 29 (16:59:58->17:00:27)
  140739016 26 (17:39:05->17:39:31)
  93301270 25 (17:24:37->17:25:02)
  28830231 25 (17:05:11->17:05:36)
  155411800 25 (17:43:46->17:44:11)
  [root@mrg-qe-12 fcluster]# wc -l cherror_dur.log
  1334 cherror_dur.log
  [root@mrg-qe-12 fcluster]# awk -v min=10 -v max=20 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  10
  [root@mrg-qe-12 fcluster]# awk -v min=20 -v max=30 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  10
  [root@mrg-qe-12 fcluster]# awk -v min=30 -v max=40 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  0
  [root@mrg-qe-12 fcluster]# awk -v min=40 -v max=999999 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
  5



The above results show interesting fact:

The only one extremely long channel exception is happening/mentioned on mrg-qe-11 @ 20:58:49, but on the others know about this one for hours (from about 17:58:40)


The cluster engine (openais) does not show any problems in the log file.

Comment 4 Frantisek Reznicek 2010-08-13 08:31:50 UTC
After re-run I performed I found another one interesting fact which might be related and that's why I believe it is worth to mentioned it here (although it might be different issue)

The description details that clients ran on the cluster are killed, default kill signal (SIGTERM) is used for such purpose. Surprisingly the clients did not exit after SIGTERM and hang in qpid::client::LocalQueueImpl::get() -> pop() -> pthread_cond_wait@@GLIBC_2.3.2() -> __kernel_vsyscall ().

So for instance qpid-perftest after being killed stays here:
(*): Shared library is missing debugging information.
(gdb)   4 Thread 0xb7fadb90 (LWP 12958)  0x00411410 in __kernel_vsyscall ()
  3 Thread 0xb69feb90 (LWP 15550)  0x00411410 in __kernel_vsyscall ()
  2 Thread 0xb55fcb90 (LWP 15791)  0x00411410 in __kernel_vsyscall ()
* 1 Thread 0xb7fae940 (LWP 12955)  0x00411410 in __kernel_vsyscall ()
(gdb)
Thread 4 (Thread 0xb7fadb90 (LWP 12958)):
#0  0x00411410 in __kernel_vsyscall ()
#1  0x005cd486 in epoll_wait () from /lib/libc.so.6
#2  0x001f8c2a in qpid::sys::Poller::wait (this=0x8172828, timeout=...)
    at qpid/sys/epoll/EpollPoller.cpp:563
#3  0x001f9853 in qpid::sys::Poller::run (this=0x8172828)
    at qpid/sys/epoll/EpollPoller.cpp:515
#4  0x001eed01 in qpid::sys::(anonymous namespace)::runRunnable (p=0x8172828)
    at qpid/sys/posix/Thread.cpp:35
#5  0x0039b832 in start_thread () from /lib/libpthread.so.0
#6  0x005cce0e in clone () from /lib/libc.so.6

Thread 3 (Thread 0xb69feb90 (LWP 15550)):
#0  0x00411410 in __kernel_vsyscall ()
#1  0x005cd486 in epoll_wait () from /lib/libc.so.6
#2  0x001f8c2a in qpid::sys::Poller::wait (this=0x8172828, timeout=...)
    at qpid/sys/epoll/EpollPoller.cpp:563
#3  0x001f9853 in qpid::sys::Poller::run (this=0x8172828)
    at qpid/sys/epoll/EpollPoller.cpp:515
#4  0x001eed01 in qpid::sys::(anonymous namespace)::runRunnable (p=0x8172828)
    at qpid/sys/posix/Thread.cpp:35
#5  0x0039b832 in start_thread () from /lib/libpthread.so.0
#6  0x005cce0e in clone () from /lib/libc.so.6

Thread 2 (Thread 0xb55fcb90 (LWP 15791)):
#0  0x00411410 in __kernel_vsyscall ()
#1  0x005cd486 in epoll_wait () from /lib/libc.so.6
#2  0x001f8c2a in qpid::sys::Poller::wait (this=0x8172828, timeout=...)
    at qpid/sys/epoll/EpollPoller.cpp:563
#3  0x001f9853 in qpid::sys::Poller::run (this=0x8172828)
    at qpid/sys/epoll/EpollPoller.cpp:515
#4  0x001eed01 in qpid::sys::(anonymous namespace)::runRunnable (p=0x8172828)
    at qpid/sys/posix/Thread.cpp:35
#5  0x0039b832 in start_thread () from /lib/libpthread.so.0
#6  0x005cce0e in clone () from /lib/libc.so.6

Thread 1 (Thread 0xb7fae940 (LWP 12955)):
#0  0x00411410 in __kernel_vsyscall ()
#1  0x0039fbc5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x005d94dd in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3  0x00bbaf7a in pop (this=0xb74f5e28, result=..., timeout=...)
    at ../include/qpid/sys/posix/Condition.h:63
#4  qpid::client::LocalQueueImpl::get (this=0xb74f5e28, result=...,
    timeout=...) at qpid/client/LocalQueueImpl.cpp:49
#5  0x00bbbbee in qpid::client::LocalQueueImpl::get (this=0xb74f5e28,
    timeout=...) at qpid/client/LocalQueueImpl.cpp:40
#6  0x00bbbdd7 in qpid::client::LocalQueueImpl::pop (this=0xb74f5e28,
    timeout=...) at qpid/client/LocalQueueImpl.cpp:36
#7  0x00bb87f9 in qpid::client::LocalQueue::pop (this=0xbfade9a4, timeout=...)
    at qpid/client/LocalQueue.cpp:43
#8  0x08059ad3 in qpid::tests::Controller::process(size_t, qpid::client::LocalQueue, std::string, boost::function<void ()(const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&),std::allocator<void> >) (
    this=0xbfadeae8, n=1, lq=..., queue=..., msgFn=...)
    at qpid-perftest.cpp:382
#9  0x0805b702 in qpid::tests::Controller::run (this=0xbfadeae8)
    at qpid-perftest.cpp:423
#10 0x0805352f in main (argc=Cannot access memory at address 0x80
) at qpid-perftest.cpp:719
(gdb) Detaching from program: /usr/bin/qpid-perftest, process 12955
(gdb) quit

Comment 5 Frantisek Reznicek 2010-08-13 08:32:42 UTC
I'm going to retest on -12 with SIGTERM followed by SIGKILL...

Comment 6 Frantisek Reznicek 2010-08-13 14:52:52 UTC
Testing update -12 build shows so far no such issue:
  [root@mrg-qe-09 fcluster]# gawk -f cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -20
  41768242 53 (13:30:06->13:30:59)
  46267278 48 (16:04:19->16:05:07)
  46546959 46 (16:12:53->16:13:39)
  46344840 44 (16:07:13->16:07:57)
  42801369 44 (14:12:45->14:13:29)
  45834192 43 (15:50:02->15:50:45)
  47686762 42 (16:47:33->16:48:15)
  47463278 42 (16:42:02->16:42:44)
  45613620 41 (15:44:11->15:44:52)
  43352306 41 (14:32:00->14:32:41)
  47251459 40 (16:36:19->16:36:59)
  46908472 40 (16:24:43->16:25:23)
  46085609 40 (15:58:43->15:59:23)
  43549886 40 (14:39:17->14:39:57)
  47097462 39 (16:30:46->16:31:25)
  46703165 39 (16:18:50->16:19:29)
  45193575 39 (15:32:34->15:33:13)
  47015123 38 (16:27:39->16:28:17)
  44555018 38 (15:15:04->15:15:42)
  41772500 38 (13:39:57->13:40:35)
  [root@mrg-qe-09 fcluster]# ssh mrg-qe-10
  [root@mrg-qe-10 ~]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -20
  45622496 20 (15:45:07->15:45:27)
  46344840 16 (16:07:41->16:07:57)
  41770771 16 (13:36:41->13:36:57)
  41768242 16 (13:30:44->13:31:00)
  45291764 15 (15:36:06->15:36:21)
  44748414 15 (15:21:21->15:21:36)
  46546959 13 (16:13:26->16:13:39)
  45834192 13 (15:50:32->15:50:45)
  45613620 13 (15:44:39->15:44:52)
  45193575 13 (15:33:00->15:33:13)
  44242598 13 (15:04:17->15:04:30)
  42801369 13 (14:13:16->14:13:29)
  47463278 12 (16:42:32->16:42:44)
  46085609 12 (15:59:11->15:59:23)
  47015123 11 (16:28:06->16:28:17)
  45932750 11 (15:53:24->15:53:35)
  42297367 11 (14:03:38->14:03:49)
  47573383 10 (16:45:20->16:45:30)
  47251459 10 (16:36:49->16:36:59)
  46636154 10 (16:16:28->16:16:38)
  [root@mrg-qe-10 ~]# ssh mrg-qe-11
  [root@mrg-qe-11 ~]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -20
  41770771 70 (13:34:52->13:36:02)
  41768242 67 (13:28:57->13:30:04)
  43549886 65 (14:37:56->14:39:01)
  42801369 65 (14:11:29->14:12:34)
  43162564 63 (14:22:23->14:23:26)
  41777630 62 (13:45:05->13:46:07)
  43777481 61 (14:44:14->14:45:15)
  43188933 61 (14:23:27->14:24:28)
  42297367 61 (14:01:50->14:02:51)
  42021685 61 (13:47:57->13:48:58)
  41772500 61 (13:38:39->13:39:40)
  43909823 60 (14:47:23->14:48:23)
  43285492 60 (14:28:00->14:29:00)
  42609812 60 (14:03:57->14:04:57)
  42245469 60 (13:58:51->13:59:51)
  42177079 60 (13:54:29->13:55:29)
  41774499 60 (13:41:45->13:42:45)
  41769288 60 (13:31:49->13:32:49)
  43352306 59 (14:30:47->14:31:46)
  43638814 58 (14:40:48->14:41:46)
  [root@mrg-qe-11 ~]# ssh mrg-qe-12
  [root@mrg-qe-12 ~]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -20
  45630787 19 (15:45:10->15:45:29)
  45309933 11 (15:36:27->15:36:38)
  41779772 11 (13:47:57->13:48:08)
  46910684 7 (16:25:37->16:25:44)
  46273371 7 (16:05:09->16:05:16)
  41772649 7 (13:41:01->13:41:08)
  44660918 5 (15:18:52->15:18:57)
  46793447 4 (16:22:28->16:22:32)
  46447328 4 (16:10:51->16:10:55)
  44993250 4 (15:27:45->15:27:49)
  43188933 4 (14:25:19->14:25:23)
  43007176 4 (14:19:21->14:19:25)
  41774520 4 (13:43:44->13:43:48)
  45392977 3 (15:39:15->15:39:18)
  43213503 3 (14:26:33->14:26:36)
  41772573 3 (13:40:44->13:40:47)
  44111902 2 (14:58:58->14:59:00)
  42899508 2 (14:16:19->14:16:21)
  42297368 2 (14:03:47->14:03:49)
  47692009 1 (16:48:16->16:48:17)

Comment 7 Frantisek Reznicek 2010-08-13 15:06:58 UTC
The hanging client described above in comment 4 is now under retesting, because the client seems to not get the SIGTERM but just the parent and kernel seem to moved the client under init process, more data should be available next week.

Comment 8 Alan Conway 2010-08-16 14:57:21 UTC
I ran the following on a 4 node cluster over the weekend, 64  hours so far with no error. Top shows qpidd processes are stable at between 700M and 1024M

node1: info_thread.sh
node2: tsxtest_thread.sh
node3&4: perftest_thread.sh

No core files, grep 'ecode:[^0]' in logs found nothing, tail of logs looks normal.

This was with trunk r985356 which corresponds to release tags
 mrg1.3-beta5
 qpid-cpp-mrg-0.7.946106-12

Comment 9 Frantisek Reznicek 2010-08-17 09:36:09 UTC
Hello Alan,
I performed the retest over the weekend and I need to say that 4 nodes cluster is still behaving the way that there is one channel exception taking extremely long time to be resolved:
[root@mrg-qe-09 fcluster]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -5
17305969 16080 (17:33:36->22:01:36)
4032459338 113 (01:33:53->01:35:46)
4059762049 102 (02:00:50->02:02:32)
2721094641 99 (00:26:09->00:27:48)
934589368 92 (15:05:54->15:07:26)

[root@mrg-qe-10 fcluster]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -5
17305969 16078 (17:33:37->22:01:35)
16752199 73 (17:19:49->17:21:02)
43067894 67 (22:25:14->22:26:21)
16958035 67 (17:29:59->17:31:06)
16950935 67 (17:22:57->17:24:04)

[root@mrg-qe-11 ~]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -5
17305969 1175 (21:41:05->22:00:40)
4059762049 55 (02:00:37->02:01:32)
3048658051 40 (06:41:10->06:41:50)
2484491772 40 (19:54:10->19:54:50)
3581085133 38 (17:51:01->17:51:39)

[root@mrg-qe-12 ~]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -5
157871336 19 (00:28:05->00:28:24)
63413322 17 (07:21:48->07:22:05)
2683434374 16 (23:20:38->23:20:54)
934585372 12 (14:57:47->14:57:59)
4103141964 12 (03:02:52->03:03:04)
[root@mrg-qe-12 ~]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | grep "17305969 "
17305969 1 (22:01:35->22:01:36)

Time synchronization is following:
[root@mrg-qe-09 fcluster]# date
Tue Aug 17 11:34:46 CEST 2010
[root@mrg-qe-10 fcluster]# date
Tue Aug 17 11:34:47 CEST 2010
[root@mrg-qe-11 ~]# date
Tue Aug 17 11:33:47 CEST 2010
[root@mrg-qe-12 ~]# date
Tue Aug 17 11:34:48 CEST 2010


The issue with clients not being correctly killed is test issue which is now fixed, so I can guarantee with above results that there are max. 10 clients running at the moment against the cluster (timeouting clients are killed).


[root@mrg-qe-12 ~]# rpm -qa | grep -E '(qpid|qmf)' | sort
python-qmf-0.7.946106-8.el5
python-qpid-0.7.946106-12.el5
qmf-0.7.946106-12.el5
qmf-devel-0.7.946106-12.el5
qpid-cpp-client-*0.7.946106-12.el5
qpid-cpp-server-*0.7.946106-12.el5
qpid-java-*-0.7.946106-7.el5
qpid-tools-0.7.946106-8.el5
ruby-qmf-0.7.946106-12.el5


I believe that the issue is still there, that's why I kick the bug to ASSIGNED.

Comment 10 Frantisek Reznicek 2010-08-17 10:07:17 UTC
Additional note:
The clustered broker is configured as persistant cluster.

The clients are:
1] tsxtest looping
2] perftest to localhost using unique queue names (*) looping
3] perftest to randomly taken cluster node using unique queue names (*) looping
4] qmf qpid-tools clients run in the loop against localhost and after that against all cluster nodes (qpid-cluster; qpid-config; qpid-stat -b) looping
5a] connection clients in python and c++ using old API connection (connect & disconnect) looping
5b] qmf python console client doing just connect and disconnect looping


Nodes are 9...12 and above actions are assingned following way:
9 ... 5a+b] + 1]
10 ... 5a+b] + 4]
11 ... 5a+b] + 2]
12 ... 5a+b] + 3]

Comment 11 Alan Conway 2010-08-17 13:10:50 UTC
comment #2 mentions an issue with clients being killed in the test scripts. Can you attach the latest test scripts with this issue fixed?

Comment 12 Alan Conway 2010-08-17 14:24:45 UTC
Could you also attach the logs from the latest reproduction of the bug?

Comment 14 Frantisek Reznicek 2010-08-19 09:28:08 UTC
Currently retesting again with special set of packages (-12 + Alan's logging patch) to be able to provide more detailed information about what is happening there.

The issue seems to be hardware dependent.

The machine specs where the issue is observed are:
[root@mrg-qe-12 ~]# mrg_get_cpu_info
mrg_get_cpu_info():CPU information:
processor       : 0 1 2 3 4 5 6 7
vendor_id       : AuthenticAMD
model name      : Quad-Core AMD Opteron(tm) Processor 2376
cpu MHz         : 2294.411
cpu cores       : 4
bogomips        : 4588.82 4588.49 4588.49 4588.47 4588.51 4588.46 4588.43 4588.49

Comment 15 Frantisek Reznicek 2010-08-20 07:47:50 UTC
Retest with Alan's 0001-DEBUG-logging-for-slow-error-resolution-issue.patch patch and recompilation on target machines suppressed the issue.
There are two opssible reasons:
a] the patch changed timing so much that issue went away
b] compilation (doing the exactly same steps as during normal build) on target AMD produces slightly different code which suppresses the issue

Next steps:
New patch which (0001-WIP-flow-control-limit-total-unprocessed-events.patch) is believed to fix the core of the issue is going to be applied and test ran over weekend.
If tests passes as it does now, then it is still needed to rebuild -12 w/o any additional patches on AMD to confirm that above point b] is not true.

Comment 16 Frantisek Reznicek 2010-08-20 10:14:01 UTC
The test with 0001-WIP-flow-control-limit-total-unprocessed-events.patch produces much much less channel exceptions (all solved within 20 seconds) than in previous case and more rapidly shows possibly another problem which might be linked.

After couple of minutes all clients trying to get attached to broker timeout. I believe we need to evaluate this as well together with the channel exception issue.

For instance simple connect/disconnect client (qc_client) gets killed after 60 seconds trying to connect to one of brokers. The usual c++ client state at the moment test kills the client is:
(*): Shared library is missing debugging information.
(gdb)   2 Thread 0x40b64940 (LWP 10395)  0x00002b13f0325108 in epoll_wait ()
   from /lib64/libc.so.6
* 1 Thread 0x2b13f314f5b0 (LWP 10392)  0x00002b13f14f4ee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb)
Thread 2 (Thread 0x40b64940 (LWP 10395)):
#0  0x00002b13f0325108 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b13f06d22b1 in qpid::sys::Poller::wait (this=0xc8d06c0,
    timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563
#2  0x00002b13f06d2d47 in qpid::sys::Poller::run (this=0xc8d06c0)
    at qpid/sys/epoll/EpollPoller.cpp:515
#3  0x00002b13f06c8f9a in qpid::sys::(anonymous namespace)::runRunnable (p=0x5)
    at qpid/sys/posix/Thread.cpp:35
#4  0x00002b13f14f073d in start_thread () from /lib64/libpthread.so.0
#5  0x00002b13f0324d1d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2b13f314f5b0 (LWP 10392)):
#0  0x00002b13f14f4ee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002b13ef87979d in wait (this=0xc8d84a0)
    at ../include/qpid/sys/posix/Condition.h:63
#2  wait (this=0xc8d84a0) at ../include/qpid/sys/Monitor.h:41
#3  qpid::sys::Waitable::wait (this=0xc8d84a0) at qpid/sys/Waitable.h:88
#4  0x00002b13ef874b09 in waitFor (this=0xc8d8430, timeout=0)
    at qpid/sys/StateMonitor.h:65
#5  waitFor (this=0xc8d8430, timeout=0) at qpid/client/SessionImpl.cpp:769
#6  qpid::client::SessionImpl::open (this=0xc8d8430, timeout=0)
    at qpid/client/SessionImpl.cpp:109
#7  0x00002b13ef841598 in qpid::client::ConnectionImpl::newSession (
    this=0xc8d0e60, name=..., timeout=0, channel=65535)
    at qpid/client/ConnectionImpl.cpp:437
#8  0x00002b13ef832489 in qpid::client::Connection::newSession (
    this=0xc8d0af0, name=..., timeout=0) at qpid/client/Connection.cpp:141
#9  0x0000000000414eb7 in qc_lib::qc_client<qc_lib::qc_optparser_client>::connect (this=0x7fff475baca0) at qc_client.hpp:359
#10 0x0000000000408b05 in main (argc=3, argv=0x7fff475baff8)
    at qc_client.cpp:51
(gdb) Detaching from program: /root/fcluster/qc_client, process 10392
(gdb) quit


Looking at perftest in approximately same moment it timeouts here:


(gdb)   2 Thread 0xb7f05b90 (LWP 18268)  0x0049a410 in __kernel_vsyscall ()
* 1 Thread 0xb7f06940 (LWP 18265)  0x0049a410 in __kernel_vsyscall ()
(gdb)
Thread 2 (Thread 0xb7f05b90 (LWP 18268)):
#0  0x0049a410 in __kernel_vsyscall ()
#1  0x00a70486 in epoll_wait () from /lib/libc.so.6
#2  0x00583caa in qpid::sys::Poller::wait (this=0x9f597b0, timeout=...)
    at qpid/sys/epoll/EpollPoller.cpp:563
#3  0x005848d3 in qpid::sys::Poller::run (this=0x9f597b0)
    at qpid/sys/epoll/EpollPoller.cpp:515
#4  0x00579d81 in qpid::sys::(anonymous namespace)::runRunnable (p=0x9f597b0)
    at qpid/sys/posix/Thread.cpp:35
#5  0x0014a832 in start_thread (arg=0xb7f05b90) at pthread_create.c:301
#6  0x00a6fe0e in clone () from /lib/libc.so.6

Thread 1 (Thread 0xb7f06940 (LWP 18265)):
#0  0x0049a410 in __kernel_vsyscall ()
#1  0x0014ebc5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x00a7c4dd in __pthread_cond_wait (cond=0x9f5a3a8, mutex=0x9f5a390)
    at forward.c:138
#3  0x002abd07 in wait (this=0x9f5a390)
    at ../include/qpid/sys/posix/Condition.h:63
#4  wait (this=0x9f5a390) at ../include/qpid/sys/Monitor.h:41
#5  qpid::sys::Waitable::wait (this=0x9f5a390) at qpid/sys/Waitable.h:88
#6  0x002a5878 in waitFor (this=0x9f5a358, timeout=0)
    at qpid/sys/StateMonitor.h:65
#7  waitFor (this=0x9f5a358, timeout=0) at qpid/client/SessionImpl.cpp:769
#8  qpid::client::SessionImpl::open (this=0x9f5a358, timeout=0)
    at qpid/client/SessionImpl.cpp:109
#9  0x0026c17d in qpid::client::ConnectionImpl::newSession (this=0x9f59b08,
    name=..., timeout=0, channel=65535) at qpid/client/ConnectionImpl.cpp:437
#10 0x0025df2e in qpid::client::Connection::newSession (this=0xbffaace8,
    name=..., timeout=0) at qpid/client/Connection.cpp:141
#11 0x080594f8 in qpid::tests::Client::Client (this=0xbffaace0)
    at qpid-perftest.cpp:248
#12 0x0805338b in main (argc=Cannot access memory at address 0x80
) at qpid-perftest.cpp:263

So both at the came call qpid::client::SessionImpl::open () -> waitFor () -> qpid::sys::Waitable::wait ()

I'm currently not able to distinguish whether this is symptom of applied patch, but I was seeing occassionly this behavior on clean -12 as well, I'm just not sure by the same client state at timout/kill.

Comment 17 Frantisek Reznicek 2010-08-23 07:02:29 UTC
The issue was retested with Alan's proposed potential patch 0001-WIP-flow-control-limit-total-unprocessed-events.patch w/o applying previous logging patch.

After couple of minutes the test started to suffer from the extremely long channel exception resolution:

[root@mrg-qe-09 fcluster]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -5
10242677 13346 (10:29:41->14:12:07)
10243005 29 (14:12:07->14:12:36)
95939 11 (10:24:19->10:24:30)
10243339 3 (14:12:46->14:12:49)
10246182 2 (14:14:42->14:14:44)

[root@mrg-qe-10 fcluster]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -5
10242677 698 (14:00:29->14:12:07)
95939 1 (10:24:29->10:24:30)
10248322 1 (14:16:07->14:16:08)
10248295 1 (14:16:06->14:16:07)
10248265 1 (14:16:05->14:16:06)

[root@mrg-qe-11 fcluster]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -5
10242677 13345 (10:28:36->14:11:01)
10243005 29 (14:11:01->14:11:30)
95939 4 (10:23:20->10:23:24)
10243339 3 (14:11:40->14:11:43)
10246391 2 (14:13:44->14:13:46)

[root@mrg-qe-12 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -5
10244299 2 (14:13:28->14:13:30)
10248295 1 (14:16:06->14:16:07)
10248266 1 (14:16:05->14:16:06)
10248248 1 (14:16:04->14:16:05)
10248110 1 (14:15:58->14:15:59)
[root@mrg-qe-12 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | grep 10242677
10242677 0 (14:12:07->14:12:07)


This excersize confirmed the issue and proved that issue is timing dependent (as reverted/excluded logging patch was able to suppress it).
Moreover compilation on AMD or Intel machine seems to provide same code (see hypothesis #15).
The possible patch have effect over the channel exception resolution, all the durations were reduced, but does not fix the issue.

Comment 18 Frantisek Reznicek 2010-08-31 07:38:48 UTC
Just an update...

I was retesting, because it seemed that issue is not reproducible anymore and I was able to trigger again the issue, (tsxtest --flow 0 as suggested).

The data below show two extremely long exceptions:
  [root@mrg-qe-09 fcluster]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -3 ; echo "..." ; gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | tail -3
  102690519 4321 (04:53:23->06:05:24)
  236649358 59 (08:48:13->08:49:12)
  66476831 56 (00:05:19->00:06:15)
  ...
  102690577 0 (06:05:26->06:05:26)
  102690521 0 (06:05:25->06:05:25)
  66476376 -76876 (21:26:35->00:05:19)
  [root@mrg-qe-09 fcluster]# grep 102690519 /tmp/qpidd.log
  2010-08-31 04:53:23 debug cluster(10.34.45.9:7335 READY/error) channel error 102690519 on 127.0.0.1:48244(10.34.45.9:7335-2927 local) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
  2010-08-31 04:53:23 debug cluster(10.34.45.9:7335 READY/error) error 102690519 resolved with 10.34.45.10:27131
  2010-08-31 04:53:23 debug cluster(10.34.45.9:7335 READY/error) error 102690519 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014
  2010-08-31 04:53:23 debug cluster(10.34.45.9:7335 READY/error) error 102690519 resolved with 10.34.45.9:7335
  2010-08-31 04:53:23 debug cluster(10.34.45.9:7335 READY/error) error 102690519 must be resolved with 10.34.45.11:11734 10.34.45.12:25014
  2010-08-31 05:38:50 debug cluster(10.34.45.9:7335 READY/error) error 102690519 resolved with 10.34.45.11:11734
  2010-08-31 05:38:50 debug cluster(10.34.45.9:7335 READY/error) error 102690519 must be resolved with 10.34.45.12:25014
  2010-08-31 06:05:24 debug cluster(10.34.45.9:7335 READY/error) error 102690519 resolved with 10.34.45.12:25014
  2010-08-31 06:05:24 debug cluster(10.34.45.9:7335 READY) error 102690519 resolved.
  [root@mrg-qe-09 fcluster]# grep 66476376 /tmp/qpidd.log
  2010-08-30 21:26:35 debug cluster(10.34.45.9:7335 READY/error) channel error 66476376 on 127.0.0.1:35258(10.34.45.12:25014-3099 shadow) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
  2010-08-30 21:26:35 debug cluster(10.34.45.9:7335 READY/error) error 66476376 resolved with 10.34.45.10:27131
  2010-08-30 21:26:35 debug cluster(10.34.45.9:7335 READY/error) error 66476376 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014
  2010-08-30 21:26:35 debug cluster(10.34.45.9:7335 READY/error) error 66476376 resolved with 10.34.45.9:7335
  2010-08-30 21:26:35 debug cluster(10.34.45.9:7335 READY/error) error 66476376 must be resolved with 10.34.45.11:11734 10.34.45.12:25014
  2010-08-30 23:52:37 debug cluster(10.34.45.9:7335 READY/error) error 66476376 resolved with 10.34.45.11:11734
  2010-08-30 23:52:37 debug cluster(10.34.45.9:7335 READY/error) error 66476376 must be resolved with 10.34.45.12:25014
  2010-08-31 00:05:19 debug cluster(10.34.45.9:7335 READY/error) error 66476376 resolved with 10.34.45.12:25014
  2010-08-31 00:05:19 debug cluster(10.34.45.9:7335 READY) error 66476376 resolved.


  [root@mrg-qe-10 fcluster]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -3 ; echo "..." ; gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | tail -3
  102690519 19666 (00:37:38->06:05:24)
  66476831 56 (00:05:19->00:06:15)
  236650386 32 (08:52:03->08:52:35)
  ...
  102690577 0 (06:05:26->06:05:26)
  102690521 0 (06:05:25->06:05:25)
  66476376 -76875 (21:26:34->00:05:19)
  [root@mrg-qe-10 fcluster]# grep 102690519 /tmp/qpidd.log
  2010-08-31 00:37:38 debug cluster(10.34.45.10:27131 READY/error) channel error 102690519 on 127.0.0.1:48244(10.34.45.9:7335-2927 shadow) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
  2010-08-31 00:37:38 debug cluster(10.34.45.10:27131 READY/error) error 102690519 resolved with 10.34.45.10:27131
  2010-08-31 00:37:38 debug cluster(10.34.45.10:27131 READY/error) error 102690519 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014
  2010-08-31 04:53:23 debug cluster(10.34.45.10:27131 READY/error) error 102690519 resolved with 10.34.45.9:7335
  2010-08-31 04:53:23 debug cluster(10.34.45.10:27131 READY/error) error 102690519 must be resolved with 10.34.45.11:11734 10.34.45.12:25014
  2010-08-31 05:38:50 debug cluster(10.34.45.10:27131 READY/error) error 102690519 resolved with 10.34.45.11:11734
  2010-08-31 05:38:50 debug cluster(10.34.45.10:27131 READY/error) error 102690519 must be resolved with 10.34.45.12:25014
  2010-08-31 06:05:24 debug cluster(10.34.45.10:27131 READY/error) error 102690519 resolved with 10.34.45.12:25014
  2010-08-31 06:05:24 debug cluster(10.34.45.10:27131 READY) error 102690519 resolved.
  [root@mrg-qe-10 fcluster]# grep 66476376 /tmp/qpidd.log
  2010-08-30 21:26:34 debug cluster(10.34.45.10:27131 READY/error) channel error 66476376 on 127.0.0.1:35258(10.34.45.12:25014-3099 shadow) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
  2010-08-30 21:26:34 debug cluster(10.34.45.10:27131 READY/error) error 66476376 resolved with 10.34.45.10:27131
  2010-08-30 21:26:34 debug cluster(10.34.45.10:27131 READY/error) error 66476376 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014
  2010-08-30 21:26:35 debug cluster(10.34.45.10:27131 READY/error) error 66476376 resolved with 10.34.45.9:7335
  2010-08-30 21:26:35 debug cluster(10.34.45.10:27131 READY/error) error 66476376 must be resolved with 10.34.45.11:11734 10.34.45.12:25014
  2010-08-30 23:52:37 debug cluster(10.34.45.10:27131 READY/error) error 66476376 resolved with 10.34.45.11:11734
  2010-08-30 23:52:37 debug cluster(10.34.45.10:27131 READY/error) error 66476376 must be resolved with 10.34.45.12:25014
  2010-08-31 00:05:19 debug cluster(10.34.45.10:27131 READY/error) error 66476376 resolved with 10.34.45.12:25014
  2010-08-31 00:05:19 debug cluster(10.34.45.10:27131 READY) error 66476376 resolved.


  [root@mrg-qe-11 fcluster]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -3 ; echo "..." ; gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | tail -3
  102690519 1595 (05:37:27->06:04:02)
  236649358 37 (08:47:15->08:47:52)
  236650386 31 (08:50:41->08:51:12)
  ...
  102690581 0 (06:04:06->06:04:06)
  102690520 0 (06:04:02->06:04:02)
  66476376 -85638 (23:51:15->00:03:57)
  [root@mrg-qe-11 fcluster]# grep 102690519 /tmp/qpidd.log
  2010-08-31 05:37:27 debug cluster(10.34.45.11:11734 READY/error) channel error 102690519 on 127.0.0.1:48244(10.34.45.9:7335-2927 shadow) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
  2010-08-31 05:37:27 debug cluster(10.34.45.11:11734 READY/error) error 102690519 resolved with 10.34.45.10:27131
  2010-08-31 05:37:27 debug cluster(10.34.45.11:11734 READY/error) error 102690519 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014
  2010-08-31 05:37:27 debug cluster(10.34.45.11:11734 READY/error) error 102690519 resolved with 10.34.45.9:7335
  2010-08-31 05:37:27 debug cluster(10.34.45.11:11734 READY/error) error 102690519 must be resolved with 10.34.45.11:11734 10.34.45.12:25014
  2010-08-31 05:37:28 debug cluster(10.34.45.11:11734 READY/error) error 102690519 resolved with 10.34.45.11:11734
  2010-08-31 05:37:28 debug cluster(10.34.45.11:11734 READY/error) error 102690519 must be resolved with 10.34.45.12:25014
  2010-08-31 06:04:02 debug cluster(10.34.45.11:11734 READY/error) error 102690519 resolved with 10.34.45.12:25014
  2010-08-31 06:04:02 debug cluster(10.34.45.11:11734 READY) error 102690519 resolved.
  [root@mrg-qe-11 fcluster]# grep 66476376 /tmp/qpidd.log
  2010-08-30 23:51:15 debug cluster(10.34.45.11:11734 READY/error) channel error 66476376 on 127.0.0.1:35258(10.34.45.12:25014-3099 shadow) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
  2010-08-30 23:51:15 debug cluster(10.34.45.11:11734 READY/error) error 66476376 resolved with 10.34.45.10:27131
  2010-08-30 23:51:15 debug cluster(10.34.45.11:11734 READY/error) error 66476376 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014
  2010-08-30 23:51:15 debug cluster(10.34.45.11:11734 READY/error) error 66476376 resolved with 10.34.45.9:7335
  2010-08-30 23:51:16 debug cluster(10.34.45.11:11734 READY/error) error 66476376 must be resolved with 10.34.45.11:11734 10.34.45.12:25014
  2010-08-30 23:51:16 debug cluster(10.34.45.11:11734 READY/error) error 66476376 resolved with 10.34.45.11:11734
  2010-08-30 23:51:16 debug cluster(10.34.45.11:11734 READY/error) error 66476376 must be resolved with 10.34.45.12:25014
  2010-08-31 00:03:57 debug cluster(10.34.45.11:11734 READY/error) error 66476376 resolved with 10.34.45.12:25014
  2010-08-31 00:03:57 debug cluster(10.34.45.11:11734 READY) error 66476376 resolved.
  
  [root@mrg-qe-12 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -3 ; echo "..." ; gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | tail -3
  236650386 21 (08:52:14->08:52:35)
  159294587 19 (06:54:04->06:54:23)
  236652727 16 (08:58:32->08:58:48)
  ...
  102690577 0 (06:05:26->06:05:26)
  102690521 0 (06:05:25->06:05:25)
  102690520 0 (06:05:25->06:05:25)
  [root@mrg-qe-12 ~]# grep 102690519 /tmp/qpidd.log
  2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) channel error 102690519 on 127.0.0.1:48244(10.34.45.9:7335-2927 shadow) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
  2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) error 102690519 resolved with 10.34.45.10:27131
  2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) error 102690519 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014
  2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) error 102690519 resolved with 10.34.45.9:7335
  2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) error 102690519 must be resolved with 10.34.45.11:11734 10.34.45.12:25014
  2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) error 102690519 resolved with 10.34.45.11:11734
  2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) error 102690519 must be resolved with 10.34.45.12:25014
  2010-08-31 06:05:25 debug cluster(10.34.45.12:25014 READY/error) error 102690519 resolved with 10.34.45.12:25014
  2010-08-31 06:05:25 debug cluster(10.34.45.12:25014 READY) error 102690519 resolved.
  [root@mrg-qe-12 ~]# grep 66476376 /tmp/qpidd.log
  2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) channel error 66476376 on 127.0.0.1:35258(10.34.45.12:25014-3099 local) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
  2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 resolved with 10.34.45.10:27131
  2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014
  2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 resolved with 10.34.45.9:7335
  2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 must be resolved with 10.34.45.11:11734 10.34.45.12:25014
  2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 resolved with 10.34.45.11:11734
  2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 must be resolved with 10.34.45.12:25014
  2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 resolved with 10.34.45.12:25014
  2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY) error 66476376 resolved.

Comment 19 Frantisek Reznicek 2010-09-01 13:26:53 UTC
Currently retesting on different machines in another lab (mrg33-6), so far no such issue was seen because no single 'Channel 1 is not attached' exception detected.

Keeping test busy. Feedback soon.

Comment 20 Frantisek Reznicek 2010-09-02 14:59:38 UTC
The test on mrg33-6 show no such issue over the additional day and half.

It really seems that less powerfull AMD Opterons (and potentially mixing architectures in one cluster) cause the issue to appear.

The other important condition is presence of message store.


The test is now still running on mrg33-6 and mrg36 is manually put behing by running there unrelated actions to be closer to the situation when machines in the cluster are not homogenous. My thanks to Mick.

Comment 21 Alan Conway 2010-09-02 21:03:07 UTC
The journal inactive timer late/overrun issue does not require a cluster. I've raised bug 629756 to separate this issue. 

I will leave this bug open as we still need to
- ensure there cannot be an unbounded time lag between cluster members with different processing speeds
- understand the not-attached warnings

Comment 22 Frantisek Reznicek 2010-09-03 10:15:57 UTC
An update on unbalancing mrg33-6 run.

There is seen similar effect on Intel machines, it is just very difficult to reach that state (cluster have to run for long time and one of the machines have to execute also different actions).

Unbalancing mrg36 was able to trigger similar case:

  [root@mrg33 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log
  2178553487 1 (16:07:07->16:07:08)
  [root@mrg33 ~]# grep -c "overrunning" /tmp/qpidd.log
  4447
  
  [root@mrg34 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log
  2178553487 38 (16:06:29->16:07:07)
  [root@mrg34 ~]# grep -c "overrunning" /tmp/qpidd.log
  3319
  
  [root@mrg35 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log
  2178553487 38 (16:06:29->16:07:07)
  [root@mrg35 ~]# grep -c "overrunning" /tmp/qpidd.log
  4258
  
  [root@mrg36 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log
  2178553487 38 (16:06:29->16:07:07)
  [root@mrg36 ~]# grep 2178553487 /tmp/qpidd.log
  2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) channel error 2178553487 on 10.16.44.241:53227(20.0.100.33:18339-160511 shadow) must be resolved with: 20.0.100.33:18339 20.0.100.34:27453 20.0.100.35:23092 20.0.100.36:9011 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
  2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 resolved with 20.0.100.35:23092
  2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 must be resolved with 20.0.100.33:18339 20.0.100.34:27453 20.0.100.36:9011
  2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 resolved with 20.0.100.34:27453
  2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 must be resolved with 20.0.100.33:18339 20.0.100.36:9011
  2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 resolved with 20.0.100.36:9011
  2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 must be resolved with 20.0.100.33:18339
  2010-09-02 16:07:07 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 resolved with 20.0.100.33:18339
  2010-09-02 16:07:07 debug cluster(20.0.100.36:9011 READY) error 2178553487 resolved.
  [root@mrg36 ~]# grep -c "overrunning" /tmp/qpidd.log
  10656
  
  
Above data show that there was single channel exception which took a second for three machines which were not executing anything else and 38 seconds for the one under random load.

The 'Timer woken up overrunning' case counts are also reflecting the load.

I'll try to unbalance half of the cluster to see if I'm able to be closer to the AMD data.

Comment 23 Frantisek Reznicek 2010-09-03 10:19:23 UTC
An correction to previous post, there are observed three channel exception durations of 38 seconds and one of one second. It is seen that machine 33 got noticed late.

Comment 25 Alan Conway 2010-09-03 13:58:56 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Nodes in a cluster are unbalanced such that one node runs significantly faster than another. This could be because nodes use different hardware or because of other  CPU or IO intensive processes running continuously on some nodes but not others.

Consequence: Under continuous heavy load, the slower nodes may lag behind the cluster and become slow to respond to their clients.

Workaround: Use similar hardware for cluster nodes, avoid other long running CPU/IO intensive tasks on cluster nodes. Occasional spikes of activity are not a problem. The problem only occurs qpidd is heavily loaded for a long period (several hours) on a continuously unbalanced cluster.

Comment 26 Frantisek Reznicek 2010-09-03 14:04:48 UTC
I'm fine with the tech. note formulation.

Comment 27 Alan Conway 2010-09-03 15:31:31 UTC
The underlying problem is that there's no flow control to slow faster brokers down if a slow broker gets behind. To implement this we need flow control in corosync which is currently not there. Bug 579127 will add the required flow control to corosync, at that point we can implement the flow control in qpidd.

Comment 28 Frantisek Reznicek 2011-01-20 09:32:21 UTC
Symptom of channel exceptions / missing flow control are also visible in broker log file as https://bugzilla.redhat.com/show_bug.cgi?id=629756#c7
messages.


Overrun messages should not indicate delays greater than defined in bug 629756
  description's Expected results:

  moderate number of warnings, <- solved in bug 629756
  
  delays of <10ms, maybe one or two in 100s of ms. Certainly nothing over 1000ms. <- part of this defect now


This timing requirements are excluded from bug 629756 now and added to this defect as bug 629756 targets on mechanism how to limit number of overrun messages.

Comment 32 Red Hat Bugzilla 2025-02-10 03:13:31 UTC
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.