Bug 623660
| Summary: | clustered qpidd resolves 'Channel exception: not-attached: Channel 1 is not attached' extremely slowly | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Frantisek Reznicek <freznice> | ||||
| Component: | qpid-cpp | Assignee: | messaging-bugs <messaging-bugs> | ||||
| Status: | CLOSED UPSTREAM | QA Contact: | Messaging QE <messaging-qe-bugs> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | beta | CC: | gsim, rrajaram | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Cause: Nodes in a cluster are unbalanced such that one node runs significantly faster than another. This could be because nodes use different hardware or because of other CPU or IO intensive processes running continuously on some nodes but not others.
Consequence: Under continuous heavy load, the slower nodes may lag behind the cluster and become slow to respond to their clients.
Workaround: Use similar hardware for cluster nodes, avoid other long running CPU/IO intensive tasks on cluster nodes. Occasional spikes of activity are not a problem. The problem only occurs qpidd is heavily loaded for a long period (several hours) on a continuously unbalanced cluster.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2025-02-10 03:13:31 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 579127 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
|
Description
Frantisek Reznicek
2010-08-12 13:11:47 UTC
Created attachment 438615 [details]
The tool for reporting duration of channel exception resolution
The attachement allows to measure the channel exception resolution.
Usage:
gawk -f qpidd_channel_error_duration_reporter.awk /tmp/qpidd.log | \
sort -nr -k 2 > qpidd_channel_error_duration_reporter.log
[root@mrg-qe-12 fcluster]# head -3 qpidd_channel_error_duration_reporter.log
197589822 10862 (17:58:41->20:59:43)
197591234 53 (20:59:43->21:00:36)
198951004 51 (21:07:52->21:08:43)
# ^ ID ^ duration[secs]
# ^ timestamps, to double check
Note:
The attached tool is very simple and will not deal with change of the time from one day to another one that's why you need to check timestamps
The issue was retested with results detailed below.
There was seen another one long case when exception was resolved within 3 hours only (one occurrence), the other cases were resolved much more quickly, some within few seconds, some withing few tens of seconds.
The simple statistics is also summarized below:
gawk -f cherror_dur.awk /tmp/qpidd.log | sort -nr -k 2 > cherror_dur.log
head cherror_dur.log
wc -l cherror_dur.log
awk -v min=10 -v max=20 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
awk -v min=20 -v max=30 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
awk -v min=30 -v max=40 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
awk -v min=40 -v max=999999 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
[09-first in cluster]
[root@mrg-qe-09 fcluster]# head cherror_dur.log
197589822 10865 (17:58:38->20:59:43)
225824437 56 (07:36:39->07:37:35)
200130231 53 (21:21:10->21:22:03)
229187039 50 (08:53:47->08:54:37)
224992378 50 (07:10:03->07:10:53)
219963899 49 (04:55:22->04:56:11)
213121023 48 (01:44:49->01:45:37)
207347485 45 (23:08:37->23:09:22)
205867402 45 (22:31:41->22:32:26)
204581138 45 (22:00:29->22:01:14)
[root@mrg-qe-09 fcluster]# wc -l cherror_dur.log
1334 cherror_dur.log
[root@mrg-qe-09 fcluster]# awk -v min=10 -v max=20 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
44
[root@mrg-qe-09 fcluster]# awk -v min=20 -v max=30 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
84
[root@mrg-qe-09 fcluster]# awk -v min=30 -v max=40 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
148
[root@mrg-qe-09 fcluster]# awk -v min=40 -v max=999999 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
33
[10-second in cluster]
[root@mrg-qe-10 fcluster]# head cherror_dur.log
197589822 10863 (17:58:40->20:59:43)
197591234 53 (20:59:43->21:00:36)
13459025 29 (16:59:58->17:00:27)
140739016 26 (17:39:05->17:39:31)
93301270 25 (17:24:37->17:25:02)
28830231 25 (17:05:11->17:05:36)
155411800 25 (17:43:46->17:44:11)
115423132 24 (17:31:19->17:31:43)
196914882 23 (17:56:23->17:56:46)
54510456 21 (17:12:52->17:13:13)
[root@mrg-qe-10 fcluster]# wc -l cherror_dur.log
1334 cherror_dur.log
[root@mrg-qe-10 fcluster]# awk -v min=10 -v max=20 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
76
[root@mrg-qe-10 fcluster]# awk -v min=20 -v max=30 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
11
[root@mrg-qe-10 fcluster]# awk -v min=30 -v max=40 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
0
[root@mrg-qe-10 fcluster]# awk -v min=40 -v max=999999 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
2
[11-third in cluster]
[root@mrg-qe-11 fcluster]# head cherror_dur.log
225824437 38 (07:36:02->07:36:40)
219963899 27 (04:54:50->04:55:17)
200130231 24 (21:20:40->21:21:04)
224992378 23 (07:09:35->07:09:58)
213121023 23 (01:44:19->01:44:42)
228312913 22 (08:34:51->08:35:13)
204581138 22 (21:59:58->22:00:20)
226224279 21 (07:50:13->07:50:34)
214836440 21 (02:41:16->02:41:37)
213955315 21 (02:15:25->02:15:46)
[root@mrg-qe-11 fcluster]# wc -l cherror_dur.log
1334 cherror_dur.log
[root@mrg-qe-11 fcluster]# awk -v min=10 -v max=20 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
178
[root@mrg-qe-11 fcluster]# awk -v min=20 -v max=30 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
22
[root@mrg-qe-11 fcluster]# awk -v min=30 -v max=40 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
1
[root@mrg-qe-11 fcluster]# awk -v min=40 -v max=999999 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
0
[root@mrg-qe-11 fcluster]# grep 197589822 cherror_dur.log
197589822 0 (20:58:49->20:58:49)
[12-fourth in cluster]
[root@mrg-qe-12 fcluster]# head cherror_dur.log
197589822 10862 (17:58:41->20:59:43)
197591234 53 (20:59:43->21:00:36)
198951004 51 (21:07:52->21:08:43)
197602148 47 (21:03:59->21:04:46)
197609774 40 (21:05:20->21:06:00)
13459025 29 (16:59:58->17:00:27)
140739016 26 (17:39:05->17:39:31)
93301270 25 (17:24:37->17:25:02)
28830231 25 (17:05:11->17:05:36)
155411800 25 (17:43:46->17:44:11)
[root@mrg-qe-12 fcluster]# wc -l cherror_dur.log
1334 cherror_dur.log
[root@mrg-qe-12 fcluster]# awk -v min=10 -v max=20 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
10
[root@mrg-qe-12 fcluster]# awk -v min=20 -v max=30 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
10
[root@mrg-qe-12 fcluster]# awk -v min=30 -v max=40 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
0
[root@mrg-qe-12 fcluster]# awk -v min=40 -v max=999999 'BEGIN{cnt=0}{if((($2+0)>=min)&&(($2+0)<max)){cnt++}}END{print cnt}' cherror_dur.log
5
The above results show interesting fact:
The only one extremely long channel exception is happening/mentioned on mrg-qe-11 @ 20:58:49, but on the others know about this one for hours (from about 17:58:40)
The cluster engine (openais) does not show any problems in the log file.
After re-run I performed I found another one interesting fact which might be related and that's why I believe it is worth to mentioned it here (although it might be different issue)
The description details that clients ran on the cluster are killed, default kill signal (SIGTERM) is used for such purpose. Surprisingly the clients did not exit after SIGTERM and hang in qpid::client::LocalQueueImpl::get() -> pop() -> pthread_cond_wait@@GLIBC_2.3.2() -> __kernel_vsyscall ().
So for instance qpid-perftest after being killed stays here:
(*): Shared library is missing debugging information.
(gdb) 4 Thread 0xb7fadb90 (LWP 12958) 0x00411410 in __kernel_vsyscall ()
3 Thread 0xb69feb90 (LWP 15550) 0x00411410 in __kernel_vsyscall ()
2 Thread 0xb55fcb90 (LWP 15791) 0x00411410 in __kernel_vsyscall ()
* 1 Thread 0xb7fae940 (LWP 12955) 0x00411410 in __kernel_vsyscall ()
(gdb)
Thread 4 (Thread 0xb7fadb90 (LWP 12958)):
#0 0x00411410 in __kernel_vsyscall ()
#1 0x005cd486 in epoll_wait () from /lib/libc.so.6
#2 0x001f8c2a in qpid::sys::Poller::wait (this=0x8172828, timeout=...)
at qpid/sys/epoll/EpollPoller.cpp:563
#3 0x001f9853 in qpid::sys::Poller::run (this=0x8172828)
at qpid/sys/epoll/EpollPoller.cpp:515
#4 0x001eed01 in qpid::sys::(anonymous namespace)::runRunnable (p=0x8172828)
at qpid/sys/posix/Thread.cpp:35
#5 0x0039b832 in start_thread () from /lib/libpthread.so.0
#6 0x005cce0e in clone () from /lib/libc.so.6
Thread 3 (Thread 0xb69feb90 (LWP 15550)):
#0 0x00411410 in __kernel_vsyscall ()
#1 0x005cd486 in epoll_wait () from /lib/libc.so.6
#2 0x001f8c2a in qpid::sys::Poller::wait (this=0x8172828, timeout=...)
at qpid/sys/epoll/EpollPoller.cpp:563
#3 0x001f9853 in qpid::sys::Poller::run (this=0x8172828)
at qpid/sys/epoll/EpollPoller.cpp:515
#4 0x001eed01 in qpid::sys::(anonymous namespace)::runRunnable (p=0x8172828)
at qpid/sys/posix/Thread.cpp:35
#5 0x0039b832 in start_thread () from /lib/libpthread.so.0
#6 0x005cce0e in clone () from /lib/libc.so.6
Thread 2 (Thread 0xb55fcb90 (LWP 15791)):
#0 0x00411410 in __kernel_vsyscall ()
#1 0x005cd486 in epoll_wait () from /lib/libc.so.6
#2 0x001f8c2a in qpid::sys::Poller::wait (this=0x8172828, timeout=...)
at qpid/sys/epoll/EpollPoller.cpp:563
#3 0x001f9853 in qpid::sys::Poller::run (this=0x8172828)
at qpid/sys/epoll/EpollPoller.cpp:515
#4 0x001eed01 in qpid::sys::(anonymous namespace)::runRunnable (p=0x8172828)
at qpid/sys/posix/Thread.cpp:35
#5 0x0039b832 in start_thread () from /lib/libpthread.so.0
#6 0x005cce0e in clone () from /lib/libc.so.6
Thread 1 (Thread 0xb7fae940 (LWP 12955)):
#0 0x00411410 in __kernel_vsyscall ()
#1 0x0039fbc5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2 0x005d94dd in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3 0x00bbaf7a in pop (this=0xb74f5e28, result=..., timeout=...)
at ../include/qpid/sys/posix/Condition.h:63
#4 qpid::client::LocalQueueImpl::get (this=0xb74f5e28, result=...,
timeout=...) at qpid/client/LocalQueueImpl.cpp:49
#5 0x00bbbbee in qpid::client::LocalQueueImpl::get (this=0xb74f5e28,
timeout=...) at qpid/client/LocalQueueImpl.cpp:40
#6 0x00bbbdd7 in qpid::client::LocalQueueImpl::pop (this=0xb74f5e28,
timeout=...) at qpid/client/LocalQueueImpl.cpp:36
#7 0x00bb87f9 in qpid::client::LocalQueue::pop (this=0xbfade9a4, timeout=...)
at qpid/client/LocalQueue.cpp:43
#8 0x08059ad3 in qpid::tests::Controller::process(size_t, qpid::client::LocalQueue, std::string, boost::function<void ()(const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&),std::allocator<void> >) (
this=0xbfadeae8, n=1, lq=..., queue=..., msgFn=...)
at qpid-perftest.cpp:382
#9 0x0805b702 in qpid::tests::Controller::run (this=0xbfadeae8)
at qpid-perftest.cpp:423
#10 0x0805352f in main (argc=Cannot access memory at address 0x80
) at qpid-perftest.cpp:719
(gdb) Detaching from program: /usr/bin/qpid-perftest, process 12955
(gdb) quit
I'm going to retest on -12 with SIGTERM followed by SIGKILL... Testing update -12 build shows so far no such issue: [root@mrg-qe-09 fcluster]# gawk -f cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -20 41768242 53 (13:30:06->13:30:59) 46267278 48 (16:04:19->16:05:07) 46546959 46 (16:12:53->16:13:39) 46344840 44 (16:07:13->16:07:57) 42801369 44 (14:12:45->14:13:29) 45834192 43 (15:50:02->15:50:45) 47686762 42 (16:47:33->16:48:15) 47463278 42 (16:42:02->16:42:44) 45613620 41 (15:44:11->15:44:52) 43352306 41 (14:32:00->14:32:41) 47251459 40 (16:36:19->16:36:59) 46908472 40 (16:24:43->16:25:23) 46085609 40 (15:58:43->15:59:23) 43549886 40 (14:39:17->14:39:57) 47097462 39 (16:30:46->16:31:25) 46703165 39 (16:18:50->16:19:29) 45193575 39 (15:32:34->15:33:13) 47015123 38 (16:27:39->16:28:17) 44555018 38 (15:15:04->15:15:42) 41772500 38 (13:39:57->13:40:35) [root@mrg-qe-09 fcluster]# ssh mrg-qe-10 [root@mrg-qe-10 ~]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -20 45622496 20 (15:45:07->15:45:27) 46344840 16 (16:07:41->16:07:57) 41770771 16 (13:36:41->13:36:57) 41768242 16 (13:30:44->13:31:00) 45291764 15 (15:36:06->15:36:21) 44748414 15 (15:21:21->15:21:36) 46546959 13 (16:13:26->16:13:39) 45834192 13 (15:50:32->15:50:45) 45613620 13 (15:44:39->15:44:52) 45193575 13 (15:33:00->15:33:13) 44242598 13 (15:04:17->15:04:30) 42801369 13 (14:13:16->14:13:29) 47463278 12 (16:42:32->16:42:44) 46085609 12 (15:59:11->15:59:23) 47015123 11 (16:28:06->16:28:17) 45932750 11 (15:53:24->15:53:35) 42297367 11 (14:03:38->14:03:49) 47573383 10 (16:45:20->16:45:30) 47251459 10 (16:36:49->16:36:59) 46636154 10 (16:16:28->16:16:38) [root@mrg-qe-10 ~]# ssh mrg-qe-11 [root@mrg-qe-11 ~]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -20 41770771 70 (13:34:52->13:36:02) 41768242 67 (13:28:57->13:30:04) 43549886 65 (14:37:56->14:39:01) 42801369 65 (14:11:29->14:12:34) 43162564 63 (14:22:23->14:23:26) 41777630 62 (13:45:05->13:46:07) 43777481 61 (14:44:14->14:45:15) 43188933 61 (14:23:27->14:24:28) 42297367 61 (14:01:50->14:02:51) 42021685 61 (13:47:57->13:48:58) 41772500 61 (13:38:39->13:39:40) 43909823 60 (14:47:23->14:48:23) 43285492 60 (14:28:00->14:29:00) 42609812 60 (14:03:57->14:04:57) 42245469 60 (13:58:51->13:59:51) 42177079 60 (13:54:29->13:55:29) 41774499 60 (13:41:45->13:42:45) 41769288 60 (13:31:49->13:32:49) 43352306 59 (14:30:47->14:31:46) 43638814 58 (14:40:48->14:41:46) [root@mrg-qe-11 ~]# ssh mrg-qe-12 [root@mrg-qe-12 ~]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -20 45630787 19 (15:45:10->15:45:29) 45309933 11 (15:36:27->15:36:38) 41779772 11 (13:47:57->13:48:08) 46910684 7 (16:25:37->16:25:44) 46273371 7 (16:05:09->16:05:16) 41772649 7 (13:41:01->13:41:08) 44660918 5 (15:18:52->15:18:57) 46793447 4 (16:22:28->16:22:32) 46447328 4 (16:10:51->16:10:55) 44993250 4 (15:27:45->15:27:49) 43188933 4 (14:25:19->14:25:23) 43007176 4 (14:19:21->14:19:25) 41774520 4 (13:43:44->13:43:48) 45392977 3 (15:39:15->15:39:18) 43213503 3 (14:26:33->14:26:36) 41772573 3 (13:40:44->13:40:47) 44111902 2 (14:58:58->14:59:00) 42899508 2 (14:16:19->14:16:21) 42297368 2 (14:03:47->14:03:49) 47692009 1 (16:48:16->16:48:17) The hanging client described above in comment 4 is now under retesting, because the client seems to not get the SIGTERM but just the parent and kernel seem to moved the client under init process, more data should be available next week. I ran the following on a 4 node cluster over the weekend, 64 hours so far with no error. Top shows qpidd processes are stable at between 700M and 1024M node1: info_thread.sh node2: tsxtest_thread.sh node3&4: perftest_thread.sh No core files, grep 'ecode:[^0]' in logs found nothing, tail of logs looks normal. This was with trunk r985356 which corresponds to release tags mrg1.3-beta5 qpid-cpp-mrg-0.7.946106-12 Hello Alan, I performed the retest over the weekend and I need to say that 4 nodes cluster is still behaving the way that there is one channel exception taking extremely long time to be resolved: [root@mrg-qe-09 fcluster]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -5 17305969 16080 (17:33:36->22:01:36) 4032459338 113 (01:33:53->01:35:46) 4059762049 102 (02:00:50->02:02:32) 2721094641 99 (00:26:09->00:27:48) 934589368 92 (15:05:54->15:07:26) [root@mrg-qe-10 fcluster]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -5 17305969 16078 (17:33:37->22:01:35) 16752199 73 (17:19:49->17:21:02) 43067894 67 (22:25:14->22:26:21) 16958035 67 (17:29:59->17:31:06) 16950935 67 (17:22:57->17:24:04) [root@mrg-qe-11 ~]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -5 17305969 1175 (21:41:05->22:00:40) 4059762049 55 (02:00:37->02:01:32) 3048658051 40 (06:41:10->06:41:50) 2484491772 40 (19:54:10->19:54:50) 3581085133 38 (17:51:01->17:51:39) [root@mrg-qe-12 ~]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | sort -r -n -k2 | head -5 157871336 19 (00:28:05->00:28:24) 63413322 17 (07:21:48->07:22:05) 2683434374 16 (23:20:38->23:20:54) 934585372 12 (14:57:47->14:57:59) 4103141964 12 (03:02:52->03:03:04) [root@mrg-qe-12 ~]# gawk -f ~/fcluster/cherror_dur.awk /tmp/qpidd.log | grep "17305969 " 17305969 1 (22:01:35->22:01:36) Time synchronization is following: [root@mrg-qe-09 fcluster]# date Tue Aug 17 11:34:46 CEST 2010 [root@mrg-qe-10 fcluster]# date Tue Aug 17 11:34:47 CEST 2010 [root@mrg-qe-11 ~]# date Tue Aug 17 11:33:47 CEST 2010 [root@mrg-qe-12 ~]# date Tue Aug 17 11:34:48 CEST 2010 The issue with clients not being correctly killed is test issue which is now fixed, so I can guarantee with above results that there are max. 10 clients running at the moment against the cluster (timeouting clients are killed). [root@mrg-qe-12 ~]# rpm -qa | grep -E '(qpid|qmf)' | sort python-qmf-0.7.946106-8.el5 python-qpid-0.7.946106-12.el5 qmf-0.7.946106-12.el5 qmf-devel-0.7.946106-12.el5 qpid-cpp-client-*0.7.946106-12.el5 qpid-cpp-server-*0.7.946106-12.el5 qpid-java-*-0.7.946106-7.el5 qpid-tools-0.7.946106-8.el5 ruby-qmf-0.7.946106-12.el5 I believe that the issue is still there, that's why I kick the bug to ASSIGNED. Additional note: The clustered broker is configured as persistant cluster. The clients are: 1] tsxtest looping 2] perftest to localhost using unique queue names (*) looping 3] perftest to randomly taken cluster node using unique queue names (*) looping 4] qmf qpid-tools clients run in the loop against localhost and after that against all cluster nodes (qpid-cluster; qpid-config; qpid-stat -b) looping 5a] connection clients in python and c++ using old API connection (connect & disconnect) looping 5b] qmf python console client doing just connect and disconnect looping Nodes are 9...12 and above actions are assingned following way: 9 ... 5a+b] + 1] 10 ... 5a+b] + 4] 11 ... 5a+b] + 2] 12 ... 5a+b] + 3] comment #2 mentions an issue with clients being killed in the test scripts. Can you attach the latest test scripts with this issue fixed? Could you also attach the logs from the latest reproduction of the bug? Currently retesting again with special set of packages (-12 + Alan's logging patch) to be able to provide more detailed information about what is happening there. The issue seems to be hardware dependent. The machine specs where the issue is observed are: [root@mrg-qe-12 ~]# mrg_get_cpu_info mrg_get_cpu_info():CPU information: processor : 0 1 2 3 4 5 6 7 vendor_id : AuthenticAMD model name : Quad-Core AMD Opteron(tm) Processor 2376 cpu MHz : 2294.411 cpu cores : 4 bogomips : 4588.82 4588.49 4588.49 4588.47 4588.51 4588.46 4588.43 4588.49 Retest with Alan's 0001-DEBUG-logging-for-slow-error-resolution-issue.patch patch and recompilation on target machines suppressed the issue. There are two opssible reasons: a] the patch changed timing so much that issue went away b] compilation (doing the exactly same steps as during normal build) on target AMD produces slightly different code which suppresses the issue Next steps: New patch which (0001-WIP-flow-control-limit-total-unprocessed-events.patch) is believed to fix the core of the issue is going to be applied and test ran over weekend. If tests passes as it does now, then it is still needed to rebuild -12 w/o any additional patches on AMD to confirm that above point b] is not true. The test with 0001-WIP-flow-control-limit-total-unprocessed-events.patch produces much much less channel exceptions (all solved within 20 seconds) than in previous case and more rapidly shows possibly another problem which might be linked.
After couple of minutes all clients trying to get attached to broker timeout. I believe we need to evaluate this as well together with the channel exception issue.
For instance simple connect/disconnect client (qc_client) gets killed after 60 seconds trying to connect to one of brokers. The usual c++ client state at the moment test kills the client is:
(*): Shared library is missing debugging information.
(gdb) 2 Thread 0x40b64940 (LWP 10395) 0x00002b13f0325108 in epoll_wait ()
from /lib64/libc.so.6
* 1 Thread 0x2b13f314f5b0 (LWP 10392) 0x00002b13f14f4ee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb)
Thread 2 (Thread 0x40b64940 (LWP 10395)):
#0 0x00002b13f0325108 in epoll_wait () from /lib64/libc.so.6
#1 0x00002b13f06d22b1 in qpid::sys::Poller::wait (this=0xc8d06c0,
timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563
#2 0x00002b13f06d2d47 in qpid::sys::Poller::run (this=0xc8d06c0)
at qpid/sys/epoll/EpollPoller.cpp:515
#3 0x00002b13f06c8f9a in qpid::sys::(anonymous namespace)::runRunnable (p=0x5)
at qpid/sys/posix/Thread.cpp:35
#4 0x00002b13f14f073d in start_thread () from /lib64/libpthread.so.0
#5 0x00002b13f0324d1d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2b13f314f5b0 (LWP 10392)):
#0 0x00002b13f14f4ee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
#1 0x00002b13ef87979d in wait (this=0xc8d84a0)
at ../include/qpid/sys/posix/Condition.h:63
#2 wait (this=0xc8d84a0) at ../include/qpid/sys/Monitor.h:41
#3 qpid::sys::Waitable::wait (this=0xc8d84a0) at qpid/sys/Waitable.h:88
#4 0x00002b13ef874b09 in waitFor (this=0xc8d8430, timeout=0)
at qpid/sys/StateMonitor.h:65
#5 waitFor (this=0xc8d8430, timeout=0) at qpid/client/SessionImpl.cpp:769
#6 qpid::client::SessionImpl::open (this=0xc8d8430, timeout=0)
at qpid/client/SessionImpl.cpp:109
#7 0x00002b13ef841598 in qpid::client::ConnectionImpl::newSession (
this=0xc8d0e60, name=..., timeout=0, channel=65535)
at qpid/client/ConnectionImpl.cpp:437
#8 0x00002b13ef832489 in qpid::client::Connection::newSession (
this=0xc8d0af0, name=..., timeout=0) at qpid/client/Connection.cpp:141
#9 0x0000000000414eb7 in qc_lib::qc_client<qc_lib::qc_optparser_client>::connect (this=0x7fff475baca0) at qc_client.hpp:359
#10 0x0000000000408b05 in main (argc=3, argv=0x7fff475baff8)
at qc_client.cpp:51
(gdb) Detaching from program: /root/fcluster/qc_client, process 10392
(gdb) quit
Looking at perftest in approximately same moment it timeouts here:
(gdb) 2 Thread 0xb7f05b90 (LWP 18268) 0x0049a410 in __kernel_vsyscall ()
* 1 Thread 0xb7f06940 (LWP 18265) 0x0049a410 in __kernel_vsyscall ()
(gdb)
Thread 2 (Thread 0xb7f05b90 (LWP 18268)):
#0 0x0049a410 in __kernel_vsyscall ()
#1 0x00a70486 in epoll_wait () from /lib/libc.so.6
#2 0x00583caa in qpid::sys::Poller::wait (this=0x9f597b0, timeout=...)
at qpid/sys/epoll/EpollPoller.cpp:563
#3 0x005848d3 in qpid::sys::Poller::run (this=0x9f597b0)
at qpid/sys/epoll/EpollPoller.cpp:515
#4 0x00579d81 in qpid::sys::(anonymous namespace)::runRunnable (p=0x9f597b0)
at qpid/sys/posix/Thread.cpp:35
#5 0x0014a832 in start_thread (arg=0xb7f05b90) at pthread_create.c:301
#6 0x00a6fe0e in clone () from /lib/libc.so.6
Thread 1 (Thread 0xb7f06940 (LWP 18265)):
#0 0x0049a410 in __kernel_vsyscall ()
#1 0x0014ebc5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2 0x00a7c4dd in __pthread_cond_wait (cond=0x9f5a3a8, mutex=0x9f5a390)
at forward.c:138
#3 0x002abd07 in wait (this=0x9f5a390)
at ../include/qpid/sys/posix/Condition.h:63
#4 wait (this=0x9f5a390) at ../include/qpid/sys/Monitor.h:41
#5 qpid::sys::Waitable::wait (this=0x9f5a390) at qpid/sys/Waitable.h:88
#6 0x002a5878 in waitFor (this=0x9f5a358, timeout=0)
at qpid/sys/StateMonitor.h:65
#7 waitFor (this=0x9f5a358, timeout=0) at qpid/client/SessionImpl.cpp:769
#8 qpid::client::SessionImpl::open (this=0x9f5a358, timeout=0)
at qpid/client/SessionImpl.cpp:109
#9 0x0026c17d in qpid::client::ConnectionImpl::newSession (this=0x9f59b08,
name=..., timeout=0, channel=65535) at qpid/client/ConnectionImpl.cpp:437
#10 0x0025df2e in qpid::client::Connection::newSession (this=0xbffaace8,
name=..., timeout=0) at qpid/client/Connection.cpp:141
#11 0x080594f8 in qpid::tests::Client::Client (this=0xbffaace0)
at qpid-perftest.cpp:248
#12 0x0805338b in main (argc=Cannot access memory at address 0x80
) at qpid-perftest.cpp:263
So both at the came call qpid::client::SessionImpl::open () -> waitFor () -> qpid::sys::Waitable::wait ()
I'm currently not able to distinguish whether this is symptom of applied patch, but I was seeing occassionly this behavior on clean -12 as well, I'm just not sure by the same client state at timout/kill.
The issue was retested with Alan's proposed potential patch 0001-WIP-flow-control-limit-total-unprocessed-events.patch w/o applying previous logging patch. After couple of minutes the test started to suffer from the extremely long channel exception resolution: [root@mrg-qe-09 fcluster]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -5 10242677 13346 (10:29:41->14:12:07) 10243005 29 (14:12:07->14:12:36) 95939 11 (10:24:19->10:24:30) 10243339 3 (14:12:46->14:12:49) 10246182 2 (14:14:42->14:14:44) [root@mrg-qe-10 fcluster]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -5 10242677 698 (14:00:29->14:12:07) 95939 1 (10:24:29->10:24:30) 10248322 1 (14:16:07->14:16:08) 10248295 1 (14:16:06->14:16:07) 10248265 1 (14:16:05->14:16:06) [root@mrg-qe-11 fcluster]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -5 10242677 13345 (10:28:36->14:11:01) 10243005 29 (14:11:01->14:11:30) 95939 4 (10:23:20->10:23:24) 10243339 3 (14:11:40->14:11:43) 10246391 2 (14:13:44->14:13:46) [root@mrg-qe-12 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -5 10244299 2 (14:13:28->14:13:30) 10248295 1 (14:16:06->14:16:07) 10248266 1 (14:16:05->14:16:06) 10248248 1 (14:16:04->14:16:05) 10248110 1 (14:15:58->14:15:59) [root@mrg-qe-12 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | grep 10242677 10242677 0 (14:12:07->14:12:07) This excersize confirmed the issue and proved that issue is timing dependent (as reverted/excluded logging patch was able to suppress it). Moreover compilation on AMD or Intel machine seems to provide same code (see hypothesis #15). The possible patch have effect over the channel exception resolution, all the durations were reduced, but does not fix the issue. Just an update... I was retesting, because it seemed that issue is not reproducible anymore and I was able to trigger again the issue, (tsxtest --flow 0 as suggested). The data below show two extremely long exceptions: [root@mrg-qe-09 fcluster]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -3 ; echo "..." ; gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | tail -3 102690519 4321 (04:53:23->06:05:24) 236649358 59 (08:48:13->08:49:12) 66476831 56 (00:05:19->00:06:15) ... 102690577 0 (06:05:26->06:05:26) 102690521 0 (06:05:25->06:05:25) 66476376 -76876 (21:26:35->00:05:19) [root@mrg-qe-09 fcluster]# grep 102690519 /tmp/qpidd.log 2010-08-31 04:53:23 debug cluster(10.34.45.9:7335 READY/error) channel error 102690519 on 127.0.0.1:48244(10.34.45.9:7335-2927 local) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39) 2010-08-31 04:53:23 debug cluster(10.34.45.9:7335 READY/error) error 102690519 resolved with 10.34.45.10:27131 2010-08-31 04:53:23 debug cluster(10.34.45.9:7335 READY/error) error 102690519 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014 2010-08-31 04:53:23 debug cluster(10.34.45.9:7335 READY/error) error 102690519 resolved with 10.34.45.9:7335 2010-08-31 04:53:23 debug cluster(10.34.45.9:7335 READY/error) error 102690519 must be resolved with 10.34.45.11:11734 10.34.45.12:25014 2010-08-31 05:38:50 debug cluster(10.34.45.9:7335 READY/error) error 102690519 resolved with 10.34.45.11:11734 2010-08-31 05:38:50 debug cluster(10.34.45.9:7335 READY/error) error 102690519 must be resolved with 10.34.45.12:25014 2010-08-31 06:05:24 debug cluster(10.34.45.9:7335 READY/error) error 102690519 resolved with 10.34.45.12:25014 2010-08-31 06:05:24 debug cluster(10.34.45.9:7335 READY) error 102690519 resolved. [root@mrg-qe-09 fcluster]# grep 66476376 /tmp/qpidd.log 2010-08-30 21:26:35 debug cluster(10.34.45.9:7335 READY/error) channel error 66476376 on 127.0.0.1:35258(10.34.45.12:25014-3099 shadow) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39) 2010-08-30 21:26:35 debug cluster(10.34.45.9:7335 READY/error) error 66476376 resolved with 10.34.45.10:27131 2010-08-30 21:26:35 debug cluster(10.34.45.9:7335 READY/error) error 66476376 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014 2010-08-30 21:26:35 debug cluster(10.34.45.9:7335 READY/error) error 66476376 resolved with 10.34.45.9:7335 2010-08-30 21:26:35 debug cluster(10.34.45.9:7335 READY/error) error 66476376 must be resolved with 10.34.45.11:11734 10.34.45.12:25014 2010-08-30 23:52:37 debug cluster(10.34.45.9:7335 READY/error) error 66476376 resolved with 10.34.45.11:11734 2010-08-30 23:52:37 debug cluster(10.34.45.9:7335 READY/error) error 66476376 must be resolved with 10.34.45.12:25014 2010-08-31 00:05:19 debug cluster(10.34.45.9:7335 READY/error) error 66476376 resolved with 10.34.45.12:25014 2010-08-31 00:05:19 debug cluster(10.34.45.9:7335 READY) error 66476376 resolved. [root@mrg-qe-10 fcluster]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -3 ; echo "..." ; gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | tail -3 102690519 19666 (00:37:38->06:05:24) 66476831 56 (00:05:19->00:06:15) 236650386 32 (08:52:03->08:52:35) ... 102690577 0 (06:05:26->06:05:26) 102690521 0 (06:05:25->06:05:25) 66476376 -76875 (21:26:34->00:05:19) [root@mrg-qe-10 fcluster]# grep 102690519 /tmp/qpidd.log 2010-08-31 00:37:38 debug cluster(10.34.45.10:27131 READY/error) channel error 102690519 on 127.0.0.1:48244(10.34.45.9:7335-2927 shadow) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39) 2010-08-31 00:37:38 debug cluster(10.34.45.10:27131 READY/error) error 102690519 resolved with 10.34.45.10:27131 2010-08-31 00:37:38 debug cluster(10.34.45.10:27131 READY/error) error 102690519 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014 2010-08-31 04:53:23 debug cluster(10.34.45.10:27131 READY/error) error 102690519 resolved with 10.34.45.9:7335 2010-08-31 04:53:23 debug cluster(10.34.45.10:27131 READY/error) error 102690519 must be resolved with 10.34.45.11:11734 10.34.45.12:25014 2010-08-31 05:38:50 debug cluster(10.34.45.10:27131 READY/error) error 102690519 resolved with 10.34.45.11:11734 2010-08-31 05:38:50 debug cluster(10.34.45.10:27131 READY/error) error 102690519 must be resolved with 10.34.45.12:25014 2010-08-31 06:05:24 debug cluster(10.34.45.10:27131 READY/error) error 102690519 resolved with 10.34.45.12:25014 2010-08-31 06:05:24 debug cluster(10.34.45.10:27131 READY) error 102690519 resolved. [root@mrg-qe-10 fcluster]# grep 66476376 /tmp/qpidd.log 2010-08-30 21:26:34 debug cluster(10.34.45.10:27131 READY/error) channel error 66476376 on 127.0.0.1:35258(10.34.45.12:25014-3099 shadow) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39) 2010-08-30 21:26:34 debug cluster(10.34.45.10:27131 READY/error) error 66476376 resolved with 10.34.45.10:27131 2010-08-30 21:26:34 debug cluster(10.34.45.10:27131 READY/error) error 66476376 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014 2010-08-30 21:26:35 debug cluster(10.34.45.10:27131 READY/error) error 66476376 resolved with 10.34.45.9:7335 2010-08-30 21:26:35 debug cluster(10.34.45.10:27131 READY/error) error 66476376 must be resolved with 10.34.45.11:11734 10.34.45.12:25014 2010-08-30 23:52:37 debug cluster(10.34.45.10:27131 READY/error) error 66476376 resolved with 10.34.45.11:11734 2010-08-30 23:52:37 debug cluster(10.34.45.10:27131 READY/error) error 66476376 must be resolved with 10.34.45.12:25014 2010-08-31 00:05:19 debug cluster(10.34.45.10:27131 READY/error) error 66476376 resolved with 10.34.45.12:25014 2010-08-31 00:05:19 debug cluster(10.34.45.10:27131 READY) error 66476376 resolved. [root@mrg-qe-11 fcluster]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -3 ; echo "..." ; gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | tail -3 102690519 1595 (05:37:27->06:04:02) 236649358 37 (08:47:15->08:47:52) 236650386 31 (08:50:41->08:51:12) ... 102690581 0 (06:04:06->06:04:06) 102690520 0 (06:04:02->06:04:02) 66476376 -85638 (23:51:15->00:03:57) [root@mrg-qe-11 fcluster]# grep 102690519 /tmp/qpidd.log 2010-08-31 05:37:27 debug cluster(10.34.45.11:11734 READY/error) channel error 102690519 on 127.0.0.1:48244(10.34.45.9:7335-2927 shadow) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39) 2010-08-31 05:37:27 debug cluster(10.34.45.11:11734 READY/error) error 102690519 resolved with 10.34.45.10:27131 2010-08-31 05:37:27 debug cluster(10.34.45.11:11734 READY/error) error 102690519 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014 2010-08-31 05:37:27 debug cluster(10.34.45.11:11734 READY/error) error 102690519 resolved with 10.34.45.9:7335 2010-08-31 05:37:27 debug cluster(10.34.45.11:11734 READY/error) error 102690519 must be resolved with 10.34.45.11:11734 10.34.45.12:25014 2010-08-31 05:37:28 debug cluster(10.34.45.11:11734 READY/error) error 102690519 resolved with 10.34.45.11:11734 2010-08-31 05:37:28 debug cluster(10.34.45.11:11734 READY/error) error 102690519 must be resolved with 10.34.45.12:25014 2010-08-31 06:04:02 debug cluster(10.34.45.11:11734 READY/error) error 102690519 resolved with 10.34.45.12:25014 2010-08-31 06:04:02 debug cluster(10.34.45.11:11734 READY) error 102690519 resolved. [root@mrg-qe-11 fcluster]# grep 66476376 /tmp/qpidd.log 2010-08-30 23:51:15 debug cluster(10.34.45.11:11734 READY/error) channel error 66476376 on 127.0.0.1:35258(10.34.45.12:25014-3099 shadow) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39) 2010-08-30 23:51:15 debug cluster(10.34.45.11:11734 READY/error) error 66476376 resolved with 10.34.45.10:27131 2010-08-30 23:51:15 debug cluster(10.34.45.11:11734 READY/error) error 66476376 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014 2010-08-30 23:51:15 debug cluster(10.34.45.11:11734 READY/error) error 66476376 resolved with 10.34.45.9:7335 2010-08-30 23:51:16 debug cluster(10.34.45.11:11734 READY/error) error 66476376 must be resolved with 10.34.45.11:11734 10.34.45.12:25014 2010-08-30 23:51:16 debug cluster(10.34.45.11:11734 READY/error) error 66476376 resolved with 10.34.45.11:11734 2010-08-30 23:51:16 debug cluster(10.34.45.11:11734 READY/error) error 66476376 must be resolved with 10.34.45.12:25014 2010-08-31 00:03:57 debug cluster(10.34.45.11:11734 READY/error) error 66476376 resolved with 10.34.45.12:25014 2010-08-31 00:03:57 debug cluster(10.34.45.11:11734 READY) error 66476376 resolved. [root@mrg-qe-12 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | head -3 ; echo "..." ; gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log | sort -r -n -k2 | tail -3 236650386 21 (08:52:14->08:52:35) 159294587 19 (06:54:04->06:54:23) 236652727 16 (08:58:32->08:58:48) ... 102690577 0 (06:05:26->06:05:26) 102690521 0 (06:05:25->06:05:25) 102690520 0 (06:05:25->06:05:25) [root@mrg-qe-12 ~]# grep 102690519 /tmp/qpidd.log 2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) channel error 102690519 on 127.0.0.1:48244(10.34.45.9:7335-2927 shadow) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39) 2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) error 102690519 resolved with 10.34.45.10:27131 2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) error 102690519 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014 2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) error 102690519 resolved with 10.34.45.9:7335 2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) error 102690519 must be resolved with 10.34.45.11:11734 10.34.45.12:25014 2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) error 102690519 resolved with 10.34.45.11:11734 2010-08-31 06:05:24 debug cluster(10.34.45.12:25014 READY/error) error 102690519 must be resolved with 10.34.45.12:25014 2010-08-31 06:05:25 debug cluster(10.34.45.12:25014 READY/error) error 102690519 resolved with 10.34.45.12:25014 2010-08-31 06:05:25 debug cluster(10.34.45.12:25014 READY) error 102690519 resolved. [root@mrg-qe-12 ~]# grep 66476376 /tmp/qpidd.log 2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) channel error 66476376 on 127.0.0.1:35258(10.34.45.12:25014-3099 local) must be resolved with: 10.34.45.9:7335 10.34.45.10:27131 10.34.45.11:11734 10.34.45.12:25014 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39) 2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 resolved with 10.34.45.10:27131 2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 must be resolved with 10.34.45.9:7335 10.34.45.11:11734 10.34.45.12:25014 2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 resolved with 10.34.45.9:7335 2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 must be resolved with 10.34.45.11:11734 10.34.45.12:25014 2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 resolved with 10.34.45.11:11734 2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 must be resolved with 10.34.45.12:25014 2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY/error) error 66476376 resolved with 10.34.45.12:25014 2010-08-31 00:05:19 debug cluster(10.34.45.12:25014 READY) error 66476376 resolved. Currently retesting on different machines in another lab (mrg33-6), so far no such issue was seen because no single 'Channel 1 is not attached' exception detected. Keeping test busy. Feedback soon. The test on mrg33-6 show no such issue over the additional day and half. It really seems that less powerfull AMD Opterons (and potentially mixing architectures in one cluster) cause the issue to appear. The other important condition is presence of message store. The test is now still running on mrg33-6 and mrg36 is manually put behing by running there unrelated actions to be closer to the situation when machines in the cluster are not homogenous. My thanks to Mick. The journal inactive timer late/overrun issue does not require a cluster. I've raised bug 629756 to separate this issue. I will leave this bug open as we still need to - ensure there cannot be an unbounded time lag between cluster members with different processing speeds - understand the not-attached warnings An update on unbalancing mrg33-6 run. There is seen similar effect on Intel machines, it is just very difficult to reach that state (cluster have to run for long time and one of the machines have to execute also different actions). Unbalancing mrg36 was able to trigger similar case: [root@mrg33 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log 2178553487 1 (16:07:07->16:07:08) [root@mrg33 ~]# grep -c "overrunning" /tmp/qpidd.log 4447 [root@mrg34 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log 2178553487 38 (16:06:29->16:07:07) [root@mrg34 ~]# grep -c "overrunning" /tmp/qpidd.log 3319 [root@mrg35 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log 2178553487 38 (16:06:29->16:07:07) [root@mrg35 ~]# grep -c "overrunning" /tmp/qpidd.log 4258 [root@mrg36 ~]# gawk -f ~/fcluster/cherror_dur.awkk /tmp/qpidd.log 2178553487 38 (16:06:29->16:07:07) [root@mrg36 ~]# grep 2178553487 /tmp/qpidd.log 2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) channel error 2178553487 on 10.16.44.241:53227(20.0.100.33:18339-160511 shadow) must be resolved with: 20.0.100.33:18339 20.0.100.34:27453 20.0.100.35:23092 20.0.100.36:9011 : not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39) 2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 resolved with 20.0.100.35:23092 2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 must be resolved with 20.0.100.33:18339 20.0.100.34:27453 20.0.100.36:9011 2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 resolved with 20.0.100.34:27453 2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 must be resolved with 20.0.100.33:18339 20.0.100.36:9011 2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 resolved with 20.0.100.36:9011 2010-09-02 16:06:29 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 must be resolved with 20.0.100.33:18339 2010-09-02 16:07:07 debug cluster(20.0.100.36:9011 READY/error) error 2178553487 resolved with 20.0.100.33:18339 2010-09-02 16:07:07 debug cluster(20.0.100.36:9011 READY) error 2178553487 resolved. [root@mrg36 ~]# grep -c "overrunning" /tmp/qpidd.log 10656 Above data show that there was single channel exception which took a second for three machines which were not executing anything else and 38 seconds for the one under random load. The 'Timer woken up overrunning' case counts are also reflecting the load. I'll try to unbalance half of the cluster to see if I'm able to be closer to the AMD data. An correction to previous post, there are observed three channel exception durations of 38 seconds and one of one second. It is seen that machine 33 got noticed late.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Cause: Nodes in a cluster are unbalanced such that one node runs significantly faster than another. This could be because nodes use different hardware or because of other CPU or IO intensive processes running continuously on some nodes but not others.
Consequence: Under continuous heavy load, the slower nodes may lag behind the cluster and become slow to respond to their clients.
Workaround: Use similar hardware for cluster nodes, avoid other long running CPU/IO intensive tasks on cluster nodes. Occasional spikes of activity are not a problem. The problem only occurs qpidd is heavily loaded for a long period (several hours) on a continuously unbalanced cluster.
I'm fine with the tech. note formulation. The underlying problem is that there's no flow control to slow faster brokers down if a slow broker gets behind. To implement this we need flow control in corosync which is currently not there. Bug 579127 will add the required flow control to corosync, at that point we can implement the flow control in qpidd. Symptom of channel exceptions / missing flow control are also visible in broker log file as https://bugzilla.redhat.com/show_bug.cgi?id=629756#c7 messages. Overrun messages should not indicate delays greater than defined in bug 629756 description's Expected results: moderate number of warnings, <- solved in bug 629756 delays of <10ms, maybe one or two in 100s of ms. Certainly nothing over 1000ms. <- part of this defect now This timing requirements are excluded from bug 629756 now and added to this defect as bug 629756 targets on mechanism how to limit number of overrun messages. This product has been discontinued or is no longer tracked in Red Hat Bugzilla. |