Bug 604567

Summary: qpidd broker rarely aborts in ~ManagementAgent
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: messaging-bugs <messaging-bugs>
Status: CLOSED ERRATA QA Contact: Frantisek Reznicek <freznice>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: esammons, gsim
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-20 11:29:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Frantisek Reznicek 2010-06-16 09:01:52 UTC
Description of problem:

There was observed singlebroker abort when ran under stress of restarting broker with different message store parameters and sending and receiving messages in every run to see whether restarted broker keep the messages. No clustering because the observation was seen on RHEL4)

The abort in ~ManagementAgent:

  Thread 1 (process 16520):
  #0  0x0000003f3002e32d in *__GI_raise (sig=Variable "sig" is not available.
  )
      at ../nptl/sysdeps/unix/sysv/linux/raise.c:67
  #1  0x0000003f3002fb2e in *__GI_abort () at ../sysdeps/generic/abort.c:88
  #2  0x000000332c66bfb6 in ~ManagementAgent (this=0x2a96182010)
      at ../include/qpid/sys/posix/Mutex.h:112
  #3  0x000000332c552a80 in ~Broker (this=0x5662a0)
      at /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../include/c++/3.4.6/memory:260
  #4  0x000000000040e48e in QpiddBroker::execute (this=Variable "this" is not available.
  )
      at ./qpid/RefCounted.h:42
  #5  0x0000000000408d34 in main (argc=21, argv=0x7fbfffee18)
      at /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../include/c++/3.4.6/memory:301
  #6  0x0000003f3001c4cb in __libc_start_main (main=0x4055a0 <main>, argc=21,
      ubp_av=0x7fbfffee18, init=0x41ae50 <__libc_csu_init>, fini=Variable "fini" is not available.
  )
      at ../sysdeps/generic/libc-start.c:209

occurred during broker shutdown as detailed on bottom.


Version-Release number of selected component (if applicable):
python-qmf-0.7.946106-3.el4
python-qpid-0.7.946106-1.el4
qmf-0.7.946106-2.el4
qmf-devel-0.7.946106-2.el4
qpid-cpp-client-0.7.946106-2.el4
qpid-cpp-client-devel-0.7.946106-2.el4
qpid-cpp-client-devel-docs-0.7.946106-2.el4
qpid-cpp-client-ssl-0.7.946106-2.el4
qpid-cpp-mrg-debuginfo-0.7.946106-2.el4
qpid-cpp-server-0.7.946106-2.el4
qpid-cpp-server-devel-0.7.946106-2.el4
qpid-cpp-server-ssl-0.7.946106-2.el4
qpid-cpp-server-store-0.7.946106-2.el4
qpid-cpp-server-xml-0.7.946106-2.el4
qpid-dotnet-0.4.738274-2.el4
qpid-java-client-0.7.946106-3.el4
qpid-java-common-0.7.946106-3.el4
qpid-tests-0.7.946106-1.el4
qpid-tools-0.7.946106-4.el4


How reproducible:
extremely hard


Steps to Reproduce:
1. run qpid_ptest_restart_with_msgstore_sweep
see http://cvs.devel.redhat.com/cgi-bin/cvsweb.cgi/tests/distribution/MRG/Messaging/qpid_ptest_restart_with_msgstore_sweep/runtest.sh?rev=1.9 for more details or:
  1a. start broker with next set of store parameters
  1b. receive messages from previous step
  1c. create queue and bind it to exchange
  1d. send messages to the exchange (to be propagated to the created queue)
  1e. stop broker
2. wait for the core file
  
Actual results:
qpidd c++ broker aborts.

Expected results:
qpidd c++ broker should not abort.

Additional info:

GNU gdb Red Hat Linux (6.3.0.0-1.162.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Core was generated by `/usr/sbin/qpidd --data-dir /root/MRG/Messaging/qpid_ptest_restart_with_msgstore'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/lib64/libqpidbroker.so.2...Reading symbols from /usr/lib/debug/usr/lib64/libqpidbroker.so.2.0.0.debug...done.
done.
...
Reading symbols from /usr/lib64/libaio.so.1...done.
Loaded symbols for /usr/lib64/libaio.so.1
#0  0x0000003f3002e32d in *__GI_raise (sig=Variable "sig" is not available.
)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:67
67        int res = INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) rax            0x0        0
rbx            0x2a96182118     182906790168
rcx            0xffffffffffffffff       -1
rdx            0x6      6
rsi            0x4088   16520
rdi            0x4088   16520
rbp            0x2a9557ea80     0x2a9557ea80
rsp            0x7fbfffe228     0x7fbfffe228
r8             0x6      6
r9             0x4088   16520
r10            0x8      8
r11            0x206    518
r12            0x2a96182010     182906789904
r13            0x0      0
r14            0x10     16
r15            0x2a961820f0     182906790128
rip            0x3f3002e32d     0x3f3002e32d <*__GI_raise+61>
eflags         0x206    518
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
Current language:  auto; currently c
(gdb) There are no memory regions defined.
(gdb) From                To                  Syms Read   Shared Object Library
0x000000332c4c5a00  0x000000332c6975a8  Yes         /usr/lib64/libqpidbroker.so.2
0x000000332c148f40  0x000000332c262058  Yes         /usr/lib64/libqpidcommon.so.2
0x0000003f30914160  0x0000003f30935e68  Yes         /usr/lib64/libboost_program_options.so.1
0x0000003a3de05cc0  0x0000003a3de0d7a8  Yes         /usr/lib64/libboost_filesystem.so.1
0x0000003f30701670  0x0000003f307028c8  Yes         /lib64/tls/libuuid.so.1
0x0000003f35c02c90  0x0000003f35c07b38  Yes         /lib64/tls/librt.so.1
0x0000003f37c044f0  0x0000003f37c11ec8  Yes         /usr/lib64/libsasl2.so.2
0x0000003f30500f80  0x0000003f30501918  Yes         /lib64/libdl.so.2
0x0000003f331036a0  0x0000003f3310ee28  Yes         /lib64/libresolv.so.2
0x0000003f37800bf0  0x0000003f37805668  Yes         /lib64/libcrypt.so.1
0x0000003f3444d090  0x0000003f344b5498  Yes         /usr/lib64/libstdc++.so.6
0x0000003f30303df0  0x0000003f30346b98  Yes         /lib64/tls/libm.so.6
0x0000003f34202060  0x0000003f3420ab08  Yes         /lib64/libgcc_s.so.1
0x0000003f3001c2d0  0x0000003f300fe2ac  Yes         /lib64/tls/libc.so.6
0x0000003f30d05100  0x0000003f30d0d228  Yes         /lib64/tls/libpthread.so.0
0x0000003f2fe00a80  0x0000003f2fe10de7  Yes         /lib64/ld-linux-x86-64.so.2
0x0000002a95587990  0x0000002a95591888  Yes         /usr/lib64/qpid/daemon/replicating_listener.so
0x0000002a956a62a0  0x0000002a956c5a58  Yes         /usr/lib64/qpid/daemon/acl.so
0x0000002a957d6490  0x0000002a957dc428  Yes         /usr/lib64/qpid/daemon/replication_exchange.so
0x0000002a958e8f60  0x0000002a958f3228  Yes         /usr/lib64/qpid/daemon/xml.so
0x00000037da471330  0x00000037da5ff098  Yes         /usr/lib64/libxerces-c.so.28
0x00000037daa69cf0  0x00000037dabfb3c8  Yes         /usr/lib64/libxqilla.so.3
0x0000002a95a01c70  0x0000002a95a0cab8  Yes         /usr/lib64/qpid/daemon/ssl.so
0x0000002a95b417c0  0x0000002a95b59be8  Yes         /usr/lib64/libsslcommon.so.2
0x0000003f38418d50  0x0000003f384fed38  Yes         /usr/lib64/libnss3.so
0x0000003f39108ca0  0x0000003f3912c018  Yes         /usr/lib64/libssl3.so
0x0000003f37e0d4f0  0x0000003f37e2bec8  Yes         /usr/lib64/libnspr4.so
0x0000003f393085f0  0x0000003f39312c08  Yes         /usr/lib64/libnssutil3.so
0x0000003f382015d0  0x0000003f38202cb8  Yes         /usr/lib64/libplc4.so
0x0000003f38001010  0x0000003f38001e68  Yes         /usr/lib64/libplds4.so
0x0000003f30b01f20  0x0000003f30b0b278  Yes         /usr/lib64/libz.so.1
0x0000002a95c9ae80  0x0000002a95d421d8  Yes         /usr/lib64/qpid/daemon/msgstore.so
0x0000002a95eb4ea0  0x0000002a95f59f68  Yes         /usr/lib64/tls/libdb_cxx-4.2.so
0x0000002a960815b0  0x0000002a96081776  Yes         /usr/lib64/libaio.so.1
(gdb)   2 process 16525  ?? ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:59
   from /lib64/tls/libpthread.so.0
* 1 process 16520  0x0000003f3002e32d in *__GI_raise (sig=Variable "sig" is not available.
)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:67
(gdb)
Thread 2 (process 16525):
#0  ?? () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:59
   from /lib64/tls/libpthread.so.0
#1  0x00000000409df608 in ?? ()
#2  0x0000000000000003 in ?? ()
#3  0x0000003f30d07f4f in ?? () from /lib64/tls/libpthread.so.0
#4  0x0000000000000000 in ?? ()
Current language:  auto; currently asm

Thread 1 (process 16520):
#0  0x0000003f3002e32d in *__GI_raise (sig=Variable "sig" is not available.
)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:67
#1  0x0000003f3002fb2e in *__GI_abort () at ../sysdeps/generic/abort.c:88
#2  0x000000332c66bfb6 in ~ManagementAgent (this=0x2a96182010)
    at ../include/qpid/sys/posix/Mutex.h:112
#3  0x000000332c552a80 in ~Broker (this=0x5662a0)
    at /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../include/c++/3.4.6/memory:260
#4  0x000000000040e48e in QpiddBroker::execute (this=Variable "this" is not available.
)
    at ./qpid/RefCounted.h:42
#5  0x0000000000408d34 in main (argc=21, argv=0x7fbfffee18)
    at /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../include/c++/3.4.6/memory:301
#6  0x0000003f3001c4cb in __libc_start_main (main=0x4055a0 <main>, argc=21,
    ubp_av=0x7fbfffee18, init=0x41ae50 <__libc_csu_init>, fini=Variable "fini" is not available.
)
    at ../sysdeps/generic/libc-start.c:209
#7  0x000000000040550a in _start ()
#8  0x0000007fbfffee08 in ?? ()
#9  0x000000000000001c in ?? ()
#10 0x0000000000000015 in ?? ()
#11 0x0000007fbffff1fe in ?? ()
#12 0x0000007fbffff20e in ?? ()
#13 0x0000007fbffff219 in ?? ()
#14 0x0000007fbffff273 in ?? ()
#15 0x0000007fbffff27a in ?? ()
#16 0x0000007fbffff280 in ?? ()
#17 0x0000007fbffff287 in ?? ()
#18 0x0000007fbffff28a in ?? ()
#19 0x0000007fbffff297 in ?? ()
#20 0x0000007fbffff29d in ?? ()
#21 0x0000007fbffff2aa in ?? ()
#22 0x0000007fbffff2ad in ?? ()
#23 0x0000007fbffff2be in ?? ()
#24 0x0000007fbffff2c1 in ?? ()
#25 0x0000007fbffff2d4 in ?? ()
#26 0x0000007fbffff2d7 in ?? ()
#27 0x0000007fbffff2e8 in ?? ()
#28 0x0000007fbffff2eb in ?? ()
#29 0x0000007fbffff300 in ?? ()
#30 0x0000007fbffff303 in ?? ()
#31 0x0000007fbffff31a in ?? ()
#32 0x0000000000000000 in ?? ()
Current language:  auto; currently c
(gdb)

# timestamp of the core file:
# ###########################

# stat core.16520
  File: `core.16520'
  Size: 44904448        Blocks: 4368       IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 3868243     Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2010-06-16 04:22:16.000000000 -0400
Modify: 2010-06-15 09:48:37.000000000 -0400
Change: 2010-06-15 09:48:37.000000000 -0400

# rough test transcript:
# ######################

[09:47:51] ..parameters: cluster_width:1
[09:47:51]   *_num_jfiles:16,*_jfile_size_pgs:12,*_wcache_page_size:32
[09:47:51]   *_tpl_num_jfiles:16,*_tpl_jfile_size_pgs:12,*_tpl_wcache_page_size:32
[09:47:51] .Launching 1 qpidd instance[s]... [err_cnt:0]
[09:47:52] mrg_qpidd_start: qpidd launched normal bg way (port:36858,log:qpidd_23_1_1_t16_12_32.log,params: --auth no --log-enable info+ --num-jf
iles 16 --jfile-size-pgs 12 --wcache-page-size 32 --tpl-num-jfiles 16 --tpl-jfile-size-pgs 12 --tpl-wcache-page-size 32 --data-dir /root/MRG/Mess
aging/qpid_ptest_restart_with_msgstore_sweep/rhts_qpidd/qpidd_data_dir_23_1,lockfile:qpidd_23_1_1_t16_12_32.lock)
[09:47:56] mrg_qpidd_wait_on_settle: qpidd started-up (dur:0sec)
[09:48:00] ..Checking stored messages from last run
[09:48:00] ..OK, received data are same and correct ecode=0
[09:48:00] ..Create and bind queue
[09:48:00] ..ecode=0
[09:48:00] ..Send messages
[09:48:02] ..ecode=0
[09:48:02] .Stopping all running qpidd instances...
[09:48:02] mrg_qpidd_stop: following qpidd brokers are found to stop (13104 )
[09:48:06] mrg_qpidd_status: 0 instance[s] running (pids:,ports:)
[09:48:08] ..qpidd instances stopped ok
[09:48:08] ..parameters: cluster_width:1
[09:48:08]   *_num_jfiles:16,*_jfile_size_pgs:12,*_wcache_page_size:64
[09:48:08]   *_tpl_num_jfiles:16,*_tpl_jfile_size_pgs:12,*_tpl_wcache_page_size:64
[09:48:09] mrg_qpidd_start: qpidd launched normal bg way (port:48921,log:qpidd_23_1_1_t16_12_64.log,params: --auth no --log-enable info+ --num-jf
iles 16 --jfile-size-pgs 12 --wcache-page-size 64 --tpl-num-jfiles 16 --tpl-jfile-size-pgs 12 --tpl-wcache-page-size 64 --data-dir /root/MRG/Mess
aging/qpid_ptest_restart_with_msgstore_sweep/rhts_qpidd/qpidd_data_dir_23_1,lockfile:qpidd_23_1_1_t16_12_64.lock)
[09:48:12] mrg_qpidd_wait_on_settle: qpidd started-up (dur:0sec)
[09:48:17] ..Checking stored messages from last run
[09:48:17] ..OK, received data are same and correct ecode=0
[09:48:17] ..Create and bind queue
[09:48:18] ..ecode=0
[09:48:18] ..Send messages
[09:48:20] ..ecode=0
[09:48:20] .Stopping all running qpidd instances...
[09:48:20] mrg_qpidd_stop: following qpidd brokers are found to stop (14816 )
[09:48:23] mrg_qpidd_status: 0 instance[s] running (pids:,ports:)
[09:48:26] ..qpidd instances stopped ok
[09:48:26] ..parameters: cluster_width:1
[09:48:26]   *_num_jfiles:16,*_jfile_size_pgs:24,*_wcache_page_size:16
[09:48:26]   *_tpl_num_jfiles:16,*_tpl_jfile_size_pgs:24,*_tpl_wcache_page_size:16
[09:48:26] .Launching 1 qpidd instance[s]... [err_cnt:0]
[09:48:26] mrg_qpidd_start: qpidd launched normal bg way (port:41581,log:qpidd_23_1_1_t16_24_16.log,params: --auth no --log-enable info+ --num-jf
iles 16 --jfile-size-pgs 24 --wcache-page-size 16 --tpl-num-jfiles 16 --tpl-jfile-size-pgs 24 --tpl-wcache-page-size 16 --data-dir /root/MRG/Mess
aging/qpid_ptest_restart_with_msgstore_sweep/rhts_qpidd/qpidd_data_dir_23_1,lockfile:qpidd_23_1_1_t16_24_16.lock)
[09:48:30] mrg_qpidd_wait_on_settle: qpidd started-up (dur:0sec)
[09:48:35] ..Checking stored messages from last run
[09:48:35] ..OK, received data are same and correct ecode=0
[09:48:35] ..Create and bind queue
[09:48:35] ..ecode=0
[09:48:35] ..Send messages
[09:48:36] ..ecode=0
[09:48:36] .Stopping all running qpidd instances...
[09:48:36] mrg_qpidd_stop: following qpidd brokers are found to stop (16520 )
/root/MRG/Messaging/qpid_common/qpid_test_functions_broker.sh: line 51: 16520 Aborted                 (core dumped) ${qpidd_wrapper} ${QPIDD_DIR}
/${MRG_CPP_BROKER} ${datadir_sw} --port ${port} ${test_params} >&${transcript}
[09:48:40] mrg_qpidd_status: 0 instance[s] running (pids:,ports:)
[09:48:43] ..qpidd instances stopped ok

# broker log file (info+):
# ########################

qpidd qpidd_23_1_1_t16_24_16.log log:
2010-06-15 09:48:26 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so
2010-06-15 09:48:26 info Loaded Module: /usr/lib64/qpid/daemon/acl.so
2010-06-15 09:48:26 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so
2010-06-15 09:48:26 info Loaded Module: /usr/lib64/qpid/daemon/xml.so
2010-06-15 09:48:26 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so
2010-06-15 09:48:26 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so
2010-06-15 09:48:26 info No message store configured, persistence is disabled.
2010-06-15 09:48:26 info Management enabled
2010-06-15 09:48:26 info Registered replication exchange
2010-06-15 09:48:26 info Registered xml exchange
2010-06-15 09:48:27 notice Journal "TplStore": Created
2010-06-15 09:48:27 notice Store module initialized; store-dir=/root/MRG/Messaging/qpid_ptest_restart_with_msgstore_sweep/rhts_qpidd/qpidd_data_d
ir_23_1
2010-06-15 09:48:27 info > Default files per journal: 16
2010-06-15 09:48:27 info > Default journal file size: 24 (wpgs)
2010-06-15 09:48:27 info > Default write cache page size: 16 (Kib)
2010-06-15 09:48:27 info > Default number of write cache pages: 32
2010-06-15 09:48:27 info > TPL files per journal: 16
2010-06-15 09:48:27 info > TPL journal file size: 24 (wpgs)
2010-06-15 09:48:27 info > TPL write cache page size: 16 (Kib)
2010-06-15 09:48:27 info > TPL number of write cache pages: 32
2010-06-15 09:48:27 info Queue "queue": Policy created: type=reject; maxCount=0; maxSize=104857600
2010-06-15 09:48:27 notice Journal "queue": Created
2010-06-15 09:48:27 warning Journal "queue": Recovery found 4 files (different from --num-jfiles value of 16).
2010-06-15 09:48:27 warning Journal "queue": Recovery found file size = 12 (different from --jfile-size-pgs value of 24).
2010-06-15 09:48:27 info Recovered queue "queue": 101 messages recovered; 0 messages in-doubt.
2010-06-15 09:48:27 info Most recent persistence id found: 0x102d
2010-06-15 09:48:27 notice SASL disabled: No Authentication Performed
2010-06-15 09:48:27 notice Listening on TCP port 41581
2010-06-15 09:48:27 info Policy file not specified. ACL Disabled, no ACL checking being done!
2010-06-15 09:48:27 info SSL plugin not enabled, you must set --ssl-cert-db to enable it.
2010-06-15 09:48:27 notice Broker running
2010-06-15 09:48:37 notice Shut down
Device or resource busy

Comment 1 Frantisek Reznicek 2010-06-16 09:03:04 UTC
Currently retesting on qpid-cpp-*0.7.946106-3.*

Comment 2 Frantisek Reznicek 2010-06-22 08:38:57 UTC
Tests on qpid-cpp-*0.7.946106-3 succeeded, but more extended testing needed due to very low reproducibility. Currently retesting on qpid-cpp-*0.7.946106-4, in the middle of the test no failure seen.

Comment 3 Gordon Sim 2010-06-23 13:41:41 UTC
Suspect fixed by fix to bug 566691 i.e. r952307 on trunk and http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=79804fa332a5c1b7a2ca7da5cbee843f91f7fa3b in release repo (which occured after the -2 packages were built).

Comment 4 Frantisek Reznicek 2010-06-23 13:45:36 UTC
The long term testing on 4 ... 6 parallel machines plus couple of test suite runs on RHEL 4.8 / 5.5 i386 / x86_64 show that issue is fixed on packages:
python-qmf-0.7.946106-4.el5
python-qpid-0.7.946106-2.el5
qmf-0.7.946106-4.el5
qmf-devel-0.7.946106-4.el5
qpid-cpp-client-0.7.946106-4.el5
qpid-cpp-client-devel-0.7.946106-4.el5
qpid-cpp-client-devel-docs-0.7.946106-4.el5
qpid-cpp-client-ssl-0.7.946106-4.el5
qpid-cpp-mrg-debuginfo-0.7.946106-3.el5
qpid-cpp-server-0.7.946106-4.el5
qpid-cpp-server-cluster-0.7.946106-4.el5
qpid-cpp-server-devel-0.7.946106-4.el5
qpid-cpp-server-ssl-0.7.946106-4.el5
qpid-cpp-server-store-0.7.946106-4.el5
qpid-cpp-server-xml-0.7.946106-4.el5
qpid-java-client-0.7.946106-4.el5
qpid-java-common-0.7.946106-4.el5
qpid-tools-0.7.946106-4.el5

-> VERIFIED