Description of problem: Broker can segfault when: - EFP is empty - 2 journals concurrently request new journal file from EFP There is a race condition described in Additional info that leads to segfault. Version-Release number of selected component (if applicable): qpid-cpp-server-linearstore 0.22-49 How reproducible: ??? (reproducer to be created) Steps to Reproduce: Most probably, produce durable messages via fanout exchange to many durable queues for a long time. Proper reproducer to be provided. Actual results: segfault with bt: Thread 1 (Thread 0x7ff85b3f1700 (LWP 17810)): #0 0x00007ff9927104f3 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib64/libstdc++.so.6 No symbol table info available. #1 0x00007ff98e59d6a1 in operator= (this=0x1ab3480) at /usr/include/c++/4.4.7/bits/basic_string.h:511 No locals. #2 qpid::linearstore::journal::EmptyFilePool::popEmptyFile (this=0x1ab3480) at /usr/src/debug/qpid-0.22/cpp/src/qpid/linearstore/journal/EmptyFilePool.cpp:213 l = {_sm = @0x1ab34f8} emptyFileName = "" isEmpty = true #3 0x00007ff98e59ddec in qpid::linearstore::journal::EmptyFilePool::takeEmptyFile (this=0x1ab3480, destDirectory= "/var/lib/qpidd/qls/jrnl/DurableQueue") at /usr/src/debug/qpid-0.22/cpp/src/qpid/linearstore/journal/EmptyFilePool.cpp:108 emptyFileName = "" newFileName = "" Expected results: no segfault Additional info: Relevant source code: std::string EmptyFilePool::popEmptyFile() { std::string emptyFileName; bool isEmpty = false; { slock l(emptyFileListMutex_); isEmpty = emptyFileList_.empty(); } if (isEmpty) { createEmptyFile(); } { slock l(emptyFileListMutex_); emptyFileName = emptyFileList_.front(); <-- line 213 emptyFileList_.pop_front(); } return emptyFileName; } If two requests (R1 and R2) are made concurrently when EFP is empty such that: - R1 performs most of the function until line 212 (second lock) - this means creating one empty file - R2 performs the same - but now EFP has one file so no new file to be created - R1 (or R2, it does not matter) continues on line 212 and further - so it takes the empty file - the second request tries to take an empty file from the empty EFP and triggers the segfault
Trivial reproducer: enqueue to 10 topics each having 10 queues bound. Worth decreasing efp-file-size to e.g. 256 in /etc/qpid/qpidd.conf Reproducer script: topics=10 queues_per_topic=10 rm -rf /var/lib/qpidd/* /tmp/qpidd.log service qpidd restart echo "$(date): creating $(($((topics))*$((queues_per_topic)))) queues" for i in $(seq 1 $topics); do for j in $(seq 1 $queues_per_topic); do qpid-receive -a "Durable_${i}_${j}; {create:always, node:{durable:true, x-bindings:[{exchange:'amq.direct', queue:'Durable_${i}_${j}', key:'${i}'}] }}" & done done wait echo "$(date): queues created" while true; do echo "$(date): publishing messages.." for i in $(seq 1 $topics); do qpid-send -a "amq.direct/${i}" -m 1000000 --durable=yes --content-size=1000 & done wait echo "$(date): consuming messages.." for i in $(seq 1 $topics); do for j in $(seq 1 $queues_per_topic); do qpid-receive -a "Durable_${i}_${j}" -m 1000000 --print-content=no & done done wait done #end of the script Broker segfaults within few minutes.
Potential workaround (not applicable everytime): populate EFP before broker start by sufficient number of empty journal files. Such that EFP is never empty.
Kim, please assess.
Kim, I need an update here.
Root cause: the method popEmptyFile needs to do _atomic_ test-and-create-and-grab for efp file, otherwise the scenario at the end of "Additional info:" could happen. So let put the whole method content into the lock. That requires moving of the lock from pushEmptyFile Very in-mature patch: Index: src/qpid/linearstore/journal/EmptyFilePool.cpp =================================================================== --- src/qpid/linearstore/journal/EmptyFilePool.cpp (revision 1631725) +++ src/qpid/linearstore/journal/EmptyFilePool.cpp (working copy) @@ -59,7 +59,10 @@ if (i->substr(dotPos).compare(".jrnl") == 0 && i->length() == 41) { std::string emptyFile(efpDirectory_ + "/" + (*i)); if (validateEmptyFile(emptyFile)) { - pushEmptyFile(emptyFile); + { + slock l(emptyFileListMutex_); + pushEmptyFile(emptyFile); + } } } } @@ -111,7 +114,10 @@ // Try again with new UUID for file name newFileName = destDirectory + "/" + getEfpFileName(); if (moveEmptyFile(emptyFileName.c_str(), newFileName.c_str())) { - pushEmptyFile(emptyFileName); + { + slock l(emptyFileListMutex_); + pushEmptyFile(emptyFileName); + } std::ostringstream oss; oss << "file=\"" << emptyFileName << "\" dest=\"" << newFileName << "\"" << FORMAT_SYSERR(errno); throw jexception(jerrno::JERR_JDIR_FMOVE, oss.str(), "EmptyFilePool", "takeEmptyFile"); @@ -135,6 +141,7 @@ if (partitionPtr_->getOverwriteBeforeReturnFlag()) { overwriteFileContents(emptyFileName); } + slock l(emptyFileListMutex_); pushEmptyFile(emptyFileName); } @@ -211,12 +218,12 @@ { slock l(emptyFileListMutex_); isEmpty = emptyFileList_.empty(); - } +// } if (isEmpty) { createEmptyFile(); } - { - slock l(emptyFileListMutex_); +// { +// slock l(emptyFileListMutex_); emptyFileName = emptyFileList_.front(); emptyFileList_.pop_front(); } @@ -224,7 +231,7 @@ } void EmptyFilePool::pushEmptyFile(const std::string fqFileName) { - slock l(emptyFileListMutex_); +// slock l(emptyFileListMutex_); emptyFileList_.push_back(fqFileName); } /me successfully testing the patch.
Patch sent to review (https://reviews.apache.org/r/26806/)
Committed revision 1632504.
A checkin for BZ1160367 - QPID-5671 (Adding ability to use disk partitions) r.1636598 has replaced Pavel's checkin r.1632504 for this bug. However, because this new checkin also adds new features unrelated to this issue (including a new directory structure for the store), it is not suitable for backporting to existing product branches. This means that both Pavel's fix and this new fix will need to be tested for this defect, but that Pavel's be used for any backporting.
Created attachment 956790 [details] Backtrace of core-dump RHEL 2.6.32-504.el6.x86_64
Bug has been tested on both RHEL6 32,64 with 4 CPUs, 6GB ram VMs. Bug occurred again on both platforms. Coredump backtrace has been attached. I can provide whole core-dumps if needed (~2GB). Bug was reproduced in variable time (5 to 16, to 63minutes). qpid-cpp-client-0.22-50.el6.x86_64 qpid-cpp-debuginfo-0.22-50.el6.x86_64 qpid-cpp-server-0.22-50.el6.x86_64 qpid-cpp-server-linearstore-0.22-50.el6.x86_64
(In reply to Michal Toth from comment #13) > Bug has been tested on both RHEL6 32,64 with 4 CPUs, 6GB ram VMs. > Bug occurred again on both platforms. Coredump backtrace has been attached. > I can provide whole core-dumps if needed (~2GB). > Bug was reproduced in variable time (5 to 16, to 63minutes). > > qpid-cpp-client-0.22-50.el6.x86_64 > qpid-cpp-debuginfo-0.22-50.el6.x86_64 > qpid-cpp-server-0.22-50.el6.x86_64 > qpid-cpp-server-linearstore-0.22-50.el6.x86_64 As discussed over IRC, it seems to be caused by ulimit -n too low. Could you please re-test with sufficiently high ulimit -n (i.e. 10000 or 100000)? For the "segfault with low ulimit", I raised https://bugzilla.redhat.com/show_bug.cgi?id=1163650.
Tested on 2 more powerful beaker machines. Both x86_64 arch with RHEL6.6. 1) 60GB ram, 8 cores 2) 78GB ram, 16 cores Tested with "ulimit -n 100000" Second machine ran out of memory after running for 8 hours. Nov 13 18:07:39 ibm-hs22-03 kernel: Out of memory: Kill process 13922 (qpidd) score 283 or sacrifice child The first machine performed well and successfully finished reproducer. Halfway there. Testing with i386 arch
Tested with another two i386 machines. Results were either out of memory (qc2_spout invoked oom-killer) or passing the reproducer. 1) 16 cores, 16GB ram 2) 8 cores, 32GB ram Verified qpid-cpp-server-0.22-50.el6.i686 qpid-cpp-server-linearstore-0.22-50.el6.i686
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2014-1952.html