Bug 1561819 - qpidd segfault at 10 ip 00007f27fbbabe19 sp 00007ffe478c05a0 error 4 in linearstore.so
Summary: qpidd segfault at 10 ip 00007f27fbbabe19 sp 00007ffe478c05a0 error 4 in linea...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Qpid
Version: 6.3.0
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: Unspecified
Assignee: Mike Cressman
QA Contact: Harshad More
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-29 00:28 UTC by Paul Dudley
Modified: 2021-12-10 15:53 UTC (History)
18 users (show)

Fixed In Version: qpid-cpp-1.36.0-18
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-19 20:17:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch proposal (sufficient, incomplete) (695 bytes, patch)
2018-05-07 09:53 UTC, Pavel Moravec
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-8184 0 None None None 2018-05-07 20:42:21 UTC
Red Hat Knowledge Base (Solution) 3433901 0 None None None 2018-05-07 10:26:28 UTC
Red Hat Product Errata RHBA-2018:1950 0 None None None 2018-06-19 20:17:33 UTC

Comment 4 Neal Kim 2018-03-29 19:15:06 UTC
#0  qpid::linearstore::journal::EmptyFilePool::getIdentity (this=0x0) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/journal/EmptyFilePool.cpp:139
#1  0x00007f863a6dbe34 in qpid::linearstore::journal::LinearFileController::restoreEmptyFile (this=this@entry=0x2f4520f8, 
    fileName="/var/lib/qpidd/.qpidd/qls/jrnl2/pulp.agent.4a01d18a-542a-4dd2-994e-ab91ee0ca00d/754b3db5-4f9c-4fbb-ad75-478715cf70d0.jrnl")
    at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/journal/LinearFileController.cpp:109
#2  0x00007f863a6e0c3b in qpid::linearstore::journal::RecoveryManager::setLinearFileControllerJournals (this=this@entry=0x2f452640, fnPtr=<optimized out>, lfcPtr=lfcPtr@entry=0x2f4520f8)
    at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/journal/RecoveryManager.cpp:305
#3  0x00007f863a6cacc8 in qpid::linearstore::journal::jcntl::recover (this=this@entry=0x2f4520c8, efpmp=<optimized out>, wcache_num_pages=wcache_num_pages@entry=16, 
    wcache_pgsize_sblks=wcache_pgsize_sblks@entry=1, cbp=cbp@entry=0x2f452980, prep_txn_list_ptr=prep_txn_list_ptr@entry=0x7ffdf4db9060, highest_rid=@0x7ffdf4db9760: 0)
    at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/journal/jcntl.cpp:124
#4  0x00007f863a6fc130 in qpid::linearstore::JournalImpl::recover (this=this@entry=0x2f4520c0, efpm=..., wcache_num_pages=<optimized out>, wcache_pgsize_sblks=1, cbp=0x2f452980, 
    prep_tx_list_ptr=prep_tx_list_ptr@entry=0x7ffdf4db9ea0, highest_rid=@0x7ffdf4db9760: 0, queue_id=9912) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/JournalImpl.cpp:236
#5  0x00007f863a714454 in recover (queue_id=<optimized out>, highest_rid=@0x7ffdf4db9760: 0, prep_tx_list_ptr=0x7ffdf4db9ea0, wcache_pgsize_sblks=<optimized out>, 
    wcache_num_pages=<optimized out>, efpm=..., this=0x2f4520c0) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/JournalImpl.h:140
#6  qpid::linearstore::MessageStoreImpl::recoverQueues (this=this@entry=0x1da0b70, txn=..., registry=..., queue_index=std::map with 7999 elements = {...}, prepared=..., 
    messages=std::map with 0 elements) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/MessageStoreImpl.cpp:758
#7  0x00007f863a715344 in qpid::linearstore::MessageStoreImpl::recover (this=0x1da0b70, registry_=...) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/MessageStoreImpl.cpp:627
#8  0x00007f8640215415 in qpid::broker::MessageStoreModule::recover (this=<optimized out>, registry=...) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/broker/MessageStoreModule.cpp:87
#9  0x00007f8640173208 in qpid::broker::Broker::Broker (this=0x1d7acd0, conf=...) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/broker/Broker.cpp:326
#10 0x0000000000405b52 in qpid::broker::QpiddBroker::execute (this=this@entry=0x7ffdf4dbb99e, options=0x1d6fd70) at /usr/src/debug/qpid-cpp-1.36.0/src/posix/QpiddBroker.cpp:229
#11 0x00000000004099c4 in qpid::broker::run_broker (argc=3, argv=0x7ffdf4dbbd38, hidden=<optimized out>) at /usr/src/debug/qpid-cpp-1.36.0/src/qpidd.cpp:108
#12 0x00007f863ecbec05 in __libc_start_main (main=0x404c10 <main(int, char**)>, argc=3, ubp_av=0x7ffdf4dbbd38, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7ffdf4dbbd28) at ../csu/libc-start.c:274
#13 0x0000000000404e81 in _start ()

Comment 6 Pavel Moravec 2018-04-03 15:50:49 UTC
my 2c:

(gdb) frame 1
#1  0x00007f863a6dbe34 in qpid::linearstore::journal::LinearFileController::restoreEmptyFile (this=this@entry=0x2f4520f8, 
    fileName="/var/lib/qpidd/.qpidd/qls/jrnl2/pulp.agent.4a01d18a-542a-4dd2-994e-ab91ee0ca00d/754b3db5-4f9c-4fbb-ad75-478715cf70d0.jrnl")
    at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/journal/LinearFileController.cpp:109
109	    addJournalFile(fileName, emptyFilePoolPtr_->getIdentity(), getNextFileSeqNum(), 0);
(gdb) p emptyFilePoolPtr_
$6 = (qpid::linearstore::journal::EmptyFilePool *) 0x0

- here is the technical problem, efp ptr cant be null

(gdb) frame 4
#4  0x00007f863a6fc130 in qpid::linearstore::JournalImpl::recover (this=this@entry=0x2f4520c0, efpm=..., wcache_num_pages=<optimized out>, wcache_pgsize_sblks=1, cbp=0x2f452980, 
    prep_tx_list_ptr=prep_tx_list_ptr@entry=0x7ffdf4db9ea0, highest_rid=@0x7ffdf4db9760: 0, queue_id=9912) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/JournalImpl.cpp:236
236	        jcntl::recover(efpm.get(), wcache_num_pages, wcache_pgsize_sblks, cbp, &prep_xid_list, highest_rid);
(gdb) p efpm.px
$7 = (boost::shared_ptr<qpid::linearstore::journal::EmptyFilePoolManager>::element_type *) 0x1da57f0

- the efp ptr should equal to this ptr


This is strange:

(gdb) frame 3
#3  0x00007f863a6cacc8 in qpid::linearstore::journal::jcntl::recover (this=this@entry=0x2f4520c8, efpmp=<optimized out>, wcache_num_pages=wcache_num_pages@entry=16, 
    wcache_pgsize_sblks=wcache_pgsize_sblks@entry=1, cbp=cbp@entry=0x2f452980, prep_txn_list_ptr=prep_txn_list_ptr@entry=0x7ffdf4db9060, highest_rid=@0x7ffdf4db9760: 0)
    at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/journal/jcntl.cpp:124
124	    _recoveryManager.setLinearFileControllerJournals(&qpid::linearstore::journal::LinearFileController::addJournalFile, &_linearFileController);
(gdb) list
118         _recoveryManager.analyzeJournals(prep_txn_list_ptr, efpmp, &_emptyFilePoolPtr);
119	    assert(_emptyFilePoolPtr != 0);
120	
121	    highest_rid = _recoveryManager.getHighestRecordId();
122	    _jrnl_log.log(/*LOG_DEBUG*/JournalLog::LOG_INFO, _jid, _recoveryManager.toString(_jid, 5U));
123	    _linearFileController.initialize(_jdir.dirname(), _emptyFilePoolPtr, _recoveryManager.getHighestFileNumber());
124	    _recoveryManager.setLinearFileControllerJournals(&qpid::linearstore::journal::LinearFileController::addJournalFile, &_linearFileController);
125	    if (_recoveryManager.isLastFileFull()) {
126	        _linearFileController.getNextJournalFile();
127	    }
128	    _wmgr.initialize(cbp, wcache_pgsize_sblks, wcache_num_pages, QLS_WMGR_MAXDTOKPP, QLS_WMGR_MAXWAITUS,
(gdb) p _emptyFilePoolPtr
$9 = (qpid::linearstore::journal::EmptyFilePool *) 0x0
(gdb) 

so:
- _emptyFilePoolPtr was populated on line 118
- .. asserted as not null on 119
- used to initialize _linearFileController that has it initialized to null already (I speculate)
- now it is null

what could null-ize the _linearFileController ??? If it isnt assigned anywhere??? and no concurrency could cause re-initializing this class?

Comment 8 Pavel Moravec 2018-04-03 18:32:58 UTC
Even more peculiar information deduced:

1) bt starts with:

#0  qpid::linearstore::journal::EmptyFilePool::getIdentity (this=0x0) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/journal/EmptyFilePool.cpp:139

being called from:

#1  0x00007f863a6dbe34 in qpid::linearstore::journal::LinearFileController::restoreEmptyFile (this=this@entry=0x2f4520f8, 
    fileName="/var/lib/qpidd/.qpidd/qls/jrnl2/pulp.agent.4a01d18a-542a-4dd2-994e-ab91ee0ca00d/754b3db5-4f9c-4fbb-ad75-478715cf70d0.jrnl")
    at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/journal/LinearFileController.cpp:109

That means, LinearFileController.cpp:109 :
    addJournalFile(fileName, emptyFilePoolPtr_->getIdentity(), getNextFileSeqNum(), 0);

was able to dereference emptyFilePoolPtr_ properly, to call its method. So _at_this_time_, at the beginning of getIdentity() method, emptyFilePoolPtr_ was a valid pointer.


2) So the valid pointer had to be null-ified during execution of getIdentity() method. What the method does? Almost nothing:

const efpIdentity_t EmptyFilePool::getIdentity() const {
    return efpIdentity_t(partitionPtr_->getPartitionNumber(), efpDataSize_kib_);
}

- partitionPtr_->getPartitionNumber() just returns partitionNum_
- efpDataSize_kib_ is a number
- efpIdentity_t is a copy constructor that just assigns both values

How either of the three steps could ever cause the class instance becomes null???

Comment 13 Pavel Moravec 2018-05-04 16:49:59 UTC
I might spotted something: when I copied /var/lib/qpidd/.qpidd to my system and started qpidd with info+ logs, I noticed before the sgefault log:

2018-05-04 18:40:45 [Store] warning Linear Store: EmptyFilePool create failed: jexception 0x0d03 EmptyFilePool::fileSizeKbFromDirName() threw JERR_EFP_BADEFPDIRNAME: Bad Empty File Pool directory name (must be 'NNNk', where NNN is a number which is a multiple of 4) (Partition: 1; EFP directory: '6k')


The wrong EFP directory name differs each time ('6k' or '10k' or '83k'), I *think* nothing changed under the .qpidd dir meantime, so the value must come from some random uninitialized vortex.

So if I am right, the fix should be 2fold:
1) prevent populating the EFP dir.name from some random place (this is critical)
2) raising JERR_EFP_BADEFPDIRNAME shall not cause segfault but normal program termination

Comment 15 Pavel Moravec 2018-05-05 19:01:50 UTC
My 2$ (not only cents):

adding some debugs to src.code, this happens in linearstore/journal/RecoveryManager.cpp :

void RecoveryManager::analyzeJournalFileHeaders(efpIdentity_t& efpIdentity) {
..
        } else if (hdrEmpty) {
            // Read symlink, find efp directory name which is efp size in KiB
            // TODO: place this bit into a common function as it is also used in EmptyFilePool.cpp::deleteSymlink()
            char buff[1024];
            ssize_t len = ::readlink((*i).c_str(), buff, 1024);
            if (len < 0) {
                std::ostringstream oss;
                oss << "symlink=\"" << (*i) << "\"" << FORMAT_SYSERR(errno);
                throw jexception(jerrno::JERR__SYMLINK, oss.str(), "RecoveryManager", "analyzeJournalFileHeaders");
            }
            // Find second and third '/' from back of string, which contains the EFP directory name
            *(::strrchr(buff, '/')) = '\0';
            *(::strrchr(buff, '/')) = '\0';
            int efpDataSize_kib = atoi(::strrchr(buff, '/') + 1);
            uninitFileList_.push_back(*i);
            efpIdentity.pn_ = fileHeader._efp_partition;
            efpIdentity.ds_ = efpDataSize_kib;
            printf("PavelM: hdrOk and hdrEmpty: link=%s buff=%s len=%li efpDataSize_kib=%i\n", (*i).c_str(), buff, len, efpDataSize_kib);
..

Just before the segfault, this is printed:

2018-05-05 20:34:37 [Store] info Linear Store: Recovered queue "pulp.agent.9a905acf-26c0-415e-b0c4-399211c4f3c6": 0 messages recovered; 0 messages in-doubt.
PavelM: hdrOk and hdrEmpty: link=/root/.qpidd/qls/jrnl2/pulp.agent.c5e62b44-ebf8-40f1-b262-ce28080ec5a5/378bf898-09d0-4ddd-91c3-e85da595dbc2.jrnl buff=/var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use len=89 efpDataSize_kib=0
2018-05-05 20:34:37 [Store] info Linear Store: Journal "pulp.agent.c5e62b44-ebf8-40f1-b262-ce28080ec5a5": 
     Journal recovery analysis (jid="pulp.agent.c5e62b44-ebf8-40f1-b262-ce28080ec5a5"):
     <Journal empty, no journal files found>

2018-05-05 20:34:37 [Store] info Linear Store: Recovered queue "pulp.agent.c5e62b44-ebf8-40f1-b262-ce28080ec5a5": 0 messages recovered; 0 messages in-doubt.
PavelM: hdrOk and hdrEmpty: link=/root/.qpidd/qls/jrnl2/pulp.agent.da90ab27-572c-4e02-a8ba-ed5aeb451eb7/6d7056d4-1e0b-4fc9-8a40-e1d946073820.jrnl buff=/var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use/6d7056d4-1e0b-4fc9-8a40-e1d946073820.jrnl len=89 efpDataSize_kib=6
2018-05-05 20:34:37 [Store] warning Linear Store: EmptyFilePool create failed: jexception 0x0d03 EmptyFilePool::fileSizeKbFromDirName() threw JERR_EFP_BADEFPDIRNAME: Bad Empty File Pool directory name (must be 'NNNk', where NNN is a number which is a multiple of 4) (Partition: 1; EFP directory: '6k')
2018-05-05 20:34:37 [Store] info Linear Store: Journal "pulp.agent.da90ab27-572c-4e02-a8ba-ed5aeb451eb7": 
     Journal recovery analysis (jid="pulp.agent.da90ab27-572c-4e02-a8ba-ed5aeb451eb7"):
     <Journal empty, no journal files found>

The 1st bunch of logs is OK - EFP identity data size properly identified from the symlink target.

For the 2nd bunch of logs, the buff is properly populated (symlink followed), but the strrchr commands didnt work there, resulting in buff not truncated. Hence instead of taking "2048" from 2048k dirname, 6 from 6d7056d4-.. filename is taken. I.e. strrchr did not apply here.

This "strrchr not applied to a string" happens:
- for a random string (as radnom queue is the latest recovered / problematic)
- for that string, in every call of the strrchr (since the method is applied either 3times or never)


Million dollar question: why strrchr sometimes doesnt apply to the string with (path&)filename ?

Comment 16 Pavel Moravec 2018-05-05 20:16:15 UTC
.. anyway when I create a simple program to traverse /root/.qpidd/qls/jrnl2/ dir, follow symlinks and apply the above code, it applies strrchr properly every time.

Comment 17 Pavel Moravec 2018-05-06 06:52:09 UTC
Workaround:

Before starting the broker, rename all the symlinks and filenames such that they start with 2048a (or any other hexadecimal digit that isnt decimal digit). Like for a symlink:

/var/lib/qpidd/.qpidd/qls/jrnl2/pulp.agent.da90ab27-572c-4e02-a8ba-ed5aeb451eb7/6d7056d4-1e0b-4fc9-8a40-e1d946073820.jrnl  symbolic link to  /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use/6d7056d4-1e0b-4fc9-8a40-e1d946073820.jrnl

rename both the symlink and the target to:

/var/lib/qpidd/.qpidd/qls/jrnl2/pulp.agent.da90ab27-572c-4e02-a8ba-ed5aeb451eb7/2048aaaa-1e0b-4fc9-8a40-e1d946073820.jrnl  symbolic link to  /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use/2048aaaa-1e0b-4fc9-8a40-e1d946073820.jrnl

One has to modify so for _all_ /var/lib/qpidd/.qpidd/qls/jrnl2/*/*jrnl symlinks (and their targets), to be sure.


Then even if strrchr fails to truncate '/' and tries to take filename like "2048k" dirname, it wont further hurt (the 'k' from the dirname is ignored, this works well).

Comment 18 Pavel Moravec 2018-05-07 09:53:54 UTC
Created attachment 1432547 [details]
patch proposal (sufficient, incomplete)

Reasoning:

(*) Method readlink used in

https://github.com/apache/qpid-cpp/blob/master/src/qpid/linearstore/journal/RecoveryManager.cpp#L395

does not append trailing '\0' to buff (see e.g. http://man7.org/linux/man-pages/man2/readlink.2.html).


(*) That means, buff string can continue behind the ".jrnl" fragment with arbitrary data (until first null character appears).

(*) In the extra continuation contains '/', methods strrchr applied to truncate the path fails to disect the proper directory, and filename (<uuid>.jrnl) is often taken instead of directory name with DS ("2048k").

(*) Converting the leading digits of this string (atoi command on line 404) shall return 2048 (or any other valid DataSize) but it returns arbitrary number when applied to UUID.

(*) For invalid DataSize, linearstore raises JERR_EFP_BADEFPDIRNAME.

(*) That exception is somehow wrongly caught / processed, such that:
- it assumes whole recovery should terminate
- but recovery of next journal starts and hits the nullpointer


The patch ensures the buffer is properly terminated by '\0'. Then the JERR_EFP_BADEFPDIRNAME does not happen.

The patch is incomplete as it also *should* properly catch/process the exception to stop recovery of other journals.

Comment 19 Kim van der Riet 2018-05-07 13:52:16 UTC
Excellent work analyzing the problem. I came to the conclusion analyzing the code only that the problem was either with ::readlink() or ::strrchr(), but I was unable to compile a test sore on the test machine in c#10 as the RHEL7 deps could not be met.

If this is sufficiently urgent, I suggest we submit the patch as a first step to the clients that need it so that they can get started in a more reliable way, and I will provide a way to handle the error should it occur in a way that does not blow up the broker as a second update. Otherwise, wait for both patches, I'll have it done later today.

I suggest one minor update to the patch to ensure that no buffer overwrites can occur:

char buff[1024];
ssize_t len = ::readlink((*i).c_str(), buff, 1023);
                                             ^^^^
if (len < 0)) {...}
buff[len] = '\0';

Comment 35 errata-xmlrpc 2018-06-19 20:17:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1950


Note You need to log in before you can comment on or make changes to this bug.