Red Hat Bugzilla – Bug 1443470
Orphaned qpid journal files when unregistering a content host
Last modified: 2017-08-10 13:02:29 EDT
Description of problem: Unregistering a content host (with goferd running) triggers deletion of pulp.agent.* qpid queue on Satellite. But this does not remove the underlying journal file for the durable queue - see https://bugzilla.redhat.com/show_bug.cgi?id=1067429 That means, recycling Content Hosts leaves orphaned files under /var/lib/qpidd. Version-Release number of selected component (if applicable): qpid-cpp-server-linearstore (any version incl. 0.30-* and 0.34-21) How reproducible: 100% Steps to Reproduce: 1. (on Sat - on RHEL7) ls /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use | wc du -ks /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use 2. Register a client system, install katello-agent, run goferd, unregister it. 3. ls /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use | wc du -ks /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use 4. for that client system, register via sub-man and unregister it in a loop (just to mimic new system registered, used and discarded) - ensure goferd is restarted meantime 5. ls /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use | wc du -ks /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use Actual results: The directory content is growing over time. Expected results: The directory content remains stable. Additional info:
Kim, as journal files in /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use (or similar) directory are left orphaned, is it a safe workaround to: - identify what such files dont have a symlink from /var/lib/qpidd/.qpidd/qls/jrnl2/*/* - delete such files Is this procedure safe? Does it require qpidd being stopped?
FYI this seems not to be _current_ problem in Sat6.2 that uses qpid-cpp 0.30. That qpid-cpp version isnt affected. Only 0.34 version is affected, where we plan to upgrade to, due to bz 1367735.
(In reply to Pavel Moravec from comment #1) > Kim, > as journal files in > > /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use > > (or similar) directory are left orphaned, is it a safe workaround to: > - identify what such files dont have a symlink from > > /var/lib/qpidd/.qpidd/qls/jrnl2/*/* > > - delete such files > > Is this procedure safe? Does it require qpidd being stopped? Yes, this should be safe. No, qpidd does not need to be stopped, as it has lost all references and interest in the orphaned files. I suggest trying this procedure out on a running broker to be sure. If it works, a simple clean-up script could be run periodically from cron to do this. This is only a workaround, this behavior is a bug and should be fixed. As an alternative to deleting these files, they can be moved into the emtpy file pool for later re-use. Their size should determine which EFP directory to place them in. IIRC, if there has been a stoppage or sudden crash, it sometimes can happen that orphaned files are not of the correct size (as they were interrupted during creation). These files should be deleted.
> As an alternative to deleting these files, they can be moved into the emtpy > file pool for later re-use. To clarify, these files can be moved up one directory out of NNNNk/in_use into the NNNNk directory (by default 2048k) so that they can be re-used. However, if this is done while the broker is running, the moved files will not be seen until a restart as the store initializes its internal list of unused files only on startup.
Thanks Kim. I was wrong earlier about the version affected. 0.34 is affected but 0.30 isnt. So an upgrade to 0.34 would have this BZ as a regression.
Kim, can you please review my patch (against upstream)? I did some testing (like delete (non)empty queue with 1 or more files), but nothing more serious. Reasons behind the code changes: - LinearFileController::purgeEmptyFilesToEfp (when called from jcntl::delete_jrnl_files) needs to return _really_ all files, including the latest one - but it can do so at the right time, since the sequence must be: - first close currentJournalFilePtr_ (otherwise segfault in atomic counters for AIO) - purge files - finalize the L.F.C. - remove the directory (in jcntl::delete_jrnl_files) The above explains both split of LinearFileController::finalize and argument for LinearFileController::purgeEmptyFilesToEfp. Patch itself: diff -rup journal-orig/jcntl.cpp journal/jcntl.cpp --- journal-orig/jcntl.cpp 2017-05-13 20:37:09.193101767 +0200 +++ journal/jcntl.cpp 2017-05-16 14:05:11.780167885 +0200 @@ -145,7 +145,8 @@ void jcntl::delete_jrnl_files() { stop(true); // wait for AIO to complete - _linearFileController.purgeEmptyFilesToEfp(); + _linearFileController.closeCurrentJournal(); + _linearFileController.purgeEmptyFilesToEfp(true); _jdir.delete_dir(); } diff -rup journal-orig/LinearFileController.cpp journal/LinearFileController.cpp --- journal-orig/LinearFileController.cpp 2017-05-13 20:37:09.192101773 +0200 +++ journal/LinearFileController.cpp 2017-05-16 14:17:11.589132413 +0200 @@ -49,16 +49,20 @@ void LinearFileController::initialize(co } void LinearFileController::finalize() { - if (currentJournalFilePtr_) { - currentJournalFilePtr_->close(); - currentJournalFilePtr_ = 0; - } + closeCurrentJournal(); while (!journalFileList_.empty()) { delete journalFileList_.front(); journalFileList_.pop_front(); } } +void LinearFileController::closeCurrentJournal() { + if (currentJournalFilePtr_) { + currentJournalFilePtr_->close(); + currentJournalFilePtr_ = 0; + } +} + void LinearFileController::addJournalFile(JournalFile* journalFilePtr, const uint32_t completedDblkCount, const bool makeCurrentFlag) { @@ -105,9 +109,10 @@ void LinearFileController::restoreEmptyF addJournalFile(fileName, emptyFilePoolPtr_->getIdentity(), getNextFileSeqNum(), 0); } -void LinearFileController::purgeEmptyFilesToEfp() { +void LinearFileController::purgeEmptyFilesToEfp(bool force_all) { slock l(journalFileListMutex_); - while (journalFileList_.front()->isNoEnqueuedRecordsRemaining() && journalFileList_.size() > 1) { // Can't purge last file, even if it has no enqueued records + while ((force_all && (journalFileList_.size() > 0)) || // when deleting the queue, remove really all journal files, otherwise .. + ((journalFileList_.size() > 1) && (journalFileList_.front()->isNoEnqueuedRecordsRemaining()))) { // .. dont purge last file, even if it has no enqueued records emptyFilePoolPtr_->returnEmptyFileSymlink(journalFileList_.front()->getFqFileName()); delete journalFileList_.front(); journalFileList_.pop_front(); diff -rup journal-orig/LinearFileController.h journal/LinearFileController.h --- journal-orig/LinearFileController.h 2017-05-13 20:37:09.192101773 +0200 +++ journal/LinearFileController.h 2017-05-16 14:04:52.817274436 +0200 @@ -60,6 +60,7 @@ public: EmptyFilePool* emptyFilePoolPtr, uint64_t initialFileNumberVal); void finalize(); + void closeCurrentJournal(); void addJournalFile(JournalFile* journalFilePtr, const uint32_t completedDblkCount, @@ -71,7 +72,7 @@ public: uint64_t getNextRecordId(); void removeFileToEfp(const std::string& fileName); void restoreEmptyFile(const std::string& fileName); - void purgeEmptyFilesToEfp(); + void purgeEmptyFilesToEfp(bool force_all=false); // Functions for manipulating counts of non-current JournalFile instances in journalFileList_ uint32_t getEnqueuedRecordCount(const uint64_t fileSeqNumber);
Pavel, The patch looks good to me. Coincidentally, I submitted a similar patch (though not quite as elegant) on BZ1067429.
Thanks for feedback. I run some stress tests (concurrent creating+enqueueing+dequeueing+deleting of queues) and: - no stability issue noticed - all files returned properly - no jrnl file leak elsewhere so commited my patch to upstream.
Build: Satellite 6.2.11 snap1 On Satellite for i in {1..1000};do ls /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use | wc; sleep 5;done 12 12 504 11 11 462 12 12 504 12 12 504 12 12 504 11 11 462 12 12 504 12 12 504 12 12 504 11 11 462 12 12 504 12 12 504 12 12 504 11 11 462 12 12 504 12 12 504 12 12 504 12 12 504 12 12 504 12 12 504 12 12 504 On client system approx for 30 times for i in {1..100}; do bash apple.sh echo $i;done This system is already registered. Use --force to override System has been unregistered. The system has been registered with ID: 5a658ee2--2fa893e060fd Installed Product Current Status: Product Name: Red Hat Enterprise Linux Server Status: Subscribed Redirecting to /bin/systemctl restart goferd.service The contents of the folder did not increase and were stable.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2466