1443470 – Orphaned qpid journal files when unregistering a content host

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1443470 - Orphaned qpid journal files when unregistering a content host

Summary: Orphaned qpid journal files when unregistering a content host

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Qpid
Sub Component:
Version:	6.2.8
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	Unspecified
Assignee:	Pavel Moravec
QA Contact:	Sanket Jagtap
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-19 10:00 UTC by Pavel Moravec
Modified:	2023-09-07 18:52 UTC (History)
CC List:	8 users (show)
Fixed In Version:	qpid-cpp-0.34-25
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1067429
Clones:	1463819 (view as bug list)
Environment:
Last Closed:	2017-08-10 17:02:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Apache JIRA	QPID-7788	None	None	None	2017-05-18 11:29:03 UTC
Red Hat Bugzilla	1367735	urgent	CLOSED	qpidd consuming huge memory during scale	2021-09-09 11:54:43 UTC
Red Hat Product Errata	RHBA-2017:2466	normal	SHIPPED_LIVE	Satellite 6.2.11 Async Release	2017-08-10 21:01:20 UTC

Internal Links: 1367735

Description Pavel Moravec 2017-04-19 10:00:18 UTC

Description of problem:
Unregistering a content host (with goferd running) triggers deletion of pulp.agent.* qpid queue on Satellite. But this does not remove the underlying journal file for the durable queue - see

https://bugzilla.redhat.com/show_bug.cgi?id=1067429

That means, recycling Content Hosts leaves orphaned files under /var/lib/qpidd.


Version-Release number of selected component (if applicable):
qpid-cpp-server-linearstore (any version incl. 0.30-* and 0.34-21)


How reproducible:
100%


Steps to Reproduce:
1. (on Sat - on RHEL7)
ls /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use | wc
du -ks /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use

2. Register a client system, install katello-agent, run goferd, unregister it.

3.
ls /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use | wc
du -ks /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use

4. for that client system, register via sub-man and unregister it in a loop (just to mimic new system registered, used and discarded) - ensure goferd is restarted meantime

5.
ls /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use | wc
du -ks /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use


Actual results:
The directory content is growing over time.


Expected results:
The directory content remains stable.


Additional info:

Comment 1 Pavel Moravec 2017-04-19 10:02:57 UTC

Kim,
as journal files in

/var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use 

(or similar) directory are left orphaned, is it a safe workaround to:
- identify what such files dont have a symlink from 

/var/lib/qpidd/.qpidd/qls/jrnl2/*/*

- delete such files

Is this procedure safe? Does it require qpidd being stopped?

Comment 4 Pavel Moravec 2017-04-19 21:03:00 UTC

FYI this seems not to be _current_ problem in Sat6.2 that uses qpid-cpp 0.30. That qpid-cpp version isnt affected.

Only 0.34 version is affected, where we plan to upgrade to, due to bz 1367735.

Comment 5 Kim van der Riet 2017-05-10 14:51:12 UTC

(In reply to Pavel Moravec from comment #1)
> Kim,
> as journal files in
> 
> /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use 
> 
> (or similar) directory are left orphaned, is it a safe workaround to:
> - identify what such files dont have a symlink from 
> 
> /var/lib/qpidd/.qpidd/qls/jrnl2/*/*
> 
> - delete such files
> 
> Is this procedure safe? Does it require qpidd being stopped?

Yes, this should be safe. No, qpidd does not need to be stopped, as it has lost all references and interest in the orphaned files. I suggest trying this procedure out on a running broker to be sure. If it works, a simple clean-up script could be run periodically from cron to do this. This is only a workaround, this behavior is a bug and should be fixed.

As an alternative to deleting these files, they can be moved into the emtpy file pool for later re-use. Their size should determine which EFP directory to place them in. IIRC, if there has been a stoppage or sudden crash, it sometimes can happen that orphaned files are not of the correct size (as they were interrupted during creation). These files should be deleted.

Comment 6 Kim van der Riet 2017-05-10 15:03:41 UTC

> As an alternative to deleting these files, they can be moved into the emtpy
> file pool for later re-use.

To clarify, these files can be moved up one directory out of NNNNk/in_use into the NNNNk directory (by default 2048k) so that they can be re-used. However, if this is done while the broker is running, the moved files will not be seen until a restart as the store initializes its internal list of unused files only on startup.

Comment 7 Pavel Moravec 2017-05-10 20:33:23 UTC

Thanks Kim.

I was wrong earlier about the version affected. 0.34 is affected but 0.30 isnt. So an upgrade to 0.34 would have this BZ as a regression.

Comment 8 Pavel Moravec 2017-05-16 12:27:27 UTC

Kim,
can you please review my patch (against upstream)?

I did some testing (like delete (non)empty queue with 1 or more files), but nothing more serious.

Reasons behind the code changes:
- LinearFileController::purgeEmptyFilesToEfp (when called from jcntl::delete_jrnl_files) needs to return _really_ all files, including the latest one
- but it can do so at the right time, since the sequence must be:
  - first close currentJournalFilePtr_ (otherwise segfault in atomic counters for AIO)
  - purge files
  - finalize the L.F.C.
  - remove the directory (in jcntl::delete_jrnl_files)

The above explains both split of LinearFileController::finalize and argument for LinearFileController::purgeEmptyFilesToEfp.

Patch itself:

diff -rup journal-orig/jcntl.cpp journal/jcntl.cpp
--- journal-orig/jcntl.cpp	2017-05-13 20:37:09.193101767 +0200
+++ journal/jcntl.cpp	2017-05-16 14:05:11.780167885 +0200
@@ -145,7 +145,8 @@ void
 jcntl::delete_jrnl_files()
 {
     stop(true); // wait for AIO to complete
-    _linearFileController.purgeEmptyFilesToEfp();
+    _linearFileController.closeCurrentJournal();
+    _linearFileController.purgeEmptyFilesToEfp(true);
     _jdir.delete_dir();
 }
 
diff -rup journal-orig/LinearFileController.cpp journal/LinearFileController.cpp
--- journal-orig/LinearFileController.cpp	2017-05-13 20:37:09.192101773 +0200
+++ journal/LinearFileController.cpp	2017-05-16 14:17:11.589132413 +0200
@@ -49,16 +49,20 @@ void LinearFileController::initialize(co
 }
 
 void LinearFileController::finalize() {
-    if (currentJournalFilePtr_) {
-        currentJournalFilePtr_->close();
-        currentJournalFilePtr_ = 0;
-    }
+    closeCurrentJournal();
     while (!journalFileList_.empty()) {
         delete journalFileList_.front();
         journalFileList_.pop_front();
     }
 }
 
+void LinearFileController::closeCurrentJournal() {
+    if (currentJournalFilePtr_) {
+        currentJournalFilePtr_->close();
+        currentJournalFilePtr_ = 0;
+    }
+}
+
 void LinearFileController::addJournalFile(JournalFile* journalFilePtr,
                                           const uint32_t completedDblkCount,
                                           const bool makeCurrentFlag) {
@@ -105,9 +109,10 @@ void LinearFileController::restoreEmptyF
     addJournalFile(fileName, emptyFilePoolPtr_->getIdentity(), getNextFileSeqNum(), 0);
 }
 
-void LinearFileController::purgeEmptyFilesToEfp() {
+void LinearFileController::purgeEmptyFilesToEfp(bool force_all) {
     slock l(journalFileListMutex_);
-    while (journalFileList_.front()->isNoEnqueuedRecordsRemaining() && journalFileList_.size() > 1) { // Can't purge last file, even if it has no enqueued records
+    while ((force_all && (journalFileList_.size() > 0)) || // when deleting the queue, remove really all journal files, otherwise ..
+          ((journalFileList_.size() > 1) && (journalFileList_.front()->isNoEnqueuedRecordsRemaining()))) { // .. dont purge last file, even if it has no enqueued records
         emptyFilePoolPtr_->returnEmptyFileSymlink(journalFileList_.front()->getFqFileName());
         delete journalFileList_.front();
         journalFileList_.pop_front();
diff -rup journal-orig/LinearFileController.h journal/LinearFileController.h
--- journal-orig/LinearFileController.h	2017-05-13 20:37:09.192101773 +0200
+++ journal/LinearFileController.h	2017-05-16 14:04:52.817274436 +0200
@@ -60,6 +60,7 @@ public:
                     EmptyFilePool* emptyFilePoolPtr,
                     uint64_t initialFileNumberVal);
     void finalize();
+    void closeCurrentJournal();
 
     void addJournalFile(JournalFile* journalFilePtr,
                         const uint32_t completedDblkCount,
@@ -71,7 +72,7 @@ public:
     uint64_t getNextRecordId();
     void removeFileToEfp(const std::string& fileName);
     void restoreEmptyFile(const std::string& fileName);
-    void purgeEmptyFilesToEfp();
+    void purgeEmptyFilesToEfp(bool force_all=false);
 
     // Functions for manipulating counts of non-current JournalFile instances in journalFileList_
     uint32_t getEnqueuedRecordCount(const uint64_t fileSeqNumber);

Comment 9 Kim van der Riet 2017-05-17 17:19:25 UTC

Pavel,
The patch looks good to me.

Coincidentally, I submitted a similar patch (though not quite as elegant) on BZ1067429.

Comment 10 Pavel Moravec 2017-05-18 11:29:04 UTC

Thanks for feedback.

I run some stress tests (concurrent creating+enqueueing+dequeueing+deleting of queues) and:
- no stability issue noticed
- all files returned properly
- no jrnl file leak elsewhere

so commited my patch to upstream.

Comment 12 Sanket Jagtap 2017-07-24 12:45:17 UTC

Build: Satellite 6.2.11 snap1 


On Satellite
for i in {1..1000};do ls /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use | wc; sleep 5;done
     12      12     504
     11      11     462
     12      12     504
     12      12     504
     12      12     504
     11      11     462
     12      12     504
     12      12     504
     12      12     504
     11      11     462
     12      12     504
     12      12     504
     12      12     504
     11      11     462
     12      12     504
     12      12     504
     12      12     504
     12      12     504
     12      12     504
     12      12     504
     12      12     504


On client system approx for 30 times
for i in {1..100}; do bash apple.sh echo $i;done
This system is already registered. Use --force to override
System has been unregistered.
The system has been registered with ID: 5a658ee2--2fa893e060fd

Installed Product Current Status:
Product Name: Red Hat Enterprise Linux Server
Status:       Subscribed
Redirecting to /bin/systemctl restart  goferd.service

The contents of the folder did not increase and were stable.

Comment 14 errata-xmlrpc 2017-08-10 17:02:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2466

Note You need to log in before you can comment on or make changes to this bug.