Bug 1443470

Summary: Orphaned qpid journal files when unregistering a content host
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: QpidAssignee: Pavel Moravec <pmoravec>
Status: CLOSED ERRATA QA Contact: Sanket Jagtap <sjagtap>
Severity: high Docs Contact:
Priority: high    
Version: 6.2.8CC: andrew.schofield, bbuckingham, bkearney, jcallaha, kim.vdriet, mcressma, sjagtap, zhunting
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qpid-cpp-0.34-25 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1067429
: 1463819 (view as bug list) Environment:
Last Closed: 2017-08-10 17:02:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pavel Moravec 2017-04-19 10:00:18 UTC
Description of problem:
Unregistering a content host (with goferd running) triggers deletion of pulp.agent.* qpid queue on Satellite. But this does not remove the underlying journal file for the durable queue - see

https://bugzilla.redhat.com/show_bug.cgi?id=1067429

That means, recycling Content Hosts leaves orphaned files under /var/lib/qpidd.


Version-Release number of selected component (if applicable):
qpid-cpp-server-linearstore (any version incl. 0.30-* and 0.34-21)


How reproducible:
100%


Steps to Reproduce:
1. (on Sat - on RHEL7)
ls /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use | wc
du -ks /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use

2. Register a client system, install katello-agent, run goferd, unregister it.

3.
ls /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use | wc
du -ks /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use

4. for that client system, register via sub-man and unregister it in a loop (just to mimic new system registered, used and discarded) - ensure goferd is restarted meantime

5.
ls /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use | wc
du -ks /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use


Actual results:
The directory content is growing over time.


Expected results:
The directory content remains stable.


Additional info:

Comment 1 Pavel Moravec 2017-04-19 10:02:57 UTC
Kim,
as journal files in

/var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use 

(or similar) directory are left orphaned, is it a safe workaround to:
- identify what such files dont have a symlink from 

/var/lib/qpidd/.qpidd/qls/jrnl2/*/*

- delete such files

Is this procedure safe? Does it require qpidd being stopped?

Comment 4 Pavel Moravec 2017-04-19 21:03:00 UTC
FYI this seems not to be _current_ problem in Sat6.2 that uses qpid-cpp 0.30. That qpid-cpp version isnt affected.

Only 0.34 version is affected, where we plan to upgrade to, due to bz 1367735.

Comment 5 Kim van der Riet 2017-05-10 14:51:12 UTC
(In reply to Pavel Moravec from comment #1)
> Kim,
> as journal files in
> 
> /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use 
> 
> (or similar) directory are left orphaned, is it a safe workaround to:
> - identify what such files dont have a symlink from 
> 
> /var/lib/qpidd/.qpidd/qls/jrnl2/*/*
> 
> - delete such files
> 
> Is this procedure safe? Does it require qpidd being stopped?

Yes, this should be safe. No, qpidd does not need to be stopped, as it has lost all references and interest in the orphaned files. I suggest trying this procedure out on a running broker to be sure. If it works, a simple clean-up script could be run periodically from cron to do this. This is only a workaround, this behavior is a bug and should be fixed.

As an alternative to deleting these files, they can be moved into the emtpy file pool for later re-use. Their size should determine which EFP directory to place them in. IIRC, if there has been a stoppage or sudden crash, it sometimes can happen that orphaned files are not of the correct size (as they were interrupted during creation). These files should be deleted.

Comment 6 Kim van der Riet 2017-05-10 15:03:41 UTC
> As an alternative to deleting these files, they can be moved into the emtpy
> file pool for later re-use.

To clarify, these files can be moved up one directory out of NNNNk/in_use into the NNNNk directory (by default 2048k) so that they can be re-used. However, if this is done while the broker is running, the moved files will not be seen until a restart as the store initializes its internal list of unused files only on startup.

Comment 7 Pavel Moravec 2017-05-10 20:33:23 UTC
Thanks Kim.

I was wrong earlier about the version affected. 0.34 is affected but 0.30 isnt. So an upgrade to 0.34 would have this BZ as a regression.

Comment 8 Pavel Moravec 2017-05-16 12:27:27 UTC
Kim,
can you please review my patch (against upstream)?

I did some testing (like delete (non)empty queue with 1 or more files), but nothing more serious.

Reasons behind the code changes:
- LinearFileController::purgeEmptyFilesToEfp (when called from jcntl::delete_jrnl_files) needs to return _really_ all files, including the latest one
- but it can do so at the right time, since the sequence must be:
  - first close currentJournalFilePtr_ (otherwise segfault in atomic counters for AIO)
  - purge files
  - finalize the L.F.C.
  - remove the directory (in jcntl::delete_jrnl_files)

The above explains both split of LinearFileController::finalize and argument for LinearFileController::purgeEmptyFilesToEfp.

Patch itself:

diff -rup journal-orig/jcntl.cpp journal/jcntl.cpp
--- journal-orig/jcntl.cpp	2017-05-13 20:37:09.193101767 +0200
+++ journal/jcntl.cpp	2017-05-16 14:05:11.780167885 +0200
@@ -145,7 +145,8 @@ void
 jcntl::delete_jrnl_files()
 {
     stop(true); // wait for AIO to complete
-    _linearFileController.purgeEmptyFilesToEfp();
+    _linearFileController.closeCurrentJournal();
+    _linearFileController.purgeEmptyFilesToEfp(true);
     _jdir.delete_dir();
 }
 
diff -rup journal-orig/LinearFileController.cpp journal/LinearFileController.cpp
--- journal-orig/LinearFileController.cpp	2017-05-13 20:37:09.192101773 +0200
+++ journal/LinearFileController.cpp	2017-05-16 14:17:11.589132413 +0200
@@ -49,16 +49,20 @@ void LinearFileController::initialize(co
 }
 
 void LinearFileController::finalize() {
-    if (currentJournalFilePtr_) {
-        currentJournalFilePtr_->close();
-        currentJournalFilePtr_ = 0;
-    }
+    closeCurrentJournal();
     while (!journalFileList_.empty()) {
         delete journalFileList_.front();
         journalFileList_.pop_front();
     }
 }
 
+void LinearFileController::closeCurrentJournal() {
+    if (currentJournalFilePtr_) {
+        currentJournalFilePtr_->close();
+        currentJournalFilePtr_ = 0;
+    }
+}
+
 void LinearFileController::addJournalFile(JournalFile* journalFilePtr,
                                           const uint32_t completedDblkCount,
                                           const bool makeCurrentFlag) {
@@ -105,9 +109,10 @@ void LinearFileController::restoreEmptyF
     addJournalFile(fileName, emptyFilePoolPtr_->getIdentity(), getNextFileSeqNum(), 0);
 }
 
-void LinearFileController::purgeEmptyFilesToEfp() {
+void LinearFileController::purgeEmptyFilesToEfp(bool force_all) {
     slock l(journalFileListMutex_);
-    while (journalFileList_.front()->isNoEnqueuedRecordsRemaining() && journalFileList_.size() > 1) { // Can't purge last file, even if it has no enqueued records
+    while ((force_all && (journalFileList_.size() > 0)) || // when deleting the queue, remove really all journal files, otherwise ..
+          ((journalFileList_.size() > 1) && (journalFileList_.front()->isNoEnqueuedRecordsRemaining()))) { // .. dont purge last file, even if it has no enqueued records
         emptyFilePoolPtr_->returnEmptyFileSymlink(journalFileList_.front()->getFqFileName());
         delete journalFileList_.front();
         journalFileList_.pop_front();
diff -rup journal-orig/LinearFileController.h journal/LinearFileController.h
--- journal-orig/LinearFileController.h	2017-05-13 20:37:09.192101773 +0200
+++ journal/LinearFileController.h	2017-05-16 14:04:52.817274436 +0200
@@ -60,6 +60,7 @@ public:
                     EmptyFilePool* emptyFilePoolPtr,
                     uint64_t initialFileNumberVal);
     void finalize();
+    void closeCurrentJournal();
 
     void addJournalFile(JournalFile* journalFilePtr,
                         const uint32_t completedDblkCount,
@@ -71,7 +72,7 @@ public:
     uint64_t getNextRecordId();
     void removeFileToEfp(const std::string& fileName);
     void restoreEmptyFile(const std::string& fileName);
-    void purgeEmptyFilesToEfp();
+    void purgeEmptyFilesToEfp(bool force_all=false);
 
     // Functions for manipulating counts of non-current JournalFile instances in journalFileList_
     uint32_t getEnqueuedRecordCount(const uint64_t fileSeqNumber);

Comment 9 Kim van der Riet 2017-05-17 17:19:25 UTC
Pavel,
The patch looks good to me.

Coincidentally, I submitted a similar patch (though not quite as elegant) on BZ1067429.

Comment 10 Pavel Moravec 2017-05-18 11:29:04 UTC
Thanks for feedback.

I run some stress tests (concurrent creating+enqueueing+dequeueing+deleting of queues) and:
- no stability issue noticed
- all files returned properly
- no jrnl file leak elsewhere

so commited my patch to upstream.

Comment 12 Sanket Jagtap 2017-07-24 12:45:17 UTC
Build: Satellite 6.2.11 snap1 


On Satellite
for i in {1..1000};do ls /var/lib/qpidd/.qpidd/qls/p001/efp/2048k/in_use | wc; sleep 5;done
     12      12     504
     11      11     462
     12      12     504
     12      12     504
     12      12     504
     11      11     462
     12      12     504
     12      12     504
     12      12     504
     11      11     462
     12      12     504
     12      12     504
     12      12     504
     11      11     462
     12      12     504
     12      12     504
     12      12     504
     12      12     504
     12      12     504
     12      12     504
     12      12     504


On client system approx for 30 times
for i in {1..100}; do bash apple.sh echo $i;done
This system is already registered. Use --force to override
System has been unregistered.
The system has been registered with ID: 5a658ee2--2fa893e060fd

Installed Product Current Status:
Product Name: Red Hat Enterprise Linux Server
Status:       Subscribed
Redirecting to /bin/systemctl restart  goferd.service

The contents of the folder did not increase and were stable.

Comment 14 errata-xmlrpc 2017-08-10 17:02:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2466