Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 754202 - History file and index management during condor_job_server runtime
History file and index management during condor_job_server runtime
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-qmf (Show other bugs)
2.1
Unspecified Linux
unspecified Severity unspecified
: 2.1.1
: ---
Assigned To: Pete MacKinnon
Daniel Horák
:
Depends On:
Blocks: 765607
  Show dependency treegraph
 
Reported: 2011-11-15 12:12 EST by Pete MacKinnon
Modified: 2012-07-24 09:53 EDT (History)
4 users (show)

See Also:
Fixed In Version: condor-7.6.5-0.10
Doc Type: Bug Fix
Doc Text:
Cause: History file configuration settings set to very low values. Consequence: The index file for the current history file could eventually become orphaned but its inode was quickly reused by the file system and incorrectly maintained in an active list by the condor_job_server and aviary_query_server. Fix: The chief problem was with the rotation of the current history file and its inode being quickly reclaimed by the file system. Thus, the fix was to detect the rotation and rescan the directory so that dangling index files would be removed and correct ones regenerated from the current set of history backup files BEFORE inode reuse could occur. Result: Index files are correctly cleaned up after a history file rotation and point to valid archived history backups based on inode.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-02-06 13:18:02 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2012:0100 normal SHIPPED_LIVE Moderate: MRG Grid security, bug fix, and enhancement update 2012-02-06 18:15:47 EST

  None (edit)
Description Pete MacKinnon 2011-11-15 12:12:09 EST
Using a test configuration like the following:

MAX_HISTORY_ROTATIONS = 2
MAX_HISTORY_LOG = 10240

it is possible to artificially create a high rate of inode reuse between the history files and their indices in the spool directory. When this occurs, QMF requests for full job details (aka their ClassAds) can find "corrupted" results:

e.g.: 
  qpid: call 429 GetJobAd "1.0" returned me classad with u'GlobalJobId':
'#4.0#1320851443'

  qpid: call 429 GetJobAd "1.1"
  qpid: OK (0) - {u'JobAd': {u'CompletionDate': 1320851794, u'!!descriptors':
{u'CurrentTime': 'com.redhat.grid.Expression'}, u'CurrentTime': 1320852825,
u'ipal': 'unauthenticated@unmapped/10.34.33.173', u'LastJobStatus': 2,
u'MyType': 'Job'}}

  qpid: call 429 GetJobAd "3.3"
  qpid: malformed ad for job '3.3' in /var/lib/condor/spool//history (65537) -
{}
(Classad for job "3.3" should be there because of he is in condor_history
output)


Approach will be to refactor history file management code and index garbage collection.

Also, impacts aviary query_server due to shared code heritage.
Comment 1 Matthew Farrellee 2011-11-21 17:41:05 EST
Immediate workaround is to restart the condor_job_server.
Comment 2 Daniel Horák 2011-11-23 02:18:46 EST
For testing progress see also bug 699737, comment 12.
Comment 4 Pete MacKinnon 2011-12-08 13:47:14 EST
Fixed upstream for BOTH condor_job_server (QMF) and aviary_query_server.

Previously defined test procedure used to:
a) confirm idx cleanup
b) primary history file rotation triggers a rescan of all job data in HISTORY across all valid history.<isodate> files
Comment 5 Pete MacKinnon 2011-12-12 13:12:45 EST
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: History file configuration settings set to very low values.

Consequence: The index file for the current history file could eventually become orphaned but its inode was quickly reused by the file system and incorrectly maintained in an active list by the condor_job_server and aviary_query_server.

Fix: The chief problem was with the rotation of the current history file and its inode being quickly reclaimed by the file system. Thus, the fix was to detect the rotation and rescan the directory so that dangling index files would be removed and correct ones regenerated from the current set of history backup files BEFORE inode reuse could occur.

Result: Index files are correctly cleaned up after a history file rotation and point to valid archived history backups based on inode.
Comment 7 Daniel Horák 2012-01-11 07:28:19 EST
Reproduced on RHEL 6.2 i386 with condor 7.6.5-0.8 with both condor_job_server (QMF) and aviary_query_server:
# rpm -qa | grep condor
  condor-classads-7.6.5-0.8.el6.i686
  condor-qmf-7.6.5-0.8.el6.i686
  condor-aviary-7.6.5-0.8.el6.i686
  condor-7.6.5-0.8.el6.i686

After few completed jobs:
# pwd
  /var/lib/condor/spool
# ls -AlFi *history*
  141474 -rw-r--r--. 1 condor condor 6517 Jan 11 11:33 history
  137450 -rw-r--r--. 1 condor condor  450 Jan 11 11:32 history.141445.idx
  141470 -rw-r--r--. 1 condor condor  450 Jan 11 11:33 history.141456.idx
  141484 -rw-r--r--. 1 condor condor  450 Jan 11 11:33 history.141471.idx
  141486 -rw-r--r--. 1 condor condor  299 Jan 11 11:33 history.141474.idx
  141487 -rw-r--r--. 1 condor condor  450 Jan 11 11:22 history.141484.idx
  141493 -rw-r--r--. 1 condor condor  450 Jan 11 11:22 history.141485.idx
  139570 -rw-r--r--. 1 condor condor  450 Jan 11 11:32 history.141497.idx
  141456 -rw-r--r--. 1 condor condor 9641 Jan 11 11:32 history.20120111T113250
  141471 -rw-r--r--. 1 condor condor 9815 Jan 11 11:32 history.20120111T113300


Verified on RHEL 6.2 i386 with condor-7.6.5-0.11.
In both cases (with condor_job_server and aviary_query_server) index files are correctly cleaned up related to history file rotation. Jobs in cumin correspond to output of command condor_history.

# rpm -qa | grep condor
  condor-7.6.5-0.11.el6.i686
  condor-qmf-7.6.5-0.11.el6.i686
  condor-classads-7.6.5-0.11.el6.i686
  condor-aviary-7.6.5-0.11.el6.i686

# ls -AlFi *history*
  33668 -rw-r--r-- 1 condor condor 6519 Jan 11 12:32 history
  33653 -rw-r--r-- 1 condor condor 9818 Jan 11 12:32 history.20120111T123229
  33659 -rw-r--r-- 1 condor condor 9818 Jan 11 12:32 history.20120111T123245
  33658 -rw-r--r-- 1 condor condor  450 Jan 11 12:32 history.33653.idx
  33665 -rw-r--r-- 1 condor condor  450 Jan 11 12:32 history.33659.idx
  33671 -rw-r--r-- 1 condor condor  299 Jan 11 12:33 history.33668.idx

 
Verified on all platforms (RHEL 5.7 and RHEL 6.2, i386 and x86_64) with similar result.

>>>VERIFIED
Comment 8 errata-xmlrpc 2012-02-06 13:18:02 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0100.html

Note You need to log in before you can comment on or make changes to this bug.