Bug 754202

Summary: History file and index management during condor_job_server runtime
Product: Red Hat Enterprise MRG Reporter: Pete MacKinnon <pmackinn>
Component: condor-qmfAssignee: Pete MacKinnon <pmackinn>
Status: CLOSED ERRATA QA Contact: Daniel Horák <dahorak>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2.1CC: dahorak, matt, mkudlej, tstclair
Target Milestone: 2.1.1   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: condor-7.6.5-0.10 Doc Type: Bug Fix
Doc Text:
Cause: History file configuration settings set to very low values. Consequence: The index file for the current history file could eventually become orphaned but its inode was quickly reused by the file system and incorrectly maintained in an active list by the condor_job_server and aviary_query_server. Fix: The chief problem was with the rotation of the current history file and its inode being quickly reclaimed by the file system. Thus, the fix was to detect the rotation and rescan the directory so that dangling index files would be removed and correct ones regenerated from the current set of history backup files BEFORE inode reuse could occur. Result: Index files are correctly cleaned up after a history file rotation and point to valid archived history backups based on inode.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-06 18:18:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 765607    

Description Pete MacKinnon 2011-11-15 17:12:09 UTC
Using a test configuration like the following:

MAX_HISTORY_ROTATIONS = 2
MAX_HISTORY_LOG = 10240

it is possible to artificially create a high rate of inode reuse between the history files and their indices in the spool directory. When this occurs, QMF requests for full job details (aka their ClassAds) can find "corrupted" results:

e.g.: 
  qpid: call 429 GetJobAd "1.0" returned me classad with u'GlobalJobId':
'#4.0#1320851443'

  qpid: call 429 GetJobAd "1.1"
  qpid: OK (0) - {u'JobAd': {u'CompletionDate': 1320851794, u'!!descriptors':
{u'CurrentTime': 'com.redhat.grid.Expression'}, u'CurrentTime': 1320852825,
u'ipal': 'unauthenticated@unmapped/10.34.33.173', u'LastJobStatus': 2,
u'MyType': 'Job'}}

  qpid: call 429 GetJobAd "3.3"
  qpid: malformed ad for job '3.3' in /var/lib/condor/spool//history (65537) -
{}
(Classad for job "3.3" should be there because of he is in condor_history
output)


Approach will be to refactor history file management code and index garbage collection.

Also, impacts aviary query_server due to shared code heritage.

Comment 1 Matthew Farrellee 2011-11-21 22:41:05 UTC
Immediate workaround is to restart the condor_job_server.

Comment 2 Daniel Horák 2011-11-23 07:18:46 UTC
For testing progress see also bug 699737, comment 12.

Comment 4 Pete MacKinnon 2011-12-08 18:47:14 UTC
Fixed upstream for BOTH condor_job_server (QMF) and aviary_query_server.

Previously defined test procedure used to:
a) confirm idx cleanup
b) primary history file rotation triggers a rescan of all job data in HISTORY across all valid history.<isodate> files

Comment 5 Pete MacKinnon 2011-12-12 18:12:45 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: History file configuration settings set to very low values.

Consequence: The index file for the current history file could eventually become orphaned but its inode was quickly reused by the file system and incorrectly maintained in an active list by the condor_job_server and aviary_query_server.

Fix: The chief problem was with the rotation of the current history file and its inode being quickly reclaimed by the file system. Thus, the fix was to detect the rotation and rescan the directory so that dangling index files would be removed and correct ones regenerated from the current set of history backup files BEFORE inode reuse could occur.

Result: Index files are correctly cleaned up after a history file rotation and point to valid archived history backups based on inode.

Comment 7 Daniel Horák 2012-01-11 12:28:19 UTC
Reproduced on RHEL 6.2 i386 with condor 7.6.5-0.8 with both condor_job_server (QMF) and aviary_query_server:
# rpm -qa | grep condor
  condor-classads-7.6.5-0.8.el6.i686
  condor-qmf-7.6.5-0.8.el6.i686
  condor-aviary-7.6.5-0.8.el6.i686
  condor-7.6.5-0.8.el6.i686

After few completed jobs:
# pwd
  /var/lib/condor/spool
# ls -AlFi *history*
  141474 -rw-r--r--. 1 condor condor 6517 Jan 11 11:33 history
  137450 -rw-r--r--. 1 condor condor  450 Jan 11 11:32 history.141445.idx
  141470 -rw-r--r--. 1 condor condor  450 Jan 11 11:33 history.141456.idx
  141484 -rw-r--r--. 1 condor condor  450 Jan 11 11:33 history.141471.idx
  141486 -rw-r--r--. 1 condor condor  299 Jan 11 11:33 history.141474.idx
  141487 -rw-r--r--. 1 condor condor  450 Jan 11 11:22 history.141484.idx
  141493 -rw-r--r--. 1 condor condor  450 Jan 11 11:22 history.141485.idx
  139570 -rw-r--r--. 1 condor condor  450 Jan 11 11:32 history.141497.idx
  141456 -rw-r--r--. 1 condor condor 9641 Jan 11 11:32 history.20120111T113250
  141471 -rw-r--r--. 1 condor condor 9815 Jan 11 11:32 history.20120111T113300


Verified on RHEL 6.2 i386 with condor-7.6.5-0.11.
In both cases (with condor_job_server and aviary_query_server) index files are correctly cleaned up related to history file rotation. Jobs in cumin correspond to output of command condor_history.

# rpm -qa | grep condor
  condor-7.6.5-0.11.el6.i686
  condor-qmf-7.6.5-0.11.el6.i686
  condor-classads-7.6.5-0.11.el6.i686
  condor-aviary-7.6.5-0.11.el6.i686

# ls -AlFi *history*
  33668 -rw-r--r-- 1 condor condor 6519 Jan 11 12:32 history
  33653 -rw-r--r-- 1 condor condor 9818 Jan 11 12:32 history.20120111T123229
  33659 -rw-r--r-- 1 condor condor 9818 Jan 11 12:32 history.20120111T123245
  33658 -rw-r--r-- 1 condor condor  450 Jan 11 12:32 history.33653.idx
  33665 -rw-r--r-- 1 condor condor  450 Jan 11 12:32 history.33659.idx
  33671 -rw-r--r-- 1 condor condor  299 Jan 11 12:33 history.33668.idx

 
Verified on all platforms (RHEL 5.7 and RHEL 6.2, i386 and x86_64) with similar result.

>>>VERIFIED

Comment 8 errata-xmlrpc 2012-02-06 18:18:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0100.html