Bug 754202 - History file and index management during condor_job_server runtime
Summary: History file and index management during condor_job_server runtime
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-qmf
Version: 2.1
Hardware: Unspecified
OS: Linux
unspecified
unspecified
Target Milestone: 2.1.1
: ---
Assignee: Pete MacKinnon
QA Contact: Daniel Horák
URL:
Whiteboard:
Depends On:
Blocks: 765607
TreeView+ depends on / blocked
 
Reported: 2011-11-15 17:12 UTC by Pete MacKinnon
Modified: 2012-07-24 13:53 UTC (History)
4 users (show)

Fixed In Version: condor-7.6.5-0.10
Doc Type: Bug Fix
Doc Text:
Cause: History file configuration settings set to very low values. Consequence: The index file for the current history file could eventually become orphaned but its inode was quickly reused by the file system and incorrectly maintained in an active list by the condor_job_server and aviary_query_server. Fix: The chief problem was with the rotation of the current history file and its inode being quickly reclaimed by the file system. Thus, the fix was to detect the rotation and rescan the directory so that dangling index files would be removed and correct ones regenerated from the current set of history backup files BEFORE inode reuse could occur. Result: Index files are correctly cleaned up after a history file rotation and point to valid archived history backups based on inode.
Clone Of:
Environment:
Last Closed: 2012-02-06 18:18:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 628611 0 medium CLOSED jobserver: maintenance of history files and their indices needs clarity 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 699737 1 None None None 2021-01-20 06:05:38 UTC
Red Hat Bugzilla 842665 1 None None None 2021-01-20 06:05:38 UTC
Red Hat Product Errata RHSA-2012:0100 0 normal SHIPPED_LIVE Moderate: MRG Grid security, bug fix, and enhancement update 2012-02-06 23:15:47 UTC

Internal Links: 628611 699737 842665

Description Pete MacKinnon 2011-11-15 17:12:09 UTC
Using a test configuration like the following:

MAX_HISTORY_ROTATIONS = 2
MAX_HISTORY_LOG = 10240

it is possible to artificially create a high rate of inode reuse between the history files and their indices in the spool directory. When this occurs, QMF requests for full job details (aka their ClassAds) can find "corrupted" results:

e.g.: 
  qpid: call 429 GetJobAd "1.0" returned me classad with u'GlobalJobId':
'#4.0#1320851443'

  qpid: call 429 GetJobAd "1.1"
  qpid: OK (0) - {u'JobAd': {u'CompletionDate': 1320851794, u'!!descriptors':
{u'CurrentTime': 'com.redhat.grid.Expression'}, u'CurrentTime': 1320852825,
u'ipal': 'unauthenticated@unmapped/10.34.33.173', u'LastJobStatus': 2,
u'MyType': 'Job'}}

  qpid: call 429 GetJobAd "3.3"
  qpid: malformed ad for job '3.3' in /var/lib/condor/spool//history (65537) -
{}
(Classad for job "3.3" should be there because of he is in condor_history
output)


Approach will be to refactor history file management code and index garbage collection.

Also, impacts aviary query_server due to shared code heritage.

Comment 1 Matthew Farrellee 2011-11-21 22:41:05 UTC
Immediate workaround is to restart the condor_job_server.

Comment 2 Daniel Horák 2011-11-23 07:18:46 UTC
For testing progress see also bug 699737, comment 12.

Comment 4 Pete MacKinnon 2011-12-08 18:47:14 UTC
Fixed upstream for BOTH condor_job_server (QMF) and aviary_query_server.

Previously defined test procedure used to:
a) confirm idx cleanup
b) primary history file rotation triggers a rescan of all job data in HISTORY across all valid history.<isodate> files

Comment 5 Pete MacKinnon 2011-12-12 18:12:45 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: History file configuration settings set to very low values.

Consequence: The index file for the current history file could eventually become orphaned but its inode was quickly reused by the file system and incorrectly maintained in an active list by the condor_job_server and aviary_query_server.

Fix: The chief problem was with the rotation of the current history file and its inode being quickly reclaimed by the file system. Thus, the fix was to detect the rotation and rescan the directory so that dangling index files would be removed and correct ones regenerated from the current set of history backup files BEFORE inode reuse could occur.

Result: Index files are correctly cleaned up after a history file rotation and point to valid archived history backups based on inode.

Comment 7 Daniel Horák 2012-01-11 12:28:19 UTC
Reproduced on RHEL 6.2 i386 with condor 7.6.5-0.8 with both condor_job_server (QMF) and aviary_query_server:
# rpm -qa | grep condor
  condor-classads-7.6.5-0.8.el6.i686
  condor-qmf-7.6.5-0.8.el6.i686
  condor-aviary-7.6.5-0.8.el6.i686
  condor-7.6.5-0.8.el6.i686

After few completed jobs:
# pwd
  /var/lib/condor/spool
# ls -AlFi *history*
  141474 -rw-r--r--. 1 condor condor 6517 Jan 11 11:33 history
  137450 -rw-r--r--. 1 condor condor  450 Jan 11 11:32 history.141445.idx
  141470 -rw-r--r--. 1 condor condor  450 Jan 11 11:33 history.141456.idx
  141484 -rw-r--r--. 1 condor condor  450 Jan 11 11:33 history.141471.idx
  141486 -rw-r--r--. 1 condor condor  299 Jan 11 11:33 history.141474.idx
  141487 -rw-r--r--. 1 condor condor  450 Jan 11 11:22 history.141484.idx
  141493 -rw-r--r--. 1 condor condor  450 Jan 11 11:22 history.141485.idx
  139570 -rw-r--r--. 1 condor condor  450 Jan 11 11:32 history.141497.idx
  141456 -rw-r--r--. 1 condor condor 9641 Jan 11 11:32 history.20120111T113250
  141471 -rw-r--r--. 1 condor condor 9815 Jan 11 11:32 history.20120111T113300


Verified on RHEL 6.2 i386 with condor-7.6.5-0.11.
In both cases (with condor_job_server and aviary_query_server) index files are correctly cleaned up related to history file rotation. Jobs in cumin correspond to output of command condor_history.

# rpm -qa | grep condor
  condor-7.6.5-0.11.el6.i686
  condor-qmf-7.6.5-0.11.el6.i686
  condor-classads-7.6.5-0.11.el6.i686
  condor-aviary-7.6.5-0.11.el6.i686

# ls -AlFi *history*
  33668 -rw-r--r-- 1 condor condor 6519 Jan 11 12:32 history
  33653 -rw-r--r-- 1 condor condor 9818 Jan 11 12:32 history.20120111T123229
  33659 -rw-r--r-- 1 condor condor 9818 Jan 11 12:32 history.20120111T123245
  33658 -rw-r--r-- 1 condor condor  450 Jan 11 12:32 history.33653.idx
  33665 -rw-r--r-- 1 condor condor  450 Jan 11 12:32 history.33659.idx
  33671 -rw-r--r-- 1 condor condor  299 Jan 11 12:33 history.33668.idx

 
Verified on all platforms (RHEL 5.7 and RHEL 6.2, i386 and x86_64) with similar result.

>>>VERIFIED

Comment 8 errata-xmlrpc 2012-02-06 18:18:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0100.html


Note You need to log in before you can comment on or make changes to this bug.