Using a test configuration like the following: MAX_HISTORY_ROTATIONS = 2 MAX_HISTORY_LOG = 10240 it is possible to artificially create a high rate of inode reuse between the history files and their indices in the spool directory. When this occurs, QMF requests for full job details (aka their ClassAds) can find "corrupted" results: e.g.: qpid: call 429 GetJobAd "1.0" returned me classad with u'GlobalJobId': '#4.0#1320851443' qpid: call 429 GetJobAd "1.1" qpid: OK (0) - {u'JobAd': {u'CompletionDate': 1320851794, u'!!descriptors': {u'CurrentTime': 'com.redhat.grid.Expression'}, u'CurrentTime': 1320852825, u'ipal': 'unauthenticated@unmapped/10.34.33.173', u'LastJobStatus': 2, u'MyType': 'Job'}} qpid: call 429 GetJobAd "3.3" qpid: malformed ad for job '3.3' in /var/lib/condor/spool//history (65537) - {} (Classad for job "3.3" should be there because of he is in condor_history output) Approach will be to refactor history file management code and index garbage collection. Also, impacts aviary query_server due to shared code heritage.
Immediate workaround is to restart the condor_job_server.
For testing progress see also bug 699737, comment 12.
Fixed upstream for BOTH condor_job_server (QMF) and aviary_query_server. Previously defined test procedure used to: a) confirm idx cleanup b) primary history file rotation triggers a rescan of all job data in HISTORY across all valid history.<isodate> files
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: History file configuration settings set to very low values. Consequence: The index file for the current history file could eventually become orphaned but its inode was quickly reused by the file system and incorrectly maintained in an active list by the condor_job_server and aviary_query_server. Fix: The chief problem was with the rotation of the current history file and its inode being quickly reclaimed by the file system. Thus, the fix was to detect the rotation and rescan the directory so that dangling index files would be removed and correct ones regenerated from the current set of history backup files BEFORE inode reuse could occur. Result: Index files are correctly cleaned up after a history file rotation and point to valid archived history backups based on inode.
Reproduced on RHEL 6.2 i386 with condor 7.6.5-0.8 with both condor_job_server (QMF) and aviary_query_server: # rpm -qa | grep condor condor-classads-7.6.5-0.8.el6.i686 condor-qmf-7.6.5-0.8.el6.i686 condor-aviary-7.6.5-0.8.el6.i686 condor-7.6.5-0.8.el6.i686 After few completed jobs: # pwd /var/lib/condor/spool # ls -AlFi *history* 141474 -rw-r--r--. 1 condor condor 6517 Jan 11 11:33 history 137450 -rw-r--r--. 1 condor condor 450 Jan 11 11:32 history.141445.idx 141470 -rw-r--r--. 1 condor condor 450 Jan 11 11:33 history.141456.idx 141484 -rw-r--r--. 1 condor condor 450 Jan 11 11:33 history.141471.idx 141486 -rw-r--r--. 1 condor condor 299 Jan 11 11:33 history.141474.idx 141487 -rw-r--r--. 1 condor condor 450 Jan 11 11:22 history.141484.idx 141493 -rw-r--r--. 1 condor condor 450 Jan 11 11:22 history.141485.idx 139570 -rw-r--r--. 1 condor condor 450 Jan 11 11:32 history.141497.idx 141456 -rw-r--r--. 1 condor condor 9641 Jan 11 11:32 history.20120111T113250 141471 -rw-r--r--. 1 condor condor 9815 Jan 11 11:32 history.20120111T113300 Verified on RHEL 6.2 i386 with condor-7.6.5-0.11. In both cases (with condor_job_server and aviary_query_server) index files are correctly cleaned up related to history file rotation. Jobs in cumin correspond to output of command condor_history. # rpm -qa | grep condor condor-7.6.5-0.11.el6.i686 condor-qmf-7.6.5-0.11.el6.i686 condor-classads-7.6.5-0.11.el6.i686 condor-aviary-7.6.5-0.11.el6.i686 # ls -AlFi *history* 33668 -rw-r--r-- 1 condor condor 6519 Jan 11 12:32 history 33653 -rw-r--r-- 1 condor condor 9818 Jan 11 12:32 history.20120111T123229 33659 -rw-r--r-- 1 condor condor 9818 Jan 11 12:32 history.20120111T123245 33658 -rw-r--r-- 1 condor condor 450 Jan 11 12:32 history.33653.idx 33665 -rw-r--r-- 1 condor condor 450 Jan 11 12:32 history.33659.idx 33671 -rw-r--r-- 1 condor condor 299 Jan 11 12:33 history.33668.idx Verified on all platforms (RHEL 5.7 and RHEL 6.2, i386 and x86_64) with similar result. >>>VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0100.html