Bug 832331

Summary: corrupted log file
Product: Red Hat Enterprise MRG Reporter: Martin Kudlej <mkudlej>
Component: condorAssignee: Erik Erlandson <eerlands>
Status: CLOSED NOTABUG QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: DevelopmentCC: dahorak, eerlands, iboverma, ltoscano, matt, rrati, sgraf, tstclair
Target Milestone: 2.2Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-22 21:54:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Corrupted job queue log none

Description Martin Kudlej 2012-06-15 08:00:58 UTC
Description of problem:
During testing of HA Schedd, I've found this error in logs and Scheduler doesn't run because of it:

06/15/12 02:09:38 (pid:29918) ERROR "Error: corrupt log record 17631 (byte offset 562952) occurred inside closed transaction, recovery failed" at line 1104 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/classad_log.cpp

Version-Release number of selected component (if applicable):
condor-7.6.5-0.15.el6.x86_64
condor-classads-7.6.5-0.15.el6.x86_64
condor-cluster-resource-agent-7.6.5-0.15.el6.x86_64
condor-wallaby-client-4.1.2-1.el6.noarch
python-condorutils-1.5-4.el6.noarch
python-qpid-0.14-8.el6.noarch
python-qpid-qmf-0.14-7.el6_2.x86_64
python-wallabyclient-4.1.2-1.el6.noarch
qpid-cpp-client-0.14-16.el6.x86_64
qpid-qmf-0.14-7.el6_2.x86_64
ruby-qpid-qmf-0.14-7.el6_2.x86_64
ruby-wallaby-0.12.5-1.el6.noarch
wallaby-utils-0.12.5-1.el6.noarch

How reproducible:
100%

Steps to Reproduce:
1. install and setup pool(3 nodes) with HA schedulers
2. periodically kill these schedulers on all nodes
3. wait for error in logs
  
Actual results:
Condor daemon cannot recover from problems with log files.

Expected results:
Condor daemon cannot recover from problems with log files.

Additional info:

Comment 2 Luigi Toscano 2012-06-19 10:37:16 UTC
Issue indipendently reproduced during triggerd testing (the message is in a different log but it seems that the same codepath is hit).

Configure a machine with triggerd, execute the trigger test:
condor_trigger_config -s localhost
Then restart condor.

Triggerd won't start again, /var/log/condor/triggerd.log shows:

06/19/12 06:27:05 main_init() called
06/19/12 06:27:05 WARNING: Encountered corrupt log record 12 (byte offset 335)
06/19/12 06:27:05 Lines following corrupt log record 12 (up to 3):
06/19/12 06:27:05     106 
06/19/12 06:27:05 ERROR "Error: corrupt log record 12 (byte offset 335) occurred inside closed transaction, recovery failed" at line 1104 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/classad_log.cpp



When removing the content of /var/lib/condor/spool/triggers.log, triggerd is able to restart again (temporarily) away. This is the content generated by the aforementioned steps:

---------------------------------------
107 1 CreationTimestamp 1340100986
105 
101 1340101063 EventTrigger Trigger
103 1340101063 TriggerText "$(Machine) has a slot 1"
103 1340101063 TriggerName "TestTrigger"
103 1340101063 TriggerQuery "(SlotID == 1)"
103 1340101063 TargetType "Trigger"
103 1340101063 CurrentTime time()
103 1340101063 MyType "EventTrigger"
106 
105 
103 1340101063 TriggerName Changed Test Trigger
106 
105 
103 1340101063 TriggerQuery (SlotID > 0)
106 
105 
103 1340101063 TriggerText $(Machine) has a slot $(SlotID)
106 
105 
102 1340101063
106
---------------------------------------

Reproduced on RHEL5.8/i386 and RHEL6.3/x86_64, condor 7.6.5-0.15.

Raising the severity and priority of the bug.

Comment 3 Timothy St. Clair 2012-06-19 13:29:30 UTC
Does this only happen when you kill the daemons @ once?

Comment 4 Luigi Toscano 2012-06-19 13:49:05 UTC
(In reply to comment #3)
> Does this only happen when you kill the daemons @ once?

At least in the triggerd case, I noticed the error after a 
service condor restart
But then I tried to kill triggerd only, when master respawns it then it can't restart because of the error.

Comment 5 Martin Kudlej 2012-06-19 14:13:16 UTC
I've tried to remove all temporary condor files including locks, logs, address files and so on. It haven't helped.

Comment 6 Robert Rati 2012-06-19 14:22:44 UTC
When I looked at martin's system, I removed job_queue.log and the schedd was able to start back up.

Comment 7 Robert Rati 2012-06-19 17:10:45 UTC
Created attachment 593016 [details]
Corrupted job queue log