Bug 832331 - corrupted log file
corrupted log file
Status: CLOSED NOTABUG
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
Development
Unspecified Unspecified
urgent Severity urgent
: 2.2
: ---
Assigned To: Erik Erlandson
MRG Quality Engineering
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-06-15 04:00 EDT by Martin Kudlej
Modified: 2012-07-20 05:36 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-06-22 17:54:46 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Corrupted job queue log (564.87 KB, application/octet-stream)
2012-06-19 13:10 EDT, Robert Rati
no flags Details

  None (edit)
Description Martin Kudlej 2012-06-15 04:00:58 EDT
Description of problem:
During testing of HA Schedd, I've found this error in logs and Scheduler doesn't run because of it:

06/15/12 02:09:38 (pid:29918) ERROR "Error: corrupt log record 17631 (byte offset 562952) occurred inside closed transaction, recovery failed" at line 1104 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/classad_log.cpp

Version-Release number of selected component (if applicable):
condor-7.6.5-0.15.el6.x86_64
condor-classads-7.6.5-0.15.el6.x86_64
condor-cluster-resource-agent-7.6.5-0.15.el6.x86_64
condor-wallaby-client-4.1.2-1.el6.noarch
python-condorutils-1.5-4.el6.noarch
python-qpid-0.14-8.el6.noarch
python-qpid-qmf-0.14-7.el6_2.x86_64
python-wallabyclient-4.1.2-1.el6.noarch
qpid-cpp-client-0.14-16.el6.x86_64
qpid-qmf-0.14-7.el6_2.x86_64
ruby-qpid-qmf-0.14-7.el6_2.x86_64
ruby-wallaby-0.12.5-1.el6.noarch
wallaby-utils-0.12.5-1.el6.noarch

How reproducible:
100%

Steps to Reproduce:
1. install and setup pool(3 nodes) with HA schedulers
2. periodically kill these schedulers on all nodes
3. wait for error in logs
  
Actual results:
Condor daemon cannot recover from problems with log files.

Expected results:
Condor daemon cannot recover from problems with log files.

Additional info:
Comment 2 Luigi Toscano 2012-06-19 06:37:16 EDT
Issue indipendently reproduced during triggerd testing (the message is in a different log but it seems that the same codepath is hit).

Configure a machine with triggerd, execute the trigger test:
condor_trigger_config -s localhost
Then restart condor.

Triggerd won't start again, /var/log/condor/triggerd.log shows:

06/19/12 06:27:05 main_init() called
06/19/12 06:27:05 WARNING: Encountered corrupt log record 12 (byte offset 335)
06/19/12 06:27:05 Lines following corrupt log record 12 (up to 3):
06/19/12 06:27:05     106 
06/19/12 06:27:05 ERROR "Error: corrupt log record 12 (byte offset 335) occurred inside closed transaction, recovery failed" at line 1104 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/classad_log.cpp



When removing the content of /var/lib/condor/spool/triggers.log, triggerd is able to restart again (temporarily) away. This is the content generated by the aforementioned steps:

---------------------------------------
107 1 CreationTimestamp 1340100986
105 
101 1340101063 EventTrigger Trigger
103 1340101063 TriggerText "$(Machine) has a slot 1"
103 1340101063 TriggerName "TestTrigger"
103 1340101063 TriggerQuery "(SlotID == 1)"
103 1340101063 TargetType "Trigger"
103 1340101063 CurrentTime time()
103 1340101063 MyType "EventTrigger"
106 
105 
103 1340101063 TriggerName Changed Test Trigger
106 
105 
103 1340101063 TriggerQuery (SlotID > 0)
106 
105 
103 1340101063 TriggerText $(Machine) has a slot $(SlotID)
106 
105 
102 1340101063
106
---------------------------------------

Reproduced on RHEL5.8/i386 and RHEL6.3/x86_64, condor 7.6.5-0.15.

Raising the severity and priority of the bug.
Comment 3 Timothy St. Clair 2012-06-19 09:29:30 EDT
Does this only happen when you kill the daemons @ once?
Comment 4 Luigi Toscano 2012-06-19 09:49:05 EDT
(In reply to comment #3)
> Does this only happen when you kill the daemons @ once?

At least in the triggerd case, I noticed the error after a 
service condor restart
But then I tried to kill triggerd only, when master respawns it then it can't restart because of the error.
Comment 5 Martin Kudlej 2012-06-19 10:13:16 EDT
I've tried to remove all temporary condor files including locks, logs, address files and so on. It haven't helped.
Comment 6 Robert Rati 2012-06-19 10:22:44 EDT
When I looked at martin's system, I removed job_queue.log and the schedd was able to start back up.
Comment 7 Robert Rati 2012-06-19 13:10:45 EDT
Created attachment 593016 [details]
Corrupted job queue log

Note You need to log in before you can comment on or make changes to this bug.