Bug 832331 - corrupted log file
Summary: corrupted log file
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: Development
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: 2.2
: ---
Assignee: Erik Erlandson
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-06-15 08:00 UTC by Martin Kudlej
Modified: 2012-07-20 09:36 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-22 21:54:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Corrupted job queue log (564.87 KB, application/octet-stream)
2012-06-19 17:10 UTC, Robert Rati
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 834659 0 medium CLOSED triggerd injects improper classad expressions into classad log file 2021-02-22 00:41:40 UTC

Internal Links: 834659

Description Martin Kudlej 2012-06-15 08:00:58 UTC
Description of problem:
During testing of HA Schedd, I've found this error in logs and Scheduler doesn't run because of it:

06/15/12 02:09:38 (pid:29918) ERROR "Error: corrupt log record 17631 (byte offset 562952) occurred inside closed transaction, recovery failed" at line 1104 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/classad_log.cpp

Version-Release number of selected component (if applicable):
condor-7.6.5-0.15.el6.x86_64
condor-classads-7.6.5-0.15.el6.x86_64
condor-cluster-resource-agent-7.6.5-0.15.el6.x86_64
condor-wallaby-client-4.1.2-1.el6.noarch
python-condorutils-1.5-4.el6.noarch
python-qpid-0.14-8.el6.noarch
python-qpid-qmf-0.14-7.el6_2.x86_64
python-wallabyclient-4.1.2-1.el6.noarch
qpid-cpp-client-0.14-16.el6.x86_64
qpid-qmf-0.14-7.el6_2.x86_64
ruby-qpid-qmf-0.14-7.el6_2.x86_64
ruby-wallaby-0.12.5-1.el6.noarch
wallaby-utils-0.12.5-1.el6.noarch

How reproducible:
100%

Steps to Reproduce:
1. install and setup pool(3 nodes) with HA schedulers
2. periodically kill these schedulers on all nodes
3. wait for error in logs
  
Actual results:
Condor daemon cannot recover from problems with log files.

Expected results:
Condor daemon cannot recover from problems with log files.

Additional info:

Comment 2 Luigi Toscano 2012-06-19 10:37:16 UTC
Issue indipendently reproduced during triggerd testing (the message is in a different log but it seems that the same codepath is hit).

Configure a machine with triggerd, execute the trigger test:
condor_trigger_config -s localhost
Then restart condor.

Triggerd won't start again, /var/log/condor/triggerd.log shows:

06/19/12 06:27:05 main_init() called
06/19/12 06:27:05 WARNING: Encountered corrupt log record 12 (byte offset 335)
06/19/12 06:27:05 Lines following corrupt log record 12 (up to 3):
06/19/12 06:27:05     106 
06/19/12 06:27:05 ERROR "Error: corrupt log record 12 (byte offset 335) occurred inside closed transaction, recovery failed" at line 1104 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/classad_log.cpp



When removing the content of /var/lib/condor/spool/triggers.log, triggerd is able to restart again (temporarily) away. This is the content generated by the aforementioned steps:

---------------------------------------
107 1 CreationTimestamp 1340100986
105 
101 1340101063 EventTrigger Trigger
103 1340101063 TriggerText "$(Machine) has a slot 1"
103 1340101063 TriggerName "TestTrigger"
103 1340101063 TriggerQuery "(SlotID == 1)"
103 1340101063 TargetType "Trigger"
103 1340101063 CurrentTime time()
103 1340101063 MyType "EventTrigger"
106 
105 
103 1340101063 TriggerName Changed Test Trigger
106 
105 
103 1340101063 TriggerQuery (SlotID > 0)
106 
105 
103 1340101063 TriggerText $(Machine) has a slot $(SlotID)
106 
105 
102 1340101063
106
---------------------------------------

Reproduced on RHEL5.8/i386 and RHEL6.3/x86_64, condor 7.6.5-0.15.

Raising the severity and priority of the bug.

Comment 3 Timothy St. Clair 2012-06-19 13:29:30 UTC
Does this only happen when you kill the daemons @ once?

Comment 4 Luigi Toscano 2012-06-19 13:49:05 UTC
(In reply to comment #3)
> Does this only happen when you kill the daemons @ once?

At least in the triggerd case, I noticed the error after a 
service condor restart
But then I tried to kill triggerd only, when master respawns it then it can't restart because of the error.

Comment 5 Martin Kudlej 2012-06-19 14:13:16 UTC
I've tried to remove all temporary condor files including locks, logs, address files and so on. It haven't helped.

Comment 6 Robert Rati 2012-06-19 14:22:44 UTC
When I looked at martin's system, I removed job_queue.log and the schedd was able to start back up.

Comment 7 Robert Rati 2012-06-19 17:10:45 UTC
Created attachment 593016 [details]
Corrupted job queue log


Note You need to log in before you can comment on or make changes to this bug.