Description of problem: During testing of HA Schedd, I've found this error in logs and Scheduler doesn't run because of it: 06/15/12 02:09:38 (pid:29918) ERROR "Error: corrupt log record 17631 (byte offset 562952) occurred inside closed transaction, recovery failed" at line 1104 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/classad_log.cpp Version-Release number of selected component (if applicable): condor-7.6.5-0.15.el6.x86_64 condor-classads-7.6.5-0.15.el6.x86_64 condor-cluster-resource-agent-7.6.5-0.15.el6.x86_64 condor-wallaby-client-4.1.2-1.el6.noarch python-condorutils-1.5-4.el6.noarch python-qpid-0.14-8.el6.noarch python-qpid-qmf-0.14-7.el6_2.x86_64 python-wallabyclient-4.1.2-1.el6.noarch qpid-cpp-client-0.14-16.el6.x86_64 qpid-qmf-0.14-7.el6_2.x86_64 ruby-qpid-qmf-0.14-7.el6_2.x86_64 ruby-wallaby-0.12.5-1.el6.noarch wallaby-utils-0.12.5-1.el6.noarch How reproducible: 100% Steps to Reproduce: 1. install and setup pool(3 nodes) with HA schedulers 2. periodically kill these schedulers on all nodes 3. wait for error in logs Actual results: Condor daemon cannot recover from problems with log files. Expected results: Condor daemon cannot recover from problems with log files. Additional info:
Issue indipendently reproduced during triggerd testing (the message is in a different log but it seems that the same codepath is hit). Configure a machine with triggerd, execute the trigger test: condor_trigger_config -s localhost Then restart condor. Triggerd won't start again, /var/log/condor/triggerd.log shows: 06/19/12 06:27:05 main_init() called 06/19/12 06:27:05 WARNING: Encountered corrupt log record 12 (byte offset 335) 06/19/12 06:27:05 Lines following corrupt log record 12 (up to 3): 06/19/12 06:27:05 106 06/19/12 06:27:05 ERROR "Error: corrupt log record 12 (byte offset 335) occurred inside closed transaction, recovery failed" at line 1104 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/classad_log.cpp When removing the content of /var/lib/condor/spool/triggers.log, triggerd is able to restart again (temporarily) away. This is the content generated by the aforementioned steps: --------------------------------------- 107 1 CreationTimestamp 1340100986 105 101 1340101063 EventTrigger Trigger 103 1340101063 TriggerText "$(Machine) has a slot 1" 103 1340101063 TriggerName "TestTrigger" 103 1340101063 TriggerQuery "(SlotID == 1)" 103 1340101063 TargetType "Trigger" 103 1340101063 CurrentTime time() 103 1340101063 MyType "EventTrigger" 106 105 103 1340101063 TriggerName Changed Test Trigger 106 105 103 1340101063 TriggerQuery (SlotID > 0) 106 105 103 1340101063 TriggerText $(Machine) has a slot $(SlotID) 106 105 102 1340101063 106 --------------------------------------- Reproduced on RHEL5.8/i386 and RHEL6.3/x86_64, condor 7.6.5-0.15. Raising the severity and priority of the bug.
Does this only happen when you kill the daemons @ once?
(In reply to comment #3) > Does this only happen when you kill the daemons @ once? At least in the triggerd case, I noticed the error after a service condor restart But then I tried to kill triggerd only, when master respawns it then it can't restart because of the error.
I've tried to remove all temporary condor files including locks, logs, address files and so on. It haven't helped.
When I looked at martin's system, I removed job_queue.log and the schedd was able to start back up.
Created attachment 593016 [details] Corrupted job queue log