Description of problem: For a long time, classad logs did not enforce successful parsing of RHS expressions in 'set attribute' operation (103) when loading in a classad log file. This allowed some undefined set of instances where RHS expressions were created with improper classad syntax to creep in over time. When I introduced strict parsing for Bug 784051, I kicked over some of these instances. Although it passed upstream regression testing, we have discovered a few more cases from external components. As a transition measure, we are going to add a new flag, which can be used to turn on strict parsing. For our purposes we will default this flag to FALSE, to maintain backward compatible behavior with legacy job queue logs. How reproducible: 100% Steps to Reproduce: 1. introduce an invalid RHS expression to a 'set attr' operation (op code 103), for example: 103 1340101063 TriggerName Changed Test Trigger Actual results: Exception will be thrown (or a log message showing read error on last classad transaction will occur, if last transaction was not closed) Expected results: improper RHS will be accepted in legacy fashion.
Downstream we are defaulting CLASSAD_LOG_STRICT_PARSING to false, to prevent any customer legacy classad log files from failing. TESTING: to test the fix, copy this content to job_queue.log 107 1 CreationTimestamp 1334245749 101 0.0 Job Machine 103 0.0 NextClusterNum 1 105 103 1.0 JobStatus 2 ZOMG! 103 1.0 EnteredCurrentStatus 1334245771 103 1.0 LastSuspensionTime 0 103 1.0 CurrentHosts 1 106 105 103 1.1 LastJobStatus 1 103 1.1 JobStatus 2 If you let the value of CLASSAD_LOG_STRICT_PARSING default, you should see the scheduler allow the incorrect line to pass with a warning, in the SchedLog: 06/22/12 13:24:43 (pid:32102) WARNING: strict classad parsing failed for expression: "2 ZOMG!" 06/22/12 13:24:43 (pid:32102) Detected unterminated transaction in ClassAd Log/home/eje/condor/local/spool/job_queue.log. Forcing rotation. 06/22/12 13:24:43 (pid:32102) About to rotate ClassAd log /home/eje/condor/local/spool/job_queue.log 06/22/12 13:24:43 (pid:32102) About to save historical log /home/eje/condor/local/spool/job_queue.log.1 To test functionality when flag is enabled, set: CLASSAD_LOG_STRICT_PARSING = True If you attempt to repeat the above test, it should now fail with an exception: 06/22/12 12:59:46 (pid:31585) WARNING: Encountered corrupt log record 5 (byte offset 89) 06/22/12 12:59:46 (pid:31585) Lines following corrupt log record 5 (up to 3): 06/22/12 12:59:46 (pid:31585) 103 1.0 EnteredCurrentStatus 1334245771 06/22/12 12:59:46 (pid:31585) 103 1.0 LastSuspensionTime 0 06/22/12 12:59:46 (pid:31585) 103 1.0 CurrentHosts 1 06/22/12 12:59:46 (pid:31585) ERROR "Error: corrupt log record 5 (byte offset 89) occurred inside closed transaction, recovery failed" at line 1141 in file /home/eje/git/grid/src/condor_utils/classad_log.cpp
pushed fix to UPSTREAM-7.9.0-BZ834669-strict-parsing-flag
(In reply to comment #2) > > 06/22/12 13:24:43 (pid:32102) WARNING: strict classad parsing failed for > expression: "2 ZOMG!" > 06/22/12 13:24:43 (pid:32102) Detected unterminated transaction in ClassAd > Log/home/eje/condor/local/spool/job_queue.log. Forcing rotation. > 06/22/12 13:24:43 (pid:32102) About to rotate ClassAd log > /home/eje/condor/local/spool/job_queue.log > 06/22/12 13:24:43 (pid:32102) About to save historical log > /home/eje/condor/local/spool/job_queue.log.1 Is it the rotation a new behavior? If yes, could it be a problem, if it happens everytime a bad classad is found?
(In reply to comment #4) > Is it the rotation a new behavior? If yes, could it be a problem, if it > happens everytime a bad classad is found? The rotation is a standard response to detecting an unterminated transaction in the job queue log (which is the case in the test job_queue.log). When it happens, it only happens once, after it finishes reading the log.
Tested on RHEL 5.8 and 6.3 - i386, x86_64 according to comment 2 (following output is from RHEL 6.3 i386, on other platforms is similar): # rpm -qa | grep condor condor-7.6.5-0.16.el6.i686 condor-classads-7.6.5-0.16.el6.i686 # cat /var/lib/condor/spool/job_queue.log 107 1 CreationTimestamp 1334245749 101 0.0 Job Machine 103 0.0 NextClusterNum 1 105 103 1.0 JobStatus 2 ZOMG! 103 1.0 EnteredCurrentStatus 1334245771 103 1.0 LastSuspensionTime 0 103 1.0 CurrentHosts 1 106 105 103 1.1 LastJobStatus 1 103 1.1 JobStatus 2 # condor_config_val SCHEDD_DEBUG D_FULLDEBUG # condor_config_val CLASSAD_LOG_STRICT_PARSING false # service condor start Starting Condor daemons: [ OK ] # tail -F /var/log/condor/SchedLog 06/28/12 12:36:30 WARNING: strict classad parsing failed for expression: "2 ZOMG!" 06/28/12 12:36:30 Detected unterminated transaction in ClassAd Log/var/lib/condor/spool/job_queue.log. Forcing rotation. 06/28/12 12:36:30 About to rotate ClassAd log /var/lib/condor/spool/job_queue.log 06/28/12 12:36:30 About to save historical log /var/lib/condor/spool/job_queue.log.1 # condor_config_val CLASSAD_LOG_STRICT_PARSING True # tail -F /var/log/condor/SchedLog 06/28/12 12:41:10 WARNING: Encountered corrupt log record 5 (byte offset 89) 06/28/12 12:41:10 Lines following corrupt log record 5 (up to 3): 06/28/12 12:41:10 103 1.0 EnteredCurrentStatus 1334245771 06/28/12 12:41:10 103 1.0 LastSuspensionTime 0 06/28/12 12:41:10 103 1.0 CurrentHosts 1 06/28/12 12:41:10 ERROR "Error: corrupt log record 5 (byte offset 89) occurred inside closed transaction, recovery failed" at line 1108 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/classad_log.cpp Also retested all other scenarios from bug 784051 with expected behaviour. >>> VERIFIED