Bug 834669

Summary: introduction of strict log queue log parsing rejects some legacy classad log exprs
Product: Red Hat Enterprise MRG Reporter: Erik Erlandson <eerlands>
Component: condorAssignee: Erik Erlandson <eerlands>
Status: CLOSED ERRATA QA Contact: Daniel Horák <dahorak>
Severity: high Docs Contact:
Priority: high    
Version: 2.2CC: dahorak, iboverma, ltoscano, matt, mkudlej, tstclair
Target Milestone: 2.2Keywords: Regression
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: condor-7.6.5-0.16 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-19 18:26:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Erik Erlandson 2012-06-22 18:05:11 UTC
Description of problem:
For a long time, classad logs did not enforce successful parsing of RHS expressions in 'set attribute' operation (103) when loading in a classad log file. This allowed some undefined set of instances where RHS expressions were created with improper classad syntax to creep in over time.

When I introduced strict parsing for Bug 784051, I kicked over some of these instances. Although it passed upstream regression testing, we have discovered a few more cases from external components.

As a transition measure, we are going to add a new flag, which can be used to turn on strict parsing.  For our purposes we will default this flag to FALSE, to maintain backward compatible behavior with legacy job queue logs.


How reproducible:
100%

Steps to Reproduce:
1. introduce an invalid RHS expression to a 'set attr' operation (op code 103), for example:

103 1340101063 TriggerName Changed Test Trigger


Actual results:
Exception will be thrown (or a log message showing read error on last classad transaction will occur, if last transaction was not closed)

Expected results:
improper RHS will be accepted in legacy fashion.

Comment 2 Erik Erlandson 2012-06-22 22:53:05 UTC
Downstream we are defaulting CLASSAD_LOG_STRICT_PARSING to false, to prevent any customer legacy classad log files from failing.

TESTING:

to test the fix, copy this content to job_queue.log

107 1 CreationTimestamp 1334245749
101 0.0 Job Machine
103 0.0 NextClusterNum 1
105 
103 1.0 JobStatus 2  ZOMG!
103 1.0 EnteredCurrentStatus 1334245771
103 1.0 LastSuspensionTime 0
103 1.0 CurrentHosts 1
106 
105 
103 1.1 LastJobStatus 1
103 1.1 JobStatus 2


If you let the value of CLASSAD_LOG_STRICT_PARSING default, you should see the scheduler allow the incorrect line to pass with a warning, in the SchedLog:

06/22/12 13:24:43 (pid:32102) WARNING: strict classad parsing failed for expression: "2  ZOMG!"
06/22/12 13:24:43 (pid:32102) Detected unterminated transaction in ClassAd Log/home/eje/condor/local/spool/job_queue.log. Forcing rotation.
06/22/12 13:24:43 (pid:32102) About to rotate ClassAd log /home/eje/condor/local/spool/job_queue.log
06/22/12 13:24:43 (pid:32102) About to save historical log /home/eje/condor/local/spool/job_queue.log.1


To test functionality when flag is enabled, set:
CLASSAD_LOG_STRICT_PARSING = True

If you attempt to repeat the above test, it should now fail with an exception:

06/22/12 12:59:46 (pid:31585) WARNING: Encountered corrupt log record 5 (byte offset 89)
06/22/12 12:59:46 (pid:31585) Lines following corrupt log record 5 (up to 3):
06/22/12 12:59:46 (pid:31585)     103 1.0 EnteredCurrentStatus 1334245771
06/22/12 12:59:46 (pid:31585)     103 1.0 LastSuspensionTime 0
06/22/12 12:59:46 (pid:31585)     103 1.0 CurrentHosts 1
06/22/12 12:59:46 (pid:31585) ERROR "Error: corrupt log record 5 (byte offset 89) occurred inside closed transaction, recovery failed" at line 1141 in file /home/eje/git/grid/src/condor_utils/classad_log.cpp

Comment 3 Erik Erlandson 2012-06-22 22:54:39 UTC
pushed fix to UPSTREAM-7.9.0-BZ834669-strict-parsing-flag

Comment 4 Luigi Toscano 2012-06-26 09:24:09 UTC
(In reply to comment #2)
> 
> 06/22/12 13:24:43 (pid:32102) WARNING: strict classad parsing failed for
> expression: "2  ZOMG!"
> 06/22/12 13:24:43 (pid:32102) Detected unterminated transaction in ClassAd
> Log/home/eje/condor/local/spool/job_queue.log. Forcing rotation.
> 06/22/12 13:24:43 (pid:32102) About to rotate ClassAd log
> /home/eje/condor/local/spool/job_queue.log
> 06/22/12 13:24:43 (pid:32102) About to save historical log
> /home/eje/condor/local/spool/job_queue.log.1

Is it the rotation a new behavior? If yes, could it be a problem, if it happens everytime a bad classad is found?

Comment 5 Erik Erlandson 2012-06-26 14:38:29 UTC
(In reply to comment #4)

> Is it the rotation a new behavior? If yes, could it be a problem, if it
> happens everytime a bad classad is found?

The rotation is a standard response to detecting an unterminated transaction in the job queue log (which is the case in the test job_queue.log).  When it happens, it only happens once, after it finishes reading the log.

Comment 6 Daniel Horák 2012-06-29 06:51:38 UTC
Tested on RHEL 5.8 and 6.3 - i386, x86_64 according to comment 2 (following output is from RHEL 6.3 i386, on other platforms is similar):

# rpm -qa | grep condor
condor-7.6.5-0.16.el6.i686
condor-classads-7.6.5-0.16.el6.i686

# cat /var/lib/condor/spool/job_queue.log 
107 1 CreationTimestamp 1334245749
101 0.0 Job Machine
103 0.0 NextClusterNum 1
105 
103 1.0 JobStatus 2  ZOMG!
103 1.0 EnteredCurrentStatus 1334245771
103 1.0 LastSuspensionTime 0
103 1.0 CurrentHosts 1
106 
105 
103 1.1 LastJobStatus 1
103 1.1 JobStatus 2

# condor_config_val SCHEDD_DEBUG
D_FULLDEBUG

# condor_config_val CLASSAD_LOG_STRICT_PARSING
false

# service condor start
Starting Condor daemons:                                   [  OK  ]

# tail -F /var/log/condor/SchedLog 
06/28/12 12:36:30 WARNING: strict classad parsing failed for expression: "2  ZOMG!"
06/28/12 12:36:30 Detected unterminated transaction in ClassAd Log/var/lib/condor/spool/job_queue.log. Forcing rotation.
06/28/12 12:36:30 About to rotate ClassAd log /var/lib/condor/spool/job_queue.log
06/28/12 12:36:30 About to save historical log /var/lib/condor/spool/job_queue.log.1


# condor_config_val CLASSAD_LOG_STRICT_PARSING
True

# tail -F /var/log/condor/SchedLog 
06/28/12 12:41:10 WARNING: Encountered corrupt log record 5 (byte offset 89)
06/28/12 12:41:10 Lines following corrupt log record 5 (up to 3):
06/28/12 12:41:10     103 1.0 EnteredCurrentStatus 1334245771
06/28/12 12:41:10     103 1.0 LastSuspensionTime 0
06/28/12 12:41:10     103 1.0 CurrentHosts 1
06/28/12 12:41:10 ERROR "Error: corrupt log record 5 (byte offset 89) occurred inside closed transaction, recovery failed" at line 1108 in file /builddir/build/BUILD/condor-7.6.4/src/condor_utils/classad_log.cpp

Also retested all other scenarios from bug 784051 with expected behaviour.

>>> VERIFIED