Bug 477008 - Gridmanager (with amazon_gahp) cause Schedd EXCEPT
Gridmanager (with amazon_gahp) cause Schedd EXCEPT
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: grid (Show other bugs)
1.0
All Linux
medium Severity medium
: 1.1
: ---
Assigned To: Matthew Farrellee
Jeff Needle
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-18 11:23 EST by Matthew Farrellee
Modified: 2009-02-04 11:05 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-02-04 11:05:57 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Matthew Farrellee 2008-12-18 11:23:36 EST
It appears the gridmanager tries to write a HoldReason with a newline in it, causing the Schedd to EXCEPT.

SchedLog:
12/18 00:25:13 (pid:20899) Refusing attempt to add 'HoldReason' = '"SSL_ERROR_SS
L
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify faile
d"' to record '34.0' as it contains a newline, which is not allowed.
12/18 00:25:13 (pid:20899) ERROR "write failed here!!!" at line 253 in file log_
transaction.cpp
Stack dump for process 20899 at timestamp 1229577913 (20 frames)
condor_schedd(dprintf_dump_stack+0xc0)[0x560adf]
condor_schedd[0x560db2]
/lib64/libc.so.6[0x33cf8301b0]
/lib64/libc.so.6(gsignal+0x35)[0x33cf830155]
/lib64/libc.so.6(abort+0x110)[0x33cf831bf0]
condor_schedd(_EXCEPT_+0x1a5)[0x55f2cb]
condor_schedd[0x5bda71]
condor_schedd(_ZN11Transaction6CommitEP8_IO_FILEPvb+0xbc)[0x5bddf4]
condor_schedd(_ZN10ClassAdLog17CommitTransactionEv+0x91)[0x5baf87]
condor_schedd(_ZN17ClassAdCollection17CommitTransactionEv+0x15)[0x4eddcd]
condor_schedd(_Z15CloseConnectionv+0x27)[0x4ec40f]
condor_schedd(_Z12do_Q_requestP8ReliSockRb+0x11cf)[0x4ef775]
condor_schedd(_Z8handle_qP7ServiceiP6Stream+0xef)[0x4eb45d]
condor_schedd(_ZN10DaemonCore9HandleReqEP6Stream+0x3959)[0x54e061]
condor_schedd(_ZN10DaemonCore22HandleReqSocketHandlerEP6Stream+0x94)[0x550324]
condor_schedd(_ZN10DaemonCore17CallSocketHandlerERib+0x28c)[0x54e9ae]
condor_schedd(_ZN10DaemonCore6DriverEv+0x172e)[0x550234]
condor_schedd(main+0x1898)[0x558cb4]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x33cf81d8b4]
condor_schedd(__gxx_personality_v0+0x479)[0x4ad509]

GridmanagerLog:
12/18 00:25:13 [21698] GAHP[21711] -> 'S'
12/18 00:25:13 [21698] GAHP[21711] <- 'RESULTS'
12/18 00:25:13 [21698] GAHP[21711] -> 'R'
12/18 00:25:13 [21698] GAHP[21711] -> 'S' '1'
12/18 00:25:13 [21698] GAHP[21711] -> '60' '1' 'Client' 'SSL_ERROR_SSL
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify faile
d'
12/18 00:25:13 [21698] (34.0) doEvaluateState called: gmState GM_PROBE_JOB, cond
orState 1
12/18 00:25:13 [21698] (34.0) job probe failed: Client: SSL_ERROR_SSL
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify faile
d
12/18 00:25:13 [21698] (34.0) gm state change: GM_PROBE_JOB -> GM_HOLD
12/18 00:25:13 [21698] FileLock object is updating timestamp on: /home/testmonke
y/matt/ulog.28.7
12/18 00:25:13 [21698] (34.0) Writing hold record to user logfile
12/18 00:25:13 [21698] FileLock::obtain(1) - @1229577913.754998 lock on /home/te
stmonkey/matt/ulog.28.7 now WRITE
12/18 00:25:13 [21698] FileLock::obtain(2) - @1229577913.757659 lock on /home/te
stmonkey/matt/ulog.28.7 now UNLOCKED
12/18 00:25:13 [21698] (34.0) gm state change: GM_HOLD -> GM_DELETE
12/18 00:25:13 [21698] in doContactSchedd()
12/18 00:25:13 [21698] querying for removed/held jobs
12/18 00:25:13 [21698] Using constraint ((Owner=?="testmonkey"&&JobUniverse==9))
 && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobSta
tus == 5 && Managed =?= "External"))
12/18 00:25:13 [21698] Fetched 0 job ads from schedd
12/18 00:25:13 [21698] Updating classad values for 34.0:
12/18 00:25:13 [21698]    JobStatus = 5
12/18 00:25:13 [21698]    EnteredCurrentStatus = 1229577913
12/18 00:25:13 [21698]    HoldReason = "SSL_ERROR_SSL
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify faile
d"
12/18 00:25:13 [21698]    HoldReasonCode = 0
12/18 00:25:13 [21698]    HoldReasonSubCode = 0
12/18 00:25:13 [21698]    ReleaseReason = UNDEFINED
12/18 00:25:13 [21698]    NumSystemHolds = 1
12/18 00:25:13 [21698]    Managed = "Schedd"
12/18 00:25:13 [21698] condor_read(): Socket closed when trying to read 21 bytes
 from <10.16.32.114:36592>
12/18 00:25:13 [21698] IO: EOF reading packet header
12/18 00:25:13 [21698] condor_write(): Socket closed when trying to write 29 byt
es to <10.16.32.114:36592>, fd is 6
12/18 00:25:13 [21698] Buf::write(): condor_write() failed
12/18 00:25:13 [21698] Schedd connection error during updates at line 1014! Will
 retry
12/18 00:25:27 Reading condor configuration from '/etc/condor/condor_config'
12/18 00:25:27 ******************************************************
12/18 00:25:27 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP

AmazonGahpLog:
12/18 00:25:13 Call to DescribeInstance failed: SOAP 1.1 fault: SOAP-ENV:Client 
[no subcode]
"SSL_ERROR_SSL
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify faile
d"
Detail: SSL connect failed in tcp_connect()

12/18 00:25:13 Command(AMAZON_VM_STATUS) got error(code:Client, msg:SSL_ERROR_SS
L
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify faile
d
Comment 1 Matthew Farrellee 2008-12-18 11:29:00 EST
This appears to be 2 issues: 1) the EXCEPT should not be in log_transaction.cpp, and 2) the gridmanager should not be trying to set an attribute with an invalid value.
Comment 2 Will Benton 2008-12-18 11:42:40 EST
The EXCEPT is spurious; I have removed it and committed the change upstream (d217618)

For Jeff's amusement, here's a diff (no context):

-    EXCEPT("write failed here!!!");
Comment 3 Matthew Farrellee 2008-12-18 12:12:58 EST
There's a 3rd issue: 3) the Schedd should protect itself by rejecting invalid attribute values well before it tries to write them to the log, which would also allow for error propagation to the setter.
Comment 4 Matthew Farrellee 2008-12-18 13:16:47 EST
Fix for issue (2) will be present after 7.2.0-0.13

commit a88d6cfddd9df809e00733ce6f0afdf5dda512e4
Author: Matthew Farrellee <matt>
Date:   Thu Dec 18 12:12:24 2008 -0600

    Fixed gridmanager sending of invalid attribute values to schedd
    
    The amazon_gahp hit an openssl error that had a newline in it. The
    error was sent back to the gridmanager who used it as a
    HoldReason. The schedd accepted the HoldReason but the transaction log
    bailed out because newlines are not allowed in attribute values.
    
    This fix escapes \n's in error messages, which should also make for
    nicer logging.

This can be tested (was by Jaime) by modifying GahpClient::amazon_vm_create_keypair() in the gridmanager to add an extra "\nSecond line bad" to error_string on an error. And then submitting an EC2 job with a bad keypair filename. (Thanks Jaime)
Comment 5 Matthew Farrellee 2008-12-18 14:21:37 EST
Fix for issue (3) will be present after 7.2.0-0.13

commit 147c31cd07f56435794ae71792f19f420e53d616
Author: Matthew Farrellee <matt>
Date:   Thu Dec 18 13:03:35 2008 -0600

    Added invalid attribute value protection (Schedd and client side)
    
    As discovered when the gridmanager wrote a HoldReason with a newline
    in it. The Schedd will happily accept attribute values that it cannot
    write to the log. A failure that results in an EXCEPT.
    
    This fix introduces AttrList::IsValidAttrValue and uses it to protect
    the Schedd when setting attributes, similar to the IsValidAttrName
    check. IsValidAttrValue is also used in SetAttribute, along with
    IsValidAttrName, to notify clients of the error they are making,
    before even contacting the Schedd.

 src/classad.old/attrlist.cpp          |   27 +++++++++++++++++++++++++--
 src/condor_includes/condor_attrlist.h |    1 +
 src/condor_schedd.V6/qmgmt.cpp        |   10 ++++++++++
 3 files changed, 36 insertions(+), 2 deletions(-)
Comment 8 errata-xmlrpc 2009-02-04 11:05:57 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0036.html

Note You need to log in before you can comment on or make changes to this bug.