Bug 615313 - condor_chirp fails when querying the value of a non-existing attribute
Summary: condor_chirp fails when querying the value of a non-existing attribute
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.2
Hardware: All
OS: Linux
medium
medium
Target Milestone: 1.3
: ---
Assignee: Matthew Farrellee
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-07-16 13:48 UTC by Jon Thomas
Modified: 2018-10-27 13:33 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, when querying the value of a non-existing attribute the "condor_chirp get_job_attr" command aborted, returning "abnormal program termination" on a Windows system and a core from SIGABRT on a non-Windows system. With this update, these errors no longer occur and 'condor_chirp' works as expected.
Clone Of:
Environment:
Last Closed: 2010-10-14 16:13:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0773 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3 2010-10-14 15:56:44 UTC

Description Jon Thomas 2010-07-16 13:48:50 UTC
running

/usr/libexec/condor/condor_chirp get_job_attr ResidentSetSize

will coredump if ResidentSetSize isn't yet set.

also see

ERROR getting rss: invalid literal for int(): chirp: couldn't get response from server: Illegal seek

Thread 1 (process 24257):
#0  0x00000030b0030265 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00000030b0031d10 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00000000004517cb in chirp_fatal_response () at chirp_client.c:561
No locals.
#3  0x0000000000451915 in convert_result (result=24257) at chirp_client.c:522
No locals.
#4  0x0000000000451acd in simple_command (c=0x942b10,
   fmt=0x4dc043 "get_job_attr %s\n") at chirp_client.c:678
       result = <value optimized out>
       command = "get_job_attr ResidentSetSize\n\000\000\000�����\177\000\000X0\200�0\000\000\000\210����*\000\000\006.\200�0\000\000\0008˭��*\000\000\006\000\000\000\000\000\000\000 ����\177\000\000&�\000�0", '\0' <repeats 11 times>, "\001\000\000\000\000\000\000\000p���\000\000\000\000\001\000\000\000\000\000\000\000`\034\200�0\000\000\000 ���0", '\0' <repeats 11 times>, "\0300\200�0\000\000\000o����\177\000\0000����\177", '\0' <repeats 11 times>, "����*\000\000����\025\000\000\000�", '\0' <repeats 15 times>...
       args = {{gp_offset = 24, fp_offset = 2054513515,
   overflow_arg_area = 0x7fffffffe000, reg_save_area = 0x7fffffffdf20}}
#5  0x0000000000451e9d in chirp_client_get_job_attr (c=0x5ec1,
   name=0x6 <Address 0x6 out of bounds>, expr=0x7fffffffe038)
   at chirp_client.c:374
       result = <value optimized out>
#6  0x00000000004510c1 in chirp_get_job_attr (argc=<value optimized out>,
   argv=0x7fffffffe128) at condor_chirp.cpp:306
       client = (struct chirp_client *) 0x0
       p = 0x0
#7  0x00000030b001d994 in __libc_start_main () from /lib64/libc.so.6
No symbol table info available.
#8  0x0000000000450d29 in _start ()

Comment 1 Jon Thomas 2010-07-16 13:49:49 UTC
upstream

 https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=522,0

Comment 2 Matthew Farrellee 2010-07-16 23:44:11 UTC
Ticket #522: condor_chirp fails when querying the value of a non-existing attribute

    When querying the value of a non-existing attribute the condor_chirp get_job_attr command aborts, returning "abnormal program termination" on Windows and a core from SIGABRT on non-Windows.

[Append remarks]
Remarks:

    2010-Jul-15 00:05:24 by matt:
    Here's the skinny...

    On the shadow side,

    A get_job_attr for an attribute that does not exist hits pseudo_ops.cpp:pseudo_get_job_attr(name, expr) and returns -1: e = ad->Lookup(name); if(e) { ... } else { ...; return -1; }"

    The -1 gets returned to the "case CONDOR_get_job_attr" in NTreceivers.cpp, which happily handles it by encoding a response of "code(-1); code(0);" -- the -1 is the return value and 0 is the default errno.

    On the starter side,

    The receiver in io_proxy_handler.cpp eventually calls IOProxyHandler::convert to translate the errno (remember it was 0) into a CHIRP_ERROR code to send to condor_chirp. However, errno is not a known code, resulting in CHIRP_ERROR_UNKNOWN and a dprintf of "Starter ioproxy server got unknown unix errno:0"

    On the condor_chirp side,

    Result of CHIRP_ERROR_UNKNOWN is received, which triggers an unceremonious fprintf to stderr of "chirp: couldn't get response from server: Success" followed swiftly by abort(). The "Success" is from strerror(errno) and is meaningless.

    This behavior is definitely broken.

    2010-Jul-15 00:28:05 by matt:
    Options for resolving this broken behavior -

    condor_chirp/PROTOCOL equates get_job_attr with getenv, which returns NULL if the env name isn't present

        * Stop aborting, return non-zero - however, abort() is a actually triggered by a problem in the protocol
        * Make unix_errno=0 known to IOProxyHandler::convert - however, requires picking an error code for 0, maybe CHIRP_ERROR_DOESNT_EXIST, changing all chirp client implementations to handle the new code, results in breaking wire protocol between new starter and old chirp clients
        * Change pseudo_get_job_attr to set errno, maybe to ENOENT - better than converting errno=0 to CHIRP_ERROR_DOESNT_EXIST in IOProxyHandler::convert, but has all the same drawbacks
        * Change pseudo_get_job_attr to return UNDEFINED - requires no protocol changes and no client changes, aligns well with ClassAd semantics and getenv("DOESNT_EXIST") -> NULL (Lookup("DOESNT_EXIST") -> UNDEFINED) 

    2010-Jul-15 00:29:29 by matt:

    diff --git a/src/condor_shadow.V6.1/pseudo_ops.cpp b/src/condor_shadow.V6.1/pseudo_ops.cpp
    index c71e1c2..c80230f 100644
    --- a/src/condor_shadow.V6.1/pseudo_ops.cpp
    +++ b/src/condor_shadow.V6.1/pseudo_ops.cpp
    @@ -705,8 +705,9 @@ pseudo_get_job_attr( const char *name, MyString &expr )
                    dprintf(D_SYSCALLS,"pseudo_get_job_attr(%s) = %s\n",name,expr.Value());
                    return 0;
            } else {
    -               dprintf(D_SYSCALLS,"pseudo_get_job_attr(%s) failed\n",name);
    -               return -1;
    +               dprintf(D_SYSCALLS,"pseudo_get_job_attr(%s) is UNDEFINED\n",name);
    +               expr = "UNDEFINED";
    +               return 0;
            }
     }

Comment 3 Matthew Farrellee 2010-07-19 21:01:01 UTC
Resolved upstream, will be built post 7.4.4-0.4

Comment 4 Lubos Trilety 2010-08-06 09:26:58 UTC
Tested with (version):
condor-7.4.4-0.8

Tested on:
RHEL5 x86_64  - passed
RHEL5 i386    - passed
RHEL4 x86_64  - passed
RHEL4 i386    - passed

>>> VERIFIED

Comment 6 Martin Prpič 2010-10-07 16:47:00 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, when querying the value of a non-existing attribute the "condor_chirp get_job_attr" command aborted, returning "abnormal program termination" on a Windows system and a core from SIGABRT on a non-Windows system. With this update, these errors no longer occur and 'condor_chirp' works as expected.

Comment 8 errata-xmlrpc 2010-10-14 16:13:49 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html


Note You need to log in before you can comment on or make changes to this bug.