Bug 615313
| Summary: | condor_chirp fails when querying the value of a non-existing attribute | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Jon Thomas <jthomas> |
| Component: | condor | Assignee: | Matthew Farrellee <matt> |
| Status: | CLOSED ERRATA | QA Contact: | Lubos Trilety <ltrilety> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 1.2 | CC: | fnadge, ltrilety, matt |
| Target Milestone: | 1.3 | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, when querying the value of a non-existing attribute the "condor_chirp get_job_attr" command aborted, returning "abnormal program termination" on a Windows system and a core from SIGABRT on a non-Windows system. With this update, these errors no longer occur and 'condor_chirp' works as expected.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2010-10-14 16:13:49 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Ticket #522: condor_chirp fails when querying the value of a non-existing attribute
When querying the value of a non-existing attribute the condor_chirp get_job_attr command aborts, returning "abnormal program termination" on Windows and a core from SIGABRT on non-Windows.
[Append remarks]
Remarks:
2010-Jul-15 00:05:24 by matt:
Here's the skinny...
On the shadow side,
A get_job_attr for an attribute that does not exist hits pseudo_ops.cpp:pseudo_get_job_attr(name, expr) and returns -1: e = ad->Lookup(name); if(e) { ... } else { ...; return -1; }"
The -1 gets returned to the "case CONDOR_get_job_attr" in NTreceivers.cpp, which happily handles it by encoding a response of "code(-1); code(0);" -- the -1 is the return value and 0 is the default errno.
On the starter side,
The receiver in io_proxy_handler.cpp eventually calls IOProxyHandler::convert to translate the errno (remember it was 0) into a CHIRP_ERROR code to send to condor_chirp. However, errno is not a known code, resulting in CHIRP_ERROR_UNKNOWN and a dprintf of "Starter ioproxy server got unknown unix errno:0"
On the condor_chirp side,
Result of CHIRP_ERROR_UNKNOWN is received, which triggers an unceremonious fprintf to stderr of "chirp: couldn't get response from server: Success" followed swiftly by abort(). The "Success" is from strerror(errno) and is meaningless.
This behavior is definitely broken.
2010-Jul-15 00:28:05 by matt:
Options for resolving this broken behavior -
condor_chirp/PROTOCOL equates get_job_attr with getenv, which returns NULL if the env name isn't present
* Stop aborting, return non-zero - however, abort() is a actually triggered by a problem in the protocol
* Make unix_errno=0 known to IOProxyHandler::convert - however, requires picking an error code for 0, maybe CHIRP_ERROR_DOESNT_EXIST, changing all chirp client implementations to handle the new code, results in breaking wire protocol between new starter and old chirp clients
* Change pseudo_get_job_attr to set errno, maybe to ENOENT - better than converting errno=0 to CHIRP_ERROR_DOESNT_EXIST in IOProxyHandler::convert, but has all the same drawbacks
* Change pseudo_get_job_attr to return UNDEFINED - requires no protocol changes and no client changes, aligns well with ClassAd semantics and getenv("DOESNT_EXIST") -> NULL (Lookup("DOESNT_EXIST") -> UNDEFINED)
2010-Jul-15 00:29:29 by matt:
diff --git a/src/condor_shadow.V6.1/pseudo_ops.cpp b/src/condor_shadow.V6.1/pseudo_ops.cpp
index c71e1c2..c80230f 100644
--- a/src/condor_shadow.V6.1/pseudo_ops.cpp
+++ b/src/condor_shadow.V6.1/pseudo_ops.cpp
@@ -705,8 +705,9 @@ pseudo_get_job_attr( const char *name, MyString &expr )
dprintf(D_SYSCALLS,"pseudo_get_job_attr(%s) = %s\n",name,expr.Value());
return 0;
} else {
- dprintf(D_SYSCALLS,"pseudo_get_job_attr(%s) failed\n",name);
- return -1;
+ dprintf(D_SYSCALLS,"pseudo_get_job_attr(%s) is UNDEFINED\n",name);
+ expr = "UNDEFINED";
+ return 0;
}
}
Resolved upstream, will be built post 7.4.4-0.4 Tested with (version):
condor-7.4.4-0.8
Tested on:
RHEL5 x86_64 - passed
RHEL5 i386 - passed
RHEL4 x86_64 - passed
RHEL4 i386 - passed
>>> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Previously, when querying the value of a non-existing attribute the "condor_chirp get_job_attr" command aborted, returning "abnormal program termination" on a Windows system and a core from SIGABRT on a non-Windows system. With this update, these errors no longer occur and 'condor_chirp' works as expected.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html |
running /usr/libexec/condor/condor_chirp get_job_attr ResidentSetSize will coredump if ResidentSetSize isn't yet set. also see ERROR getting rss: invalid literal for int(): chirp: couldn't get response from server: Illegal seek Thread 1 (process 24257): #0 0x00000030b0030265 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00000030b0031d10 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00000000004517cb in chirp_fatal_response () at chirp_client.c:561 No locals. #3 0x0000000000451915 in convert_result (result=24257) at chirp_client.c:522 No locals. #4 0x0000000000451acd in simple_command (c=0x942b10, fmt=0x4dc043 "get_job_attr %s\n") at chirp_client.c:678 result = <value optimized out> command = "get_job_attr ResidentSetSize\n\000\000\000�����\177\000\000X0\200�0\000\000\000\210����*\000\000\006.\200�0\000\000\0008˭��*\000\000\006\000\000\000\000\000\000\000 ����\177\000\000&�\000�0", '\0' <repeats 11 times>, "\001\000\000\000\000\000\000\000p���\000\000\000\000\001\000\000\000\000\000\000\000`\034\200�0\000\000\000 ���0", '\0' <repeats 11 times>, "\0300\200�0\000\000\000o����\177\000\0000����\177", '\0' <repeats 11 times>, "����*\000\000����\025\000\000\000�", '\0' <repeats 15 times>... args = {{gp_offset = 24, fp_offset = 2054513515, overflow_arg_area = 0x7fffffffe000, reg_save_area = 0x7fffffffdf20}} #5 0x0000000000451e9d in chirp_client_get_job_attr (c=0x5ec1, name=0x6 <Address 0x6 out of bounds>, expr=0x7fffffffe038) at chirp_client.c:374 result = <value optimized out> #6 0x00000000004510c1 in chirp_get_job_attr (argc=<value optimized out>, argv=0x7fffffffe128) at condor_chirp.cpp:306 client = (struct chirp_client *) 0x0 p = 0x0 #7 0x00000030b001d994 in __libc_start_main () from /lib64/libc.so.6 No symbol table info available. #8 0x0000000000450d29 in _start ()