| Summary: | JobServer SEGV during GetJobAd | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Martin Kudlej <mkudlej> | ||||
| Component: | condor-qmf | Assignee: | Matthew Farrellee <matt> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Tomas Rusnak <trusnak> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | Development | CC: | freznice, jneedle, matt, pmackinn, trusnak | ||||
| Target Milestone: | 2.0 | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | condor-7.6.0-0.5 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-06-27 15:34:14 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
Created attachment 489094 [details]
log file
0) There may be an issue in how arguments are serialized from qpid-tool 0.10. 1) JobServer is SEGV in GetJobAd - $ rpm -qa | grep -e condor -e qpid -e qmf -e cumin | sort -u condor-7.6.0-0.4.el6.x86_64 condor-aviary-7.6.0-0.4.el6.x86_64 condor-debuginfo-7.6.0-0.2.el6.x86_64 condor-qmf-7.6.0-0.4.el6.x86_64 cumin-0.1.4675-1.el6.noarch python-qpid-0.10-1.el6.noarch python-qpid-qmf-0.10-4.el6.x86_64 qpid-cpp-client-0.10-2.el6.x86_64 qpid-cpp-client-devel-0.10-2.el6.x86_64 qpid-cpp-server-0.10-2.el6.x86_64 qpid-qmf-0.10-4.el6.x86_64 qpid-qmf-devel-0.10-4.el6.x86_64 qpid-tools-0.10-1.el6.noarch - Without condor_job_server... From qpid-tool: (call 123 GetJobAd '246.0' and "246.0" SchedLog - (with SCHEDD_DEBUG = D_FULLDEBUG) 03/31/11 10:04:22 Calling GetJobAd for ''246.0'' 03/31/11 10:04:22 Fetch: Failed to parse id: '246.0' ... 03/31/11 10:04:35 Calling GetJobAd for '"246.0"' 03/31/11 10:04:35 Fetch: Failed to parse id: "246.0" Navigating to the Submission via Cumin: (details are displayed) ScheddLog - (with QMF_PUBLISH_SUBMISSIONS = TRUE) 03/31/11 10:17:20 GetJobSummaries for submission: Submission == "eeyore.local#246" 03/31/11 10:17:20 Warning: failed to lookup attribute 'Args' from ad 03/31/11 10:17:20 Warning: Args attribute not found for job of Submission == "eeyore.local#246" 03/31/11 10:17:20 Warning: failed to lookup attribute 'ReleaseReason' from ad 03/31/11 10:17:20 Warning: ReleaseReason attribute not found for job of Submission == "eeyore.local#246" 03/31/11 10:17:28 Calling GetJobAd for '246.0' 03/31/11 10:17:28 PROC_ID is '246.0' There may be an issue in how arguments are serialized from qpid-tool 0.10. - With condor_job_server... JOB_SERVER = $(SBIN)/condor_job_server JOB_SERVER_ARGS = -f JOB_SERVER_DEBUG = D_FULLDEBUG JOB_SERVER_LOG = $(LOG)/JobServerLog SCHEDD.QMF_PUBLISH_SUBMISSIONS = FALSE DAEMON_LIST = $(DAEMON_LIST), JOB_SERVER 03/31/11 10:27:50 warning: failed to lookup attribute Args in job '246.0' 03/31/11 10:27:50 warning: failed to lookup attribute ReleaseReason in job '246.0' Stack dump for process 14900 at timestamp 1301581674 (18 frames) condor_job_server(dprintf_dump_stack+0x63)[0x534983] condor_job_server(linux_sig_coredump(int)+0x40)[0x495f00] /lib64/libpthread.so.0[0x376300f4a0] /lib64/libc.so.6[0x3762d2091f] condor_job_server(com::redhat::grid::JobServerObject::GetJobAd(std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, qpid::types::Variant, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, qpid::types::Variant> > >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+0xd7)[0x46a287] condor_job_server(com::redhat::grid::JobServerObject::ManagementMethod(unsigned int, qpid::management::Args&, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+0x93)[0x46a383] condor_job_server(qmf::com::redhat::grid::JobServer::doMethod(std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, qpid::types::Variant, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, qpid::types::Variant> > > const&, std::map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, qpid::types::Variant, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, qpid::types::Variant> > >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x710)[0x473350] /usr/lib64/libqmf.so.4(qpid::management::ManagementAgentImpl::invokeMethodRequest(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x95f)[0x326761e73f] /usr/lib64/libqmf.so.4(qpid::management::ManagementAgentImpl::pollCallbacks(unsigned int)+0xbc)[0x326762029c] condor_job_server(HandleMgmtSocket(Service*, Stream*)+0x14)[0x4639d4] condor_job_server(DaemonCore::CallSocketHandler_worker(int, bool, Stream*)+0x285)[0x490de5] condor_job_server(DaemonCore::CallSocketHandler_worker_demarshall(void*)+0x1a)[0x4912da] condor_job_server(CondorThreads::pool_add(void (*)(void*), void*, int*, char const*)+0x40)[0x51df70] condor_job_server(DaemonCore::CallSocketHandler(int&, bool)+0x135)[0x486bd5] condor_job_server(DaemonCore::Driver()+0x2013)[0x48b7c3] condor_job_server(main+0x10eb)[0x4978eb] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3762c1ec5d] condor_job_server[0x463509] Navigating to the job details in Cumin results in a SEGV when navigating to job list of eeyore.local#246. Note: When the condor_job_server SEGV, Cumin will report "Request timed out" Testing uncovered bug 692741 The JobServer SEGV fix is pushed to V7_6-branch as commit 8450a28be8531e76982b8b6c6063dd4bd8caa1e8 Author: Matthew Farrellee <matt@redhat> Date: Thu Mar 31 16:06:56 2011 -0400 Corrected JOB_AD_ERROR lookup logic, removed SEGV from string = NULL Please check to see if there is a regression in qpid-tool. You should be able to get details with, qpid: call 322 GetJobAd 2525.0 However, qpid-tool in 0.10 insists on, qpid: call 322 GetJobAd "2525.0" and passes "2525.0", not 2525.0, to the GetJobAd method, which fails as expected. I think it should work with and without quotes. We've tried another method of another object in broker:
qpid: call 246 reroute 5 True "amq.topic"
qpid: OK (0) - {}
qpid: call 246 reroute 5 True 'amq.topic'
qpid: OK (0) - {}
qpid: call 246 reroute 5 True amq.topic
qpid: OK (0) - {}
for condor-qmf I've got:
qpid: call 11414 GetJobSummaries
qpid: OK (0) - {u'Jobs': [{u'ProcId': 0, u'CurrentTime': 1301909239, u'QDate': 1301908659, u'Cmd': '/tmp/.condor_run.28044', u'ClusterId': 2531, u'JobStatus': 2, u'Arguments': '""', u'EnteredCurrentStatus': 1301908670, u'GlobalJobId': 'mrg-qe-05.lab.eng.brq.redhat.com#2531.0#1301908659'}]}
qpid: call 322 GetJobAd '2531.0'
qpid: Unknown Job Id (65536) - {}
qpid:
qpid: call 322 GetJobAd 2531.0
qpid: invalid conversion: Variant is not a string; use asString() if conversion is required. (qpid/types/Variant.cpp:569) (7) - {}
qpid: call 322 GetJobAd "2531.0"
qpid: Unknown Job Id (65536) - {}
qpid:
qpid: call 322 GetJobAd '2531.0'
qpid: Unknown Job Id (65536) - {}
All tests done with:
condor-7.6.0-0.4.el5
condor-qmf-7.6.0-0.4.el5
qpid-qmf-0.10-2.el5
qpid-cpp-server-0.10-1.el5
qpid-qmf-devel-0.10-2.el5
qpid-tools-0.10-1.el5
qpid-cpp-client-0.10-1.el5
ruby-qpid-qmf-0.10-2.el5
qpid-cpp-client-devel-0.10-1.el5
python-qpid-0.10-1.el5
python-qpid-qmf-0.10-2.el5
(0) from comment 2 demonstrates a regression in qpid-tool The behavior from comment 7 is now tracked separately as bug 693847. Retested over all supported platforms x86, x86_64 with:
RHEL5:
python-qpid-0.10-1.el5
python-qpid-qmf-0.10-8.el5
qpid-cpp-client-0.10-7.el5
qpid-cpp-client-devel-0.10-7.el5
qpid-cpp-client-devel-docs-0.10-7.el5
qpid-cpp-client-ssl-0.10-7.el5
qpid-cpp-server-0.10-7.el5
qpid-cpp-server-cluster-0.10-7.el5
qpid-cpp-server-devel-0.10-7.el5
qpid-cpp-server-ssl-0.10-7.el5
qpid-cpp-server-store-0.10-7.el5
qpid-cpp-server-xml-0.10-7.el5
qpid-java-client-0.10-6.el5
qpid-java-common-0.10-6.el5
qpid-java-example-0.10-6.el5
qpid-qmf-0.10-8.el5
qpid-qmf-devel-0.10-8.el5
qpid-tools-0.10-5.el5
ruby-qpid-qmf-0.10-8.el5
RHEL6:
python-qpid-0.10-1.el6.noarch
python-qpid-qmf-0.10-7.el6.x86_64
qpid-cpp-client-0.10-4.el6.x86_64
qpid-cpp-client-devel-0.10-4.el6.x86_64
qpid-cpp-client-devel-docs-0.10-4.el6.noarch
qpid-cpp-client-rdma-0.10-4.el6.x86_64
qpid-cpp-client-ssl-0.10-4.el6.x86_64
qpid-cpp-server-0.10-4.el6.x86_64
qpid-cpp-server-cluster-0.10-4.el6.x86_64
qpid-cpp-server-devel-0.10-4.el6.x86_64
qpid-cpp-server-rdma-0.10-4.el6.x86_64
qpid-cpp-server-ssl-0.10-4.el6.x86_64
qpid-cpp-server-store-0.10-4.el6.x86_64
qpid-cpp-server-xml-0.10-4.el6.x86_64
qpid-java-client-0.10-6.el6.noarch
qpid-java-common-0.10-6.el6.noarch
qpid-java-example-0.10-6.el6.noarch
qpid-java-jca-0.10-6.el6.noarch
qpid-qmf-0.10-7.el6.x86_64
qpid-tests-0.10-1.el6.noarch
qpid-tools-0.10-4.el6.noarch
rh-qpid-cpp-tests-0.10-4.el6.x86_64
ruby-qpid-0.7.946106-2.el6.x86_64
ruby-qpid-qmf-0.10-7.el6.x86_64
sesame-debuginfo-0.10-1.el6.x86_64
=====
python-qpid-0.10-1.el6.noarch
python-qpid-qmf-0.10-7.el6.i686
qpid-cpp-client-0.10-4.el6.i686
qpid-cpp-client-devel-0.10-4.el6.i686
qpid-cpp-client-devel-docs-0.10-4.el6.noarch
qpid-cpp-client-rdma-0.10-4.el6.i686
qpid-cpp-client-ssl-0.10-4.el6.i686
qpid-cpp-server-0.10-4.el6.i686
qpid-cpp-server-cluster-0.10-4.el6.i686
qpid-cpp-server-devel-0.10-4.el6.i686
qpid-cpp-server-rdma-0.10-4.el6.i686
qpid-cpp-server-ssl-0.10-4.el6.i686
qpid-cpp-server-store-0.10-4.el6.i686
qpid-cpp-server-xml-0.10-4.el6.i686
qpid-java-client-0.10-6.el6.noarch
qpid-java-common-0.10-6.el6.noarch
qpid-java-example-0.10-6.el6.noarch
qpid-java-jca-0.10-6.el6.noarch
qpid-qmf-0.10-7.el6.i686
qpid-qmf-devel-0.10-7.el6.i686
qpid-tests-0.10-1.el6.noarch
qpid-tools-0.10-4.el6.noarch
rh-qpid-cpp-tests-0.10-4.el6.i686
ruby-qpid-0.7.946106-2.el6.i686
ruby-qpid-qmf-0.10-7.el6.i686
sesame-0.10-1.el6.i686
sesame-debuginfo-0.10-1.el6.i686
qpid: list
Summary of Objects by Type:
Package Class Active Deleted
==============================================
com.redhat.grid scheduler 1 0
com.redhat.grid slot 1 0
com.redhat.grid master 1 0
com.redhat.grid submitter 1 0
com.redhat.grid negotiator 1 0
com.redhat.grid submission 1 0
com.redhat.grid jobserver 1 0
com.redhat.grid collector 1 0
qpid: list jobserver
Object Summary:
ID Created Destroyed Index
==========================================================================
106 12:56:06 - jobs.lab.eng.brq.redhat.com
qpid: call 106 GetJobAd "246.0"
qpid: OK (0) - {...}
qpid: call 106 GetJobAd 246.0
qpid: invalid conversion: Variant is not a string; use asString() if conversion
is required. (qpid/types/Variant.cpp:569) (7) - {}
JobserverLog:
05/24/11 21:09:10 JobServerJobLogConsumer::NewClassAd processing _key='0.0'
05/24/11 21:09:10 JobServerJobLogConsumer::NewClassAd processing _key='246.0'
05/24/11 21:09:10 JobServerJobLogConsumer::NewClassAd processing _key='0246.-1'
05/24/11 21:09:10 Job::Job of '0246.-1'
05/24/11 21:09:10 LiveJobImpl created for '0246.-1'
05/24/11 21:09:10 warning: failed to lookup attribute JobStatus in job '0246.-1'
05/24/11 21:09:10 ProcessHistoryTimer() called
05/24/11 21:09:20 TimerHandler_JobLogPolling() called
05/24/11 21:09:20 === Current Probing Information ===
05/24/11 21:09:20 fsize: 5535 mtime: 1305222134
05/24/11 21:09:20 first log entry: 4 CreationTimestamp 1301426975
05/24/11 21:09:30 TimerHandler_JobLogPolling() called
05/24/11 21:09:30 === Current Probing Information ===
05/24/11 21:09:30 fsize: 5535 mtime: 1305222134
05/24/11 21:09:30 first log entry: 4 CreationTimestamp 1301426975
05/24/11 21:09:40 TimerHandler_JobLogPolling() called
05/24/11 21:09:40 === Current Probing Information ===
05/24/11 21:09:40 fsize: 5535 mtime: 1305222134
05/24/11 21:09:40 first log entry: 4 CreationTimestamp 1301426975
05/24/11 21:09:50 TimerHandler_JobLogPolling() called
05/24/11 21:09:50 === Current Probing Information ===
05/24/11 21:09:50 fsize: 5535 mtime: 1305222134
05/24/11 21:09:50 first log entry: 4 CreationTimestamp 1301426975
05/24/11 21:10:00 TimerHandler_JobLogPolling() called
05/24/11 21:10:00 === Current Probing Information ===
05/24/11 21:10:00 fsize: 5535 mtime: 1305222134
05/24/11 21:10:00 first log entry: 4 CreationTimestamp 1301426975
Original issue with calling jobserver methods solved. New bugzilla was created for an issue from comment 7.
>>> VERIFIED
|
Description of problem: $ qpid-tool qpid: list submission ... 3225 06:34:02 - hostname#2525 qpid: call 3225 GetJobSummaries qpid: OK (0) - {u'Jobs': [{u'ProcId': 0, u'Args': '6', u'Submission': 'mrg-qe-05.lab.eng.brq.redhat.com#2525', u'CurrentTime': 1301570306, u'QDate': 1301553224, u'Cmd': '/bin/sleep', u'ClusterId': 2525, u'JobStatus': 4, u'Owner': 'condor', u'EnteredCurrentStatus': 1301553236, u'GlobalJobId': 'hostname#2525.0#1301553224'}]} qpid: list ... com.redhat.grid jobserver 1 0 qpid: list jobserver 322 14:48:00 - jobs@hostname qpid: call 322 GetJobAd 'hostname#2525' qpid: Unknown Job Id (65536) - {} qpid: call 322 GetJobAd '2525' qpid: Unknown Job Id (65536) - {} qpid: call 322 GetJobAd '2525.0' qpid: Unknown Job Id (65536) - {} qpid: call 322 GetJobAd 'hostname#2525.0' qpid: Unknown Job Id (65536) - {} qpid: call 322 GetJobAd 'hostname#2525.0#1301553224' qpid: Unknown Job Id (65536) - {} I think this is a bug. Or am I doing anything wrong? Version-Release number of selected component (if applicable): condor-qmf-7.6.0-0.4.el5 qpid-qmf-0.10-2.el5 qpid-cpp-server-0.10-1.el5 How reproducible: 100% Steps to Reproduce: 1. install condor 2. install condor remote configuration 3. set up: condor_configure_pool -n `hostname` -l Node "hostname": Last Check-in Time: Thu Mar 31 06:51:42 2011 Group Memberships: Internal Default Group Features Applied: CentralManager Scheduler ConsoleExecuteNode ConsoleMaster ConsoleScheduler ConsoleCollector ConsoleNegotiator QMF JobServer ExecuteNode Master NodeAccess Explicitly Set Parameters: QMF_BROKER_HOST = 127.0.0.1 Also I've tried this configuration with reordered features: condor_configure_pool -n `hostname` -l Node "hostname": Last Check-in Time: Thu Mar 31 07:57:54 2011 Group Memberships: Internal Default Group Features Applied: QMF JobServer CentralManager Scheduler ConsoleExecuteNode ConsoleMaster ConsoleScheduler ConsoleCollector ConsoleNegotiator ExecuteNode Master NodeAccess Explicitly Set Parameters: QMF_BROKER_HOST = 127.0.0.1 $ condor_history 2525.0 2525.0 condor 3/31 02:33 0+00:00:07 C 3/31 02:33 /bin/sleep 6 Actual results: QMF function jobserver.GetJobAd doesn't get right data. Expected results: QMF function jobserver.GetJobAd will get right data. Additional info: