Bug 692485 - JobServer SEGV during GetJobAd
Summary: JobServer SEGV during GetJobAd
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-qmf
Version: Development
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: 2.0
: ---
Assignee: Matthew Farrellee
QA Contact: Tomas Rusnak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-31 12:38 UTC by Martin Kudlej
Modified: 2011-06-27 15:34 UTC (History)
5 users (show)

Fixed In Version: condor-7.6.0-0.5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-06-27 15:34:14 UTC
Target Upstream Version:


Attachments (Terms of Use)
log file (2.63 KB, application/x-gzip)
2011-03-31 14:26 UTC, Martin Kudlej
no flags Details

Description Martin Kudlej 2011-03-31 12:38:56 UTC
Description of problem:
$ qpid-tool
qpid: list submission
...
3225  06:34:02  -          hostname#2525

qpid: call 3225 GetJobSummaries

qpid: OK (0) - {u'Jobs': [{u'ProcId': 0, u'Args': '6', u'Submission': 'mrg-qe-05.lab.eng.brq.redhat.com#2525', u'CurrentTime': 1301570306, u'QDate': 1301553224, u'Cmd': '/bin/sleep', u'ClusterId': 2525, u'JobStatus': 4, u'Owner': 'condor', u'EnteredCurrentStatus': 1301553236, u'GlobalJobId': 'hostname#2525.0#1301553224'}]}

qpid: list
...
com.redhat.grid         jobserver     1       0

qpid: list jobserver
322  14:48:00  -          jobs@hostname

qpid: call 322 GetJobAd 'hostname#2525'
qpid: Unknown Job Id (65536) - {}

qpid: call 322 GetJobAd '2525'
qpid: Unknown Job Id (65536) - {}

qpid: call 322 GetJobAd '2525.0'
qpid: Unknown Job Id (65536) - {}

qpid: call 322 GetJobAd 'hostname#2525.0'
qpid: Unknown Job Id (65536) - {}

qpid: call 322 GetJobAd 'hostname#2525.0#1301553224'
qpid: Unknown Job Id (65536) - {}

I think this is a bug. Or am I doing anything wrong?

Version-Release number of selected component (if applicable):
condor-qmf-7.6.0-0.4.el5
qpid-qmf-0.10-2.el5
qpid-cpp-server-0.10-1.el5

How reproducible:
100%

Steps to Reproduce:
1. install condor
2. install condor remote configuration
3. set up:
condor_configure_pool -n `hostname` -l
Node "hostname":
Last Check-in Time: Thu Mar 31 06:51:42 2011
Group Memberships:
  Internal Default Group
Features Applied:
  CentralManager
  Scheduler
  ConsoleExecuteNode
  ConsoleMaster
  ConsoleScheduler
  ConsoleCollector
  ConsoleNegotiator
  QMF
  JobServer
  ExecuteNode
  Master
  NodeAccess
Explicitly Set Parameters:
  QMF_BROKER_HOST = 127.0.0.1

Also I've tried this configuration with reordered features:
condor_configure_pool -n `hostname` -l
Node "hostname":
Last Check-in Time: Thu Mar 31 07:57:54 2011
Group Memberships:
  Internal Default Group
Features Applied:
  QMF
  JobServer
  CentralManager
  Scheduler
  ConsoleExecuteNode
  ConsoleMaster
  ConsoleScheduler
  ConsoleCollector
  ConsoleNegotiator
  ExecuteNode
  Master
  NodeAccess
Explicitly Set Parameters:
  QMF_BROKER_HOST = 127.0.0.1

$ condor_history 2525.0
  2525.0   condor          3/31 02:33   0+00:00:07 C   3/31 02:33 /bin/sleep 6

Actual results:
QMF function jobserver.GetJobAd doesn't get right data.

Expected results:
QMF function jobserver.GetJobAd will get right data.

Additional info:

Comment 1 Martin Kudlej 2011-03-31 14:26:34 UTC
Created attachment 489094 [details]
log file

Comment 2 Matthew Farrellee 2011-03-31 15:47:16 UTC
0) There may be an issue in how arguments are serialized from qpid-tool 0.10.
1) JobServer is SEGV in GetJobAd

-

$ rpm -qa | grep -e condor -e qpid -e qmf -e cumin | sort -u
condor-7.6.0-0.4.el6.x86_64
condor-aviary-7.6.0-0.4.el6.x86_64
condor-debuginfo-7.6.0-0.2.el6.x86_64
condor-qmf-7.6.0-0.4.el6.x86_64
cumin-0.1.4675-1.el6.noarch
python-qpid-0.10-1.el6.noarch
python-qpid-qmf-0.10-4.el6.x86_64
qpid-cpp-client-0.10-2.el6.x86_64
qpid-cpp-client-devel-0.10-2.el6.x86_64
qpid-cpp-server-0.10-2.el6.x86_64
qpid-qmf-0.10-4.el6.x86_64
qpid-qmf-devel-0.10-4.el6.x86_64
qpid-tools-0.10-1.el6.noarch

-

Without condor_job_server...

From qpid-tool: (call 123 GetJobAd '246.0' and "246.0"

SchedLog - (with SCHEDD_DEBUG = D_FULLDEBUG)
03/31/11 10:04:22 Calling GetJobAd for ''246.0''
03/31/11 10:04:22 Fetch: Failed to parse id: '246.0'
...
03/31/11 10:04:35 Calling GetJobAd for '"246.0"'
03/31/11 10:04:35 Fetch: Failed to parse id: "246.0"

Navigating to the Submission via Cumin: (details are displayed)

ScheddLog - (with QMF_PUBLISH_SUBMISSIONS = TRUE)
03/31/11 10:17:20 GetJobSummaries for submission: Submission == "eeyore.local#246"
03/31/11 10:17:20 Warning: failed to lookup attribute 'Args' from ad
03/31/11 10:17:20 Warning: Args attribute not found for job of Submission == "eeyore.local#246"
03/31/11 10:17:20 Warning: failed to lookup attribute 'ReleaseReason' from ad
03/31/11 10:17:20 Warning: ReleaseReason attribute not found for job of Submission == "eeyore.local#246"
03/31/11 10:17:28 Calling GetJobAd for '246.0'
03/31/11 10:17:28 PROC_ID is '246.0'

There may be an issue in how arguments are serialized from qpid-tool 0.10.

-

With condor_job_server...

JOB_SERVER = $(SBIN)/condor_job_server
JOB_SERVER_ARGS = -f
JOB_SERVER_DEBUG = D_FULLDEBUG
JOB_SERVER_LOG = $(LOG)/JobServerLog
SCHEDD.QMF_PUBLISH_SUBMISSIONS = FALSE
DAEMON_LIST = $(DAEMON_LIST), JOB_SERVER

03/31/11 10:27:50 warning: failed to lookup attribute Args in job '246.0'
03/31/11 10:27:50 warning: failed to lookup attribute ReleaseReason in job '246.0'
Stack dump for process 14900 at timestamp 1301581674 (18 frames)
condor_job_server(dprintf_dump_stack+0x63)[0x534983]
condor_job_server(linux_sig_coredump(int)+0x40)[0x495f00]
/lib64/libpthread.so.0[0x376300f4a0]
/lib64/libc.so.6[0x3762d2091f]
condor_job_server(com::redhat::grid::JobServerObject::GetJobAd(std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, qpid::types::Variant, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, qpid::types::Variant> > >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+0xd7)[0x46a287]
condor_job_server(com::redhat::grid::JobServerObject::ManagementMethod(unsigned int, qpid::management::Args&, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+0x93)[0x46a383]
condor_job_server(qmf::com::redhat::grid::JobServer::doMethod(std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, qpid::types::Variant, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, qpid::types::Variant> > > const&, std::map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, qpid::types::Variant, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, qpid::types::Variant> > >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x710)[0x473350]
/usr/lib64/libqmf.so.4(qpid::management::ManagementAgentImpl::invokeMethodRequest(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x95f)[0x326761e73f]
/usr/lib64/libqmf.so.4(qpid::management::ManagementAgentImpl::pollCallbacks(unsigned int)+0xbc)[0x326762029c]
condor_job_server(HandleMgmtSocket(Service*, Stream*)+0x14)[0x4639d4]
condor_job_server(DaemonCore::CallSocketHandler_worker(int, bool, Stream*)+0x285)[0x490de5]
condor_job_server(DaemonCore::CallSocketHandler_worker_demarshall(void*)+0x1a)[0x4912da]
condor_job_server(CondorThreads::pool_add(void (*)(void*), void*, int*, char const*)+0x40)[0x51df70]
condor_job_server(DaemonCore::CallSocketHandler(int&, bool)+0x135)[0x486bd5]
condor_job_server(DaemonCore::Driver()+0x2013)[0x48b7c3]
condor_job_server(main+0x10eb)[0x4978eb]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3762c1ec5d]
condor_job_server[0x463509]

Navigating to the job details in Cumin results in a SEGV when navigating to job list of eeyore.local#246.

Comment 3 Matthew Farrellee 2011-03-31 16:01:55 UTC
Note: When the condor_job_server SEGV, Cumin will report "Request timed out"

Comment 4 Matthew Farrellee 2011-04-01 01:24:10 UTC
Testing uncovered bug 692741

The JobServer SEGV fix is pushed to V7_6-branch as

commit 8450a28be8531e76982b8b6c6063dd4bd8caa1e8
Author: Matthew Farrellee <matt@redhat>
Date:   Thu Mar 31 16:06:56 2011 -0400

    Corrected JOB_AD_ERROR lookup logic, removed SEGV from string = NULL

Comment 5 Matthew Farrellee 2011-04-01 01:26:51 UTC
Please check to see if there is a regression in qpid-tool. You should be able to get details with,

qpid: call 322 GetJobAd 2525.0

However, qpid-tool in 0.10 insists on,

qpid: call 322 GetJobAd "2525.0"

and passes "2525.0", not 2525.0, to the GetJobAd method, which fails as expected.

Comment 7 Martin Kudlej 2011-04-04 09:38:10 UTC
I think it should work with and without quotes. We've tried another method of another object in broker:

qpid: call 246 reroute 5 True "amq.topic"
qpid: OK (0) - {}

qpid: call 246 reroute 5 True 'amq.topic'
qpid: OK (0) - {}

qpid: call 246 reroute 5 True amq.topic
qpid: OK (0) - {}

for condor-qmf I've got:
qpid: call 11414 GetJobSummaries
qpid: OK (0) - {u'Jobs': [{u'ProcId': 0, u'CurrentTime': 1301909239, u'QDate': 1301908659, u'Cmd': '/tmp/.condor_run.28044', u'ClusterId': 2531, u'JobStatus': 2, u'Arguments': '""', u'EnteredCurrentStatus': 1301908670, u'GlobalJobId': 'mrg-qe-05.lab.eng.brq.redhat.com#2531.0#1301908659'}]}
qpid: call 322 GetJobAd '2531.0'
qpid: Unknown Job Id (65536) - {}

qpid:
qpid: call 322 GetJobAd 2531.0
qpid: invalid conversion: Variant is not a string; use asString() if conversion is required. (qpid/types/Variant.cpp:569) (7) - {}

qpid: call 322 GetJobAd "2531.0"
qpid: Unknown Job Id (65536) - {}
qpid: 
qpid: call 322 GetJobAd '2531.0'
qpid: Unknown Job Id (65536) - {}

All tests done with:
condor-7.6.0-0.4.el5
condor-qmf-7.6.0-0.4.el5
qpid-qmf-0.10-2.el5
qpid-cpp-server-0.10-1.el5
qpid-qmf-devel-0.10-2.el5
qpid-tools-0.10-1.el5
qpid-cpp-client-0.10-1.el5
ruby-qpid-qmf-0.10-2.el5
qpid-cpp-client-devel-0.10-1.el5
python-qpid-0.10-1.el5
python-qpid-qmf-0.10-2.el5

Comment 8 Matthew Farrellee 2011-04-04 12:54:12 UTC
(0) from comment 2 demonstrates a regression in qpid-tool

Comment 10 Frantisek Reznicek 2011-04-05 18:10:00 UTC
The behavior from comment 7 is now tracked separately as bug 693847.

Comment 11 Tomas Rusnak 2011-05-24 18:11:51 UTC
Retested over all supported platforms x86, x86_64 with:

RHEL5:
python-qpid-0.10-1.el5
python-qpid-qmf-0.10-8.el5
qpid-cpp-client-0.10-7.el5
qpid-cpp-client-devel-0.10-7.el5
qpid-cpp-client-devel-docs-0.10-7.el5
qpid-cpp-client-ssl-0.10-7.el5
qpid-cpp-server-0.10-7.el5
qpid-cpp-server-cluster-0.10-7.el5
qpid-cpp-server-devel-0.10-7.el5
qpid-cpp-server-ssl-0.10-7.el5
qpid-cpp-server-store-0.10-7.el5
qpid-cpp-server-xml-0.10-7.el5
qpid-java-client-0.10-6.el5
qpid-java-common-0.10-6.el5
qpid-java-example-0.10-6.el5
qpid-qmf-0.10-8.el5
qpid-qmf-devel-0.10-8.el5
qpid-tools-0.10-5.el5
ruby-qpid-qmf-0.10-8.el5

RHEL6:

python-qpid-0.10-1.el6.noarch
python-qpid-qmf-0.10-7.el6.x86_64
qpid-cpp-client-0.10-4.el6.x86_64
qpid-cpp-client-devel-0.10-4.el6.x86_64
qpid-cpp-client-devel-docs-0.10-4.el6.noarch
qpid-cpp-client-rdma-0.10-4.el6.x86_64
qpid-cpp-client-ssl-0.10-4.el6.x86_64
qpid-cpp-server-0.10-4.el6.x86_64
qpid-cpp-server-cluster-0.10-4.el6.x86_64
qpid-cpp-server-devel-0.10-4.el6.x86_64
qpid-cpp-server-rdma-0.10-4.el6.x86_64
qpid-cpp-server-ssl-0.10-4.el6.x86_64
qpid-cpp-server-store-0.10-4.el6.x86_64
qpid-cpp-server-xml-0.10-4.el6.x86_64
qpid-java-client-0.10-6.el6.noarch
qpid-java-common-0.10-6.el6.noarch
qpid-java-example-0.10-6.el6.noarch
qpid-java-jca-0.10-6.el6.noarch
qpid-qmf-0.10-7.el6.x86_64
qpid-tests-0.10-1.el6.noarch
qpid-tools-0.10-4.el6.noarch
rh-qpid-cpp-tests-0.10-4.el6.x86_64
ruby-qpid-0.7.946106-2.el6.x86_64
ruby-qpid-qmf-0.10-7.el6.x86_64
sesame-debuginfo-0.10-1.el6.x86_64
=====
python-qpid-0.10-1.el6.noarch
python-qpid-qmf-0.10-7.el6.i686
qpid-cpp-client-0.10-4.el6.i686
qpid-cpp-client-devel-0.10-4.el6.i686
qpid-cpp-client-devel-docs-0.10-4.el6.noarch
qpid-cpp-client-rdma-0.10-4.el6.i686
qpid-cpp-client-ssl-0.10-4.el6.i686
qpid-cpp-server-0.10-4.el6.i686
qpid-cpp-server-cluster-0.10-4.el6.i686
qpid-cpp-server-devel-0.10-4.el6.i686
qpid-cpp-server-rdma-0.10-4.el6.i686
qpid-cpp-server-ssl-0.10-4.el6.i686
qpid-cpp-server-store-0.10-4.el6.i686
qpid-cpp-server-xml-0.10-4.el6.i686
qpid-java-client-0.10-6.el6.noarch
qpid-java-common-0.10-6.el6.noarch
qpid-java-example-0.10-6.el6.noarch
qpid-java-jca-0.10-6.el6.noarch
qpid-qmf-0.10-7.el6.i686
qpid-qmf-devel-0.10-7.el6.i686
qpid-tests-0.10-1.el6.noarch
qpid-tools-0.10-4.el6.noarch
rh-qpid-cpp-tests-0.10-4.el6.i686
ruby-qpid-0.7.946106-2.el6.i686
ruby-qpid-qmf-0.10-7.el6.i686
sesame-0.10-1.el6.i686
sesame-debuginfo-0.10-1.el6.i686


qpid: list
Summary of Objects by Type:
    Package          Class       Active  Deleted
    ==============================================
    com.redhat.grid  scheduler   1       0
    com.redhat.grid  slot        1       0
    com.redhat.grid  master      1       0
    com.redhat.grid  submitter   1       0
    com.redhat.grid  negotiator  1       0
    com.redhat.grid  submission  1       0
    com.redhat.grid  jobserver   1       0
    com.redhat.grid  collector   1       0
qpid: list jobserver
Object Summary:
    ID   Created   Destroyed  Index
    ==========================================================================
    106  12:56:06  -          jobs.lab.eng.brq.redhat.com
qpid: call 106 GetJobAd "246.0"
qpid: OK (0) - {...}

qpid: call 106 GetJobAd 246.0
qpid: invalid conversion: Variant is not a string; use asString() if conversion
is required. (qpid/types/Variant.cpp:569) (7) - {}

JobserverLog:
05/24/11 21:09:10 JobServerJobLogConsumer::NewClassAd processing _key='0.0'
05/24/11 21:09:10 JobServerJobLogConsumer::NewClassAd processing _key='246.0'
05/24/11 21:09:10 JobServerJobLogConsumer::NewClassAd processing _key='0246.-1'
05/24/11 21:09:10 Job::Job of '0246.-1'
05/24/11 21:09:10 LiveJobImpl created for '0246.-1'
05/24/11 21:09:10 warning: failed to lookup attribute JobStatus in job '0246.-1'
05/24/11 21:09:10 ProcessHistoryTimer() called
05/24/11 21:09:20 TimerHandler_JobLogPolling() called
05/24/11 21:09:20 === Current Probing Information ===
05/24/11 21:09:20 fsize: 5535           mtime: 1305222134
05/24/11 21:09:20 first log entry: 4 CreationTimestamp 1301426975
05/24/11 21:09:30 TimerHandler_JobLogPolling() called
05/24/11 21:09:30 === Current Probing Information ===
05/24/11 21:09:30 fsize: 5535           mtime: 1305222134
05/24/11 21:09:30 first log entry: 4 CreationTimestamp 1301426975
05/24/11 21:09:40 TimerHandler_JobLogPolling() called
05/24/11 21:09:40 === Current Probing Information ===
05/24/11 21:09:40 fsize: 5535           mtime: 1305222134
05/24/11 21:09:40 first log entry: 4 CreationTimestamp 1301426975
05/24/11 21:09:50 TimerHandler_JobLogPolling() called
05/24/11 21:09:50 === Current Probing Information ===
05/24/11 21:09:50 fsize: 5535           mtime: 1305222134
05/24/11 21:09:50 first log entry: 4 CreationTimestamp 1301426975
05/24/11 21:10:00 TimerHandler_JobLogPolling() called
05/24/11 21:10:00 === Current Probing Information ===
05/24/11 21:10:00 fsize: 5535           mtime: 1305222134
05/24/11 21:10:00 first log entry: 4 CreationTimestamp 1301426975

Original issue with calling jobserver methods solved. New bugzilla was created for an issue from comment 7. 

>>> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.