Bug 707618

Summary: aviary doesn't return answer to client from calling getData function + coredump of condor_preen
Product: Red Hat Enterprise MRG Reporter: Martin Kudlej <mkudlej>
Component: condor-aviaryAssignee: Pete MacKinnon <pmackinn>
Status: CLOSED ERRATA QA Contact: Martin Kudlej <mkudlej>
Severity: urgent Docs Contact:
Priority: urgent    
Version: DevelopmentCC: iboverma, jneedle, matt, pmackinn
Target Milestone: 2.0   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: condor-7.6.1-0.8 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-27 14:20:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Patch to create separate JobDataType ptr for return none

Description Martin Kudlej 2011-05-25 14:23:58 UTC
Version-Release number of selected component (if applicable):
condor-7.6.1-0.6.el6.i686
condor-aviary-7.6.1-0.6.el6.i686
condor-classads-7.6.1-0.6.el6.i686
condor-debuginfo-7.6.1-0.6.el6.i686
condor-qmf-7.6.1-0.6.el6.i686
condor-wallaby-base-db-1.12-1.el6.noarch
condor-wallaby-client-4.0-6.el6.noarch
condor-wallaby-tools-4.0-6.el6.noarch
python-condorutils-1.5-3.el6.noarch
python-qpid-qmf-0.10-7.el6.i686
qpid-qmf-0.10-7.el6.i686
ruby-qpid-qmf-0.10-7.el6.i686
wso2-axis2-2.1.0-3.el6.i686
wso2-rampart-2.1.0-3.el6.i686
wso2-wsf-cpp-2.1.0-3.el6.i686
wso2-wsf-cpp-debuginfo-2.1.0-3.el6.i686
Red Hat Enterprise Linux Server release 6.1 (Santiago)

How reproducible:
100%

Steps to Reproduce:
1. install aviary
2. submit 10 jobs via aviary
3. if those jobs end, call getData on each of them in this order of data types: ['ERR', 'LOG', 'OUT']
4. client stucks on first call of getData and after manual break of client based on suds I see this:
...
      result = client.service.getJobDetails(ids_avia)
  File "/usr/lib/python2.4/site-packages/suds/client.py", line 539, in __call__
    return client.invoke(args, kwargs)
  File "/usr/lib/python2.4/site-packages/suds/client.py", line 598, in invoke
    result = self.send(msg)
  File "/usr/lib/python2.4/site-packages/suds/client.py", line 623, in send
    reply = transport.send(request)
  File "/usr/lib/python2.4/site-packages/suds/transport/https.py", line 64, in send
    return  HttpTransport.send(self, request)
  File "/usr/lib/python2.4/site-packages/suds/transport/http.py", line 77, in send
    fp = self.u2open(u2request)
  File "/usr/lib/python2.4/site-packages/suds/transport/http.py", line 116, in u2open
    return url.open(u2request)
  File "/usr/lib/python2.4/urllib2.py", line 358, in open
    response = self._open(req, data)
  File "/usr/lib/python2.4/urllib2.py", line 376, in _open
    '_open', req)
  File "/usr/lib/python2.4/urllib2.py", line 337, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.4/urllib2.py", line 1118, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.4/urllib2.py", line 1090, in do_open
    r = h.getresponse()
KeyboardInterrupt

Actual results:
Calling of getData doesn't work.

Expected results:
Calling of getData via aviari will work and there will be no coredump there.

Comment 4 Pete MacKinnon 2011-05-25 16:28:24 UTC
After the valgrind cleanup looks like I'm zigging while Axis2/C is zagging...

#16 <signal handler called>
#17 0x005de16a in malloc_consolidate () from /lib/libc.so.6
#18 0x005e0c85 in _int_malloc () from /lib/libc.so.6
#19 0x005e1efe in malloc () from /lib/libc.so.6
#20 0x0027e522 in xmlBufferCreate () from /usr/lib/libxml2.so.2
#21 0x0076d35c in axiom_xml_writer_create_for_memory () from /usr/lib/libaxis2_parser.so.0
#22 0x0087d179 in axis2_http_transport_sender_invoke () from /usr/lib/libaxis2_http_sender.so.0


Possibly mismatched malloc/delete.

Comment 5 Pete MacKinnon 2011-05-25 16:30:00 UTC
The hang appears be due to the fact that the stack has gotten catastrophically whacked.

#0  0x00946424 in __kernel_vsyscall ()
#1  0x0065d1a3 in __lll_lock_wait_private () from /lib/libc.so.6
#2  0x005e4131 in _L_lock_9450 () from /lib/libc.so.6
#3  0x005e1ef4 in malloc () from /lib/libc.so.6

Comment 6 Pete MacKinnon 2011-05-25 20:48:12 UTC
*** Bug 707543 has been marked as a duplicate of this bug. ***

Comment 7 Pete MacKinnon 2011-05-25 20:51:32 UTC
Created attachment 500938 [details]
Patch to create separate JobDataType ptr for return

Diffed from upstream 7.6 branch to up-to-date FH master

Comment 9 Pete MacKinnon 2011-05-25 20:54:23 UTC
memory was corrupted so that the runtime was stuck in a low-level libc lock on malloc causing the appearance of a hang when in fact a SEGV has occured

Comment 11 Martin Kudlej 2011-06-01 09:03:38 UTC
Tested on RHEL 5.6/6.1 x x86_64/i386 with condor-7.6.1-0.8 and it works. -->VERIFIED