Bug 737614

Summary: glibc backtrace during repo sync
Product: Red Hat Update Infrastructure for Cloud Providers Reporter: James Slagle <jslagle>
Component: RHUAAssignee: John Matthews <jmatthew>
Status: CLOSED CURRENTRELEASE QA Contact: wes hayutin <whayutin>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2.0CC: jmatthew, kbidarka, sghai, tsanders
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-31 17:56:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
apache error log
none
pulp log
none
apache error log none

Description James Slagle 2011-09-12 17:55:01 UTC
I saw glibc backtrace in the apache error log when the scheduled repo syncs kicked off this morning at 7:37 am.  I'm attaching the error_log and pulp.log.

This was from ap-southeast.

Comment 1 James Slagle 2011-09-12 17:55:43 UTC
Created attachment 522747 [details]
apache error log

Comment 2 James Slagle 2011-09-12 17:56:23 UTC
Created attachment 522748 [details]
pulp log

Comment 3 John Matthews 2011-09-12 18:09:38 UTC
*** glibc detected *** (wsgi:pulp)    : double free or corruption (!prev): 0x00007fb6ec0b5740 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75716)[0x7fb72cf90716]
/usr/lib64/libnsspem.so(+0x18e72)[0x7fb700689e72]
/usr/lib64/libnsspem.so(+0xa4f4)[0x7fb70067b4f4]
/usr/lib64/libnsspem.so(+0xa64d)[0x7fb70067b64d]
/usr/lib64/libnsspem.so(+0x17e0f)[0x7fb700688e0f]
/usr/lib64/libnsspem.so(+0x13df1)[0x7fb700684df1]
/usr/lib64/libnss3.so(PK11_Sign+0xb1)[0x7fb729eb3b71]
/usr/lib64/libssl3.so(+0xe620)[0x7fb72a3e2620]
/usr/lib64/libssl3.so(+0xf159)[0x7fb72a3e3159]
/usr/lib64/libssl3.so(+0x12860)[0x7fb72a3e6860]
/usr/lib64/libssl3.so(+0x13e30)[0x7fb72a3e7e30]
/usr/lib64/libssl3.so(+0x148cc)[0x7fb72a3e88cc]
/usr/lib64/libssl3.so(+0x17169)[0x7fb72a3eb169]
/usr/lib64/libssl3.so(+0x1cf15)[0x7fb72a3f0f15]
/usr/lib64/libssl3.so(SSL_ForceHandshake+0x9f)[0x7fb72a3f267f]
/usr/lib64/libcurl.so.4(Curl_nss_connect+0xb70)[0x7fb71b0b3e10]
/usr/lib64/libcurl.so.4(Curl_ssl_connect+0x2b)[0x7fb71b0ab34b]
/usr/lib64/libcurl.so.4(Curl_http_connect+0x5b)[0x7fb71b089e5b]
/usr/lib64/libcurl.so.4(Curl_protocol_connect+0xd2)[0x7fb71b090612]
/usr/lib64/libcurl.so.4(Curl_connect+0x2fc)[0x7fb71b096abc]
/usr/lib64/libcurl.so.4(Curl_perform+0x80)[0x7fb71b09ea50]
/usr/lib64/python2.6/site-packages/pycurl.so(+0x864b)[0x7fb71b2ce64b]
/usr/lib64/libpython2.6.so.1.0(PyEval_EvalFrameEx+0x5176)[0x7fb721e8ead6]
/usr/lib64/libpython2.6.so.1.0(PyEval_EvalCodeEx+0x92c)[0x7fb721e9081c]
/usr/lib64/libpython2.6.so.1.0(PyEval_EvalFrameEx+0x5304)[0x7fb721e8ec64]
/usr/lib64/libpython2.6.so.1.0(PyEval_EvalFrameEx+0x63ef)[0x7fb721e8fd4f]

Mon Sep 12 07:44:25 2011] [notice] child pid 7684 exit signal Aborted (6)

Comment 4 James Slagle 2011-09-16 17:57:36 UTC
Saw this again last night on my test RHUA.  Looks like it happened at 9:21 pm.  Disk space and memory usage were not an issue at the time.

Memory usage before and after the backtrace:
Thu Sep 15 21:20:01 EDT 2011
             total       used       free     shared    buffers     cached
Mem:          6964       6864         99          0         15       4778
-/+ buffers/cache:       2070       4893
Swap:            0          0          0

Thu Sep 15 21:25:01 EDT 2011
             total       used       free     shared    buffers     cached
Mem:          6964       5109       1854          0         20       4730
-/+ buffers/cache:        358       6605
Swap:            0          0          0

Notice the big drop after apache dies.

Disk space usage before and after the backtrace:
Thu Sep 15 21:20:01 EDT 2011
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvde1            5.7G  2.7G  3.0G  47% /
none                  3.5G     0  3.5G   0% /dev/shm
/dev/xvdl1            296G   99G  182G  36% /var/lib/pulp
/dev/xvdm1             20G  1.2G   18G   7% /var/lib/mongodb
/root/deploy/RHEL-6.1-RHUI-2.0-20110727.2-Server-x86_64-DVD1.iso
                       62M   62M     0 100% /mnt/rhuiso

Thu Sep 15 21:25:01 EDT 2011
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvde1            5.7G  2.7G  3.0G  48% /
none                  3.5G     0  3.5G   0% /dev/shm
/dev/xvdl1            296G   99G  182G  36% /var/lib/pulp
/dev/xvdm1             20G  1.2G   18G   7% /var/lib/mongodb
/root/deploy/RHEL-6.1-RHUI-2.0-20110727.2-Server-x86_64-DVD1.iso
                       62M   62M     0 100% /mnt/rhuiso

Attaching the error_log from this box.

Comment 5 James Slagle 2011-09-16 17:58:13 UTC
Created attachment 523594 [details]
apache error log

Comment 6 John Matthews 2011-10-12 15:50:56 UTC
I have a hunch this may be related to our downloading of .treeinfo files.
Reasoning is:
 We have seen this problem in the past when multiple threads were performing pycurl transfers of protected repos.  The issue came down that using NSS through pycurl was not thread safe.  There was an infrequent crash.

 If we have multiple repos syncing at the same time, it could allow multiple instances of the main grinder thread.  If two or more main grinder threads were performing a fetch this could result in a crash.

 For fetching of packages this is not an issue.  The fetch of packages happens under an "ActiveObject", it happens in it's own process...yet there are 2 cases I know of where pycurl fetches happen in the main grinder thread.

 Grinder main thread will perform pycurl requests when:
 1) Yum fetch of metadata
 2) Fetch of .treeinfo file

For #1 we have a lock and only allow one thread in the process to perform a yum operation at a time.

For #2 we are not doing anything to protect this.

Comment 7 John Matthews 2011-10-13 18:27:43 UTC
I think below will fix this crash.

Spoke with Pradeep about the problem and we decided to go for a simple solution.
We introduced a lock that will serialize all requests for fetching the Distribution Tree metadata.

This restricts only one thread a time to fetch .treeinfo files.

http://git.fedorahosted.org/git?p=grinder.git;a=commitdiff;h=99332bcfcac4005fd0933abbbd84931d8f5ba921


Moving to ActiveObject for the .treeinfo fetching is left for future work.  


Going to let this run for a few days before moving bug to MODIFIED.  
Need some more test data to ensure it's fixed.

Comment 8 John Matthews 2011-10-26 13:28:52 UTC
Moving this to MODIFIED.  We isolated the downloading of distribution files to it's own ActiveObject process.  This should avoid the issue of multiple downloads in same process resulting in a possible crash.