Bug 737614 - glibc backtrace during repo sync
Summary: glibc backtrace during repo sync
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Update Infrastructure for Cloud Providers
Classification: Red Hat
Component: RHUA
Version: 2.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: John Matthews
QA Contact: wes hayutin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-09-12 17:55 UTC by James Slagle
Modified: 2012-07-31 17:56 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-07-31 17:56:42 UTC


Attachments (Terms of Use)
apache error log (75.73 KB, application/octet-stream)
2011-09-12 17:55 UTC, James Slagle
no flags Details
pulp log (2.00 MB, application/octet-stream)
2011-09-12 17:56 UTC, James Slagle
no flags Details
apache error log (145.21 KB, application/octet-stream)
2011-09-16 17:58 UTC, James Slagle
no flags Details

Description James Slagle 2011-09-12 17:55:01 UTC
I saw glibc backtrace in the apache error log when the scheduled repo syncs kicked off this morning at 7:37 am.  I'm attaching the error_log and pulp.log.

This was from ap-southeast.

Comment 1 James Slagle 2011-09-12 17:55:43 UTC
Created attachment 522747 [details]
apache error log

Comment 2 James Slagle 2011-09-12 17:56:23 UTC
Created attachment 522748 [details]
pulp log

Comment 3 John Matthews 2011-09-12 18:09:38 UTC
*** glibc detected *** (wsgi:pulp)    : double free or corruption (!prev): 0x00007fb6ec0b5740 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75716)[0x7fb72cf90716]
/usr/lib64/libnsspem.so(+0x18e72)[0x7fb700689e72]
/usr/lib64/libnsspem.so(+0xa4f4)[0x7fb70067b4f4]
/usr/lib64/libnsspem.so(+0xa64d)[0x7fb70067b64d]
/usr/lib64/libnsspem.so(+0x17e0f)[0x7fb700688e0f]
/usr/lib64/libnsspem.so(+0x13df1)[0x7fb700684df1]
/usr/lib64/libnss3.so(PK11_Sign+0xb1)[0x7fb729eb3b71]
/usr/lib64/libssl3.so(+0xe620)[0x7fb72a3e2620]
/usr/lib64/libssl3.so(+0xf159)[0x7fb72a3e3159]
/usr/lib64/libssl3.so(+0x12860)[0x7fb72a3e6860]
/usr/lib64/libssl3.so(+0x13e30)[0x7fb72a3e7e30]
/usr/lib64/libssl3.so(+0x148cc)[0x7fb72a3e88cc]
/usr/lib64/libssl3.so(+0x17169)[0x7fb72a3eb169]
/usr/lib64/libssl3.so(+0x1cf15)[0x7fb72a3f0f15]
/usr/lib64/libssl3.so(SSL_ForceHandshake+0x9f)[0x7fb72a3f267f]
/usr/lib64/libcurl.so.4(Curl_nss_connect+0xb70)[0x7fb71b0b3e10]
/usr/lib64/libcurl.so.4(Curl_ssl_connect+0x2b)[0x7fb71b0ab34b]
/usr/lib64/libcurl.so.4(Curl_http_connect+0x5b)[0x7fb71b089e5b]
/usr/lib64/libcurl.so.4(Curl_protocol_connect+0xd2)[0x7fb71b090612]
/usr/lib64/libcurl.so.4(Curl_connect+0x2fc)[0x7fb71b096abc]
/usr/lib64/libcurl.so.4(Curl_perform+0x80)[0x7fb71b09ea50]
/usr/lib64/python2.6/site-packages/pycurl.so(+0x864b)[0x7fb71b2ce64b]
/usr/lib64/libpython2.6.so.1.0(PyEval_EvalFrameEx+0x5176)[0x7fb721e8ead6]
/usr/lib64/libpython2.6.so.1.0(PyEval_EvalCodeEx+0x92c)[0x7fb721e9081c]
/usr/lib64/libpython2.6.so.1.0(PyEval_EvalFrameEx+0x5304)[0x7fb721e8ec64]
/usr/lib64/libpython2.6.so.1.0(PyEval_EvalFrameEx+0x63ef)[0x7fb721e8fd4f]

Mon Sep 12 07:44:25 2011] [notice] child pid 7684 exit signal Aborted (6)

Comment 4 James Slagle 2011-09-16 17:57:36 UTC
Saw this again last night on my test RHUA.  Looks like it happened at 9:21 pm.  Disk space and memory usage were not an issue at the time.

Memory usage before and after the backtrace:
Thu Sep 15 21:20:01 EDT 2011
             total       used       free     shared    buffers     cached
Mem:          6964       6864         99          0         15       4778
-/+ buffers/cache:       2070       4893
Swap:            0          0          0

Thu Sep 15 21:25:01 EDT 2011
             total       used       free     shared    buffers     cached
Mem:          6964       5109       1854          0         20       4730
-/+ buffers/cache:        358       6605
Swap:            0          0          0

Notice the big drop after apache dies.

Disk space usage before and after the backtrace:
Thu Sep 15 21:20:01 EDT 2011
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvde1            5.7G  2.7G  3.0G  47% /
none                  3.5G     0  3.5G   0% /dev/shm
/dev/xvdl1            296G   99G  182G  36% /var/lib/pulp
/dev/xvdm1             20G  1.2G   18G   7% /var/lib/mongodb
/root/deploy/RHEL-6.1-RHUI-2.0-20110727.2-Server-x86_64-DVD1.iso
                       62M   62M     0 100% /mnt/rhuiso

Thu Sep 15 21:25:01 EDT 2011
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvde1            5.7G  2.7G  3.0G  48% /
none                  3.5G     0  3.5G   0% /dev/shm
/dev/xvdl1            296G   99G  182G  36% /var/lib/pulp
/dev/xvdm1             20G  1.2G   18G   7% /var/lib/mongodb
/root/deploy/RHEL-6.1-RHUI-2.0-20110727.2-Server-x86_64-DVD1.iso
                       62M   62M     0 100% /mnt/rhuiso

Attaching the error_log from this box.

Comment 5 James Slagle 2011-09-16 17:58:13 UTC
Created attachment 523594 [details]
apache error log

Comment 6 John Matthews 2011-10-12 15:50:56 UTC
I have a hunch this may be related to our downloading of .treeinfo files.
Reasoning is:
 We have seen this problem in the past when multiple threads were performing pycurl transfers of protected repos.  The issue came down that using NSS through pycurl was not thread safe.  There was an infrequent crash.

 If we have multiple repos syncing at the same time, it could allow multiple instances of the main grinder thread.  If two or more main grinder threads were performing a fetch this could result in a crash.

 For fetching of packages this is not an issue.  The fetch of packages happens under an "ActiveObject", it happens in it's own process...yet there are 2 cases I know of where pycurl fetches happen in the main grinder thread.

 Grinder main thread will perform pycurl requests when:
 1) Yum fetch of metadata
 2) Fetch of .treeinfo file

For #1 we have a lock and only allow one thread in the process to perform a yum operation at a time.

For #2 we are not doing anything to protect this.

Comment 7 John Matthews 2011-10-13 18:27:43 UTC
I think below will fix this crash.

Spoke with Pradeep about the problem and we decided to go for a simple solution.
We introduced a lock that will serialize all requests for fetching the Distribution Tree metadata.

This restricts only one thread a time to fetch .treeinfo files.

http://git.fedorahosted.org/git?p=grinder.git;a=commitdiff;h=99332bcfcac4005fd0933abbbd84931d8f5ba921


Moving to ActiveObject for the .treeinfo fetching is left for future work.  


Going to let this run for a few days before moving bug to MODIFIED.  
Need some more test data to ensure it's fixed.

Comment 8 John Matthews 2011-10-26 13:28:52 UTC
Moving this to MODIFIED.  We isolated the downloading of distribution files to it's own ActiveObject process.  This should avoid the issue of multiple downloads in same process resulting in a possible crash.


Note You need to log in before you can comment on or make changes to this bug.