Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 737614

Summary:

glibc backtrace during repo sync

Product:

Red Hat Update Infrastructure for Cloud Providers

Reporter:

James Slagle <jslagle>

Component:

RHUA

Assignee:

John Matthews <jmatthew>

Status:

CLOSED CURRENTRELEASE

QA Contact:

wes hayutin <whayutin>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

2.0

CC:

jmatthew, kbidarka, sghai, tsanders

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-07-31 17:56:42 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
apache error log	none
pulp log	none
apache error log	none

Description James Slagle 2011-09-12 17:55:01 UTC

I saw glibc backtrace in the apache error log when the scheduled repo syncs kicked off this morning at 7:37 am.  I'm attaching the error_log and pulp.log.

This was from ap-southeast.

Comment 1 James Slagle 2011-09-12 17:55:43 UTC

Created attachment 522747 [details]
apache error log

Comment 2 James Slagle 2011-09-12 17:56:23 UTC

Created attachment 522748 [details]
pulp log

Comment 3 John Matthews 2011-09-12 18:09:38 UTC

*** glibc detected *** (wsgi:pulp)    : double free or corruption (!prev): 0x00007fb6ec0b5740 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75716)[0x7fb72cf90716]
/usr/lib64/libnsspem.so(+0x18e72)[0x7fb700689e72]
/usr/lib64/libnsspem.so(+0xa4f4)[0x7fb70067b4f4]
/usr/lib64/libnsspem.so(+0xa64d)[0x7fb70067b64d]
/usr/lib64/libnsspem.so(+0x17e0f)[0x7fb700688e0f]
/usr/lib64/libnsspem.so(+0x13df1)[0x7fb700684df1]
/usr/lib64/libnss3.so(PK11_Sign+0xb1)[0x7fb729eb3b71]
/usr/lib64/libssl3.so(+0xe620)[0x7fb72a3e2620]
/usr/lib64/libssl3.so(+0xf159)[0x7fb72a3e3159]
/usr/lib64/libssl3.so(+0x12860)[0x7fb72a3e6860]
/usr/lib64/libssl3.so(+0x13e30)[0x7fb72a3e7e30]
/usr/lib64/libssl3.so(+0x148cc)[0x7fb72a3e88cc]
/usr/lib64/libssl3.so(+0x17169)[0x7fb72a3eb169]
/usr/lib64/libssl3.so(+0x1cf15)[0x7fb72a3f0f15]
/usr/lib64/libssl3.so(SSL_ForceHandshake+0x9f)[0x7fb72a3f267f]
/usr/lib64/libcurl.so.4(Curl_nss_connect+0xb70)[0x7fb71b0b3e10]
/usr/lib64/libcurl.so.4(Curl_ssl_connect+0x2b)[0x7fb71b0ab34b]
/usr/lib64/libcurl.so.4(Curl_http_connect+0x5b)[0x7fb71b089e5b]
/usr/lib64/libcurl.so.4(Curl_protocol_connect+0xd2)[0x7fb71b090612]
/usr/lib64/libcurl.so.4(Curl_connect+0x2fc)[0x7fb71b096abc]
/usr/lib64/libcurl.so.4(Curl_perform+0x80)[0x7fb71b09ea50]
/usr/lib64/python2.6/site-packages/pycurl.so(+0x864b)[0x7fb71b2ce64b]
/usr/lib64/libpython2.6.so.1.0(PyEval_EvalFrameEx+0x5176)[0x7fb721e8ead6]
/usr/lib64/libpython2.6.so.1.0(PyEval_EvalCodeEx+0x92c)[0x7fb721e9081c]
/usr/lib64/libpython2.6.so.1.0(PyEval_EvalFrameEx+0x5304)[0x7fb721e8ec64]
/usr/lib64/libpython2.6.so.1.0(PyEval_EvalFrameEx+0x63ef)[0x7fb721e8fd4f]

Mon Sep 12 07:44:25 2011] [notice] child pid 7684 exit signal Aborted (6)

Comment 4 James Slagle 2011-09-16 17:57:36 UTC

Saw this again last night on my test RHUA.  Looks like it happened at 9:21 pm.  Disk space and memory usage were not an issue at the time.

Memory usage before and after the backtrace:
Thu Sep 15 21:20:01 EDT 2011
             total       used       free     shared    buffers     cached
Mem:          6964       6864         99          0         15       4778
-/+ buffers/cache:       2070       4893
Swap:            0          0          0

Thu Sep 15 21:25:01 EDT 2011
             total       used       free     shared    buffers     cached
Mem:          6964       5109       1854          0         20       4730
-/+ buffers/cache:        358       6605
Swap:            0          0          0

Notice the big drop after apache dies.

Disk space usage before and after the backtrace:
Thu Sep 15 21:20:01 EDT 2011
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvde1            5.7G  2.7G  3.0G  47% /
none                  3.5G     0  3.5G   0% /dev/shm
/dev/xvdl1            296G   99G  182G  36% /var/lib/pulp
/dev/xvdm1             20G  1.2G   18G   7% /var/lib/mongodb
/root/deploy/RHEL-6.1-RHUI-2.0-20110727.2-Server-x86_64-DVD1.iso
                       62M   62M     0 100% /mnt/rhuiso

Thu Sep 15 21:25:01 EDT 2011
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvde1            5.7G  2.7G  3.0G  48% /
none                  3.5G     0  3.5G   0% /dev/shm
/dev/xvdl1            296G   99G  182G  36% /var/lib/pulp
/dev/xvdm1             20G  1.2G   18G   7% /var/lib/mongodb
/root/deploy/RHEL-6.1-RHUI-2.0-20110727.2-Server-x86_64-DVD1.iso
                       62M   62M     0 100% /mnt/rhuiso

Attaching the error_log from this box.

Comment 5 James Slagle 2011-09-16 17:58:13 UTC

Created attachment 523594 [details]
apache error log

Comment 6 John Matthews 2011-10-12 15:50:56 UTC

I have a hunch this may be related to our downloading of .treeinfo files.
Reasoning is:
 We have seen this problem in the past when multiple threads were performing pycurl transfers of protected repos.  The issue came down that using NSS through pycurl was not thread safe.  There was an infrequent crash.

 If we have multiple repos syncing at the same time, it could allow multiple instances of the main grinder thread.  If two or more main grinder threads were performing a fetch this could result in a crash.

 For fetching of packages this is not an issue.  The fetch of packages happens under an "ActiveObject", it happens in it's own process...yet there are 2 cases I know of where pycurl fetches happen in the main grinder thread.

 Grinder main thread will perform pycurl requests when:
 1) Yum fetch of metadata
 2) Fetch of .treeinfo file

For #1 we have a lock and only allow one thread in the process to perform a yum operation at a time.

For #2 we are not doing anything to protect this.

Comment 7 John Matthews 2011-10-13 18:27:43 UTC

I think below will fix this crash.

Spoke with Pradeep about the problem and we decided to go for a simple solution.
We introduced a lock that will serialize all requests for fetching the Distribution Tree metadata.

This restricts only one thread a time to fetch .treeinfo files.

http://git.fedorahosted.org/git?p=grinder.git;a=commitdiff;h=99332bcfcac4005fd0933abbbd84931d8f5ba921


Moving to ActiveObject for the .treeinfo fetching is left for future work.  


Going to let this run for a few days before moving bug to MODIFIED.  
Need some more test data to ensure it's fixed.

Comment 8 John Matthews 2011-10-26 13:28:52 UTC

Moving this to MODIFIED.  We isolated the downloading of distribution files to it's own ActiveObject process.  This should avoid the issue of multiple downloads in same process resulting in a possible crash.