Hide Forgot
I saw glibc backtrace in the apache error log when the scheduled repo syncs kicked off this morning at 7:37 am. I'm attaching the error_log and pulp.log. This was from ap-southeast.
Created attachment 522747 [details] apache error log
Created attachment 522748 [details] pulp log
*** glibc detected *** (wsgi:pulp) : double free or corruption (!prev): 0x00007fb6ec0b5740 *** ======= Backtrace: ========= /lib64/libc.so.6(+0x75716)[0x7fb72cf90716] /usr/lib64/libnsspem.so(+0x18e72)[0x7fb700689e72] /usr/lib64/libnsspem.so(+0xa4f4)[0x7fb70067b4f4] /usr/lib64/libnsspem.so(+0xa64d)[0x7fb70067b64d] /usr/lib64/libnsspem.so(+0x17e0f)[0x7fb700688e0f] /usr/lib64/libnsspem.so(+0x13df1)[0x7fb700684df1] /usr/lib64/libnss3.so(PK11_Sign+0xb1)[0x7fb729eb3b71] /usr/lib64/libssl3.so(+0xe620)[0x7fb72a3e2620] /usr/lib64/libssl3.so(+0xf159)[0x7fb72a3e3159] /usr/lib64/libssl3.so(+0x12860)[0x7fb72a3e6860] /usr/lib64/libssl3.so(+0x13e30)[0x7fb72a3e7e30] /usr/lib64/libssl3.so(+0x148cc)[0x7fb72a3e88cc] /usr/lib64/libssl3.so(+0x17169)[0x7fb72a3eb169] /usr/lib64/libssl3.so(+0x1cf15)[0x7fb72a3f0f15] /usr/lib64/libssl3.so(SSL_ForceHandshake+0x9f)[0x7fb72a3f267f] /usr/lib64/libcurl.so.4(Curl_nss_connect+0xb70)[0x7fb71b0b3e10] /usr/lib64/libcurl.so.4(Curl_ssl_connect+0x2b)[0x7fb71b0ab34b] /usr/lib64/libcurl.so.4(Curl_http_connect+0x5b)[0x7fb71b089e5b] /usr/lib64/libcurl.so.4(Curl_protocol_connect+0xd2)[0x7fb71b090612] /usr/lib64/libcurl.so.4(Curl_connect+0x2fc)[0x7fb71b096abc] /usr/lib64/libcurl.so.4(Curl_perform+0x80)[0x7fb71b09ea50] /usr/lib64/python2.6/site-packages/pycurl.so(+0x864b)[0x7fb71b2ce64b] /usr/lib64/libpython2.6.so.1.0(PyEval_EvalFrameEx+0x5176)[0x7fb721e8ead6] /usr/lib64/libpython2.6.so.1.0(PyEval_EvalCodeEx+0x92c)[0x7fb721e9081c] /usr/lib64/libpython2.6.so.1.0(PyEval_EvalFrameEx+0x5304)[0x7fb721e8ec64] /usr/lib64/libpython2.6.so.1.0(PyEval_EvalFrameEx+0x63ef)[0x7fb721e8fd4f] Mon Sep 12 07:44:25 2011] [notice] child pid 7684 exit signal Aborted (6)
Saw this again last night on my test RHUA. Looks like it happened at 9:21 pm. Disk space and memory usage were not an issue at the time. Memory usage before and after the backtrace: Thu Sep 15 21:20:01 EDT 2011 total used free shared buffers cached Mem: 6964 6864 99 0 15 4778 -/+ buffers/cache: 2070 4893 Swap: 0 0 0 Thu Sep 15 21:25:01 EDT 2011 total used free shared buffers cached Mem: 6964 5109 1854 0 20 4730 -/+ buffers/cache: 358 6605 Swap: 0 0 0 Notice the big drop after apache dies. Disk space usage before and after the backtrace: Thu Sep 15 21:20:01 EDT 2011 Filesystem Size Used Avail Use% Mounted on /dev/xvde1 5.7G 2.7G 3.0G 47% / none 3.5G 0 3.5G 0% /dev/shm /dev/xvdl1 296G 99G 182G 36% /var/lib/pulp /dev/xvdm1 20G 1.2G 18G 7% /var/lib/mongodb /root/deploy/RHEL-6.1-RHUI-2.0-20110727.2-Server-x86_64-DVD1.iso 62M 62M 0 100% /mnt/rhuiso Thu Sep 15 21:25:01 EDT 2011 Filesystem Size Used Avail Use% Mounted on /dev/xvde1 5.7G 2.7G 3.0G 48% / none 3.5G 0 3.5G 0% /dev/shm /dev/xvdl1 296G 99G 182G 36% /var/lib/pulp /dev/xvdm1 20G 1.2G 18G 7% /var/lib/mongodb /root/deploy/RHEL-6.1-RHUI-2.0-20110727.2-Server-x86_64-DVD1.iso 62M 62M 0 100% /mnt/rhuiso Attaching the error_log from this box.
Created attachment 523594 [details] apache error log
I have a hunch this may be related to our downloading of .treeinfo files. Reasoning is: We have seen this problem in the past when multiple threads were performing pycurl transfers of protected repos. The issue came down that using NSS through pycurl was not thread safe. There was an infrequent crash. If we have multiple repos syncing at the same time, it could allow multiple instances of the main grinder thread. If two or more main grinder threads were performing a fetch this could result in a crash. For fetching of packages this is not an issue. The fetch of packages happens under an "ActiveObject", it happens in it's own process...yet there are 2 cases I know of where pycurl fetches happen in the main grinder thread. Grinder main thread will perform pycurl requests when: 1) Yum fetch of metadata 2) Fetch of .treeinfo file For #1 we have a lock and only allow one thread in the process to perform a yum operation at a time. For #2 we are not doing anything to protect this.
I think below will fix this crash. Spoke with Pradeep about the problem and we decided to go for a simple solution. We introduced a lock that will serialize all requests for fetching the Distribution Tree metadata. This restricts only one thread a time to fetch .treeinfo files. http://git.fedorahosted.org/git?p=grinder.git;a=commitdiff;h=99332bcfcac4005fd0933abbbd84931d8f5ba921 Moving to ActiveObject for the .treeinfo fetching is left for future work. Going to let this run for a few days before moving bug to MODIFIED. Need some more test data to ensure it's fixed.
Moving this to MODIFIED. We isolated the downloading of distribution files to it's own ActiveObject process. This should avoid the issue of multiple downloads in same process resulting in a possible crash.