Bug 624559 - reposync core dumps on large downloads
Summary: reposync core dumps on large downloads
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: python-pycurl
Version: 6.0
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Karel Klíč
QA Contact: Petr Šplíchal
URL:
Whiteboard:
Depends On:
Blocks: 637911
TreeView+ depends on / blocked
 
Reported: 2010-08-16 23:02 UTC by Stephen John Smoogen
Modified: 2016-06-01 01:40 UTC (History)
6 users (show)

Fixed In Version: python-pycurl-7.19.0-6.el6
Doc Type: Bug Fix
Doc Text:
When the reset() method was called, the number of references to the "Py_None" object was not counted properly. Consequent to this, Python could terminate unexpectedly with the following error message: Fatal Python error: deallocating None Aborted (core dumped) With this update, the underlying source code has been modified to address this issue, and references to the "Py_None" object are now counted as expected.
Clone Of:
: 624580 637911 (view as bug list)
Environment:
Last Closed: 2011-02-23 12:07:35 UTC
Target Upstream Version:


Attachments (Terms of Use)
List of packages David Malcolm requested. (45.37 KB, text/plain)
2010-08-16 23:02 UTC, Stephen John Smoogen
no flags Details
One-liner version of the patch, plus a test case (734 bytes, patch)
2010-08-17 13:53 UTC, Dave Malcolm
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0295 0 normal SHIPPED_LIVE python-pycurl bug fix update 2011-02-23 12:07:25 UTC

Description Stephen John Smoogen 2010-08-16 23:02:05 UTC
Created attachment 439024 [details]
List of packages David Malcolm requested.

Description of problem:

When doing something like 

reposync -c fedora.repo --repoid=fedora-source-13 --source

After the last package downloads I get a supposed 'core dump'

[fedora-source-13: 9152  of 9152  ] Downloading zzuf-0.13-1.fc13.src.rpm
zzuf-0.13-1.fc13.src.rpm           | 456 kB     00:00     
Fatal Python error: deallocating None
Aborted (core dumped)

No core seems to be dumped though.. not sure why.

Version-Release number of selected component (if applicable):
# rpm -qf /usr/bin/reposync /usr/bin/python
yum-utils-1.1.26-11.el6.noarch
python-2.6.5-3.el6.x86_64


How reproducible:
80% with EL6 beta2. Large downloads cause it to occur often. Short 10-30 package ones it does not occur.

Comment 2 Dave Malcolm 2010-08-16 23:10:20 UTC
Something is messing up the reference counts for the "None" singleton: either libpython, or an extension module; my guess is an extension module.

This is the tp_dealloc for PyNone_Type:

static void
none_dealloc(PyObject* ignore)
{
    /* This should never get called, but we also don't want to SEGV if
     * we accidentally decref None out of existence.
     */
    Py_FatalError("deallocating None");
}

Comment 3 RHEL Program Management 2010-08-16 23:18:33 UTC
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 4 Dave Malcolm 2010-08-16 23:19:32 UTC
The None singleton normally has a fairly large reference count.

For example, starting a python process under gdb shows this refcount at the first interactive prompt:
(gdb) p _Py_NoneStruct 
$2 = {ob_refcnt = 877, ob_type = 0x77f2a0}

What I suspect is happening is that something isn't doing a Py_INCREF when it should on None, and so when you have enough of these in a loop, None ends up with a lower refcount than it should.

At some point your presumably issuing lots of Py_DECREF on the singleton, and this drives it below one, hitting the Py_FatalError.

My hope is that a backtrace might suggest the rough location of the bug (by analogy between the cleanup vs the setup)

Comment 5 James Antill 2010-08-17 05:01:24 UTC
Having a look at the obvious candidate, pycurl. I see that the .reset() function does:


    /* Last, free the options */
    for (i = 0; i < OPTIONS_SIZE; i++) {
        if (self->options[i] != NULL) {
            free(self->options[i]);
            self->options[i] = NULL;
        }
    }

    return Py_None;
}

...unlike every other usage which is the (I assume) correct:


    Py_INCREF(Py_None);
    return Py_None;

...urlgrabber calls .reset() for every download.

Comment 7 Dave Malcolm 2010-08-17 13:25:50 UTC
(In reply to comment #5)
> Having a look at the obvious candidate, pycurl. I see that the .reset()
> function does:
[snip]
> ...urlgrabber calls .reset() for every download.

Thanks!

Yes: a trivial reproducer:

>>> import pycurl
>>> c = pycurl.Curl()
>>> while True: c.reset()
... 
Fatal Python error: deallocating None
Aborted (core dumped)


This is with: python-pycurl-7.19.0-5.el6.x86_64

Comment 9 Dave Malcolm 2010-08-17 13:53:45 UTC
Created attachment 439117 [details]
One-liner version of the patch, plus a test case

Comment 10 Dave Malcolm 2010-08-17 14:00:28 UTC
I've attached a one-liner version of the patch, plus a test case.  Without this patch, it's trivial to crash python:
>>> import pycurl
>>> c = pycurl.Curl()
>>> for i in xrange(20000):
...   c.reset()
... 
Fatal Python error: deallocating None
Aborted (core dumped)

With the patch, the test case passes (it takes a minute or so to run; we could tone down the number of iterations from 100000 to 20000 if that's an issue, but I don't think it should be).

Comment 11 Dave Malcolm 2010-08-17 14:19:57 UTC
urlgrabber's PyCurlFileObject calls "reset" in its _do_open method, so the reference issue happens once per file downloaded by yum.

So it appears that a "yum update" or "yum install" that downloads many thousands of packages will trip this error, but N may need to be very high.

Testing "debuginfo-install" with a breakpoint on do_curl_reset shows:
(gdb) p _Py_NoneStruct
$1 = {ob_refcnt = 23146, ob_type = 0x3b73385ca0}

Having said that, I'm not sure about the structure of the reference graph keeping ob_refcnt that high

Comment 12 Stephen John Smoogen 2010-08-17 15:06:30 UTC
Fatal Python error: deallocating None

Program received signal SIGABRT, Aborted.
0x0000003fa2e329c5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-8.el6.x86_64 elfutils-libelf-0.148-1.el6.x86_64 file-libs-5.04-5.el6.x86_64 glib2-2.22.5-5.el6.x86_64 gpgme-1.1.8-3.el6.x86_64 keyutils-libs-1.4-1.el6.x86_64 krb5-libs-1.8.2-2.el6.x86_64 libacl-2.2.49-4.el6.x86_64 libattr-2.4.44-4.el6.x86_64 libcap-2.16-5.2.el6.x86_64 libcom_err-1.41.12-3.el6.x86_64 libcurl-7.19.7-16.el6.x86_64 libgcc-4.4.4-13.el6.x86_64 libgpg-error-1.7-3.el6.x86_64 libidn-1.18-2.el6.x86_64 libselinux-2.0.94-2.el6.x86_64 libssh2-1.2.2-7.el6.x86_64 libxml2-2.7.6-1.el6.x86_64 lua-5.1.4-4.1.el6.x86_64 nspr-4.8.4-2.el6.x86_64 nss-3.12.6-3.el6.x86_64 nss-softokn-3.12.4-19.el6.x86_64 nss-softokn-freebl-3.12.4-19.el6.x86_64 nss-util-3.12.6-1.el6.x86_64 openldap-2.4.19-15.el6.x86_64 popt-1.13-7.el6.x86_64 pygpgme-0.1-18.20090824bzr68.el6.x86_64 python-pycurl-7.19.0-5.el6.x86_64 rpm-libs-4.8.0-12.el6.x86_64 rpm-python-4.8.0-12.el6.x86_64 xz-libs-4.999.9-0.3.beta.20091007git.el6.x86_64 yum-metadata-parser-1.1.2-14.1.el6.x86_64
(gdb) where
#0  0x0000003fa2e329c5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003fa2e341a5 in abort () at abort.c:92
#2  0x0000003fa52fad7e in Py_FatalError (msg=<value optimized out>) at Python/pythonrun.c:1661
#3  0x0000003fa528124b in PyDict_Clear (op=<value optimized out>) at Objects/dictobject.c:817
#4  0x0000003fa52f2f80 in PyImport_Cleanup () at Python/import.c:516
#5  0x0000003fa52fc3eb in Py_Finalize () at Python/pythonrun.c:438
#6  0x0000003fa5308ce8 in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at Modules/main.c:596
#7  0x0000003fa2e1ec5d in __libc_start_main (main=0x400710 <main>, argc=6, ubp_av=0x7fffffffe658, init=<value optimized out>, 
    fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffffffe648) at libc-start.c:226
#8  0x0000000000400649 in _start ()

Comment 13 Dave Malcolm 2010-08-18 14:32:43 UTC
(In reply to comment #11)
> urlgrabber's PyCurlFileObject calls "reset" in its _do_open method, so the
> reference issue happens once per file downloaded by yum.
> 
> So it appears that a "yum update" or "yum install" that downloads many
> thousands of packages will trip this error, but N may need to be very high.

I attempted to trigger this with a very large "yum install"; this was on a RHEL6 box within our internal "beaker" lab, with lots of repositories.

# yum install * --skip-broken
  (snip)

Transaction Summary
=============================================================================================================================================================
Install    9042 Package(s)
Upgrade       0 Package(s)

Total download size: 7.7 G
Installed size: 25 G

During the downloads:
(gdb) p _Py_NoneStruct 
$1 = {ob_refcnt = 365835, ob_type = 0x3b73385ca0}

at (1287/9042):
(gdb) p _Py_NoneStruct
$1 = {ob_refcnt = 364095, ob_type = 0x3b73385ca0}

at (2186/9042):
(gdb) p _Py_NoneStruct
$2 = {ob_refcnt = 362644, ob_type = 0x3b73385ca0}

at (4143/9042): 
(gdb) p _Py_NoneStruct
$3 = {ob_refcnt = 359517, ob_type = 0x3b73385ca0}

at (5331/9042):
(gdb) p _Py_NoneStruct
$4 = {ob_refcnt = 357365, ob_type = 0x3b73385ca0}

at (8531/9042):
(gdb) p _Py_NoneStruct
$5 = {ob_refcnt = 351284, ob_type = 0x3b73385ca0}

So it's clearly slowly losing references to None.

As it happened, it failed with some 404s (misconfigured test repository), with
"Error Downloading Packages:" and it managed to exit cleanly (with sane error messages)

I wonder if it's possibly to trigger the fatal error when using mirrors: if you have say 15 mirrors, most of which are failing, does this dramatically increase the number of PyCurlFileObject._do_open calls (and thus "Curl.reset" calls) ?  That might be a situation in which the bug manifests.   Seth: does yum's mirror handling work this way?

Comment 14 seth vidal 2010-08-18 15:24:35 UTC
the mirror handling is all inside of urlgrabber. Yum just passes in the list of mirrors to the mirrorgrab instance for that repo.


It does retry more so if you have a bad mirror you'd get that many more retries.

Comment 17 Jaromir Hradilek 2011-01-27 16:25:11 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When the reset() method was called, the number of references to the "Py_None" object was not counted properly. Consequent to this, Python could terminate unexpectedly with the following error message:

  Fatal Python error: deallocating None
  Aborted (core dumped)

With this update, the underlying source code has been modified to address this issue, and references to the "Py_None" object are now counted as expected.

Comment 19 errata-xmlrpc 2011-02-23 12:07:35 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0295.html


Note You need to log in before you can comment on or make changes to this bug.