Bug 711095
Summary: | Python garbage collector assertion failure running pq_execute (cursorObject refcount issue) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Red Hat Case Diagnostics <case-diagnostics> | ||||||||||
Component: | python-psycopg2 | Assignee: | Tom Lane <tgl> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | qe-baseos-daemons | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 6.0 | CC: | azelinka, cww, daniele.varrazzo, dmalcolm, hhorak, jscotka, julian, maor.conforti, psklenar, rdassen, syeghiay, ychavan | ||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | abrt_hash:f8a305d69951574f208e836a452f63fe1020d300 | ||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: |
Previously, if a second thread in a single application triggered the Python garbage collection while a copy operation was in progress, the copy operation terminated unexpectedly. With this update, appropriate object reference count adjustments have been added to the code, and this bug no longer occurs.
|
Story Points: | --- | ||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2013-02-21 08:41:25 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 720306 | ||||||||||||
Attachments: |
|
Description
Red Hat Case Diagnostics
2011-06-06 14:19:45 UTC
Created attachment 503247 [details]
File: backtrace
Thanks for filing this bug report. How often is this problem occurring? I believe the crash is happening due to a bug in python-psycopg2. I am about to attach a proposed patch to that package which I believe will fix it. I don't have a minimal reproducer yet. Technical analysis follows: Based on reviewing the coredump, the crash is happening in thread #1, during one of python's periodic invocations of its garbage collector. The assertion failure is that an object has a lower reference count than expected (based on the reference-owning objects pointing to it). This object is a StringIO object, containing a series of lines containing dates and numbers (I'm deliberately not including it in a public bz comment, in case it contains private information). Meanwhile thread #4 is within python-psycopg2's pq_execute function executing a query. Specifically, it is within the function "psyco_curs_copy_from" in psycopg2-2.0.13/psycopg/cursor_type.c I notice that within thread #4, pcurs->copyfile contains the textual data seen in the crash in thread #1. What's happened is that the StringIO instance containing the query has too low a reference count. Further, pyscopg2's cursorObject participates in the Python garbage collector: psycopg2-2.0.13/psycopg/cursor_type.c: cursor_traverse has: Py_VISIT(self->copyfile); It acquires PyObject *file via: if (!PyArg_ParseTupleAndKeywords(args, kwargs, "O&s|ss" CONV_CODE_PY_SSIZE_T "O", kwlist, _psyco_curs_has_read_check, &file, &table_name, &sep, &null, &bufsize, &columns)) i.e. from "O& (object) [converter, anything]" with converter "_psyco_curs_has_read_check": Looking at _psyco_curs_has_read_check, it has this code and comment: /* It's OK to store a borrowed reference, because it is only held for * the duration of psyco_curs_copy_from. */ *((PyObject**)var) = o; return 1; This is thus storing a borrowed reference to the StringIO object into the psyco's cursorObject without INCREFing it. This reference is is visible to the garbage collector. The comment in the code above is incorrect: within the call to pq_execute, there is an invocation of the Py_BEGIN_ALLOW_THREADS macro. This allows the GIL to transition to another thread, and it seems that this is what happened in this case (within the coredump, that thread is waiting at the subsequent Py_END_ALLOW_THREADS macro, waiting to reacquire the GIL). When the GC runs in the other thread, it detects that the refcount is too low, and bails out with an assertion failure. (I believe that if assertions were disabled, it could try to free up the memory, so we'd be seeing a different kind of crash) I believe the fix will be to add INCREF/DECREF pairs around the call to pq_execute within psyco_curs_copy_from, and I'll try to write a patch that does so. I don't have a minimal reproducer yet. I checked that latest upstream release of the code (2.4.1), and I believe this bug is still present there. Created attachment 503316 [details]
Proposed patch
I'm not really qualified to evaluate this. I'll see if I can get a comment from upstream. Created attachment 503327 [details]
Minimal reproducer
With the attached script, I'm reliably able to reproduce the error:
python: Modules/gcmodule.c:277: visit_decref: Assertion `gc->gc.gc_refs != 0' failed.
on my test box (with the same versions as in the report).
(In reply to comment #14) > Created attachment 503327 [details] > With the attached script, I'm reliably able to reproduce the error: Having said that, getting it to happen may be somewhat timing dependent (this is a bad interaction between two threads). I had to tweak the delay in the loop in the GCThread (currently 0.1) before I could get it to occur, this was with the postgres server on the same host as the script, so YMMV. The patch attached in comment #11 fixes the crash seen by running comment #14, so I believe we have a fix for this issue. (In reply to comment #14) > Created attachment 503327 [details] > Minimal reproducer > > With the attached script, I'm reliably able to reproduce the error: > python: Modules/gcmodule.c:277: visit_decref: Assertion `gc->gc.gc_refs != 0' > failed. > > on my test box (with the same versions as in the report). I've also now stepped through the crash in gdb, and this reproducer is indeed crashing python in the same way as in the original coredump. (gdb) t a a py-bt Thread 2 (Thread 0x7fffefd1a700 (LWP 9729)): #9 file 'test.py', in 'run' #12 file '/usr/lib64/python2.6/threading.py', in '__bootstrap_inner' #15 file '/usr/lib64/python2.6/threading.py', in '__bootstrap' Thread 1 (Thread 0x7ffff7fed700 (LWP 9727)): #7 file 'test.py', in '<module>' Thread 2 is garbage-collecting (and crashing with that assertion failure) Thread 1 is waiting to reacquire the GIL, within pq_execute, within psyco_curs_copy_from. For reference, the upstream bug Tom opened appears to be here: http://psycopg.lighthouseapp.com/projects/62710/tickets/58 Note to psycopg2 upstream folks: to reproduce the C level assertion failure in gcmodule.c, you need a build of Python with assertions enabled. Typically for an upstream build of python, this means configuring --with-pydebug. Thank you for the report and the patch. I have verified the bug and the fix. Note that the functions copy_to and copy_expert were affected by the same bug: I have fixed them as well. The devel branch of my repos (https://github.com/dvarrazzo/psycopg) contains the patches. They will be released in the upcoming release 2.4.2. Daniele: Thanks for the info. Your upstream fix appears to be: https://github.com/dvarrazzo/psycopg/commit/e9a485e30bf74c598536417819f196476e0dc406 Looking there, I think that commit may introduce a bug within psyco_curs_copy_expert, within the error handling paths. Specifically, the "file" variable is set using PyArg_ParseTupleAndKeywords with code "O", which gives a borrowed reference, not modifying the refcount. If an error occurs, and one of the first two "goto exit;" clauses is run, then my reading of that code is that you have a Py_XDECREF(file), without a matching Py_INCREF(file), so "file" loses a reference, which may lead to it being prematirely deallocated. Am I reading that correctly? I think the Py_INCREF(file); within in that function needs to be moved higher up, to before the first "goto exit" e.g. to directly below this comment: /* Any failure from here forward should 'goto exit' rather than 'return NULL' directly. */ The changes to the other two functions (copy_from/copy_to) look correct to me. [FWIW, I'm working on a static analysis tool to try to make it easier to deal with this kind of thing in the future, see: https://fedoraproject.org/wiki/Features/StaticAnalysisOfCPythonExtensions But that code isn't ready yet] (In reply to comment #24) > Looking there, I think that commit may introduce a bug within > psyco_curs_copy_expert, within the error handling paths. Uhm... right. I'd implemented a pattern typical of an object owned by the function, didn't spot that file was coming from the parsed args. I've modified copy_expert to make it more similar to copy_from/to. https://github.com/dvarrazzo/psycopg/commit/816b5dda33a8914532da160bded65a6f3afb9c9b Thank you again! > [FWIW, I'm working on a static analysis tool to try to make it easier to deal > with this kind of thing in the future I would *love* to use such a tool! (In reply to comment #25) > (In reply to comment #24) > > > Looking there, I think that commit may introduce a bug within > > psyco_curs_copy_expert, within the error handling paths. > > Uhm... right. I'd implemented a pattern typical of an object owned by the > function, didn't spot that file was coming from the parsed args. > > I've modified copy_expert to make it more similar to copy_from/to. > > https://github.com/dvarrazzo/psycopg/commit/816b5dda33a8914532da160bded65a6f3afb9c9b FWIW, I doublechecked that new implementation of psyco_curs_copy_expert, and it looks correct to me also. Thanks! Question: Do you have an idea why I see this error only on with the RHEL6 installation of Python? We've tried it on our own build of Python v2.6.5 + Psycopg2 v.2.0.13 (same versions as what comes with RHEL6), on the same RHEL6 machine, and this error does not happen. (It also didn't happen on CentOS 5.5) Any ideas? (In reply to comment #27) These assertions are typically disabled when building python, but, for better or worse, the build flags we use mean that they are enabled in RHEL 6's python build. There was some discussion within bug 614680 about changing this in RHEL 6.1. As noted in that bug, doing so could well turn errors of this kind into segfaults rather than asserts, hence in that bug I chose to keep the extra error-checking they provide, whilst improving the error messages (in python-2.6.6-5.el6) so that they indicate which object has reference-count issues. Whether or not the assertion is protecting against a segfault varies from case to case. My (tentative) reading of this case is that (for this specific case) the assertion is overzealous. The StringIO instance has an ob_refcnt of 2: presumably 2 frame's worth of dictionaries of local variables (the code is too optimized to directly verify this). The third reference is the one within the cursor_object within psyco_curs_copy_from. If those other two references were garbage-collected away by the other thread, then the cursorObject could have a borrowed reference to the StringIO, but the StringIO could have been deallocated from under it. However, I don't believe this would have been a problem, for two reasons: - assuming that the other two referrers are frame locals() dictionaries, then these will outlive the call to psyco_curs_copy_from - upon reacquisition of the GIL, psyco_curs_copy_from makes no further use of the value, immediately setting it back to NULL. Having said that, the benign nature here is merely due to the specifics of the user code and of psycopg2 in this case: invariants are being violated, but it happens to not matter. There are plenty of other situations in which similar bugs could lead to crashes of /usr/bin/python, and such crashes would be significantly less predictable and more obscure. There is perhaps an argument for reassessing bug 614680 (e.g. turning off the asserts, or making this a command-line flag or somesuch), but for now, the fix is to patch python-psycopg2 (using the patches referred to in comment #24 and comment #25). Hopefully the above makes sense. (In reply to comment #29) > There is perhaps an argument for reassessing bug 614680 (e.g. turning off the > asserts, or making this a command-line flag or somesuch), but for now, the fix > is to patch python-psycopg2 (using the patches referred to in comment #24 and > comment #25). I quite agree that turning off the assertions would be an inappropriate "fix". Created attachment 503578 [details] Backport of the two upstream patches to the 2.0.13 in RHEL6 Candidate patch: the attached patch is a combination of: https://github.com/dvarrazzo/psycopg/commit/e9a485e30bf74c598536417819f196476e0dc406 followed by: https://github.com/dvarrazzo/psycopg/commit/679af4a97528c140ac8e9cd85598f717b3f5cfb9 fixed up to apply cleanly to the 2.0.13 within RHEL. [It does not contain the change to tests/test_copy.py from 679af4a97528c140ac8e9cd85598f717b3f5cfb9 as that entire file only appeared after 2.0.13] As a smoketest, I have rebuilt RHEL6's python-psycopg2-2.0.13-2.el6 adding this patch, and tested with the psycopg2-2.0.13/scripts/ticket58.py (see the prepped source tree). Each of the three test cases within that script ("copy_from", "copy_to", and "copy_expert") fail with this error: python: Modules/gcmodule.c:277: visit_decref: Assertion `gc->gc.gc_refs != 0' failed. Aborted (core dumped) when run against the old package (with a suitably-configured postgres DB, see the comment in that file). All of them successfully run to completion with the patched version of the package. I have merely smoketested the proposed test cases. I have not tested other functionality of the package. BTW, I've pushed psycopg 2.4.2 into Fedora rawhide, so this is fixed over there. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, if a second thread in a single application triggered the Python garbage collection while a copy operation was in progress, the copy operation terminated unexpectedly. With this update, appropriate object reference count adjustments have been added to the code, and this bug no longer occurs. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0327.html |