Bug 759200 - Memory leak in python-psycopg2
Memory leak in python-psycopg2
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: cumin (Show other bugs)
Development
Unspecified Unspecified
high Severity high
: 2.1.1
: ---
Assigned To: Trevor McKay
Stanislav Graf
:
Depends On:
Blocks: 765607
  Show dependency treegraph
 
Reported: 2011-12-01 11:35 EST by Trevor McKay
Modified: 2012-02-06 13:18 EST (History)
5 users (show)

See Also:
Fixed In Version: python-psycopg2-2.0.14-3.el5
Doc Type: Bug Fix
Doc Text:
Cause A bug in python-pyscopg2-2.0.14 causes a reference leak when cumin-data updates objects in the database. Specifically, the bug manifests when cursor.mogrify(operations, params) or cursor.execute(operations, params) is called and the 'operations' string references the same value from 'params' more than once. Consequence Long-running instances of cumin can leak significant amounts of memory. Fix The _mogrify() routine in cursor_type.c was patched in python-pscyopg2-2.0.14-3. Result This reference leak has been repaired and no longer is an issue when cumin-data updates obejcts.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-02-06 13:18:25 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Script that shows reference count problem with cursor.mogrify() (905 bytes, application/octet-stream)
2011-12-03 01:31 EST, Trevor McKay
no flags Details
Modify update.py to use two operations instead of one to work around leak (2.54 KB, text/plain)
2011-12-05 12:57 EST, Trevor McKay
no flags Details
Patch applied to cursor_type.c in python-psycopg2-2.0.14-3 (1020 bytes, application/octet-stream)
2011-12-12 12:42 EST, Trevor McKay
no flags Details

  None (edit)
Description Trevor McKay 2011-12-01 11:35:13 EST
Description of problem:

There appears to be a memory leak in cumin-data related to object updates written through to the database.  It is unclear at this point whether the leak is due to some kind of object cycle or retained reference in cumin, or whether it is coming from psycopg2.

On a test machine, the cumin-data instance that tracks slot objects grew to about 1.6 gigabytes in a week.

Ongoing investigation.

Version-Release number of selected component (if applicable):

I believe this probably exists in 2.0, not certain.  To know for sure, we would need a 2.0 box set up to run for an extended time.  It exists in current development, and code around database writes has not changed.

How reproducible:

100%

Steps to Reproduce:
1.  Set up cumin pointed at a pool.  The more slots in the pool, the better
2.  Let cumin run for a long time (days if you can)
3.  Check RSS of the cumin instance tracking slots (--section=data.grid-slots).  RSS should grow over time, eventually reaching into the gigabytes.
Comment 1 Trevor McKay 2011-12-03 01:29:03 EST
The cursor.mogrify(operation, parameters) method in psycopg2 produces query strings from format/value pairs.  The mogrify code is called from cursor.execute(operation, parameters) to produce a sql query before execution. Cumin-data uses this mechanism to fill in "template" queries with values from particular objects.

It appears that cursor.mogrify() will leak a reference to a value if that value is referred to more than once in the format string.  For example:

cursor.mogrify("%(a)s %(a)s", {"a": "foo"})

will incorrectly increment the reference count for the string "foo".

When an object is updated, cumin-data updates objects and sample data in the same sql operation string (an update and an insert, separated by a semicolon) passed to cursor.execute().  So whenever an object column is written through to the database and is also tracked in sample data, the value for that column is leaked because the format string contains multiple references to the value.

Workaround:

It should be possible to work around this leak in cumin-data by issuing the update and insert operations for object data and sample data as two separate operations and therefore two calls to mogrify() (but this still will require only a single commit).  This should avoid the same value being referenced more than once in the format string.

Ultimately, this needs to be fixed in pyscopg2 (will be submitting a bug against it).

See attachment for a reproducer script that shows the bogus reference count.  The reproducer requires a running Postgresql server (the script assumes the "cumin" database and user).
Comment 2 Trevor McKay 2011-12-03 01:31:23 EST
Created attachment 539925 [details]
Script that shows reference count problem with cursor.mogrify()

Illustrates the problem with cursor.mogrify().
Comment 3 Trevor McKay 2011-12-03 01:36:53 EST
Output of the test script.  Note in the last fragment that the reference count will be incremented incorrectly for EACH reference beyond the first!  b is referenced 3 times, so the count is incremented twice.

[tmckay@tmckaylt mint]$ ./psycopg_leak.py
sql is %(a)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 6
reference count for vals[b] = 6
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 6
reference count for vals[b] = 6

sql is %(a)s %(a)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 6
reference count for vals[b] = 6
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 7
reference count for vals[b] = 6

sql is %(b)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 7
reference count for vals[b] = 6
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 7
reference count for vals[b] = 7

sql is %(b)s %(b)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 7
reference count for vals[b] = 7
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 7
reference count for vals[b] = 9
Comment 4 Trevor McKay 2011-12-03 12:10:08 EST
hmmm,

  we package psycopg2 for RHEL (I think jross has built it).  Maybe we should attempt a patch ourselves, coincident with submitting a bug to upstream.
Comment 6 Trevor McKay 2011-12-05 12:57:33 EST
Created attachment 541037 [details]
Modify update.py to use two operations instead of one to work around leak

Running this patch as a test.  So far so good.  This is an alternative to a patched psycopg2.  I need to run a scale test to see what the impact is of an extra operation.
Comment 7 Trevor McKay 2011-12-05 14:15:56 EST
Note, archives of notices on releases can be found here:

http://initd.org/psycopg/articles/

Here's a summary of changes from Python 2.0 to 2.4.2 (currrent version, jumped from 2.0 straight to 2.2, not sure what happened to 2.1) after reading through posts.  

Sounds to me like we could benefit in general from moving forward to 2.4.x in the near future -- a few leaks have been fixed, and performance in general has been improved.  As RHEL moves forward in the future and uses newer versions of Postgres, it makes sense to update pyscopg2 as well.  

However, aside from maybe having some more reference leaks we haven't discovered yet, there does not seem to be anything requiring us to move to a newer version at this time.  A patched 2.0.x would work for us.


two phase commits
a reference leak or two (or three)
better handling of a variety of date/time objects
better handling for decimals
optimizations for fewer copies, superfluous operations
fix leak related to libpq (part of postgresql-devel and/or postgresql-debuginfo)
doc updates
asynchronous queries
support for Postgresql 8.4 and 9.0
Comment 8 Trevor McKay 2011-12-05 14:16:54 EST
That was psycopg 2.0 to 2.4.2, not Python

(In reply to comment #7)
> 
> Here's a summary of changes from Python 2.0 to 2.4.2 (currrent version, jumped
> from 2.0 straight to 2.2, not sure what happened to 2.1) after reading through
> posts.
Comment 9 Trevor McKay 2011-12-06 10:00:23 EST
fyi, the patch to cumin-data appears to have taken care of the leak.  There may still be a slow leak somewhere else, however -- hard to tell whether the current increase is do to normal operation and changing grid landscape or a real leak.

Will run on another machine with latest psycopg2 release (there have been other leak fixes since 2.0.13)

Waiting for a response from upstream on the psycopg patch for the mogrify issue.

Abbreviated ps output, with RSS (unit is k) 12/5/2011 at 12:30 pm

34508  /usr/bin/cumin-data --section=data.grid --daemon --heapy
36832  /usr/bin/cumin-data --section=data.grid-slots --daemon --heapy
101904 /usr/bin/cumin-data --section=data.grid-submissions --daemon --heapy
32300  /usr/bin/cumin-data --section=data.sesame --daemon --heapy

Output on 12/6/2011 at about 10am.

34580  /usr/bin/cumin-data --section=data.grid --daemon --heapy
36964  /usr/bin/cumin-data --section=data.grid-slots --daemon --heapy
105428 /usr/bin/cumin-data --section=data.grid-submissions --daemon --heapy
32308  /usr/bin/cumin-data --section=data.sesame --daemon --heapy
Comment 10 Trevor McKay 2011-12-08 09:49:18 EST
Based on cumin-data stats output, 2 queries vs 1 query does not make a significant difference.  Still looks like 0.009-0.010 seconds to process an update.  This seems reasonable, since the difference is processing "update; insert" versus "update" followed by "insert" -- Postgresql I'm guessing treats the first as two separate queries anyway.  The difference would be in the overhead in psycopg2 in that case.

I would say that splitting the query is a viable patch for the leak that was found.

Update on cumin-data leakage (with psycopg2 patched)

New numbers on 12/8/2011, a few scale tests with lots of submissions and creation/deletion of groups of slots on afternoon of 12/7

39556  /usr/bin/cumin-data --section=data.grid --daemon --heapy
129644 /usr/bin/cumin-data --section=data.grid-slots --daemon --heapy
112624 /usr/bin/cumin-data --section=data.grid-submissions --daemon --heapy
37504  /usr/bin/cumin-data --section=data.sesame --daemon --heapy

Continue to investigate whether 2.4.2 will give better numbers...
Comment 11 Trevor McKay 2011-12-09 15:15:50 EST
fyi, we will attempt to correct this with a patched version of psycopg2
Comment 12 Trevor McKay 2011-12-12 12:42:04 EST
Created attachment 545841 [details]
Patch applied to cursor_type.c in python-psycopg2-2.0.14-3

Changed the check for an existing value to use PyDict_Contains() instead of creating a new temporary object with PyObject_GetItem(), this removes the "item" variable from the code.  

Added the reference decrement for "value" when the the value is already in the temporary dictionary from a previous reference.
Comment 13 Trevor McKay 2011-12-12 12:43:42 EST
Comment on attachment 541037 [details]
Modify update.py to use two operations instead of one to work around leak

This alternative is no longer needed given the patch to psycopg2
Comment 14 Trevor McKay 2011-12-12 12:59:37 EST
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
    A bug in python-pyscopg2-2.0.14 causes a reference leak when cumin-data updates objects in the database.  Specifically, the bug manifests when cursor.mogrify(operations, params) or cursor.execute(operations, params) is called and the 'operations' string references the same value from 'params' more than once.

Consequence
    Long-running instances of cumin can leak significant amounts of memory.

Fix
    The _mogrify() routine in cursor_type.c was patched in python-pscyopg2-2.0.14-3.

Result
    This reference leak has been repaired and no longer is an issue when cumin-data updates obejcts.
Comment 16 Stanislav Graf 2012-01-03 13:01:52 EST
Reproduced:
RHEL 5.7 i386/x86_64, MRG 2.1.0
cumin-0.1.5098-2.el5
python-psycopg2-2.0.14-2.el5

sql is %(a)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 5
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 5
reference count for vals[b] = 5

sql is %(a)s %(a)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 5
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 6
reference count for vals[b] = 5

sql is %(b)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 6
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 6
reference count for vals[b] = 6

sql is %(b)s %(b)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 6
reference count for vals[b] = 6
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 6
reference count for vals[b] = 8

Verified:
RHEL 5.7 i386/x86_64, MRG 2.1.x
cumin-0.1.5180-1.el5
python-psycopg2-2.0.14-3.el5

sql is %(a)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 5
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 5
reference count for vals[b] = 5

sql is %(a)s %(a)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 5
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 5
reference count for vals[b] = 5

sql is %(b)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 5
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 5
reference count for vals[b] = 5

sql is %(b)s %(b)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 5
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 5
reference count for vals[b] = 5

VERIFIED


RHEL 6 part is tracked in Bug 765998
Comment 19 errata-xmlrpc 2012-02-06 13:18:25 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0100.html

Note You need to log in before you can comment on or make changes to this bug.