Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 759200

Summary: Memory leak in python-psycopg2
Product: Red Hat Enterprise MRG Reporter: Trevor McKay <tmckay>
Component: cuminAssignee: Trevor McKay <tmckay>
Status: CLOSED ERRATA QA Contact: Stanislav Graf <sgraf>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: iboverma, jross, matt, mkudlej, sgraf
Target Milestone: 2.1.1   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-psycopg2-2.0.14-3.el5 Doc Type: Bug Fix
Doc Text:
Cause A bug in python-pyscopg2-2.0.14 causes a reference leak when cumin-data updates objects in the database. Specifically, the bug manifests when cursor.mogrify(operations, params) or cursor.execute(operations, params) is called and the 'operations' string references the same value from 'params' more than once. Consequence Long-running instances of cumin can leak significant amounts of memory. Fix The _mogrify() routine in cursor_type.c was patched in python-pscyopg2-2.0.14-3. Result This reference leak has been repaired and no longer is an issue when cumin-data updates obejcts.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-06 18:18:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 765607    
Attachments:
Description Flags
Script that shows reference count problem with cursor.mogrify()
none
Modify update.py to use two operations instead of one to work around leak
none
Patch applied to cursor_type.c in python-psycopg2-2.0.14-3 none

Description Trevor McKay 2011-12-01 16:35:13 UTC
Description of problem:

There appears to be a memory leak in cumin-data related to object updates written through to the database.  It is unclear at this point whether the leak is due to some kind of object cycle or retained reference in cumin, or whether it is coming from psycopg2.

On a test machine, the cumin-data instance that tracks slot objects grew to about 1.6 gigabytes in a week.

Ongoing investigation.

Version-Release number of selected component (if applicable):

I believe this probably exists in 2.0, not certain.  To know for sure, we would need a 2.0 box set up to run for an extended time.  It exists in current development, and code around database writes has not changed.

How reproducible:

100%

Steps to Reproduce:
1.  Set up cumin pointed at a pool.  The more slots in the pool, the better
2.  Let cumin run for a long time (days if you can)
3.  Check RSS of the cumin instance tracking slots (--section=data.grid-slots).  RSS should grow over time, eventually reaching into the gigabytes.

Comment 1 Trevor McKay 2011-12-03 06:29:03 UTC
The cursor.mogrify(operation, parameters) method in psycopg2 produces query strings from format/value pairs.  The mogrify code is called from cursor.execute(operation, parameters) to produce a sql query before execution. Cumin-data uses this mechanism to fill in "template" queries with values from particular objects.

It appears that cursor.mogrify() will leak a reference to a value if that value is referred to more than once in the format string.  For example:

cursor.mogrify("%(a)s %(a)s", {"a": "foo"})

will incorrectly increment the reference count for the string "foo".

When an object is updated, cumin-data updates objects and sample data in the same sql operation string (an update and an insert, separated by a semicolon) passed to cursor.execute().  So whenever an object column is written through to the database and is also tracked in sample data, the value for that column is leaked because the format string contains multiple references to the value.

Workaround:

It should be possible to work around this leak in cumin-data by issuing the update and insert operations for object data and sample data as two separate operations and therefore two calls to mogrify() (but this still will require only a single commit).  This should avoid the same value being referenced more than once in the format string.

Ultimately, this needs to be fixed in pyscopg2 (will be submitting a bug against it).

See attachment for a reproducer script that shows the bogus reference count.  The reproducer requires a running Postgresql server (the script assumes the "cumin" database and user).

Comment 2 Trevor McKay 2011-12-03 06:31:23 UTC
Created attachment 539925 [details]
Script that shows reference count problem with cursor.mogrify()

Illustrates the problem with cursor.mogrify().

Comment 3 Trevor McKay 2011-12-03 06:36:53 UTC
Output of the test script.  Note in the last fragment that the reference count will be incremented incorrectly for EACH reference beyond the first!  b is referenced 3 times, so the count is incremented twice.

[tmckay@tmckaylt mint]$ ./psycopg_leak.py
sql is %(a)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 6
reference count for vals[b] = 6
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 6
reference count for vals[b] = 6

sql is %(a)s %(a)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 6
reference count for vals[b] = 6
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 7
reference count for vals[b] = 6

sql is %(b)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 7
reference count for vals[b] = 6
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 7
reference count for vals[b] = 7

sql is %(b)s %(b)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 7
reference count for vals[b] = 7
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 7
reference count for vals[b] = 9

Comment 4 Trevor McKay 2011-12-03 17:10:08 UTC
hmmm,

  we package psycopg2 for RHEL (I think jross has built it).  Maybe we should attempt a patch ourselves, coincident with submitting a bug to upstream.

Comment 6 Trevor McKay 2011-12-05 17:57:33 UTC
Created attachment 541037 [details]
Modify update.py to use two operations instead of one to work around leak

Running this patch as a test.  So far so good.  This is an alternative to a patched psycopg2.  I need to run a scale test to see what the impact is of an extra operation.

Comment 7 Trevor McKay 2011-12-05 19:15:56 UTC
Note, archives of notices on releases can be found here:

http://initd.org/psycopg/articles/

Here's a summary of changes from Python 2.0 to 2.4.2 (currrent version, jumped from 2.0 straight to 2.2, not sure what happened to 2.1) after reading through posts.  

Sounds to me like we could benefit in general from moving forward to 2.4.x in the near future -- a few leaks have been fixed, and performance in general has been improved.  As RHEL moves forward in the future and uses newer versions of Postgres, it makes sense to update pyscopg2 as well.  

However, aside from maybe having some more reference leaks we haven't discovered yet, there does not seem to be anything requiring us to move to a newer version at this time.  A patched 2.0.x would work for us.


two phase commits
a reference leak or two (or three)
better handling of a variety of date/time objects
better handling for decimals
optimizations for fewer copies, superfluous operations
fix leak related to libpq (part of postgresql-devel and/or postgresql-debuginfo)
doc updates
asynchronous queries
support for Postgresql 8.4 and 9.0

Comment 8 Trevor McKay 2011-12-05 19:16:54 UTC
That was psycopg 2.0 to 2.4.2, not Python

(In reply to comment #7)
> 
> Here's a summary of changes from Python 2.0 to 2.4.2 (currrent version, jumped
> from 2.0 straight to 2.2, not sure what happened to 2.1) after reading through
> posts.

Comment 9 Trevor McKay 2011-12-06 15:00:23 UTC
fyi, the patch to cumin-data appears to have taken care of the leak.  There may still be a slow leak somewhere else, however -- hard to tell whether the current increase is do to normal operation and changing grid landscape or a real leak.

Will run on another machine with latest psycopg2 release (there have been other leak fixes since 2.0.13)

Waiting for a response from upstream on the psycopg patch for the mogrify issue.

Abbreviated ps output, with RSS (unit is k) 12/5/2011 at 12:30 pm

34508  /usr/bin/cumin-data --section=data.grid --daemon --heapy
36832  /usr/bin/cumin-data --section=data.grid-slots --daemon --heapy
101904 /usr/bin/cumin-data --section=data.grid-submissions --daemon --heapy
32300  /usr/bin/cumin-data --section=data.sesame --daemon --heapy

Output on 12/6/2011 at about 10am.

34580  /usr/bin/cumin-data --section=data.grid --daemon --heapy
36964  /usr/bin/cumin-data --section=data.grid-slots --daemon --heapy
105428 /usr/bin/cumin-data --section=data.grid-submissions --daemon --heapy
32308  /usr/bin/cumin-data --section=data.sesame --daemon --heapy

Comment 10 Trevor McKay 2011-12-08 14:49:18 UTC
Based on cumin-data stats output, 2 queries vs 1 query does not make a significant difference.  Still looks like 0.009-0.010 seconds to process an update.  This seems reasonable, since the difference is processing "update; insert" versus "update" followed by "insert" -- Postgresql I'm guessing treats the first as two separate queries anyway.  The difference would be in the overhead in psycopg2 in that case.

I would say that splitting the query is a viable patch for the leak that was found.

Update on cumin-data leakage (with psycopg2 patched)

New numbers on 12/8/2011, a few scale tests with lots of submissions and creation/deletion of groups of slots on afternoon of 12/7

39556  /usr/bin/cumin-data --section=data.grid --daemon --heapy
129644 /usr/bin/cumin-data --section=data.grid-slots --daemon --heapy
112624 /usr/bin/cumin-data --section=data.grid-submissions --daemon --heapy
37504  /usr/bin/cumin-data --section=data.sesame --daemon --heapy

Continue to investigate whether 2.4.2 will give better numbers...

Comment 11 Trevor McKay 2011-12-09 20:15:50 UTC
fyi, we will attempt to correct this with a patched version of psycopg2

Comment 12 Trevor McKay 2011-12-12 17:42:04 UTC
Created attachment 545841 [details]
Patch applied to cursor_type.c in python-psycopg2-2.0.14-3

Changed the check for an existing value to use PyDict_Contains() instead of creating a new temporary object with PyObject_GetItem(), this removes the "item" variable from the code.  

Added the reference decrement for "value" when the the value is already in the temporary dictionary from a previous reference.

Comment 13 Trevor McKay 2011-12-12 17:43:42 UTC
Comment on attachment 541037 [details]
Modify update.py to use two operations instead of one to work around leak

This alternative is no longer needed given the patch to psycopg2

Comment 14 Trevor McKay 2011-12-12 17:59:37 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
    A bug in python-pyscopg2-2.0.14 causes a reference leak when cumin-data updates objects in the database.  Specifically, the bug manifests when cursor.mogrify(operations, params) or cursor.execute(operations, params) is called and the 'operations' string references the same value from 'params' more than once.

Consequence
    Long-running instances of cumin can leak significant amounts of memory.

Fix
    The _mogrify() routine in cursor_type.c was patched in python-pscyopg2-2.0.14-3.

Result
    This reference leak has been repaired and no longer is an issue when cumin-data updates obejcts.

Comment 16 Stanislav Graf 2012-01-03 18:01:52 UTC
Reproduced:
RHEL 5.7 i386/x86_64, MRG 2.1.0
cumin-0.1.5098-2.el5
python-psycopg2-2.0.14-2.el5

sql is %(a)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 5
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 5
reference count for vals[b] = 5

sql is %(a)s %(a)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 5
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 6
reference count for vals[b] = 5

sql is %(b)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 6
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 6
reference count for vals[b] = 6

sql is %(b)s %(b)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 6
reference count for vals[b] = 6
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 6
reference count for vals[b] = 8

Verified:
RHEL 5.7 i386/x86_64, MRG 2.1.x
cumin-0.1.5180-1.el5
python-psycopg2-2.0.14-3.el5

sql is %(a)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 5
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 5
reference count for vals[b] = 5

sql is %(a)s %(a)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 5
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 5
reference count for vals[b] = 5

sql is %(b)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 5
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 5
reference count for vals[b] = 5

sql is %(b)s %(b)s %(b)s
vals {'a': 'unique string', 'b': 'some other string'}
reference count for vals[a] = 5
reference count for vals[b] = 5
after call to cursor.mogrify(sql, vals)...
reference count for vals[a] = 5
reference count for vals[b] = 5

VERIFIED


RHEL 6 part is tracked in Bug 765998

Comment 19 errata-xmlrpc 2012-02-06 18:18:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0100.html