Description of problem: There appears to be a memory leak in cumin-data related to object updates written through to the database. It is unclear at this point whether the leak is due to some kind of object cycle or retained reference in cumin, or whether it is coming from psycopg2. On a test machine, the cumin-data instance that tracks slot objects grew to about 1.6 gigabytes in a week. Ongoing investigation. Version-Release number of selected component (if applicable): I believe this probably exists in 2.0, not certain. To know for sure, we would need a 2.0 box set up to run for an extended time. It exists in current development, and code around database writes has not changed. How reproducible: 100% Steps to Reproduce: 1. Set up cumin pointed at a pool. The more slots in the pool, the better 2. Let cumin run for a long time (days if you can) 3. Check RSS of the cumin instance tracking slots (--section=data.grid-slots). RSS should grow over time, eventually reaching into the gigabytes.
The cursor.mogrify(operation, parameters) method in psycopg2 produces query strings from format/value pairs. The mogrify code is called from cursor.execute(operation, parameters) to produce a sql query before execution. Cumin-data uses this mechanism to fill in "template" queries with values from particular objects. It appears that cursor.mogrify() will leak a reference to a value if that value is referred to more than once in the format string. For example: cursor.mogrify("%(a)s %(a)s", {"a": "foo"}) will incorrectly increment the reference count for the string "foo". When an object is updated, cumin-data updates objects and sample data in the same sql operation string (an update and an insert, separated by a semicolon) passed to cursor.execute(). So whenever an object column is written through to the database and is also tracked in sample data, the value for that column is leaked because the format string contains multiple references to the value. Workaround: It should be possible to work around this leak in cumin-data by issuing the update and insert operations for object data and sample data as two separate operations and therefore two calls to mogrify() (but this still will require only a single commit). This should avoid the same value being referenced more than once in the format string. Ultimately, this needs to be fixed in pyscopg2 (will be submitting a bug against it). See attachment for a reproducer script that shows the bogus reference count. The reproducer requires a running Postgresql server (the script assumes the "cumin" database and user).
Created attachment 539925 [details] Script that shows reference count problem with cursor.mogrify() Illustrates the problem with cursor.mogrify().
Output of the test script. Note in the last fragment that the reference count will be incremented incorrectly for EACH reference beyond the first! b is referenced 3 times, so the count is incremented twice. [tmckay@tmckaylt mint]$ ./psycopg_leak.py sql is %(a)s %(b)s vals {'a': 'unique string', 'b': 'some other string'} reference count for vals[a] = 6 reference count for vals[b] = 6 after call to cursor.mogrify(sql, vals)... reference count for vals[a] = 6 reference count for vals[b] = 6 sql is %(a)s %(a)s vals {'a': 'unique string', 'b': 'some other string'} reference count for vals[a] = 6 reference count for vals[b] = 6 after call to cursor.mogrify(sql, vals)... reference count for vals[a] = 7 reference count for vals[b] = 6 sql is %(b)s %(b)s vals {'a': 'unique string', 'b': 'some other string'} reference count for vals[a] = 7 reference count for vals[b] = 6 after call to cursor.mogrify(sql, vals)... reference count for vals[a] = 7 reference count for vals[b] = 7 sql is %(b)s %(b)s %(b)s vals {'a': 'unique string', 'b': 'some other string'} reference count for vals[a] = 7 reference count for vals[b] = 7 after call to cursor.mogrify(sql, vals)... reference count for vals[a] = 7 reference count for vals[b] = 9
hmmm, we package psycopg2 for RHEL (I think jross has built it). Maybe we should attempt a patch ourselves, coincident with submitting a bug to upstream.
Here is a link to the psycopg ticket http://psycopg.lighthouseapp.com/projects/62710/tickets/81-_mogrify-leaks-reference-when-same-dict-value-referenced-multiple-times-with
Created attachment 541037 [details] Modify update.py to use two operations instead of one to work around leak Running this patch as a test. So far so good. This is an alternative to a patched psycopg2. I need to run a scale test to see what the impact is of an extra operation.
Note, archives of notices on releases can be found here: http://initd.org/psycopg/articles/ Here's a summary of changes from Python 2.0 to 2.4.2 (currrent version, jumped from 2.0 straight to 2.2, not sure what happened to 2.1) after reading through posts. Sounds to me like we could benefit in general from moving forward to 2.4.x in the near future -- a few leaks have been fixed, and performance in general has been improved. As RHEL moves forward in the future and uses newer versions of Postgres, it makes sense to update pyscopg2 as well. However, aside from maybe having some more reference leaks we haven't discovered yet, there does not seem to be anything requiring us to move to a newer version at this time. A patched 2.0.x would work for us. two phase commits a reference leak or two (or three) better handling of a variety of date/time objects better handling for decimals optimizations for fewer copies, superfluous operations fix leak related to libpq (part of postgresql-devel and/or postgresql-debuginfo) doc updates asynchronous queries support for Postgresql 8.4 and 9.0
That was psycopg 2.0 to 2.4.2, not Python (In reply to comment #7) > > Here's a summary of changes from Python 2.0 to 2.4.2 (currrent version, jumped > from 2.0 straight to 2.2, not sure what happened to 2.1) after reading through > posts.
fyi, the patch to cumin-data appears to have taken care of the leak. There may still be a slow leak somewhere else, however -- hard to tell whether the current increase is do to normal operation and changing grid landscape or a real leak. Will run on another machine with latest psycopg2 release (there have been other leak fixes since 2.0.13) Waiting for a response from upstream on the psycopg patch for the mogrify issue. Abbreviated ps output, with RSS (unit is k) 12/5/2011 at 12:30 pm 34508 /usr/bin/cumin-data --section=data.grid --daemon --heapy 36832 /usr/bin/cumin-data --section=data.grid-slots --daemon --heapy 101904 /usr/bin/cumin-data --section=data.grid-submissions --daemon --heapy 32300 /usr/bin/cumin-data --section=data.sesame --daemon --heapy Output on 12/6/2011 at about 10am. 34580 /usr/bin/cumin-data --section=data.grid --daemon --heapy 36964 /usr/bin/cumin-data --section=data.grid-slots --daemon --heapy 105428 /usr/bin/cumin-data --section=data.grid-submissions --daemon --heapy 32308 /usr/bin/cumin-data --section=data.sesame --daemon --heapy
Based on cumin-data stats output, 2 queries vs 1 query does not make a significant difference. Still looks like 0.009-0.010 seconds to process an update. This seems reasonable, since the difference is processing "update; insert" versus "update" followed by "insert" -- Postgresql I'm guessing treats the first as two separate queries anyway. The difference would be in the overhead in psycopg2 in that case. I would say that splitting the query is a viable patch for the leak that was found. Update on cumin-data leakage (with psycopg2 patched) New numbers on 12/8/2011, a few scale tests with lots of submissions and creation/deletion of groups of slots on afternoon of 12/7 39556 /usr/bin/cumin-data --section=data.grid --daemon --heapy 129644 /usr/bin/cumin-data --section=data.grid-slots --daemon --heapy 112624 /usr/bin/cumin-data --section=data.grid-submissions --daemon --heapy 37504 /usr/bin/cumin-data --section=data.sesame --daemon --heapy Continue to investigate whether 2.4.2 will give better numbers...
fyi, we will attempt to correct this with a patched version of psycopg2
Created attachment 545841 [details] Patch applied to cursor_type.c in python-psycopg2-2.0.14-3 Changed the check for an existing value to use PyDict_Contains() instead of creating a new temporary object with PyObject_GetItem(), this removes the "item" variable from the code. Added the reference decrement for "value" when the the value is already in the temporary dictionary from a previous reference.
Comment on attachment 541037 [details] Modify update.py to use two operations instead of one to work around leak This alternative is no longer needed given the patch to psycopg2
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause A bug in python-pyscopg2-2.0.14 causes a reference leak when cumin-data updates objects in the database. Specifically, the bug manifests when cursor.mogrify(operations, params) or cursor.execute(operations, params) is called and the 'operations' string references the same value from 'params' more than once. Consequence Long-running instances of cumin can leak significant amounts of memory. Fix The _mogrify() routine in cursor_type.c was patched in python-pscyopg2-2.0.14-3. Result This reference leak has been repaired and no longer is an issue when cumin-data updates obejcts.
Reproduced: RHEL 5.7 i386/x86_64, MRG 2.1.0 cumin-0.1.5098-2.el5 python-psycopg2-2.0.14-2.el5 sql is %(a)s %(b)s vals {'a': 'unique string', 'b': 'some other string'} reference count for vals[a] = 5 reference count for vals[b] = 5 after call to cursor.mogrify(sql, vals)... reference count for vals[a] = 5 reference count for vals[b] = 5 sql is %(a)s %(a)s vals {'a': 'unique string', 'b': 'some other string'} reference count for vals[a] = 5 reference count for vals[b] = 5 after call to cursor.mogrify(sql, vals)... reference count for vals[a] = 6 reference count for vals[b] = 5 sql is %(b)s %(b)s vals {'a': 'unique string', 'b': 'some other string'} reference count for vals[a] = 6 reference count for vals[b] = 5 after call to cursor.mogrify(sql, vals)... reference count for vals[a] = 6 reference count for vals[b] = 6 sql is %(b)s %(b)s %(b)s vals {'a': 'unique string', 'b': 'some other string'} reference count for vals[a] = 6 reference count for vals[b] = 6 after call to cursor.mogrify(sql, vals)... reference count for vals[a] = 6 reference count for vals[b] = 8 Verified: RHEL 5.7 i386/x86_64, MRG 2.1.x cumin-0.1.5180-1.el5 python-psycopg2-2.0.14-3.el5 sql is %(a)s %(b)s vals {'a': 'unique string', 'b': 'some other string'} reference count for vals[a] = 5 reference count for vals[b] = 5 after call to cursor.mogrify(sql, vals)... reference count for vals[a] = 5 reference count for vals[b] = 5 sql is %(a)s %(a)s vals {'a': 'unique string', 'b': 'some other string'} reference count for vals[a] = 5 reference count for vals[b] = 5 after call to cursor.mogrify(sql, vals)... reference count for vals[a] = 5 reference count for vals[b] = 5 sql is %(b)s %(b)s vals {'a': 'unique string', 'b': 'some other string'} reference count for vals[a] = 5 reference count for vals[b] = 5 after call to cursor.mogrify(sql, vals)... reference count for vals[a] = 5 reference count for vals[b] = 5 sql is %(b)s %(b)s %(b)s vals {'a': 'unique string', 'b': 'some other string'} reference count for vals[a] = 5 reference count for vals[b] = 5 after call to cursor.mogrify(sql, vals)... reference count for vals[a] = 5 reference count for vals[b] = 5 VERIFIED RHEL 6 part is tracked in Bug 765998
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0100.html