Bug 584990

Summary: Reference counting problems?
Product: [Fedora] Fedora Reporter: Miloslav Trmač <mitr>
Component: python-nssAssignee: John Dennis <jdennis>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 12CC: dcantrell, dmalcolm, jdennis
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: python-nss-0.9-8.fc12 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-09-08 02:27:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Proposed reference counting fixes
none
A few more refcounting changes none

Description Miloslav Trmač 2010-04-22 20:32:18 UTC
python-nss-0.8-1.fc12.x86_64

Just recording this in case you can notice a problem easily - from time to time I see weird errors, with some references to Python reference counts.

There are two processes using SSL sockets to communicate; the one with the client socket reports:

exception in SSLSocket.client_auth_data_callback
Traceback (most recent call last):
  File "/home/mitr/co/fedora/sigul/build/p/share/sigul/utils.py", line 215, in nss_client_auth_callback_single
    def nss_client_auth_callback_single(unused_ca_names, cert):
KeyboardInterrupt
python: Modules/gcmodule.c:277: visit_decref: Assertion `gc->gc.gc_refs != 0' failed.

(without any reason to see KeyboardInterrupt AFAIK.)  At least the assertion seems to point to a problem in this process.

The process with the server socket reports:

  File "/home/mitr/co/fedora/sigul/build/p/share/sigul/double_tls.py", line 836, in write
    self.__socket.send(utils.u32_pack(len(data)))
NSPRError: [Errno -12285] (SSL_ERROR_NO_CERTIFICATE) Unable to find the certificate or key necessary for authentication.

(self.__socket is the server NSS SSL socket); this error might be caused by the client-side error.


If there's nothing obvious, I'll try to find out more with a debugger.

Comment 1 John Dennis 2010-04-24 03:43:26 UTC
I have an idea as to what may be causing this problem. If I recall you're using threads, is that correct? In the current implementation the callback's retain their state in a global variable and thus are not thread safe. If there were more than one thread passing data through the callback then it seems entirely likely the reference counts could be corrupted.

Are you using threads with each thread setting a callback?

Comment 2 Miloslav Trmač 2010-04-24 13:45:48 UTC
Right, I am very sorry, I should have mentioned the threads.

In the "client", there is
- a parent process that does not initialize NSS
- "child1", started from this parent process
- "child2", started from child1
- child1 and child2 both initialize NSS (read-only, using the same database)
  after fork()ing child2
- child1 and child2 (separate processes) both use set_client_auth_data_callback
  and create SSL sockets.
- _after_ child1 successfully sets the callback and performs a handshake,
  it creates three threads.  These threads AFAIK do not use NSS sockets (they
  communicate with child2, which does use NSS).

Comment 3 John Dennis 2010-05-13 01:43:07 UTC
I've looked into this a bit more. Comment #1 was wrong, the state is maintained in each socket object as it should be. I've looked at the ref count manipulations in the callback code and I don't see an obvious problem from static source code inspection. The assertion occurs in the garbage collection code of Python which is periodically triggered when memory is allocated. Thus I would expect the point at which the assertion is triggered to be somewhat random (unless your data and program execution are identical each time). Is it random or do you see the assertion always inside the callback at the same place?

It does appear to be a ref counting problem, but we might need some more debugging support to track it down (e.g a special python build) because the error may have occurred long before the garbage collection code is invoked.

Comment 4 Miloslav Trmač 2010-05-13 03:00:42 UTC
AFAICS it has always occurred in the same place, but only randomly, and I haven't seen it happen very recently.

I have identified a few places where the reference counting in python-nss is not 100% clear - but I think there were mostly reference leaks rather than too many references.  I plan to check the reference counts more thoroughly.

Comment 5 John Dennis 2010-05-13 12:37:50 UTC
I've done further research in the area of Python's cyclic garbage collector which is where the problem shows up. There is a requirement for CPython objects which hold a reference to other (collection) objects to implement two special callback routines (tp_traverse, tp_clear) which are invoked by the cyclic garbage collector in order to track the references being held by the object.

Those special callbacks were never implemented for any of the CPython objects in python-nss. While there is no proof this implementation error is the root cause of the problem you're seeing it is a likely source. In any event the problem needs to be fixed irrespective. Hopefully this will eliminate the problem you're seeing.

I need to identify each object which holds a reference and add this special support code. I will start that today. I'm am planning on making a release of python-nss with significant new functionality and some bug fixes shortly. The fixes for the garbage collector will be part of this release. I'll let you know when it's available.

On a slightly different note, I discovered the example auth_certifcate_callback provided by python-nss (in doc/examples/ssl_example.py) and in the documentation was deficient and failed to fully validate a certificate. If you're using that callback you'll want to update your callback from the fixed version in the next release. Since it's likely you've haven't needed to modify the callback it should be a simple cut-n-paste.

Comment 6 Miloslav Trmač 2010-05-13 21:25:40 UTC
Created attachment 413899 [details]
Proposed reference counting fixes

Attached are some suggested refcounting fixes.  I did not do any thorough testing, so please check the patch - on my code it behaves exactly like the previous code (i.e. everything seems to work fine).

The reference counting conventions in py_nss.c (e.g. the goto out of Py_INCREF/Py_DECREF in FMT_OBJ_AND_APPEND) are rather complex, and I'm not using this code, so I did not check most of the formatting routines.

The changes in get_client_auth_data probably explain the behavior I've been seeing, py_cert_dist_names already has reference count 0 by the time Py_DECREF(py_cert_dist_names) is called.

One recurring pattern in the patch is replacing
  Py_XDECREF(static_var);
  Py_INCREF(val);
  static_var = val;
by
  old = static_var;
  Py_INCREF(val);
  static_var = val;
  Py_XDECREF(old);
to make sure that when descructors are run as part of the Py_XDECREF, they can't ever access the value "static_var" which does not have a corresponding reference.

Comment 7 John Dennis 2010-05-14 00:07:13 UTC
Thank you for the patch. I haven't had much of a chance to look at it and I won't until Monday. I wonder if rather than introducing the (visible) extra code and variables it would be better to us the Py_CLEAR macro, essentially just replacing the existing Py_XDECREF() with Py_CLEAR().

I think this provides the same safety while at the same time being less verbose and utilizes what seems to be the documented mechanism. What do you think?

BTW, Py_CLEAR is defined in object.h

Comment 8 Miloslav Trmač 2010-05-14 14:49:09 UTC
Py_CLEAR results in correct reference counting.  The NULL value is observable from other Python code, which is a bit problematic:

* In theory, the NULL value may not make any sense for the static variable (I don't think this is happening in python-nss).

* If the purpose of the function is to change the static variable, the replacement isn't atomic - Python code could observe when the static variable has neither the old nor the new value.

Comment 9 John Dennis 2010-05-17 13:34:12 UTC
The value will be NULL only during a very restricted interval in which the only possible code which can run is the object destruction code associated with the garbage collector. Control never returns to the top level. No other Python code executes aside from destruction code. Thus I don't follow your comments above. No Python code would ever see the NULL value other than the code used to dealloc, destroy, or visit the object during garbage collection. This code absolutely needs to check for NULL and it should reasonably expect the object to be NULL during collection.

Can you explain how the value can be observed as NULL other than during destruction and collection?

Also you keep making references to static variables. The code we're discussing doesn't use static variables, could you explain?

Comment 10 Miloslav Trmač 2010-05-17 21:45:51 UTC
When the reference count of an object "x" drops to 0, "x.__del__" is called if it exists, before the "physical" destruction of "x".  This can be arbitrary Python code - although I haven't checked if the types involved in python-nss can have an user-defined __del__ method.  I suspect the tuples can contain arbitrary objects.

At least one static variable involved is "password_callback", but the same problem holds for any variable (e.g. a struct field) - the important point is not being static, but being visible from Python code during the runtime of x.__del__.

Comment 11 John Dennis 2010-05-18 14:23:37 UTC
After researching this a bit more I believe the code sequence you suggest in comment #6 is correct. Thank you for pointing out the issues. FWIW I think some of the CPython documentation is misleading and I think I'll file a doc bug with upstream Python to address that.

I would prefer see the code sequence and local variable declaration encapsulated in a macro to make the code cleaner, easier to read, and to assure the proper sequence is observed in all code locations where a reference is taken. I will make that change.

With respect to the static variable in comment #10, this is an example of a value which should be thread local and not static. This issue is discussed in bug #580963. I will also update the code so the pin args are thread local and go through the code and identify any other candidates for thread local storage.

Comment 12 John Dennis 2010-05-28 20:33:58 UTC
I have prepared a new python-nss package which includes your suggested fixes as well as fixes for reference counting, thread local storage, cyclic garbage collection, and a host of new features. The release is 0.9. I have built it in rawhide (http://koji.fedoraproject.org/koji/buildinfo?buildID=175538).

I have extensively tested the new version with valgrind, with a special python interpreter build which adds reference counting assertions, and using a script I wrote to verify there are exact matching deallocs for each new object created. There are no objects uncollected, no excessive decref's, no memory accessed when it shouldn't be. Of course I may have missed something.

I would appreciate if you could both test the package and do a source code review and provide feedback. Thank you!

Comment 15 Miloslav Trmač 2010-05-31 23:09:15 UTC
Created attachment 418414 [details]
A few more refcounting changes

Thanks.

Attached is the result of merging my original patch with 0.9 - I think these changes are still needed.  I'm afraid I wasn't yet able to test sigul sufficiently with this version yet, for reasons unrelated to python-nss.

Some notes from code review:

* CertificateExtension_{traverse,clear} are defined, but not used in the method
  table.
  (This also applies to the NewType type.)

* There are some debugging printf()s left:
  +            printf(">>> item refcnt = %d\n", item->ob_refcnt);
  +    printf("callback refcnt=%d\n", callback->ob_refcnt);
  and dump_tuple/all its callers.

Comment 16 John Dennis 2010-06-01 19:33:29 UTC
Thank you for your review Miloslav and your patch. 

I applied your patch in it's entirety. Thank you for your careful review, you caught things I had missed.

I cleaned up the other issues you noted in comment #15.

I had inadvertently omitted crediting you for your patches and help in the ChangeLog and spec file changelog, I added credit and thanks to you in both places.

I did not bump the upstream release number after applying the patch, rather I just produced a new tarball and did a new RPM build (after bumping the release number).

I did the same series of testing as noted in comment #12, no issues were discovered.

There is a new version in rawhide, python-nss-0.9-2.fc14 with the above issues addressed. When you get a chance to test please let me know. If your testing is successful I will tag upstream CVS and push out into F-13 and F-12.

Thanks again.

Comment 17 John Dennis 2010-06-04 18:36:28 UTC
Following up. I discovered the way the python interpreter had been built was preventing valgrind from detecting a number of errors. I have rebuilt the interpreter (--without-pymalloc) and have been running the valgrind tests again. Valgrind is now reporting errors that had been previously masked by the use of optimized malloc library inside python. I've been fixing these errors. I wouldn't bother testing with python-nss-0.9-2.fc14 since that version contains some of these errors. I'll update this bugzilla when a new version is available with all the valgrind issues resolved.

Comment 18 John Dennis 2010-06-06 15:27:07 UTC
python-nss-0.9-3.fc14 has been built into rawhide and addresses the issues discovered in comment #17. It also adds printing support for a number of certificate extensions and adds support for the BasicConstraints certificate extension.

Hopefully this version addresses all the reference counting and other memory issues. :-)

Comment 19 John Dennis 2010-06-11 19:09:52 UTC
FYI: additional testing revealed an undiscovered bug in socket reads, this is unrelated to the reference counting issues but it will require me to respin. I already have the fix. The problem manifests itself when you combine readline calls with read calls on the same socket, this is supported and should work, however there was a bookeeping bug which caused a short read.

Comment 20 John Dennis 2010-06-14 15:07:39 UTC
python-nss-0.9-5.fc14 has been built into rawhide to address the issue described in comment #19

Changelog:
* Mon Jun 14 2010 John Dennis <jdennis> - 0.9-5
- Fix incomplete read bug (due to read ahead buffer bookkeeping).
- Remove python-nss specific httplib.py, no longer needed
  python-nss now compatible with standard library
- Rewrite httplib_example.py to use standard library and illustrate
  ssl, non-ssl, connection class, http class usage

Comment 21 Fedora Update System 2010-07-16 17:15:01 UTC
python-nss-0.9-8.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/python-nss-0.9-8.fc13

Comment 22 Fedora Update System 2010-07-16 17:16:09 UTC
python-nss-0.9-8.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/python-nss-0.9-8.fc12

Comment 23 Fedora Update System 2010-09-08 02:27:00 UTC
python-nss-0.9-8.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 24 Fedora Update System 2010-11-01 20:55:52 UTC
python-nss-0.9-8.fc12 has been pushed to the Fedora 12 stable repository.  If problems still persist, please make note of it in this bug report.