Description of problem: s3backer is a FUSE filesystem that exposes an Amazon S3 bucket as a file in your local filesystem. It relies on libcurl for communication with Amazon. RHEL's libcurl relies on nss as its TLS provider. Subjecting s3backer to prolonged stress on RHEL/CentOS 6.4 with the stock libcurl will result in unbounded memory consumption. Analysis with valgrind's memcheck tool shows no significant memory leaks. However, recompiling libcurl to use OpenSSL as its TLS provider instead of NSS makes the issue go away. Version-Release number of selected component (if applicable): nss-3.15.3-3.el6_5.x86_64 libcurl-7.19.7-37.el6_4.x86_64 How reproducible: It happens every time. Steps to Reproduce: 1. Spin up a RHEL/CentOS 6.x AWS instance and create an Amazon S3 bucket. In my case, the bucket has a 1M object size. Note that the only thing that is probably necessary here is a S3-like service, but to ensure reproducibility, I am providing precise parameters. 2. Install s3backer. 3. Mount the S3 bucket with a command such as the following; note sensitive details such as keys have been sanitized: sudo s3backer --readAhead=0 --blockSize=1M --size=500G --listBlocks --vhost --baseURL=https://s3.amazonaws.com/ --timeout=15 --blockCacheThreads=10 --accessId=<accessId> --accessKey=<accessKey> --block CacheSize=10 <bucket> <mountpoint> 4. Run dd on the mounted s3 bucket: sudo dd if=/dev/urandom of=<mountpoint>/file bs=1M count=102400 iflag=nonblock,noatime oflag=nonblock,noatime The workload where I originally observed this involved running a file system on top of s3backer that is not supported by Redhat. However, the same behavior can be reproduced by running dd on top of s3backer. It becomes increasingly obvious when using /proc/$PID/status to view memory usage over the span of a 2-4 hours. While the system running s3backer need not be on AWS, access to S3 from outside Amazon is quite slow and I highly recommend placing it there. A long running program that exercises the same code paths in libcurl and nss as s3backer should also exhibit this problem. Actual results: VmRSS slowly grows until either the dd stops or the OOM killer triggers. Expected results: VmRSS should stabilize with VmHWM not changing after the first hour or so. Additional info: Analysis with Valgrind's massif tool suggests shows multiple backtraces containing NSS in the same snapshot and the share of their memory usage grows as a function of instructions. Here is an excerpt of one of the backtraces: | ->21.93% (13,056,000B) 0xE91C59A: ??? | | ->21.93% (13,056,000B) 0xE9218C7: ??? | | ->21.93% (13,056,000B) 0xE928520: ??? | | ->21.93% (13,056,000B) 0x37C56470C8: PK11_CreateNewObject (pk11obj.c:378) | | ->21.93% (13,056,000B) 0x37C5647361: PK11_CreateGenericObject (pk11obj.c:1415) | | ->21.93% (13,056,000B) 0x37C823F49E: nss_create_object (nss.c:349) | | ->21.93% (13,056,000B) 0x37C823F625: nss_load_cert (nss.c:383) | | ->21.93% (13,056,000B) 0x37C8240E0E: Curl_nss_connect (nss.c:1095) | | ->21.93% (13,056,000B) 0x37C8238480: Curl_ssl_connect (sslgen.c:185) | | ->21.93% (13,056,000B) 0x37C8216EC9: Curl_http_connect (http.c:1796) | | ->21.93% (13,056,000B) 0x37C821D680: Curl_protocol_connect (url.c:3077) | | ->21.93% (13,056,000B) 0x37C8223B3A: Curl_connect (url.c:4743) | | ->21.93% (13,056,000B) 0x37C822BBAE: Curl_perform (transfer.c:2523) | | ->21.93% (13,056,000B) 0x409E30: http_io_perform_io (http_io.c:1437) | | ->18.49% (11,009,280B) 0x40CB36: http_io_write_block (http_io.c:1348) | | | ->18.49% (11,009,280B) 0x407422: ec_protect_write_block (ec_protect.c:433) | | | ->18.49% (11,009,280B) 0x404CFD: block_cache_worker_main (block_cache.c:1090) | | | ->18.49% (11,009,280B) 0x379120784F: start_thread (pthread_create.c:301) | | | ->18.49% (11,009,280B) 0x3790AE894B: clone (clone.S:115) Recompiling libcurl with OpenSSL as its TLS provider makes the issue go away, which strongly suggests a problem either in NSS or in how libcurl uses NSS. I am not familiar with the NSS codebase and I am concerned about the possibility of introducing a security hole should I attempt to patch this myself. Redhat employees in #rhel on freenode informed me that Redhat employs multiple NSS developers and suggested that I file a report here for them.
I have experienced this exact issue while developing a proprietary application. ->03.41% (5,580,460B) 0x11C1A628: ??? ->01.37% (2,236,192B) 0x11C0A7CA: ??? | ->01.37% (2,236,192B) 0x11C0FB57: ??? | ->01.37% (2,236,192B) 0x11C167B0: ??? | ->01.37% (2,236,192B) 0x7901888: PK11_CreateNewObject (pk11obj.c:378) | ->01.37% (2,236,192B) 0x7901B21: PK11_CreateGenericObject (pk11obj.c:1504) | ->01.37% (2,236,192B) 0x52EB49E: nss_create_object (nss.c:349) | ->01.37% (2,236,192B) 0x52EB625: nss_load_cert (nss.c:383) | ->01.37% (2,236,192B) 0x52ECE0E: Curl_nss_connect (nss.c:1095) | ->01.37% (2,236,192B) 0x52E4480: Curl_ssl_connect (sslgen.c:185) | ->01.37% (2,236,192B) 0x52C2EC9: Curl_http_connect (http.c:1796) | ->01.37% (2,236,192B) 0x52C9680: Curl_protocol_connect (url.c:3077) | ->01.37% (2,236,192B) 0x52CFB3A: Curl_connect (url.c:4743) | ->01.37% (2,236,192B) 0x52D7BAE: Curl_perform (transfer.c:2523) | ->01.09% (1,783,232B) 0x5B6B5D: CSwiftCurl::login() (swiftcurl.C:186) | | ->01.09% (1,783,232B) in 2 places, all below massif's threshold (01.00%) | | | ->00.28% (452,960B) in 1+ places, all below ms_print's threshold (01.00%) Some additional details which may be of use: I have noticed that the rate of the leak is a function of the number of threads performing simultaneous requests. If I use a single thread, the leak never materializes. I have traced the curl code and it appears to be calling PK11_DestroyGenericObject correctly. I strongly suspect that this is a race condition in PK11_DestroyGenericObject. The following C program seems to produce a leak with a similar call stack somewhat reliably: //=========leak-check.c======= #include <curl/curl.h> #include <stdio.h> #define NUM_THREADS 200 #define NUM_GETS_PER_THREAD 5 void do_nothing(void *data) { ; } void *thread_main() { CURL *c; int i; c = curl_easy_init(); for (i = 0; i < NUM_GETS_PER_THREAD; i++) { curl_easy_setopt(c, CURLOPT_URL, "https://bugzilla.redhat.com"); curl_easy_setopt(c, CURLOPT_WRITEFUNCTION, do_nothing); curl_easy_perform(c); curl_easy_reset(c); } curl_easy_cleanup(c); return 0; } int main() { curl_global_init(CURL_GLOBAL_ALL); int i; pthread_t all_threads[NUM_THREADS]; int all_args[NUM_THREADS]; for (i = 0; i < NUM_THREADS; i++) { all_args[i]=i; pthread_create(&(all_threads[i]),0, thread_main, (void*) NULL); } for (i = 0; i < NUM_THREADS; i++) { pthread_join(all_threads[i], NULL); } curl_global_cleanup(); } //============================ $ gcc -lcurl leak-test.c -o leak-test $ time valgrind --num-callers=20 --leak-check=full ./leak-test ==13090== Memcheck, a memory error detector ==13090== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==13090== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==13090== Command: ./leak-test ==13090== ==13090== ==13090== HEAP SUMMARY: ==13090== in use at exit: 33,523 bytes in 512 blocks ==13090== total heap usage: 1,449,900 allocs, 1,449,388 frees, 1,293,593,227 bytes allocated ==13090== ==13090== 400 bytes in 10 blocks are possibly lost in loss record 79 of 96 ==13090== at 0x4C279EE: malloc (vg_replace_malloc.c:270) ==13090== by 0x761556B: _PR_Getfd (prfdcach.c:112) ==13090== by 0x762CDF4: pt_SetMethods (ptio.c:3301) ==13090== by 0x762D1C8: PR_OpenFile (ptio.c:3579) ==13090== by 0x8AB76F44: ??? ==13090== by 0x8AB7475A: ??? ==13090== by 0x8AB79B58: ??? ==13090== by 0x8AB807B1: ??? ==13090== by 0x6CDC889: PK11_CreateNewObject (pk11obj.c:378) ==13090== by 0x6CDCB22: PK11_CreateGenericObject (pk11obj.c:1504) ==13090== by 0x4E6E49F: nss_create_object (nss.c:349) ==13090== by 0x4E6E626: nss_load_cert (nss.c:383) ==13090== by 0x4E6FE0F: Curl_nss_connect (nss.c:1095) ==13090== by 0x4E67481: Curl_ssl_connect (sslgen.c:185) ==13090== by 0x4E45ECA: Curl_http_connect (http.c:1796) ==13090== by 0x4E4C681: Curl_protocol_connect (url.c:3077) ==13090== by 0x4E52B3B: Curl_connect (url.c:4743) ==13090== by 0x4E5ABAF: Curl_perform (transfer.c:2523) ==13090== by 0x400987: thread_main (in /usr/local/projects/accesslogger/leak-test) ==13090== by 0x541E9D0: start_thread (in /lib64/libpthread-2.12.so) ==13090== ==13090== 480 bytes in 10 blocks are possibly lost in loss record 80 of 96 ==13090== at 0x4C279EE: malloc (vg_replace_malloc.c:270) ==13090== by 0x7615559: _PR_Getfd (prfdcach.c:109) ==13090== by 0x762CDF4: pt_SetMethods (ptio.c:3301) ==13090== by 0x762D1C8: PR_OpenFile (ptio.c:3579) ==13090== by 0x8AB76F44: ??? ==13090== by 0x8AB7475A: ??? ==13090== by 0x8AB79B58: ??? ==13090== by 0x8AB807B1: ??? ==13090== by 0x6CDC889: PK11_CreateNewObject (pk11obj.c:378) ==13090== by 0x6CDCB22: PK11_CreateGenericObject (pk11obj.c:1504) ==13090== by 0x4E6E49F: nss_create_object (nss.c:349) ==13090== by 0x4E6E626: nss_load_cert (nss.c:383) ==13090== by 0x4E6FE0F: Curl_nss_connect (nss.c:1095) ==13090== by 0x4E67481: Curl_ssl_connect (sslgen.c:185) ==13090== by 0x4E45ECA: Curl_http_connect (http.c:1796) ==13090== by 0x4E4C681: Curl_protocol_connect (url.c:3077) ==13090== by 0x4E52B3B: Curl_connect (url.c:4743) ==13090== by 0x4E5ABAF: Curl_perform (transfer.c:2523) ==13090== by 0x400987: thread_main (in /usr/local/projects/accesslogger/leak-test) ==13090== by 0x541E9D0: start_thread (in /lib64/libpthread-2.12.so) ==13090== ==13090== 520 bytes in 13 blocks are possibly lost in loss record 88 of 96 ==13090== at 0x4C279EE: malloc (vg_replace_malloc.c:270) ==13090== by 0x761556B: _PR_Getfd (prfdcach.c:112) ==13090== by 0x762CDF4: pt_SetMethods (ptio.c:3301) ==13090== by 0x762CF71: PR_ImportTCPSocket (ptio.c:4575) ==13090== by 0x4E70048: Curl_nss_connect (nss.c:1335) ==13090== by 0x4E67481: Curl_ssl_connect (sslgen.c:185) ==13090== by 0x4E45ECA: Curl_http_connect (http.c:1796) ==13090== by 0x4E4C681: Curl_protocol_connect (url.c:3077) ==13090== by 0x4E52B3B: Curl_connect (url.c:4743) ==13090== by 0x4E5ABAF: Curl_perform (transfer.c:2523) ==13090== by 0x400987: thread_main (in /usr/local/projects/accesslogger/leak-test) ==13090== by 0x541E9D0: start_thread (in /lib64/libpthread-2.12.so) ==13090== by 0x516BB6C: clone (in /lib64/libc-2.12.so) ==13090== 624 bytes in 13 blocks are possibly lost in loss record 89 of 96 ==13090== at 0x4C279EE: malloc (vg_replace_malloc.c:270) ==13090== by 0x7615559: _PR_Getfd (prfdcach.c:109) ==13090== by 0x762CDF4: pt_SetMethods (ptio.c:3301) ==13090== by 0x762CF71: PR_ImportTCPSocket (ptio.c:4575) ==13090== by 0x4E70048: Curl_nss_connect (nss.c:1335) ==13090== by 0x4E67481: Curl_ssl_connect (sslgen.c:185) ==13090== by 0x4E45ECA: Curl_http_connect (http.c:1796) ==13090== by 0x4E4C681: Curl_protocol_connect (url.c:3077) ==13090== by 0x4E52B3B: Curl_connect (url.c:4743) ==13090== by 0x4E5ABAF: Curl_perform (transfer.c:2523) ==13090== by 0x400987: thread_main (in /usr/local/projects/accesslogger/leak-test) ==13090== by 0x541E9D0: start_thread (in /lib64/libpthread-2.12.so) ==13090== by 0x516BB6C: clone (in /lib64/libc-2.12.so) ==13090== ==13090== 7,160 bytes in 179 blocks are possibly lost in loss record 95 of 96 ==13090== at 0x4C279EE: malloc (vg_replace_malloc.c:270) ==13090== by 0x761556B: _PR_Getfd (prfdcach.c:112) ==13090== by 0x762CDF4: pt_SetMethods (ptio.c:3301) ==13090== by 0x762E000: PR_Socket (ptio.c:3503) ==13090== by 0x4E6F637: Curl_nss_connect (nss.c:1200) ==13090== by 0x4E67481: Curl_ssl_connect (sslgen.c:185) ==13090== by 0x4E45ECA: Curl_http_connect (http.c:1796) ==13090== by 0x4E4C681: Curl_protocol_connect (url.c:3077) ==13090== by 0x4E52B3B: Curl_connect (url.c:4743) ==13090== by 0x4E5ABAF: Curl_perform (transfer.c:2523) ==13090== by 0x400987: thread_main (in /usr/local/projects/accesslogger/leak-test) ==13090== by 0x541E9D0: start_thread (in /lib64/libpthread-2.12.so) ==13090== by 0x516BB6C: clone (in /lib64/libc-2.12.so) ==13090== ==13090== 8,592 bytes in 179 blocks are possibly lost in loss record 96 of 96 ==13090== at 0x4C279EE: malloc (vg_replace_malloc.c:270) ==13090== by 0x7615559: _PR_Getfd (prfdcach.c:109) ==13090== by 0x762CDF4: pt_SetMethods (ptio.c:3301) ==13090== by 0x762E000: PR_Socket (ptio.c:3503) ==13090== by 0x4E6F637: Curl_nss_connect (nss.c:1200) ==13090== by 0x4E67481: Curl_ssl_connect (sslgen.c:185) ==13090== by 0x4E45ECA: Curl_http_connect (http.c:1796) ==13090== by 0x4E4C681: Curl_protocol_connect (url.c:3077) ==13090== by 0x4E52B3B: Curl_connect (url.c:4743) ==13090== by 0x4E5ABAF: Curl_perform (transfer.c:2523) ==13090== by 0x400987: thread_main (in /usr/local/projects/accesslogger/leak-test) ==13090== by 0x541E9D0: start_thread (in /lib64/libpthread-2.12.so) ==13090== by 0x516BB6C: clone (in /lib64/libc-2.12.so) ==13090== ==13090== LEAK SUMMARY: ==13090== definitely lost: 0 bytes in 0 blocks ==13090== indirectly lost: 0 bytes in 0 blocks ==13090== possibly lost: 17,776 bytes in 404 blocks ==13090== still reachable: 15,747 bytes in 108 blocks ==13090== suppressed: 0 bytes in 0 blocks ==13090== Reachable blocks (those to which a pointer was found) are not shown. ==13090== To see them, rerun with: --leak-check=full --show-reachable=yes ==13090== ==13090== For counts of detected and suppressed errors, rerun with: -v ==13090== ERROR SUMMARY: 6 errors from 6 contexts (suppressed: 12 from 9) real 3m11.968s user 0m53.931s sys 1m49.370s Increasing either the NUM_THREADS or the NUM_GETS_PER_THREAD seems to increase the size of the leak.
Also, the above was run on RHEL6.5: $ rpm -q nss libcurl nss-3.15.3-6.el6_5.x86_64 libcurl-7.19.7-37.el6_4.x86_64
A dive into the NSS code and I think I've found the race: ==== nss-3.15.1/nss/lib/pk11wrap/pk11obj.c:1421 ==== /* * remove an object from the list. If the object isn't already in * a list unlink becomes a noop. */ SECStatus PK11_UnlinkGenericObject(PK11GenericObject *object) { if (object->prev != NULL) { object->prev->next = object->next; } if (object->next != NULL) { object->next->prev = object->prev; } object->next = NULL; object->prev = NULL; return SECSuccess; } /* * This function removes a single object from the list and destroys it. * For an already unlinked object there is no difference between * PK11_DestroyGenericObject and PK11_DestroyGenericObjects */ SECStatus PK11_DestroyGenericObject(PK11GenericObject *object) { if (object == NULL) { return SECSuccess; } PK11_UnlinkGenericObject(object); if (object->slot) { PK11_FreeSlot(object->slot); } PORT_Free(object); return SECSuccess; } ======== Imagine the following execution: Thread1 Thread2 PK11_LinkGenericObject(some_obj) PK11_LinkGenericObject(some_obj->next) if (object->prev != NULL) { if (object->prev != NULL) { object->prev->next = object->next; object->prev->next = object->next; } } // Both threads update their predecessors next pointer now some_object->prev->next is pointing to some_object->next (rather than some_object->next->next like it should) if (object->next != NULL) { if (object->next != NULL) { object->next->prev = object->prev; object->next->prev = object->prev; } } object->next = NULL; object->next = NULL; object->prev = NULL; object->prev = NULL; return SECSuccess; return SECSuccess; // Now some_obj->prev->next->next = NULL, we lost the tail end of the list Now, I'm far from an expert on the NSS code, and there may be some other synchronization here that I'm missing, but my understanding is that you can't do this lockfree without CAS. If I come up with a patch, I'll be sure and post it here.
Correction above: s/PK11_LinkGenericObject(some_obj)/PK11_UnlinkGenericObject(some_obj)/g
I am also facing same problem , I have an medium ec2 instance in AWS with s3 backer configured. Once the s3 backer starts it keeps on consuming system memory till maximum, and we have to reboot the instance to get it freed. Please let me if there is any fix or solution to this problem. Additional posts:-https://code.google.com/p/s3backer/issues/detail?id=47 Instance details:- OS :Centos CentOS Linux release 6.0 (Final) x86_64 S3backer :- /usr/local/s3backer/bin/s3backer --blockSize=64k --size=50g --ssl --vhost --listBlocks --debug-http mymachine-production-backup /s3backer [root@ip-10-XXX-XXX-X ~]# rpm -qa | grep -e nss -e libcurl openssh-server-5.3p1-20.el6_0.3.x86_64 openssl-devel-1.0.1e-16.el6_5.7.x86_64 libcurl-devel-7.19.7-26.el6_2.4.x86_64 nss-sysinit-3.14.0.0-12.el6.x86_64 openssh-clients-5.3p1-20.el6_0.3.x86_64 openssl-1.0.1e-16.el6_5.7.x86_64 nss-util-3.14.0.0-2.el6.x86_64 nss-3.14.0.0-12.el6.x86_64 openssh-5.3p1-20.el6_0.3.x86_64 libcurl-7.19.7-26.el6_2.4.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 nss-tools-3.14.0.0-12.el6.x86_64 nss-softokn-3.12.9-11.el6.x86_64
This is correct. Currently thread own their own generic object lists. If they share those list, they need to protect them themselves. NSS does not use global generic object lists (unlike, say keys, where keys live in global lists within NSS).
So we should file a bug with libcurl then?
(In reply to Charles Stanley from comment #9) > So we should file a bug with libcurl then? You could do
meant to type: you could do that or reassign this bug to libcurl.
(In reply to Bob Relyea from comment #8) > This is correct. Currently thread own their own generic object lists. > If they share those list, they need to protect them themselves. NSS does not > use global generic object lists (unlike, say keys, where keys live in global > lists within NSS). libcurl maintains its own list of objects per connection (which itself is never used by multiple threads at the same time). What exactly should be protected by libcurl at the level of NSS API? This bug is about calling PK11_DestroyGenericObject() concurrently, but it is the only problem that can happen here? Is it fine to call PK11_DestroyGenericObject() while another thread is executing PK11_CreateGenericObject()? Is it actually safe to call PK11_CreateGenericObject() concurrently? Is it fine to call PK11_DestroyGenericObject() while another thread is executing other PK11_*() functions? Is it fine to call PK11_DestroyGenericObject() while another thread is executing SSL_ForceHandshake()? Bob, could please describe what exactly needs to be synchronized here?
You can call PK11_XXXGeneric object on multiple threads, but if you share the data structure returned by PK11_CreateGenericObject() with other threads you need to protect it on on various calls PK11_XXXGeneric calls. So it's not calling PK11_DestroyGenericObject() on multiple threads that is the issue, it's calling PK11_DestroyGenericObject() with the same object on multiple threads that's an issue. It's reasonable to assume that this should work, and it's a reasonable RFE to have NSS protect the generic object data structure, but currently it's expected that you wouldn't share generic objects across threads (the initial purpose was to get it, use it, free it all on the same stack). bob
(In reply to Bob Relyea from comment #14) > You can call PK11_XXXGeneric object on multiple threads, but if you share > the data structure returned by PK11_CreateGenericObject() with other threads > you need to protect it on on various calls PK11_XXXGeneric calls. So it's > not calling PK11_DestroyGenericObject() on multiple threads that is the > issue, it's calling PK11_DestroyGenericObject() with the same object on > multiple threads that's an issue. > > It's reasonable to assume that this should work, and it's a reasonable RFE > to have NSS protect the generic object data structure, but currently it's > expected that you wouldn't share generic objects across threads (the initial > purpose was to get it, use it, free it all on the same stack). > > bob Thanks for the reply, Bob! I forgot to say that I am pretty sure that curl never uses the pointer returned by PK11_CreateGenericObject() in a different thread than the owning one. Based on the above analysis, the problem seems to be that NSS internally links the objects with each other. Consequently, even if each thread manipulates its own object, the collision may happen during the update of the linking pointers. As long as linking the objects into list is an implementation detail of NSS, it is not clear what libcurl should protect at the level of NSS API.
Hmm, then something is wrong with the analysis, because NSS doesn't reuse those pointers either, nor does it put those pointers on global threads. We can still look into adding locks to these objects themselves, but that wouldn't solve the original issue. (Which may be a memory corruption rather than a thread issue).
Fair enough. I agree that the analysis does not apply to the problem reported in this bug. The memory consumption growing per connection is a known issue of nss: https://bugzilla.mozilla.org/402712#c58 As for the valgrind reports, does the amount of "possibly lost memory" go to zero if you call PR_Cleanup() at the end of main() function?
I observed a similar leak with my single-threaded demo program. So this issue is not related to threading. $ valgrind --leak-check=full --num-callers=20 ./main ==25607== Memcheck, a memory error detector ==25607== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==25607== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==25607== Command: ./main ==25607== ==25607== ==25607== HEAP SUMMARY: ==25607== in use at exit: 119,019 bytes in 245 blocks ==25607== total heap usage: 136,228 allocs, 135,983 frees, 89,561,004 bytes allocate d ==25607== ==25607== 40 bytes in 1 blocks are possibly lost in loss record 45 of 99 ==25607== at 0x4C27A2E: malloc (vg_replace_malloc.c:270) ==25607== by 0x824E60B: _PR_Getfd (prfdcach.c:112) ==25607== by 0x8266094: pt_SetMethods (ptio.c:3303) ==25607== by 0x8266468: PR_OpenFile (ptio.c:3581) ==25607== by 0x9938C67: blapi_SHVerifyFile (shvfy.c:424) ==25607== by 0x9938FFE: blapi_SHVerify (shvfy.c:358) ==25607== by 0x9923438: bl_startup_tests (fipsfreebl.c:1541) ==25607== by 0x99620D5: ??? (in /lib64/libfreebl3.so) ==25607== by 0x9910A7A: ??? (in /lib64/libfreebl3.so) ==25607== ==25607== 48 bytes in 1 blocks are possibly lost in loss record 49 of 99 ==25607== at 0x4C27A2E: malloc (vg_replace_malloc.c:270) ==25607== by 0x824E5F9: _PR_Getfd (prfdcach.c:109) ==25607== by 0x8266094: pt_SetMethods (ptio.c:3303) ==25607== by 0x8266468: PR_OpenFile (ptio.c:3581) ==25607== by 0x9938C67: blapi_SHVerifyFile (shvfy.c:424) ==25607== by 0x9938FFE: blapi_SHVerify (shvfy.c:358) ==25607== by 0x9923438: bl_startup_tests (fipsfreebl.c:1541) ==25607== by 0x99620D5: ??? (in /lib64/libfreebl3.so) ==25607== by 0x9910A7A: ??? (in /lib64/libfreebl3.so) ==25607== ==25607== 2,000 bytes in 50 blocks are possibly lost in loss record 86 of 99 ==25607== at 0x4C27A2E: malloc (vg_replace_malloc.c:270) ==25607== by 0x824E60B: _PR_Getfd (prfdcach.c:112) ==25607== by 0x8266094: pt_SetMethods (ptio.c:3303) ==25607== by 0x8266468: PR_OpenFile (ptio.c:3581) ==25607== by 0xB273F84: ??? ==25607== by 0xB27176A: ??? ==25607== by 0xB276B98: ??? ==25607== by 0xB27D7F1: ??? ==25607== by 0x7915419: PK11_CreateNewObject (pk11obj.c:381) ==25607== by 0x79156B2: PK11_CreateGenericObject (pk11obj.c:1550) ==25607== by 0x4E6E6BF: nss_create_object (nss.c:353) ==25607== by 0x4E6E846: nss_load_cert (nss.c:387) ==25607== by 0x4E70513: nss_connect_common (nss.c:1168) ==25607== by 0x4E676E8: Curl_ssl_connect_nonblocking (sslgen.c:201) ==25607== by 0x4E423BD: https_connecting (http.c:1830) ==25607== by 0x4E4C931: Curl_protocol_connect (url.c:3078) ==25607== by 0x4E609B5: multi_runsingle (multi.c:1069) ==25607== by 0x4E61013: multi_socket (multi.c:1952) ==25607== by 0x4E610DE: curl_multi_socket_action (multi.c:2070) ==25607== by 0x40298C: perform(murl_t*, long) (murl.c:309) ==25607== ==25607== 2,400 bytes in 50 blocks are possibly lost in loss record 98 of 99 ==25607== at 0x4C27A2E: malloc (vg_replace_malloc.c:270) ==25607== by 0x824E5F9: _PR_Getfd (prfdcach.c:109) ==25607== by 0x8266094: pt_SetMethods (ptio.c:3303) ==25607== by 0x8266468: PR_OpenFile (ptio.c:3581) ==25607== by 0xB273F84: ??? ==25607== by 0xB27176A: ??? ==25607== by 0xB276B98: ??? ==25607== by 0xB27D7F1: ??? ==25607== by 0x7915419: PK11_CreateNewObject (pk11obj.c:381) ==25607== by 0x79156B2: PK11_CreateGenericObject (pk11obj.c:1550) ==25607== by 0x4E6E6BF: nss_create_object (nss.c:353) ==25607== by 0x4E6E846: nss_load_cert (nss.c:387) ==25607== by 0x4E70513: nss_connect_common (nss.c:1168) ==25607== by 0x4E676E8: Curl_ssl_connect_nonblocking (sslgen.c:201) ==25607== by 0x4E423BD: https_connecting (http.c:1830) ==25607== by 0x4E4C931: Curl_protocol_connect (url.c:3078) ==25607== by 0x4E609B5: multi_runsingle (multi.c:1069) ==25607== by 0x4E61013: multi_socket (multi.c:1952) ==25607== by 0x4E610DE: curl_multi_socket_action (multi.c:2070) ==25607== by 0x40298C: perform(murl_t*, long) (murl.c:309) ==25607== ==25607== LEAK SUMMARY: ==25607== definitely lost: 0 bytes in 0 blocks ==25607== indirectly lost: 0 bytes in 0 blocks ==25607== possibly lost: 4,488 bytes in 102 blocks ==25607== still reachable: 114,531 bytes in 143 blocks ==25607== suppressed: 0 bytes in 0 blocks ==25607== Reachable blocks (those to which a pointer was found) are not shown. ==25607== To see them, rerun with: --leak-check=full --show-reachable=yes ==25607== ==25607== For counts of detected and suppressed errors, rerun with: -v ==25607== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 12 from 9)
(In reply to zjx20 from comment #18) > I observed a similar leak with my single-threaded demo program. So this > issue is not related to threading. "This issue" in your case has nothing to do with this bug report. The possibly lost memory is just file descriptor cache allocated by NSPR, which should not grow per connection. You will get rid of the memory leak reports by calling PR_Cleanup() at the end of main() function. See the following curl commit for inspiration: https://github.com/bagder/curl/commit/24c3cdce
I believe the problem here is in PK11_DestroyGenericObject() (which is called by nss_destroy_object() in curl). As you can see in comment #4, PK11_DestroyGenericObject() unlinks the object from its list and reduces the link count in its slot and then calls PORT_Free() to release the actually storage: three pointers and an integer. Unfortunately the object referred to by object->objectID isn't destroyed. We're missing a call to PK11_DestroyObject(object->slot, object->objectID) immediately after PK11_FreeSlot(): SECStatus PK11_DestroyGenericObject(PK11GenericObject *object) { if (object == NULL) { return SECSuccess; } PK11_UnlinkGenericObject(object); if (object->slot) { PK11_DestroyObject(object->slot, object->objectID); PK11_FreeSlot(object->slot); } PORT_Free(object); return SECSuccess; } (This is similar to SECKEY_DestroyPublicKey, but we don't check for permanent objects because the objects we're wanting to delete in nss_destroy_object are in fact permanent.) With that patch, all the tests pass when building the nss rpm and the test programs that use libcurl no longer leak. However, looking at the callers of PK11_DestroyGenericObject() I think that some of them might be assuming that the underlying object is not actually destroyed so perhaps a PK11_CompletelyDestroyGenericObject() might be needed.
(In reply to john.haxby from comment #20) > With that patch, all the tests pass when building the nss rpm and the test > programs that use libcurl no longer leak. Which test programs? Do they leak memory without the patch? Please report the memory leaks together with a reproducer and valgrind's output. I believe this bug is not about leaking memory at program's exit, it is about unbounded memory usage during its run time.
OK, "unbounded memory usage" :) The objectIDs that are used to referrred to the allocated storage as no longer accessible to the program because the generic objects that contain them were destroyed. From the program I'm about to attach (which is based on that in comment #4): /* * Test for unbounded memory usage (aka memory leak) with curl and * nss. * * Create a local https server like this: * * openssl req -x509 -nodes -days 365 \ * -subj '/C=XX/ST=Here/L=There/O=MyOrg/CN=localhost' \ * -newkey rsa:1024 -keyout server.key -out server.crt * openssl s_server -key server.key -cert server.crt -www > /dev/null & * * Compile this program: * * cc -Os -Wall -g -o nss-leak-test nss-leak-test.c -lcurl * * Run (for example) * * ./nss-leak-test -n 10000 * * and note that the storage increases while the program is running. * Running under valgrind shows no leaks. Run like this to shown * possibly lost storage: * * valgrind --leak-check=full --num-callers=30 ./nss-leak-test -c 10 -a 9 * * This eschews the normal clean-up and shows lots of possibly lost * memory, some with a (partial) stack like this * * [...] * by 0x5D77EE9: PK11_CreateNewObject (pk11obj.c:381) * by 0x5D79CE2: PK11_CreateGenericObject (pk11obj.c:1550) * by 0x4E406CC: nss_create_object (nss.c:371) * by 0x4E4077D: nss_load_cert (nss.c:416) * by 0x4E90BFB: nss_load_ca_certificates (nss.c:1326) * by 0x4E90BFB: nss_setup_connect (nss.c:1588) * by 0x4E90BFB: nss_connect_common (nss.c:1798) * [...] * * The object allocated in nss_create_object() is destroyed in * nss_destroy_object() which called PK11_DestroyGenericObject() which * doesn't, in fact, destroy the object. * * If you turn off verification then this particular stack doesn't * appear and you don't get the leak. */ Without the patch, the test program grows from about 8.7MB RSS to about 20MB in 10,000 iterations; the number of possibly lost blocks increases by about five on each iteration, everything else stays much the same. I encourage you to try it and try it with the nss patch (which I'll also attach rather than just describing it).
Created attachment 1070920 [details] Reproducer for issue (based on comment #4)
I realise, of course, that curl probably should not reload the server certificates on every connection, but that's a separate issue and would merely avoid the problem.
As far as I remember, the excessive memory usage is not caused by the NSS library itself but the PEM PKCS #11 module that libcurl uses to load certs/keys from PEM files: https://bugzilla.mozilla.org/show_bug.cgi?id=402712#c58
Created attachment 1070947 [details] PK11_DestroyGenericObject patch
Almost but not quite. I should have included a slightly longer stack trace: by 0xABE8B29: NSSCKFWC_CreateObject (wrap.c:1958) by 0x5D77EE9: PK11_CreateNewObject (pk11obj.c:381) by 0x5D79CE2: PK11_CreateGenericObject (pk11obj.c:1550) by 0x4E406CC: nss_create_object (nss.c:371) by 0x4E4077D: nss_load_cert (nss.c:416) by 0x4E90BFB: nss_load_ca_certificates (nss.c:1326) With the patch PK11_DestroyGenericObject() calls PK11_GETTAB(slot)->C_DestroyObject(slot->session,object), aka NSSCKFWC_DestroyObject(). Basically, PK11_CreateGenericObject() calls PK11_GETTAB(slot)->C_CreateObject() and so it seems that PK11_DestroyGenericObject() should call PK11_GETTAB(slot)->C_DestroyObject(). Does that make sense?
I should add that with the patch, my test program and its predecessor in comment #4 do not change size at all: the possibly-lost blocks do no change with the number of times around the loop. My only concern is that PK11_CreateGenericObject() and PK11_DestroyGenericObject() are not supposed to be symmetric, in spite of their names.
For completeness: https://bugzilla.mozilla.org/show_bug.cgi?id=1202413
You have something here John, but it's a bit more complex. We need to keep an 'onwer' tag on the generic object. If we create it and it's a session object, then PK11_DestroyGenericObject should destroy it. If we find it with 'FindGenericObject, or we create it as a token object (which means we mean it to be persistant) then we PK11_DestroyGenericObject should not destroy it). We need to add a new field to struct PK11GenericObjectStr { which is a PRBool called owner. owner needs to be set in PK11_CreateGenericObject as: PR_TRUE if token == PR_FALSE PR_FALSE if token == PR_TRUE and in PK11_FindGenericObjects it needs to be set to PR_FALSE Then we call PK11_DestroyObject() in PK11_DestroyGenericObject if owner == PR_TRUE. bob
For completeness, there is one other place that creates PK11_GenericObjects() for the pk11_newMergeLog() in pk11merge.c. These should be set to PR_FALSE. bob
I have tested the patch in a nss build for latest rhel6 for an user code. The code uses libcurl with https. The leak does not go away, it is just "significantly" reduced. But I think there is something odd in the patch anyway, and we (me and the user) reverted it for the moment.
Created attachment 1118263 [details] valgrind.txt Hopefully this valgrind log can help in having an idea of where the problem is.
Quoting from https://github.com/curl/curl/issues/696 . """ Several of our customers reported memory leaks when using Zabbix with web-scenarios. We tracked down the problem to libcurl / libnss and CURLOPT_SSL_VERIFYPEER = 1 option on CENTOS 6, CENTOS 7 and the corresponding RHEL versions. Eventually OOM killer kicks in and kills Zabbix processes. The attached program demonstrates the issue (see the included README for running instructions). If you run it on the aforementioned operating systems with the default libcurl from their repositories then heap memory segment will grow continuously (run it for several minutes and use the included topmap.sh script to monitor memory segments). Valgrind doesn't show any lost memory. """ The program in question is single-threaded. The problem can be reliably repeated on CENTOS 6.7 and CENTOS 7.2 (our customers use RHEL). Could somebody confirm that this is the same issue as discussed in this thread?
Created attachment 1133104 [details] A single threaded program demonstrating the issue. See "leak.txt" and "smaps" files in "leak_v2.zip".
Yes, that sounds exactly like the problem. I have a similar reproducer attached to this bug, read from comment #20 onwards. I'm beginning to think that the right way to fix this bug is to go back to OpenSSL. Again. * Thu Sep 06 2007 Jindrich Novy <jnovy> 7.16.4-5 - add support for the NSS PKCS#11 pem reader so the command-line is the same for both OpenSSL and NSS by Rob Crittenden (rcritten) - switch to NSS again * Mon Sep 03 2007 Jindrich Novy <jnovy> 7.16.4-4 - revert back to use OpenSSL (#266021) * Mon Aug 27 2007 Jindrich Novy <jnovy> 7.16.4-3 - don't use openssl, use nss instead
John, thanks for the confirmation. Unless there are very good reasons not to switch back to OpenSSL I don't see why RH shouldn't do it (at least until this problem is solved): * this is a high severity issue; * it can be reproduced with the simplest libcurl "hello world" program; * several products are affected severely in production environments; * there has been very little progress towards correction for this issue in upstream; * there are alternative TLS backends that do not have this issue.
NSS maintainers seem busy. I will have a look at this in a week or so. Thanks for being patient!
(In reply to Kamil Dudka from comment #42) > NSS maintainers seem busy. I will have a look at this in a week or so. > Thanks for being patient! My 7*24 program has been affected by the leaking for more than one year (since comment #18). Please fix that by switching to OpenSSL, thank you!
(In reply to zjx20 from comment #44) > Please fix that by switching to OpenSSL, thank you! I am afraid this will never happen in RHEL-6 but you are free to recompile curl against OpenSSL on your own responsibility.
(In reply to Kamil Dudka from comment #45) > (In reply to zjx20 from comment #44) > > Please fix that by switching to OpenSSL, thank you! > > I am afraid this will never happen in RHEL-6 but you are free to recompile > curl against OpenSSL on your own responsibility. Good to know it will be landed in RHEL-7, we are using REEL-7 :)
We are going to fix this bug in RHEL-7, too, but there is currently no plan to change the TLS backend in RHEL (not even in Fedora).
(In reply to Kamil Dudka from comment #47) > We are going to fix this bug in RHEL-7, too, but there is currently no plan > to change the TLS backend in RHEL (not even in Fedora). Good to know. If you want a volunteer to test the fix, count me in. Thanks.
Kamil, could you upgrade the severity to high? According to the Severity field help: "high-impact issues in which the customer's operation is disrupted, but there is some capacity to produce." Some customers have to restart Zabbix servers pre-emptively on a daily basis to avoid OOM killer events, or disable SSL peer verification.
(In reply to Sandis Neilands from comment #49) > Some customers have to restart Zabbix servers pre-emptively on a daily basis > to avoid OOM killer events, or disable SSL peer verification. If you are a customer, do not hesitate to contact Product Support. Bugzilla is a bug tracking tool only. Priority of our work is driven by customer requests.
Is there any update since last month?
(In reply to zjx20 from comment #51) > Is there any update since last month? I am currently working on decoupling nss-pem from nss in the "public" branch: https://git.fedorahosted.org/cgit/nss-pem.git/log/?h=public The next step is going to be covering the current functionality by a test-suite to prevent regressions while redesigning the internal structures of nss-pem.
Any update on this? Someone else has just discovered this problem again :( Has the work stalled because I don't see changes since April?
(In reply to john.haxby from comment #56) > Has the work stalled because I don't see changes since April? Nope. The development of nss-pem continues on GitHub: https://github.com/kdudka/nss-pem Pull requests are welcome!
clearing needinfo, question had been answered a while ago in comment 40.
The patch for this bug has been already reverted upstream because it broke Firefox. It needs to be immediately reverted in all versions of RHEL, too.
Daiki, could you please handle the backout? First, could you check where we have added this patch? Only to 6.9 ?
(In reply to Kai Engert (:kaie) from comment #67) > Daiki, could you please handle the backout? > > First, could you check where we have added this patch? Only to 6.9 ? Yes, only to 6.9.
What is happening with this bug? I see https://github.com/kdudka/nss-pem was last updated at the beginning of March. Presumably this bug also affects RHEL7, Fedora etc but it seems to have completely stalled. Has the time come to say that using curl+nss in long-lived applications is not supported?
Hi John, Red Hat Enterprise Linux 6 transitioned to the Production 3 Phase on May 10, 2017. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available. The official life cycle policy can be reviewed here: http://redhat.com/rhel/lifecycle I'm not saying this wont be fixed in RHEL 6, but it's more likely to be sorted in RHEL 7, https://bugzilla.redhat.com/show_bug.cgi?id=1395803, which is still in production phase 1. Just trying to set realistic expectations given where we are with RHEL 6's life cycle. If a solution can be sorted upstream then we might be able to get it in for 6.10.
Thanks Chris. I wasn't really expecting a fix in RHEL6, but I didn't want it to be forgotten for RHEL7.
Closing this BZ for RHEL 6 per Comment 75. Work will continue on this issue in https://bugzilla.redhat.com/show_bug.cgi?id=1395803 for RHEL 7 which is still in production phase 1.