Bug 1395803 (rhel7-libcurl-unbounded-mem-issue)

Summary: provide a create object API that manages the object references.
Product: Red Hat Enterprise Linux 7 Reporter: Hubert Kario <hkario>
Component: nssAssignee: Bob Relyea <rrelyea>
Status: CLOSED ERRATA QA Contact: Hubert Kario <hkario>
Severity: high Docs Contact: Mirek Jahoda <mjahoda>
Priority: high    
Version: 7.3CC: ccheney, charles.stanley, cww, dueno, hkario, jaskalnik, jch, john.haxby, kdudka, kengert, mgrepl, mjahoda, nkinder, nmavrogi, pandrade, rakesh.petkar, rmetrich, rrelyea, ryao, sandis.neilands, serge.savard, szidek, zhoujianxiong2
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: All   
Fixed In Version: nss-3.34.0-0.1.beta1.el7 Doc Type: Enhancement
Doc Text:
`PK11_CreateManagedGenericObject()` has been added to *NSS* to prevent memory leaks in applications The `PK11_DestroyGenericObject()` function does not destroy objects allocated by `PK11_CreateGenericObject()` properly, but some applications depend on a function for creating objects that persist after the use of the object. For this reason, the *Network Security Services* (NSS) libraries now include the `PK11_CreateManagedGenericObject()` function. If you create objects with `PK11_CreateManagedGenericObject()`, the `PK11_DestroyGenericObject()` function also properly destroys underlying associated objects. Applications, such as the *curl* utility, can now use `PK11_CreateManagedGenericObject()` to prevent memory leaks.
Story Points: ---
Clone Of: 1057388
: 1510247 (view as bug list) Environment:
Last Closed: 2018-04-10 09:23:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1057388, 1377248, 1420851, 1476743, 1510247    

Description Hubert Kario 2016-11-16 17:51:09 UTC
Same issue as described in Bug #1057388 exists when libcurl-devel-7.29.0-35.el7.x86_64 is used together with nss-3.21.3-2.el7_3.x86_64

+++ This bug was initially created as a clone of Bug #1057388 +++

Description of problem:

s3backer is a FUSE filesystem that exposes an Amazon S3 bucket as a file in your local filesystem. It relies on libcurl for communication with Amazon. RHEL's libcurl relies on nss as its TLS provider. Subjecting s3backer to prolonged stress on RHEL/CentOS 6.4 with the stock libcurl will result in unbounded memory consumption. Analysis with valgrind's memcheck tool shows no significant memory leaks. However, recompiling libcurl to use OpenSSL as its TLS provider instead of NSS makes the issue go away.

Version-Release number of selected component (if applicable):

How reproducible:

It happens every time.

Steps to Reproduce:

1. Spin up a RHEL/CentOS 6.x AWS instance and create an Amazon S3 bucket. In my case, the bucket has a 1M object size. Note that the only thing that is probably necessary here is a S3-like service, but to ensure reproducibility, I am providing precise parameters.
2. Install s3backer.
3. Mount the S3 bucket with a command such as the following; note sensitive details such as keys have been sanitized:

sudo s3backer --readAhead=0 --blockSize=1M --size=500G --listBlocks --vhost --baseURL=https://s3.amazonaws.com/ --timeout=15 --blockCacheThreads=10 --accessId=<accessId> --accessKey=<accessKey> --block
CacheSize=10 <bucket> <mountpoint>

4. Run dd on the mounted s3 bucket:

sudo dd if=/dev/urandom of=<mountpoint>/file bs=1M count=102400 iflag=nonblock,noatime oflag=nonblock,noatime

The workload where I originally observed this involved running a file system on top of s3backer that is not supported by Redhat. However, the same behavior can be reproduced by running dd on top of s3backer. It becomes increasingly obvious when using /proc/$PID/status to view memory usage over the span of a 2-4 hours. While the system running s3backer need not be on AWS, access to S3 from outside Amazon is quite slow and I highly recommend placing it there. A long running program that exercises the same code paths in libcurl and nss as s3backer should also exhibit this problem.

Actual results:

VmRSS slowly grows until either the dd stops or the OOM killer triggers.

Expected results:

VmRSS should stabilize with VmHWM not changing after the first hour or so.

Additional info:

Analysis with Valgrind's massif tool suggests shows multiple backtraces containing NSS in the same snapshot and the share of their memory usage grows as a function of instructions. Here is an excerpt of one of the backtraces:

| ->21.93% (13,056,000B) 0xE91C59A: ???
| | ->21.93% (13,056,000B) 0xE9218C7: ???
| |   ->21.93% (13,056,000B) 0xE928520: ???
| |     ->21.93% (13,056,000B) 0x37C56470C8: PK11_CreateNewObject (pk11obj.c:378)
| |       ->21.93% (13,056,000B) 0x37C5647361: PK11_CreateGenericObject (pk11obj.c:1415)
| |         ->21.93% (13,056,000B) 0x37C823F49E: nss_create_object (nss.c:349)
| |           ->21.93% (13,056,000B) 0x37C823F625: nss_load_cert (nss.c:383)
| |             ->21.93% (13,056,000B) 0x37C8240E0E: Curl_nss_connect (nss.c:1095)
| |               ->21.93% (13,056,000B) 0x37C8238480: Curl_ssl_connect (sslgen.c:185)
| |                 ->21.93% (13,056,000B) 0x37C8216EC9: Curl_http_connect (http.c:1796)
| |                   ->21.93% (13,056,000B) 0x37C821D680: Curl_protocol_connect (url.c:3077)
| |                     ->21.93% (13,056,000B) 0x37C8223B3A: Curl_connect (url.c:4743)
| |                       ->21.93% (13,056,000B) 0x37C822BBAE: Curl_perform (transfer.c:2523)
| |                         ->21.93% (13,056,000B) 0x409E30: http_io_perform_io (http_io.c:1437)
| |                           ->18.49% (11,009,280B) 0x40CB36: http_io_write_block (http_io.c:1348)
| |                           | ->18.49% (11,009,280B) 0x407422: ec_protect_write_block (ec_protect.c:433)
| |                           |   ->18.49% (11,009,280B) 0x404CFD: block_cache_worker_main (block_cache.c:1090)
| |                           |     ->18.49% (11,009,280B) 0x379120784F: start_thread (pthread_create.c:301)
| |                           |       ->18.49% (11,009,280B) 0x3790AE894B: clone (clone.S:115)

Recompiling libcurl with OpenSSL as its TLS provider makes the issue go away, which strongly suggests a problem either in NSS or in how libcurl uses NSS.

I am not familiar with the NSS codebase and I am concerned about the possibility of introducing a security hole should I attempt to patch this myself. Redhat employees in #rhel on freenode informed me that Redhat employs multiple NSS developers and suggested that I file a report here for them.

Comment 5 Bob Relyea 2017-03-27 21:55:17 UTC
There are 2 fixes needed for this problem. The nsspem and nss. The nss portion needs an upstream patch as it changes the ABI semantics. Moving to 7.5

Comment 9 Bob Relyea 2017-08-28 20:48:25 UTC
We better fix it for 7.5. I'll dev ack it .

Comment 10 Bob Relyea 2017-11-07 16:50:37 UTC
Fixed in  nss-3.34.0-0.1.beta1.el7
The new api is PK11_CreateManagedObject.

Comment 19 errata-xmlrpc 2018-04-10 09:23:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.