1457936 – possible memory leak in glusterfsd with multiplexing

Bug 1457936 - possible memory leak in glusterfsd with multiplexing

Summary: possible memory leak in glusterfsd with multiplexing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.3
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Mohammed Rafi KC
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:	3.3.0-devel-freeze-exception, brick-m...
Depends On:	1426291 1467986
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-06-01 15:05 UTC by Atin Mukherjee
Modified:	2018-01-20 07:23 UTC (History)
CC List:	15 users (show)
Fixed In Version:	glusterfs-3.8.4-33
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1426291
Environment:
Last Closed:	2017-09-21 04:45:37 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Comment 3 Raghavendra G 2017-06-15 04:51:22 UTC

Progress so far:

* Latest code doesn't dump mem-pools. So, memory pools are not accounted in statedump and we don't know how much memory they are occupying
* Problem is reproducible in multiple setups. In perf testing setup we saw glusterd too consuming around 8+ GB of memory. It had 2012 transports associated with it. 1000 transports could be accounted to 1000 clients that have connection open for volfile fetching. Other 1000+ transports are still unaccounted and need to confirm whether its leak or valid objects

I am experimenting in my own setup for further clues.

Comment 4 Raghavendra G 2017-06-15 08:23:29 UTC

I tried to add 3k bricks to a server process and Resident memory consumption of the process went upto 5.5g (note that Virt memory consumption is 228g)


10205 root      20   0  228g 5.5g  804 t  0.0 71.9   3:35.32 glusterfsd 

However the command itself failed timing out

On checking the threads associated with server process I found it has 30k+ threads
[root@unused shberry]# ls -lr /proc/10205/task/  | wc -l
30993

I've given different threads and their number below:

[root@unused shberry]# grep start_thread -B1 gdb.txt | grep -v start_thread | grep -v "\-\-" | grep -v posix_janitor_thread_proc | cut -d" " -f 5 | sort | uniq
br_stub_signth
br_stub_worker
changelog_ev_connector
changelog_ev_dispatch
event_dispatch_epoll_worker
gf_timer_proc
glusterfs_sigwaiter
index_worker
iot_worker
pool_sweeper
posix_fsyncer
posix_health_check_thread_proc
syncenv_processor

[root@unused shberry]# grep gf_timer_proc ./gdb.txt | wc -l
1
[root@unused shberry]# grep glusterfs_sigwaiter ./gdb.txt | wc -l
1
[root@unused shberry]# grep pool_sweeper ./gdb.txt | wc -l
1
[root@unused shberry]# grep event_dispatch_epoll_worker ./gdb.txt | wc -l
1024
[root@unused shberry]# grep index_worker ./gdb.txt | wc -l
2725
[root@unused shberry]# grep iot_worker ./gdb.txt | wc -l
2723
[root@unused shberry]# grep br_stub_signth ./gdb.txt | wc -l
2723
[root@unused shberry]# grep changelog_ev_connector ./gdb.txt | wc -l
2722
[root@unused shberry]# grep changelog_ev_dispatch ./gdb.txt | wc -l
8166
[root@unused shberry]# grep posix_fsyncer ./gdb.txt | wc -l
5444
[root@unused shberry]# grep posix_janitor_thread_proc ./gdb.txt | wc -l
2722
[root@unused shberry]# grep posix_health_check_thread_proc gdb.txt | wc -l
2722
[root@unused shberry]# grep syncenv_processor gdb.txt | wc -l
16
[root@unused shberry]# grep br_stub_worker gdb.txt | wc -l
2723

Though not sure how much memory these threads accounted for, number of threads is definitely a factor that affects how many bricks we can multiplex

I could see OOM and thread creation failure messages in server logs.

Comment 5 Raghavendra G 2017-06-15 10:24:20 UTC

I did a small test by exiting all threads other than epoll-thread. I could add upto 3000 bricks to a single process and memory consumption was not as high


# gluster volume info testvol
Volume Name: testvol
Type: Distribute
Volume ID: e62a1bae-ac50-4121-a77b-c034c0dec548
Status: Started
Snapshot Count: 0
Number of Bricks: 3001
Transport-type: tcp

top output:
21522 root      20   0 72.9g 2.7g  32m S  0.0 35.3   0:15.11 glusterfsd                                                                                                 

Note that Resident memory consumption is 50% (2.7G as compared to 5.5g earlier) and virtual memory is less (72.9g compared to 228g earlier)

However, when I tried removing 2000 bricks (as mount was not responsive), glusterd crashed:

 (gdb) bt
#0  0x0000003866a35935 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003866a370e8 in __GI_abort () at abort.c:91
#2  0x0000003866a74e8b in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x3866b78928 "*** glibc detected *** %s: %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x0000003866a7ae16 in malloc_printerr (action=3, str=0x3866b78a60 "munmap_chunk(): invalid pointer", ptr=<optimized out>) at malloc.c:5027
#4  0x0000003866b15c98 in __check_pf (seen_ipv4=seen_ipv4@entry=0x7f85e0402760, seen_ipv6=seen_ipv6@entry=0x7f85e0402770, in6ai=in6ai@entry=0x7f85e04027b0, 
    in6ailen=in6ailen@entry=0x7f85e04027c0) at ../sysdeps/unix/sysv/linux/check_pf.c:367
#5  0x0000003866adcf19 in __GI_getaddrinfo (name=name@entry=0x7f85a8a392f0 "booradley", service=<optimized out>, service@entry=0x0, hints=0x3866b73500, 
    hints@entry=0x0, pai=pai@entry=0x7f85e0402968) at ../sysdeps/posix/getaddrinfo.c:2305
#6  0x00007f85eaeceab9 in glusterd_peerinfo_find_by_hostname (hoststr=hoststr@entry=0x7f85a8a392f0 "booradley") at glusterd-peer-utils.c:115
#7  0x00007f85eaecedd7 in glusterd_hostname_to_uuid (hostname=hostname@entry=0x7f85a8a392f0 "booradley", uuid=uuid@entry=0x7f85e0402a20 "")
    at glusterd-peer-utils.c:155
#8  0x00007f85eae1367f in glusterd_volume_brickinfo_get (uuid=uuid@entry=0x0, hostname=0x7f85a8a392f0 "booradley", path=0x7f85a8a394d0 "/home/export/testvol/1820", 
    volinfo=volinfo@entry=0x16a51f0, brickinfo=brickinfo@entry=0x7f85e0402a78) at glusterd-utils.c:1448
#9  0x00007f85eae1384c in glusterd_is_brick_decommissioned (volinfo=volinfo@entry=0x16a51f0, hostname=<optimized out>, path=<optimized out>) at glusterd-utils.c:1511
#10 0x00007f85eae59bec in _xl_is_client_decommissioned (xl=0x7f85a8a38490, volinfo=0x16a51f0) at glusterd-volgen.c:3248
#11 0x00007f85eae59e49 in _graph_get_decommissioned_children (volinfo=volinfo@entry=0x16a51f0, children=children@entry=0x7f85e0402b58, 
    dht=<error reading variable: Unhandled dwarf expression opcode 0xfa>) at glusterd-volgen.c:3297
#12 0x00007f85eae5a0c0 in volgen_graph_build_dht_cluster (is_quotad=_gf_false, child_count=3001, volinfo=0x16a51f0, graph=<optimized out>) at glusterd-volgen.c:3390
#13 volume_volgen_graph_build_clusters (graph=graph@entry=0x7f85e0402e30, volinfo=volinfo@entry=0x16a51f0, is_quotad=is_quotad@entry=_gf_false)
    at glusterd-volgen.c:3602
#14 0x00007f85eae5a873 in client_graph_builder (graph=0x7f85e0402e30, volinfo=0x16a51f0, set_dict=0x7f85b7e7bc80, param=<optimized out>) at glusterd-volgen.c:3949
#15 0x00007f85eae51e6a in build_graph_generic (graph=graph@entry=0x7f85e0402e30, volinfo=0x16a51f0, mod_dict=0x7f85b7e80bb0, param=param@entry=0x0, 
    builder=builder@entry=0x7f85eae5a7d0 <client_graph_builder>) at glusterd-volgen.c:1064
#16 0x00007f85eae53bfe in build_client_graph (mod_dict=<optimized out>, volinfo=<optimized out>, graph=0x7f85e0402e30) at glusterd-volgen.c:4167
#17 generate_single_transport_client_volfile (volinfo=volinfo@entry=0x16a51f0, 
    filepath=filepath@entry=0x7f85e0403070 "/var/lib/glusterd/vols/testvol/trusted-testvol.tcp-fuse.vol", dict=dict@entry=0x7f85b7e80bb0) at glusterd-volgen.c:5362
#18 0x00007f85eae5cf5f in generate_client_volfiles (volinfo=volinfo@entry=0x16a51f0, client_type=client_type@entry=GF_CLIENT_TRUSTED) at glusterd-volgen.c:5534
#19 0x00007f85eae5e4ca in glusterd_create_volfiles (volinfo=volinfo@entry=0x16a51f0) at glusterd-volgen.c:6052
#20 0x00007f85eae5e5d1 in glusterd_create_volfiles_and_notify_services (volinfo=0x16a51f0) at glusterd-volgen.c:6078
#21 0x00007f85eae9575a in glusterd_op_remove_brick (dict=dict@entry=0x7f85d13747d0, op_errstr=op_errstr@entry=0x7f85e04061c0) at glusterd-brick-ops.c:2878
#22 0x00007f85eae04f30 in glusterd_op_commit_perform (op=op@entry=GD_OP_REMOVE_BRICK, dict=dict@entry=0x7f85d13747d0, op_errstr=op_errstr@entry=0x7f85e04061c0, 
    rsp_dict=rsp_dict@entry=0x7f85b0706080) at glusterd-op-sm.c:6085
#23 0x00007f85eae9ba36 in gd_commit_op_phase (op=GD_OP_REMOVE_BRICK, op_ctx=op_ctx@entry=0x7f85d13747d0, req_dict=0x7f85d13747d0, 
    op_errstr=op_errstr@entry=0x7f85e04061c0, txn_opinfo=txn_opinfo@entry=0x7f85e04061e0) at glusterd-syncop.c:1406
#24 0x00007f85eae9d035 in gd_sync_task_begin (op_ctx=op_ctx@entry=0x7f85d13747d0, req=req@entry=0x7f85e00d9640) at glusterd-syncop.c:1926
#25 0x00007f85eae9d31b in glusterd_op_begin_synctask (req=req@entry=0x7f85e00d9640, op=op@entry=GD_OP_REMOVE_BRICK, dict=0x7f85d13747d0) at glusterd-syncop.c:1991
#26 0x00007f85eae901d4 in __glusterd_handle_remove_brick (req=req@entry=0x7f85e00d9640) at glusterd-brick-ops.c:1160
#27 0x00007f85eadee99d in glusterd_big_locked_handler (req=0x7f85e00d9640, actor_fn=0x7f85eae8f2f0 <__glusterd_handle_remove_brick>) at glusterd-handler.c:81
#28 0x00007f85f4e8cb12 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#29 0x0000003866a46370 in ?? () from /lib64/libc.so.6
#30 0x0000000000000000 in ?? ()

Comment 6 Mohit Agrawal 2017-06-15 11:15:41 UTC

Hi,

I have checked brick side code path regards to memory leak in case of stop the volume or
remove the brick while brick mux is enable,I think below are the some code path where code has leak

1) In notify of changelog.c need to cleanup rpc server/threads after call  changelog_cleanup_rpc

2) In server xlator in server_setvolume we are creating inode table on xlator as below
   client->bound_xl->itable =
                                inode_table_new (conf->inode_lru_limit,
                                                 client->bound_xl);
   But we are not cleaning this inode table from xlator at the time of stop the volume or remove the brick, i think we need to clean this.

3) As of now in server xlator notify we are just updating the next pointer node in graph to move out node from graph at the time of getting event GF_EVENT_CLEANUP, i think this also needs to update in more cleaner way.

>>>>>>>>>>>>>
   if (victim_found)
           (*trav_p) = (*trav_p)->next;
   glusterfs_mgmt_pmap_signout (glusterfsd_ctx, victim->name);
   glusterfs_autoscale_threads (THIS->ctx, -1);

>>>>>>>>>>>>>>>>>>>
Regards
Mohit Agrawal

Comment 7 Raghavendra G 2017-06-15 12:32:35 UTC

After changing code to not spawn so many threads per brick, I ran following test (mount process, glusterd and server process all on same node):

* turned brick-mux on
* created 1000 volumes, each connecting to a single brick (1x1)
[root@unused shberry]# gluster volume info | grep "Volume Name" | wc -l
1000

* started all the 1000 volumes
* mounted all 1000 volumes simultaneously

Iteration 1:
============

[root@unused shberry]# mount | grep glusterfs | wc -l
1000

Following is the memory consumption I observed (from top) after all clients were mounted:
 9608 root      20   0 82.4g 2.4g 3140 S  0.0 31.6   0:21.78 glusterfsd                                                                                                 
26323 root      20   0 1080m 272m 2604 S  0.0  3.5   1:45.59 glusterd                                                                                                   

Number of threads in server process:
[root@unused rhs-glusterfs]# ls -l /proc/9608/task | wc -l
2007

Iteration 2:
============

Memory consumption after unmounting and remounting all 1000 volumes:
 9608 root      20   0 82.4g 2.0g 2876 S  0.0 25.5   0:32.16 glusterfsd                                                                                                 
26323 root      20   0 1080m 218m 2548 S  0.0  2.8   1:46.26 glusterd                                                                                                   

Iteration 3:
============

[root@unused rhs-glusterfs]# ls -l /proc/9608/task | wc -l
2010

Top output:
 9608 root      20   0 82.4g 1.8g 2904 S  0.3 23.7   0:43.35 glusterfsd                                                                                                 
26323 root      20   0 1080m 212m 2548 S  0.0  2.7   1:46.95 glusterd                                                                                                   

So, looks like there is no leak during mount/umount operations.

I am going to repeat the same exercise tomorrow reverting all my code changes.

Comment 8 Raghavendra G 2017-06-16 12:05:38 UTC

After reverting my changes, following are the numbers repeating same test in comment #7

Iteration 1:
===========
 7999 root      20   0 90.1g 1.6g 3080 S 21.4 21.4  24:00.70 glusterfsd                                                                                                 
23628 root      20   0 1080m 203m 2552 S  0.0  2.6   1:42.92 glusterd                                                                                                   

[root@unused glusterfs]# ls -l /proc/7999/task | wc -l
12007

Iteration 2:
===========
 7999 root      20   0 90.1g 1.3g 2840 S 17.4 17.4  25:19.69 glusterfsd                                                                                                 
23628 root      20   0 1080m 147m 2532 S  0.0  1.9   1:43.57 glusterd                                                                                                   

[root@unused glusterfs]# ls -l /proc/7999/task | wc -l
12008

after umounting all clients:
 7999 root      20   0 90.1g 1.3g 2868 S  0.7 17.4  25:43.06 glusterfsd                                                                                                 
23628 root      20   0 1080m 147m 2532 S  0.0  1.9   1:43.62 glusterd                                                                                                   

[root@unused glusterfs]# ls -l /proc/7999/task | wc -l
12008

Though not much of change in memory consumption, cpu consumption is reduced

Iteration 3:
============
 7999 root      20   0 90.1g 1.2g 2588 S 19.7 15.4  26:12.50 glusterfsd                                                                                                 
23628 root      20   0 1080m 133m 2284 S  0.0  1.7   1:44.21 glusterd                                                                                                   

As compared to tests in comment #7, the memory consumption has not increased very much (though there is around 8g increase of virtual memory consumption). But there is some significant increase in CPU consumption (0-1% vs 15-20%).

Comment 9 Raghavendra G 2017-06-29 06:20:52 UTC

Continuing with my tests. I do see changelog creating a listener socket for each brick. Following are the results on a server process with 1000 bricks multiplexed

[root@unused glusterfs]# ps ax | grep -i testvol
16512 ?        Ssl    2:33 /usr/local/sbin/glusterfsd -s booradley --volfile-id testvol.booradley.home-export-testvol1 -p /var/lib/glusterd/vols/testvol/run/booradley-home-export-testvol1.pid -S /var/run/gluster/ee3bb56dcb51a65bac04476eb1442f60.socket --brick-name /home/export/testvol1 -l /var/log/glusterfs/bricks/home-export-testvol1.log --xlator-option *-posix.glusterd-uuid=6be6076b-cc08-4b2d-8631-e065fb03c7f5 --brick-port 49154 --xlator-option testvol-server.listen-port=49154
28909 pts/3    S+     0:00 grep --color=auto -i testvol
[root@unused glusterfs]# netstat -nap | grep 16512 | wc -l
1009
[root@unused glusterfs]# ls -l /proc/16512/fd | grep socket | wc -l
1009

unix  2      [ ACC ]     STREAM     LISTENING     548603   16512/glusterfsd     /var/run/gluster/changelog-377fa4ce7da06aa9729226d78dd305b7.sock
unix  2      [ ACC ]     STREAM     LISTENING     546555   16512/glusterfsd     /var/run/gluster/changelog-c9d22cb783f54a5081d8b0d8ab63a5ad.sock
unix  2      [ ACC ]     STREAM     LISTENING     545275   16512/glusterfsd     /var/run/gluster/changelog-47c31545b2485cd6adb8be622256b5a5.sock
unix  2      [ ACC ]     STREAM     LISTENING     544507   16512/glusterfsd     /var/run/gluster/changelog-a2f0b5def53eb8f53af17a1b700dc69e.sock
unix  2      [ ACC ]     STREAM     LISTENING     556028   16512/glusterfsd     /var/run/gluster/changelog-269ff320acbb33792dbc5b744761edb5.sock
unix  2      [ ACC ]     STREAM     LISTENING     549116   16512/glusterfsd     /var/run/gluster/changelog-e5777c71634640f1fc1c6064a1fbad0c.sock
unix  2      [ ACC ]     STREAM     LISTENING     548860   16512/glusterfsd     /var/run/gluster/changelog-e05ca44bd806abbd7d7df8fd8b53b511.sock


Note that there is no client mounted when I did this test.

Comment 10 Raghavendra G 2017-06-29 07:09:26 UTC

For 1000 different volumes, when all volumes are started but without mounts:
[root@unused raghu]# ls -l /proc/12192/fd | grep socket | wc -l
1005

After mounting 1000 clients,
[root@unused raghu]# ls -l /proc/12192/fd | grep socket | wc -l
2005

Comment 11 Raghavendra G 2017-06-29 10:54:29 UTC

After doing a mount/umount cycle for couple of hours, I don't see any increase in memory or consumption of sockets. So, there is no leak in the transport layer.

Comment 12 Raghavendra G 2017-06-30 04:08:48 UTC

From https://bugzilla.redhat.com/show_bug.cgi?id=1461543#c35

> Do comment 32 and 33 make us to think about why we are seeing a spike in the
> memory consumption in the brick mix scale setup, Raghavendra G?

I'll answer based on two observations by Sowmya

> 1) With per-thread mem-pools now, as the process threads scales, the mem-pools > allocated also scales at the rate of power of 2. This similar issue may had  been observed in case of CNS too (as the volume count is increased, there is drastic increase in VSS usage of the brick process)

Possible. There are around 10-12k threads in a single process which has 1k bricks multiplexed. Need to check with shberry whether the tests involved any I/O. If no I/O is involved mem pools may not necessarily be allocated as they seem to be allocated during the first mem_get on an object provided by the pool.

> 2) The un-used memory should be cleaned up by mem-pool sweeper which may not be happening. This needs to be looked into.

This issue is not present in brick-mux setup as we use a glusterfsd binary and glusterfsd does indeed create a pool sweeper thread while it it is starting up. This issue is present only for gfapi setups.

@shekhar, @karthick, was it just mount or any I/O done during your tests?

Comment 13 Atin Mukherjee 2017-06-30 04:10:47 UTC

There were no I/O done in the setup.

Comment 17 Mohammed Rafi KC 2017-07-05 17:51:17 UTC

upstream patch : https://review.gluster.org/17709

Comment 18 Atin Mukherjee 2017-07-10 07:29:05 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/111802

Comment 20 krishnaram Karthick 2017-07-31 16:14:47 UTC

The memory consumption seems to be consistent for about a week in build glusterfs-3.8.4-35.el7rhgs.x86_64

brick process for 1000 volumes seems to be consuming less than 10GB and doesn't rise.
glusterd process for 1000 volumes consumes upto 7.5GB and doesn't rise.

Moving the bug to verified based on the above observation.

Comment 22 errata-xmlrpc 2017-09-21 04:45:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 23 errata-xmlrpc 2017-09-21 04:58:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.