Progress so far: * Latest code doesn't dump mem-pools. So, memory pools are not accounted in statedump and we don't know how much memory they are occupying * Problem is reproducible in multiple setups. In perf testing setup we saw glusterd too consuming around 8+ GB of memory. It had 2012 transports associated with it. 1000 transports could be accounted to 1000 clients that have connection open for volfile fetching. Other 1000+ transports are still unaccounted and need to confirm whether its leak or valid objects I am experimenting in my own setup for further clues.
I tried to add 3k bricks to a server process and Resident memory consumption of the process went upto 5.5g (note that Virt memory consumption is 228g) 10205 root 20 0 228g 5.5g 804 t 0.0 71.9 3:35.32 glusterfsd However the command itself failed timing out On checking the threads associated with server process I found it has 30k+ threads [root@unused shberry]# ls -lr /proc/10205/task/ | wc -l 30993 I've given different threads and their number below: [root@unused shberry]# grep start_thread -B1 gdb.txt | grep -v start_thread | grep -v "\-\-" | grep -v posix_janitor_thread_proc | cut -d" " -f 5 | sort | uniq br_stub_signth br_stub_worker changelog_ev_connector changelog_ev_dispatch event_dispatch_epoll_worker gf_timer_proc glusterfs_sigwaiter index_worker iot_worker pool_sweeper posix_fsyncer posix_health_check_thread_proc syncenv_processor [root@unused shberry]# grep gf_timer_proc ./gdb.txt | wc -l 1 [root@unused shberry]# grep glusterfs_sigwaiter ./gdb.txt | wc -l 1 [root@unused shberry]# grep pool_sweeper ./gdb.txt | wc -l 1 [root@unused shberry]# grep event_dispatch_epoll_worker ./gdb.txt | wc -l 1024 [root@unused shberry]# grep index_worker ./gdb.txt | wc -l 2725 [root@unused shberry]# grep iot_worker ./gdb.txt | wc -l 2723 [root@unused shberry]# grep br_stub_signth ./gdb.txt | wc -l 2723 [root@unused shberry]# grep changelog_ev_connector ./gdb.txt | wc -l 2722 [root@unused shberry]# grep changelog_ev_dispatch ./gdb.txt | wc -l 8166 [root@unused shberry]# grep posix_fsyncer ./gdb.txt | wc -l 5444 [root@unused shberry]# grep posix_janitor_thread_proc ./gdb.txt | wc -l 2722 [root@unused shberry]# grep posix_health_check_thread_proc gdb.txt | wc -l 2722 [root@unused shberry]# grep syncenv_processor gdb.txt | wc -l 16 [root@unused shberry]# grep br_stub_worker gdb.txt | wc -l 2723 Though not sure how much memory these threads accounted for, number of threads is definitely a factor that affects how many bricks we can multiplex I could see OOM and thread creation failure messages in server logs.
I did a small test by exiting all threads other than epoll-thread. I could add upto 3000 bricks to a single process and memory consumption was not as high # gluster volume info testvol Volume Name: testvol Type: Distribute Volume ID: e62a1bae-ac50-4121-a77b-c034c0dec548 Status: Started Snapshot Count: 0 Number of Bricks: 3001 Transport-type: tcp top output: 21522 root 20 0 72.9g 2.7g 32m S 0.0 35.3 0:15.11 glusterfsd Note that Resident memory consumption is 50% (2.7G as compared to 5.5g earlier) and virtual memory is less (72.9g compared to 228g earlier) However, when I tried removing 2000 bricks (as mount was not responsive), glusterd crashed: (gdb) bt #0 0x0000003866a35935 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003866a370e8 in __GI_abort () at abort.c:91 #2 0x0000003866a74e8b in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x3866b78928 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198 #3 0x0000003866a7ae16 in malloc_printerr (action=3, str=0x3866b78a60 "munmap_chunk(): invalid pointer", ptr=<optimized out>) at malloc.c:5027 #4 0x0000003866b15c98 in __check_pf (seen_ipv4=seen_ipv4@entry=0x7f85e0402760, seen_ipv6=seen_ipv6@entry=0x7f85e0402770, in6ai=in6ai@entry=0x7f85e04027b0, in6ailen=in6ailen@entry=0x7f85e04027c0) at ../sysdeps/unix/sysv/linux/check_pf.c:367 #5 0x0000003866adcf19 in __GI_getaddrinfo (name=name@entry=0x7f85a8a392f0 "booradley", service=<optimized out>, service@entry=0x0, hints=0x3866b73500, hints@entry=0x0, pai=pai@entry=0x7f85e0402968) at ../sysdeps/posix/getaddrinfo.c:2305 #6 0x00007f85eaeceab9 in glusterd_peerinfo_find_by_hostname (hoststr=hoststr@entry=0x7f85a8a392f0 "booradley") at glusterd-peer-utils.c:115 #7 0x00007f85eaecedd7 in glusterd_hostname_to_uuid (hostname=hostname@entry=0x7f85a8a392f0 "booradley", uuid=uuid@entry=0x7f85e0402a20 "") at glusterd-peer-utils.c:155 #8 0x00007f85eae1367f in glusterd_volume_brickinfo_get (uuid=uuid@entry=0x0, hostname=0x7f85a8a392f0 "booradley", path=0x7f85a8a394d0 "/home/export/testvol/1820", volinfo=volinfo@entry=0x16a51f0, brickinfo=brickinfo@entry=0x7f85e0402a78) at glusterd-utils.c:1448 #9 0x00007f85eae1384c in glusterd_is_brick_decommissioned (volinfo=volinfo@entry=0x16a51f0, hostname=<optimized out>, path=<optimized out>) at glusterd-utils.c:1511 #10 0x00007f85eae59bec in _xl_is_client_decommissioned (xl=0x7f85a8a38490, volinfo=0x16a51f0) at glusterd-volgen.c:3248 #11 0x00007f85eae59e49 in _graph_get_decommissioned_children (volinfo=volinfo@entry=0x16a51f0, children=children@entry=0x7f85e0402b58, dht=<error reading variable: Unhandled dwarf expression opcode 0xfa>) at glusterd-volgen.c:3297 #12 0x00007f85eae5a0c0 in volgen_graph_build_dht_cluster (is_quotad=_gf_false, child_count=3001, volinfo=0x16a51f0, graph=<optimized out>) at glusterd-volgen.c:3390 #13 volume_volgen_graph_build_clusters (graph=graph@entry=0x7f85e0402e30, volinfo=volinfo@entry=0x16a51f0, is_quotad=is_quotad@entry=_gf_false) at glusterd-volgen.c:3602 #14 0x00007f85eae5a873 in client_graph_builder (graph=0x7f85e0402e30, volinfo=0x16a51f0, set_dict=0x7f85b7e7bc80, param=<optimized out>) at glusterd-volgen.c:3949 #15 0x00007f85eae51e6a in build_graph_generic (graph=graph@entry=0x7f85e0402e30, volinfo=0x16a51f0, mod_dict=0x7f85b7e80bb0, param=param@entry=0x0, builder=builder@entry=0x7f85eae5a7d0 <client_graph_builder>) at glusterd-volgen.c:1064 #16 0x00007f85eae53bfe in build_client_graph (mod_dict=<optimized out>, volinfo=<optimized out>, graph=0x7f85e0402e30) at glusterd-volgen.c:4167 #17 generate_single_transport_client_volfile (volinfo=volinfo@entry=0x16a51f0, filepath=filepath@entry=0x7f85e0403070 "/var/lib/glusterd/vols/testvol/trusted-testvol.tcp-fuse.vol", dict=dict@entry=0x7f85b7e80bb0) at glusterd-volgen.c:5362 #18 0x00007f85eae5cf5f in generate_client_volfiles (volinfo=volinfo@entry=0x16a51f0, client_type=client_type@entry=GF_CLIENT_TRUSTED) at glusterd-volgen.c:5534 #19 0x00007f85eae5e4ca in glusterd_create_volfiles (volinfo=volinfo@entry=0x16a51f0) at glusterd-volgen.c:6052 #20 0x00007f85eae5e5d1 in glusterd_create_volfiles_and_notify_services (volinfo=0x16a51f0) at glusterd-volgen.c:6078 #21 0x00007f85eae9575a in glusterd_op_remove_brick (dict=dict@entry=0x7f85d13747d0, op_errstr=op_errstr@entry=0x7f85e04061c0) at glusterd-brick-ops.c:2878 #22 0x00007f85eae04f30 in glusterd_op_commit_perform (op=op@entry=GD_OP_REMOVE_BRICK, dict=dict@entry=0x7f85d13747d0, op_errstr=op_errstr@entry=0x7f85e04061c0, rsp_dict=rsp_dict@entry=0x7f85b0706080) at glusterd-op-sm.c:6085 #23 0x00007f85eae9ba36 in gd_commit_op_phase (op=GD_OP_REMOVE_BRICK, op_ctx=op_ctx@entry=0x7f85d13747d0, req_dict=0x7f85d13747d0, op_errstr=op_errstr@entry=0x7f85e04061c0, txn_opinfo=txn_opinfo@entry=0x7f85e04061e0) at glusterd-syncop.c:1406 #24 0x00007f85eae9d035 in gd_sync_task_begin (op_ctx=op_ctx@entry=0x7f85d13747d0, req=req@entry=0x7f85e00d9640) at glusterd-syncop.c:1926 #25 0x00007f85eae9d31b in glusterd_op_begin_synctask (req=req@entry=0x7f85e00d9640, op=op@entry=GD_OP_REMOVE_BRICK, dict=0x7f85d13747d0) at glusterd-syncop.c:1991 #26 0x00007f85eae901d4 in __glusterd_handle_remove_brick (req=req@entry=0x7f85e00d9640) at glusterd-brick-ops.c:1160 #27 0x00007f85eadee99d in glusterd_big_locked_handler (req=0x7f85e00d9640, actor_fn=0x7f85eae8f2f0 <__glusterd_handle_remove_brick>) at glusterd-handler.c:81 #28 0x00007f85f4e8cb12 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 #29 0x0000003866a46370 in ?? () from /lib64/libc.so.6 #30 0x0000000000000000 in ?? ()
Hi, I have checked brick side code path regards to memory leak in case of stop the volume or remove the brick while brick mux is enable,I think below are the some code path where code has leak 1) In notify of changelog.c need to cleanup rpc server/threads after call changelog_cleanup_rpc 2) In server xlator in server_setvolume we are creating inode table on xlator as below client->bound_xl->itable = inode_table_new (conf->inode_lru_limit, client->bound_xl); But we are not cleaning this inode table from xlator at the time of stop the volume or remove the brick, i think we need to clean this. 3) As of now in server xlator notify we are just updating the next pointer node in graph to move out node from graph at the time of getting event GF_EVENT_CLEANUP, i think this also needs to update in more cleaner way. >>>>>>>>>>>>> if (victim_found) (*trav_p) = (*trav_p)->next; glusterfs_mgmt_pmap_signout (glusterfsd_ctx, victim->name); glusterfs_autoscale_threads (THIS->ctx, -1); >>>>>>>>>>>>>>>>>>> Regards Mohit Agrawal
After changing code to not spawn so many threads per brick, I ran following test (mount process, glusterd and server process all on same node): * turned brick-mux on * created 1000 volumes, each connecting to a single brick (1x1) [root@unused shberry]# gluster volume info | grep "Volume Name" | wc -l 1000 * started all the 1000 volumes * mounted all 1000 volumes simultaneously Iteration 1: ============ [root@unused shberry]# mount | grep glusterfs | wc -l 1000 Following is the memory consumption I observed (from top) after all clients were mounted: 9608 root 20 0 82.4g 2.4g 3140 S 0.0 31.6 0:21.78 glusterfsd 26323 root 20 0 1080m 272m 2604 S 0.0 3.5 1:45.59 glusterd Number of threads in server process: [root@unused rhs-glusterfs]# ls -l /proc/9608/task | wc -l 2007 Iteration 2: ============ Memory consumption after unmounting and remounting all 1000 volumes: 9608 root 20 0 82.4g 2.0g 2876 S 0.0 25.5 0:32.16 glusterfsd 26323 root 20 0 1080m 218m 2548 S 0.0 2.8 1:46.26 glusterd Iteration 3: ============ [root@unused rhs-glusterfs]# ls -l /proc/9608/task | wc -l 2010 Top output: 9608 root 20 0 82.4g 1.8g 2904 S 0.3 23.7 0:43.35 glusterfsd 26323 root 20 0 1080m 212m 2548 S 0.0 2.7 1:46.95 glusterd So, looks like there is no leak during mount/umount operations. I am going to repeat the same exercise tomorrow reverting all my code changes.
After reverting my changes, following are the numbers repeating same test in comment #7 Iteration 1: =========== 7999 root 20 0 90.1g 1.6g 3080 S 21.4 21.4 24:00.70 glusterfsd 23628 root 20 0 1080m 203m 2552 S 0.0 2.6 1:42.92 glusterd [root@unused glusterfs]# ls -l /proc/7999/task | wc -l 12007 Iteration 2: =========== 7999 root 20 0 90.1g 1.3g 2840 S 17.4 17.4 25:19.69 glusterfsd 23628 root 20 0 1080m 147m 2532 S 0.0 1.9 1:43.57 glusterd [root@unused glusterfs]# ls -l /proc/7999/task | wc -l 12008 after umounting all clients: 7999 root 20 0 90.1g 1.3g 2868 S 0.7 17.4 25:43.06 glusterfsd 23628 root 20 0 1080m 147m 2532 S 0.0 1.9 1:43.62 glusterd [root@unused glusterfs]# ls -l /proc/7999/task | wc -l 12008 Though not much of change in memory consumption, cpu consumption is reduced Iteration 3: ============ 7999 root 20 0 90.1g 1.2g 2588 S 19.7 15.4 26:12.50 glusterfsd 23628 root 20 0 1080m 133m 2284 S 0.0 1.7 1:44.21 glusterd As compared to tests in comment #7, the memory consumption has not increased very much (though there is around 8g increase of virtual memory consumption). But there is some significant increase in CPU consumption (0-1% vs 15-20%).
Continuing with my tests. I do see changelog creating a listener socket for each brick. Following are the results on a server process with 1000 bricks multiplexed [root@unused glusterfs]# ps ax | grep -i testvol 16512 ? Ssl 2:33 /usr/local/sbin/glusterfsd -s booradley --volfile-id testvol.booradley.home-export-testvol1 -p /var/lib/glusterd/vols/testvol/run/booradley-home-export-testvol1.pid -S /var/run/gluster/ee3bb56dcb51a65bac04476eb1442f60.socket --brick-name /home/export/testvol1 -l /var/log/glusterfs/bricks/home-export-testvol1.log --xlator-option *-posix.glusterd-uuid=6be6076b-cc08-4b2d-8631-e065fb03c7f5 --brick-port 49154 --xlator-option testvol-server.listen-port=49154 28909 pts/3 S+ 0:00 grep --color=auto -i testvol [root@unused glusterfs]# netstat -nap | grep 16512 | wc -l 1009 [root@unused glusterfs]# ls -l /proc/16512/fd | grep socket | wc -l 1009 unix 2 [ ACC ] STREAM LISTENING 548603 16512/glusterfsd /var/run/gluster/changelog-377fa4ce7da06aa9729226d78dd305b7.sock unix 2 [ ACC ] STREAM LISTENING 546555 16512/glusterfsd /var/run/gluster/changelog-c9d22cb783f54a5081d8b0d8ab63a5ad.sock unix 2 [ ACC ] STREAM LISTENING 545275 16512/glusterfsd /var/run/gluster/changelog-47c31545b2485cd6adb8be622256b5a5.sock unix 2 [ ACC ] STREAM LISTENING 544507 16512/glusterfsd /var/run/gluster/changelog-a2f0b5def53eb8f53af17a1b700dc69e.sock unix 2 [ ACC ] STREAM LISTENING 556028 16512/glusterfsd /var/run/gluster/changelog-269ff320acbb33792dbc5b744761edb5.sock unix 2 [ ACC ] STREAM LISTENING 549116 16512/glusterfsd /var/run/gluster/changelog-e5777c71634640f1fc1c6064a1fbad0c.sock unix 2 [ ACC ] STREAM LISTENING 548860 16512/glusterfsd /var/run/gluster/changelog-e05ca44bd806abbd7d7df8fd8b53b511.sock Note that there is no client mounted when I did this test.
For 1000 different volumes, when all volumes are started but without mounts: [root@unused raghu]# ls -l /proc/12192/fd | grep socket | wc -l 1005 After mounting 1000 clients, [root@unused raghu]# ls -l /proc/12192/fd | grep socket | wc -l 2005
After doing a mount/umount cycle for couple of hours, I don't see any increase in memory or consumption of sockets. So, there is no leak in the transport layer.
From https://bugzilla.redhat.com/show_bug.cgi?id=1461543#c35 > Do comment 32 and 33 make us to think about why we are seeing a spike in the > memory consumption in the brick mix scale setup, Raghavendra G? I'll answer based on two observations by Sowmya > 1) With per-thread mem-pools now, as the process threads scales, the mem-pools > allocated also scales at the rate of power of 2. This similar issue may had been observed in case of CNS too (as the volume count is increased, there is drastic increase in VSS usage of the brick process) Possible. There are around 10-12k threads in a single process which has 1k bricks multiplexed. Need to check with shberry whether the tests involved any I/O. If no I/O is involved mem pools may not necessarily be allocated as they seem to be allocated during the first mem_get on an object provided by the pool. > 2) The un-used memory should be cleaned up by mem-pool sweeper which may not be happening. This needs to be looked into. This issue is not present in brick-mux setup as we use a glusterfsd binary and glusterfsd does indeed create a pool sweeper thread while it it is starting up. This issue is present only for gfapi setups. @shekhar, @karthick, was it just mount or any I/O done during your tests?
There were no I/O done in the setup.
upstream patch : https://review.gluster.org/17709
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/111802
The memory consumption seems to be consistent for about a week in build glusterfs-3.8.4-35.el7rhgs.x86_64 brick process for 1000 volumes seems to be consuming less than 10GB and doesn't rise. glusterd process for 1000 volumes consumes upto 7.5GB and doesn't rise. Moving the bug to verified based on the above observation.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774