Created attachment 599186 [details] State dump of glusterfs process consumig ~6GB memory Description of problem: Once I start geo-replication the corresponding glusterfs process starts eating memory until killed by OOM after some hours. Version-Release number of selected component (if applicable): gluster 3.2.6 Centos 6.2 in AWS How reproducible: Always Steps to Reproduce: 1.start geo-replication to a third node 2.do updates on data so gsyncd crawls through directories 3.wait Actual results: glusterfs process memory consumption increases beyond all limits (like performance.cache-size) Expected results: stable memory consumption of glusterfs once cache-size is reached Additional info: Bricks are on ecryptfs, sitting on XFS Current volume configuration: Volume Name: varema Type: Distributed-Replicate Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: ea-nas-01:/brick01/varema Brick2: ea-nas-02:/brick01/varema Brick3: ea-nas-01:/brick02/varema Brick4: ea-nas-02:/brick02/varema Brick5: ea-nas-01:/brick03/varema Brick6: ea-nas-02:/brick03/varema Brick7: ea-nas-01:/brick04/varema Brick8: ea-nas-02:/brick04/varema Options Reconfigured: performance.io-cache: off performance.cache-max-file-size: 1073741824 performance.cache-min-file-size: 0 performance.stat-prefetch: off performance.quick-read: off performance.cache-refresh-timeout: 1 performance.read-ahead: off geo-replication.indexing: on nfs.disable: on auth.allow: 10.0.2.*,127.* network.ping-timeout: 10 performance.cache-size: 1073741824
Can you help us with getting the statedump (by sending SIGUSR1 (kill -USR1 <pid of glusterfs process which consumes memory>)) and getting /tmp/glusterdump.<pid> file. That will help to pinpoint the reason for leak. Also, now that we have 3.3.0 release out, is it an option to upgrade? if yes, we recommend you to upgrade. It has multiple fixes for memory leaks etc overall.
(In reply to comment #1) > Can you help us with getting the statedump (by sending SIGUSR1 (kill -USR1 > <pid of glusterfs process which consumes memory>)) and getting > /tmp/glusterdump.<pid> file. That will help to pinpoint the reason for leak. The one I already attached is not ok? > > > Also, now that we have 3.3.0 release out, is it an option to upgrade? if > yes, we recommend you to upgrade. It has multiple fixes for memory leaks etc > overall. Well, this is a live system now and without good reasons I prefer to not update to a new major release that has already some know problems. Of course if you can identify the problem and definitely say this is fixed in 3.3.0 an update is an option. I also wonder if this is esspecially related to geo-replication. Unfortunately there is not a lot of information except source-code about how exactly geo-replication adds that extra load to the system. In the productive system I see nearly permanent 10Mbit traffic between nodes doing lookups and getxattr -- is self-healing triggered by gsyncd crawling?
Thanks for the statedump. I had missed mail on attachment and hence asked. That is ok. Now, the question for you is, how many files are in filesystem (in terms of inodes used, so include directory depth also). It has an impact on geo replication mount because it will traverse through the whole filesystem entries and it will result in creating inode entries in inode table. Next time you see spike in memory usage, try issuing below command on system and check if it has any impact. # echo 2 > /proc/sys/vm/drop_caches
(In reply to comment #3) > Thanks for the statedump. I had missed mail on attachment and hence asked. > That is ok. > > Now, the question for you is, how many files are in filesystem (in terms of > inodes used, so include directory depth also). It has an impact on geo > replication mount because it will traverse through the whole filesystem > entries and it will result in creating inode entries in inode table. There are 2366605 inodes used, maximum directory depth 21 (for code sources kept on the volume), depth 9 for the active files in use by application servers. Would you expect that moving that src-files with this high directory depth in an extra volume being a good workaround to solve this situation? ... though these files are not accessed/changed that often, as application data is. > > Next time you see spike in memory usage, try issuing below command on system > and check if it has any impact. > > # echo 2 > /proc/sys/vm/drop_caches droping dentries and inodes when glusterfs process consumed more then 9GB freed about 1GB at the first time, then memory consumption continued and dropping caches had much lower impact.
(In reply to comment #1) > Can you help us with getting the statedump (by sending SIGUSR1 (kill -USR1 > <pid of glusterfs process which consumes memory>)) and getting > /tmp/glusterdump.<pid> file. That will help to pinpoint the reason for leak. > > > Also, now that we have 3.3.0 release out, is it an option to upgrade? if > yes, we recommend you to upgrade. It has multiple fixes for memory leaks etc > overall. I just installed Gluster 3.3.0 in our Lab to see if it also eats all memory. Doing a local Gluster mount on the server and starting a simple shell script to create directories and touch empty files. After about 900k inodes uses I get: mktemp: failed to create directory via template `XXXXXX': Transport endpoint is not connected and a core dump of a glusterfs prozess. I attach it to this bug.
Created attachment 600507 [details] glusterfs core dump of "stressed" local mount
Andreas, Core doesn't help a lot as we would need same binary bits to get the right information out of it... two possible work around and see if it helps: before starting next mount, do ----- gluster volume set varema quick-read off gluster volume set varema io-cache off gluster volume set varema stat-prefetch off ----- Start the mount process after these commands are complete, and do a volume stop/start. Meantime, we are looking into this as mostly in our testbed, the leak is not this obviously hit. If it is very frequent and if you can help more with the debugging, the best possible way is to do is running below command.. step 0: gluster volume stop varema --mode=script; gluster volume start varema step 1: "bash# valgrind --leak-check=full --log-file=/tmp/valgrind.gluster-leak.log glusterfs -s <servername> --volfile-id varema /mount/point" step 2: run your tests ; keep watching memory growth; once reaches above 6-7GB, do umount /mount/point. That would pin-point the leak with the exact trace..
(In reply to comment #7) > Andreas, > > Core doesn't help a lot as we would need same binary bits to get the right > information out of it... > > two possible work around and see if it helps: > > before starting next mount, do > > ----- > gluster volume set varema quick-read off > gluster volume set varema io-cache off > gluster volume set varema stat-prefetch off > ----- > > Start the mount process after these commands are complete, and do a volume > stop/start. > > Meantime, we are looking into this as mostly in our testbed, the leak is not > this obviously hit. > > If it is very frequent and if you can help more with the debugging, the best > possible way is to do is running below command.. > > step 0: gluster volume stop varema --mode=script; gluster volume start varema > step 1: "bash# valgrind --leak-check=full > --log-file=/tmp/valgrind.gluster-leak.log glusterfs -s <servername> > --volfile-id varema /mount/point" I can do this ... but how can this be done for the mount that geo-replication does locally? > > step 2: run your tests ; keep watching memory growth; once reaches above > 6-7GB, do umount /mount/point. > > > That would pin-point the leak with the exact trace..
Andreas, Do 'gluster volume set varema memory-accounting on' and then restart the geo-rep session. When the memory usage is high, take a statedump (kill -SIGUSR1). Don't leave it to OOM kill, because memory-accounting info will be in memory, and will be lost with the process (unlike valgrind, where valgrind itself starts glusterfs process). As its hard to get the geo-replication's mount process attached to valgrind, the option is the best for now. Meantime, did you see any improvement by switching off the performance translators?
(In reply to comment #9) > Andreas, > > Do 'gluster volume set varema memory-accounting on' and then restart the > geo-rep session. When the memory usage is high, take a statedump (kill > -SIGUSR1). Don't leave it to OOM kill, because memory-accounting info will > be in memory, and will be lost with the process (unlike valgrind, where > valgrind itself starts glusterfs process). I assume this is a 3.3 feature? In the customer system where I see this problem, there is 3.2.6 running and it does not accept this option. I'll try this in the lab, unfortunately on the quite small VMs in the lab where I am testing Gluster, it is extremely unstable and core-dumps regularly on higher load ... > > As its hard to get the geo-replication's mount process attached to valgrind, > the option is the best for now. > > Meantime, did you see any improvement by switching off the performance > translators? I already tried that with 3.2.6 in the productive customer system without seeing any improvements, before opening this bug-report.
Andreas, Wanted to check if 3.3.1 fixes the issue for you. (We had not seen any OOM kills in our testing inhouse for similar workload). Also, if its possible to have a dry run of 3.4.0qa builds (now @ qa8), it would be great to evaluate the newer versions. For now, closing the bug as WORKSFORME, as there are significant patches which went into handle leaks/crashes etc.
reopen, user reports they still get OOM killed during geo-rep with 3.3.1 and 3.3.2qa3
(09:11:39 AM) social__: btw quick reproducer : mount gluster volume somewhere and run on it sysbench --num-threads=1 --max-requests=0 --init-rng=on --max-time=0 --test=fileio --file-total-size=$((1*10000))M --file-num=10000 --file-test-mode=seqwr run
Probably same issue but another way around: [root@vrana:~] gluster volume info Volume Name: kokot Type: Replicate Volume ID: 92f0a0cb-d552-4df5-8dac-54e7eb1be5c6 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: havran.int.na.kokot.com:/mnt/kokot Brick2: vrana.int.na.kokot.com:/mnt/kokot Options Reconfigured: nfs.disable: On diagnostics.client-sys-log-level: WARNING diagnostics.brick-sys-log-level: WARNING diagnostics.client-log-level: NONE diagnostics.brick-log-level: NONE server.statedump-path: /mnt/tmp [root@havran:/media/kokot] mount ... havran.int.na.kokot.com:/kokot on /media/kokot type fuse.glusterfs (rw,allow_other,max_read=131072) [root@havran:/media/kokot] sysbench --num-threads=1 --max-requests=0 --init-rng=on --max-time=0 --test=fileio --file-total-size=$((1*10000))M --file-num=10000 --file-test-mode=seqwr run WARNING: Both max-requests and max-time are 0, running endless test sysbench 0.4.12: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 1 Initializing random number generator from timer. Extra file open flags: 0 10000 files, 1Mb each 9.7656Gb total file size Block size 16Kb Periodic FSYNC enabled, calling fsync() each 100 requests. Calling fsync() at the end of test, Enabled. Using synchronous I/O mode Doing sequential write (creation) test Threads started! WARNING: Operation time (0.000000) is less than minimal counted value, counting as 1.000000 WARNING: Percentile statistics will be inaccurate FATAL: Failed to write file! file: 8350 pos: 32768 errno = 103 () Done. Operations performed: 0 Read, 534210 Write, 0 Other = 534210 Total Read 0b Written 8.1514Gb Total transferred 8.1514Gb (80.566Mb/sec) 5156.20 Requests/sec executed Test execution summary: total time: 103.6053s total number of events: 534210 total time taken by event execution: 74.6433 per-request statistics: min: 0.00ms avg: 0.14ms max: 177.90ms approx. 95 percentile: 0.20ms Threads fairness: events (avg/stddev): 534210.0000/0.00 execution time (avg/stddev): 74.6433/0.00 The error: FATAL: Failed to write file! file: 8350 pos: 32768 errno = 103 () is caused by: Out of memory: Kill process 26409 (glusterfs) score 906 or sacrifice child
Geo-replication memleak can be seen also in 3.4.0
==28620== 109,728 bytes in 1,143 blocks are definitely lost in loss record 216 of 232 ==28620== at 0x4A0577B: calloc (vg_replace_malloc.c:593) ==28620== by 0x319A83B132: __gf_calloc (mem-pool.h:75) ==28620== by 0x99F7126: cluster_getmarkerattr (libxlator.c:319) ==28620== by 0x99EC340: dht_getxattr (dht-common.c:2281) ==28620== by 0x319A81CB6A: default_getxattr (defaults.c:1083) ==28620== by 0x319A81CB6A: default_getxattr (defaults.c:1083) ==28620== by 0x319A81CB6A: default_getxattr (defaults.c:1083) ==28620== by 0x319A81CB6A: default_getxattr (defaults.c:1083) ==28620== by 0x319A81CB6A: default_getxattr (defaults.c:1083) ==28620== by 0xA646A8D: mdc_getxattr (md-cache.c:1780) ==28620== by 0xA851406: io_stats_getxattr (io-stats.c:2216) ==28620== by 0x648D772: fuse_getxattr_resume (fuse-bridge.c:3194)
==5908== 387,120 (217,800 direct, 169,320 indirect) bytes in 605 blocks are definitely lost in loss record 304 of 322 ==5908== at 0x4A0577B: calloc (vg_replace_malloc.c:593) ==5908== by 0x38B9A3B132: __gf_calloc (mem-pool.h:75) ==5908== by 0x38B9A2B27A: __inode_create (inode.c:531) ==5908== by 0x38B9A2B662: inode_new (inode.c:562) ==5908== by 0x6482D85: fuse_resolve_entry (fuse-resolve.c:114) ==5908== by 0x64831A7: fuse_resolve_parent (fuse-resolve.c:309) ==5908== by 0x6483487: fuse_resolve (fuse-resolve.c:642) ==5908== by 0x648374D: fuse_resolve_all (fuse-resolve.c:681) ==5908== by 0x6483797: fuse_resolve_and_resume (fuse-resolve.c:721) ==5908== by 0x649A117: fuse_thread_proc (fuse-bridge.c:4607) ==5908== by 0x3B21E07850: start_thread (pthread_create.c:301) ==5908== by 0xC8116FF: ???
REVIEW: http://review.gluster.org/6850 (libxlator: fix memleak in cluster_markerxtime_cbk and cluster_markeruuid_cbk) posted (#1) for review on master by Lukáš Bezdička (lukas.bezdicka)
==418== 635 (264 direct, 371 indirect) bytes in 1 blocks are definitely lost in loss record 194 of 239 ==418== at 0x4A081D4: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==418== by 0x4C577FC: __gf_default_calloc (mem-pool.h:75) ==418== by 0x4C57C16: __gf_calloc (mem-pool.c:104) ==418== by 0x4C41140: __inode_create (inode.c:531) ==418== by 0x4C4122A: inode_new (inode.c:562) ==418== by 0x6B5EA32: fuse_mkdir_resume (fuse-bridge.c:1450) ==418== by 0x6B55D07: fuse_resolve_done (fuse-resolve.c:663) ==418== by 0x6B55DDD: fuse_resolve_all (fuse-resolve.c:692) ==418== by 0x6B55CD0: fuse_resolve (fuse-resolve.c:649) ==418== by 0x6B55DB4: fuse_resolve_all (fuse-resolve.c:688) ==418== by 0x6B55E3B: fuse_resolve_continue (fuse-resolve.c:708) ==418== by 0x6B53FED: fuse_resolve_entry_cbk (fuse-resolve.c:97) ==418== by 0xA2D4A7F: io_stats_lookup_cbk (io-stats.c:1479) ==418== by 0xA0BCF0E: mdc_lookup_cbk (md-cache.c:780) ==418== by 0x9CA96B0: qr_lookup_cbk (quick-read.c:438) ==418== by 0x9A9597D: ioc_lookup_cbk (io-cache.c:255) ==418== by 0x943A551: dht_lookup_everywhere_done (dht-common.c:805) ==418== by 0x943B8FD: dht_lookup_everywhere_cbk (dht-common.c:1026) ==418== by 0x91F6454: client3_3_lookup_cbk (client-rpc-fops.c:2630) ==418== by 0x4EB5686: rpc_clnt_handle_reply (rpc-clnt.c:771) ==418== by 0x4EB5A05: rpc_clnt_notify (rpc-clnt.c:891) ==418== by 0x4EB2172: rpc_transport_notify (rpc-transport.c:497) ==418== by 0x858E089: socket_event_poll_in (socket.c:2118) ==418== by 0x858E51F: socket_event_handler (socket.c:2230) ==418== by 0x4C83DFD: event_dispatch_epoll_handler (event-epoll.c:384) ==418== by 0x4C83FE9: event_dispatch_epoll (event-epoll.c:445) ==418== by 0x4C57694: event_dispatch (event.c:113) ==418== by 0x4086DE: main (glusterfsd.c:1934)
==6400== 1,115,896 (627,840 direct, 488,056 indirect) bytes in 2,180 blocks are definitely lost in loss record 249 of 255 ==6400== at 0x4A081D4: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==6400== by 0x4C577FC: __gf_default_calloc (mem-pool.h:75) ==6400== by 0x4C57C16: __gf_calloc (mem-pool.c:104) ==6400== by 0x4C41140: __inode_create (inode.c:531) ==6400== by 0x4C4122A: inode_new (inode.c:562) ==6400== by 0x6B54093: fuse_resolve_entry (fuse-resolve.c:114) ==6400== by 0x6B551B7: fuse_resolve_parent (fuse-resolve.c:309) ==6400== by 0x6B55CA0: fuse_resolve (fuse-resolve.c:642) ==6400== by 0x6B55D5F: fuse_resolve_all (fuse-resolve.c:681) ==6400== by 0x6B55E79: fuse_resolve_and_resume (fuse-resolve.c:721) ==6400== by 0x6B5816D: fuse_lookup (fuse-bridge.c:573) ==6400== by 0x6B72696: fuse_thread_proc (fuse-bridge.c:4607) ==6400== by 0x3375207F32: start_thread (pthread_create.c:309) ==6400== by 0x3374AF4EAC: clone (clone.S:111) ==6400== 10,394 bytes in 2,187 blocks are definitely lost in loss record 210 of 255 ==6400== at 0x4A081D4: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==6400== by 0x4C577FC: __gf_default_calloc (mem-pool.h:75) ==6400== by 0x4C57C16: __gf_calloc (mem-pool.c:104) ==6400== by 0x4C3FF63: gf_strdup (mem-pool.h:140) ==6400== by 0x4C40F80: __dentry_create (inode.c:489) ==6400== by 0x4C41B26: __inode_link (inode.c:852) ==6400== by 0x4C41C9C: inode_link (inode.c:892) ==6400== by 0x6B53FBE: fuse_resolve_entry_cbk (fuse-resolve.c:89) ==6400== by 0xA563A7F: io_stats_lookup_cbk (io-stats.c:1479) ==6400== by 0xA34BF0E: mdc_lookup_cbk (md-cache.c:780) ==6400== by 0x9F386B0: qr_lookup_cbk (quick-read.c:438) ==6400== by 0x9D2497D: ioc_lookup_cbk (io-cache.c:255) ==6400== by 0x96C7E56: dht_lookup_dir_cbk (dht-common.c:498) ==6400== by 0x947E593: afr_lookup_done (afr-common.c:1977) ==6400== by 0x947F144: afr_lookup_cbk (afr-common.c:2208) ==6400== by 0x91F6454: client3_3_lookup_cbk (client-rpc-fops.c:2630) ==6400== by 0x4EB5686: rpc_clnt_handle_reply (rpc-clnt.c:771) ==6400== by 0x4EB5A05: rpc_clnt_notify (rpc-clnt.c:891) ==6400== by 0x4EB2172: rpc_transport_notify (rpc-transport.c:497) ==6400== by 0x858E089: socket_event_poll_in (socket.c:2118) ==6400== by 0x858E51F: socket_event_handler (socket.c:2230) ==6400== by 0x4C83DFD: event_dispatch_epoll_handler (event-epoll.c:384) ==6400== by 0x4C83FE9: event_dispatch_epoll (event-epoll.c:445) ==6400== by 0x4C57694: event_dispatch (event.c:113) ==6400== by 0x4086DE: main (glusterfsd.c:1934)
==13552== 18,224 (7,056 direct, 11,168 indirect) bytes in 49 blocks are definitely lost in loss record 269 of 288 ==13552== at 0x4A081D4: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==13552== by 0x4C4E655: __gf_calloc (mem-pool.h:75) ==13552== by 0x4C3E59D: __inode_create (inode.c:531) ==13552== by 0x4C3EB9A: inode_new (inode.c:562) ==13552== by 0x6B527F4: fuse_create_resume (fuse-bridge.c:1942) ==13552== by 0x6B48ED5: fuse_resolve_all (fuse-resolve.c:663) ==13552== by 0x6B48C17: fuse_resolve (fuse-resolve.c:649) ==13552== by 0x6B48F1D: fuse_resolve_all (fuse-resolve.c:688) ==13552== by 0x6B483A2: fuse_resolve_continue (fuse-resolve.c:708) ==13552== by 0x6B4860E: fuse_resolve_entry_cbk (fuse-resolve.c:97) ==13552== by 0x98DBAC0: io_stats_lookup_cbk (io-stats.c:1505) ==13552== by 0x96A224B: dht_lookup_everywhere_done (dht-common.c:817)
ignore all the inode_new leaks, they clean up by "echo 3 > /proc/sys/vm/drop_caches" so I was chasing my tail with that. only leak is the one with patch sent.
REVIEW: http://review.gluster.org/6872 (libxlator: fix memleak in cluster_markerxtime_cbk and cluster_markeruuid_cbk) posted (#1) for review on release-3.5 by Lukáš Bezdička (lukas.bezdicka)
REVIEW: http://review.gluster.org/6873 (libxlator: fix memleak in cluster_markerxtime_cbk and cluster_markeruuid_cbk) posted (#1) for review on release-3.4 by Lukáš Bezdička (lukas.bezdicka)
COMMIT: http://review.gluster.org/6850 committed in master by Anand Avati (avati) ------ commit 2f8a8cfc27e6a73ff0238681ea40563c8768d4e6 Author: Lukas Bezdicka <lukas.bezdicka> Date: Wed Jan 29 11:20:08 2014 +0100 libxlator: fix memleak in cluster_markerxtime_cbk and cluster_markeruuid_cbk Change-Id: Ide3574bf9f661b077f707990d47821feda18b1a3 BUG: 841617 Reviewed-on: http://review.gluster.org/6850 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Anand Avati <avati>
COMMIT: http://review.gluster.org/6872 committed in release-3.5 by Vijay Bellur (vbellur) ------ commit 358b88087051d3b2534446c62f6bbac472e9198b Author: Lukas Bezdicka <lukas.bezdicka> Date: Wed Jan 29 11:20:08 2014 +0100 libxlator: fix memleak in cluster_markerxtime_cbk and cluster_markeruuid_cbk Change-Id: Ide3574bf9f661b077f707990d47821feda18b1a3 BUG: 841617 Reviewed-on: http://review.gluster.org/6872 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
REVIEW: http://review.gluster.org/6873 (libxlator: fix memleak in cluster_markerxtime_cbk and cluster_markeruuid_cbk) posted (#2) for review on release-3.4 by Lukáš Bezdička (lukas.bezdicka)
COMMIT: http://review.gluster.org/6873 committed in release-3.4 by Kaleb KEITHLEY (kkeithle) ------ commit e8fca12ee9cc51ac8a4265454753a3c27dae93f8 Author: Lukas Bezdicka <lukas.bezdicka> Date: Wed Jan 29 11:20:08 2014 +0100 libxlator: fix memleak in cluster_markerxtime_cbk and cluster_markeruuid_cbk Change-Id: Ide3574bf9f661b077f707990d47821feda18b1a3 BUG: 841617 Reviewed-on: http://review.gluster.org/6873 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Kaleb KEITHLEY <kkeithle> Reviewed-by: Vijay Bellur <vbellur>
After applying patch to production we no longer leak and it's running smoothly.
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.4.3, please reopen this bug report. glusterfs-3.4.3 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should already be or become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. The fix for this bug likely to be included in all future GlusterFS releases i.e. release > 3.4.3. In the same line the recent release i.e. glusterfs-3.5.0 [3] likely to have the fix. You can verify this by reading the comments in this bug report and checking for comments mentioning "committed in release-3.5". [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/5978 [2] http://news.gmane.org/gmane.comp.file-systems.gluster.user [3] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137