Bug 1268125
Summary: | glusterd memory overcommit | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | ryanlee | ||||||||||
Component: | glusterd | Assignee: | Kaushal <kaushal> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 3.7.4 | CC: | amukherj, bordas.csaba, bugs, kaushal, ryanlee | ||||||||||
Target Milestone: | --- | Keywords: | Triaged | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | glusterfs-3.7.12 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 1327751 1331289 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2016-06-28 12:13:22 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | 1331289 | ||||||||||||
Bug Blocks: | 1327751, 1328375 | ||||||||||||
Attachments: |
|
Description
ryanlee
2015-10-01 21:21:17 UTC
Thanks for reporting the issue, however we would need a statedump of running glusterd instance once the hike is seen. Along with that cmd_history.log is also expected as that tells the commands performed in the cluster. Otherwise its almost impossible to analyze what has caused the memory leak. Created attachment 1080409 [details]
statedump of backup volume
Created attachment 1080410 [details]
statedump of other volume
I've generated the statedump files and cut out 6336 sections named xlator.features.locks.backup-locks.inode from backup.dump. I'm hoping the contents of the inode context and the specific names of clients aren't necessary bits of info and have removed them from the dump files. The cmd_history.log is pretty empty. The memory starts leaking immediately on start/restart. We're currently mitigating by restarting the process once a week, the growth rate is quite steady without having done anything else. cmd_history does contain successes for updating volume settings (specifically, SSL allows) that our internal management system runs every half hour. We could probably stand to modify it to issue changes only if the values differ, but that doesn't look to me to be the source of the issue. (In reply to ryanlee from comment #4) > I've generated the statedump files and cut out 6336 sections named > xlator.features.locks.backup-locks.inode from backup.dump. I'm hoping the > contents of the inode context and the specific names of clients aren't > necessary bits of info and have removed them from the dump files. > > The cmd_history.log is pretty empty. The memory starts leaking immediately > on start/restart. We're currently mitigating by restarting the process once > a week, the growth rate is quite steady without having done anything else. > cmd_history does contain successes for updating volume settings > (specifically, SSL allows) that our internal management system runs every > half hour. We could probably stand to modify it to issue changes only if > the values differ, but that doesn't look to me to be the source of the issue. You would need to take a statedump for glusterd process, not the clients. The attachment indicates the statedump is for clients. Would be able to provide that? When I searched for gluster and statedump, I found info for running % gluster volume statedump backup all % gluster volume statedump other all on the server, which is what's provided. If you need something else, you're going to have to be more precise about how I can make it for you, not just that you need it. You could take a statedump of the glusterd instance with the following command: kill -SIGUSR1 <pid of glusterd instance> With that the statedump will be generated in /var/run/gluster with the name as glusterdump.<pid of glusterd>.timestamp Hope this clarifies the confusion. Created attachment 1081093 [details]
statedump of serverA glusterd
Created attachment 1081094 [details]
statedump of serverB glusterd
Great, thanks. Perhaps the new attachments will be of more use. Statedump helps, however I did ask for the cmd_history.log & glusterd.log file as well. Along with it could you also provide the output of gluster volume info output? (In reply to Atin Mukherjee from comment #11) > however I did ask for the ... glusterd.log file as well No, you didn't. But I already summarized the errors in it in the original bug report. > cmd_history.log I've already mentioned it's effectively empty in comment #4. > the output of gluster volume info output I already provided it in the original bug report. Still present in 3.7.6. I wonder if this is related to bug 1258931? (In reply to ryanlee from comment #13) > Still present in 3.7.6. I wonder if this is related to bug 1258931? Not really IMO. There is no memory leak/over commit reported in that bug. Sorry, I should have added a bit more context. Several months on, and the overcommit spread to all Gluster clients and started forcing one particularly small node offline due to resource exhaustion. It maxed out at 1TB on serverA (probably because the system didn't allow it to commit any more). Our requirement for SSL support was for secure access from an offsite Gluster client, but while enabling SSL provided a way in from one angle, its apparent memory-hogging side effects meant that client was nearly always disconnected anyways, so it wasn't going to work. We went to find a different solution so we could switch back to non-SSL mode for everything else. It may not be the same bug, certainly, but I turned off SSL yesterday, and everything is back to normal - all of the memory overcommits on the servers and the clients are mercifully gone, and there's no longer a growth pattern in our monitoring graphs. Which is the type of related I meant. If not the same, they're both rooted in issues with SSL mode. Is it possible the inability to connect to the self-heal daemon manifests as a constantly growing memory overcommit over the course of months? REVIEW: http://review.gluster.org/14143 (socket: Reap own-threads) posted (#1) for review on release-3.7 by Kaushal M (kaushal) COMMIT: http://review.gluster.org/14143 committed in release-3.7 by Jeff Darcy (jdarcy) ------ commit be41e31fefb92bf09c23efdb228000a2a1de28b5 Author: Kaushal M <kaushal> Date: Wed Apr 27 16:12:49 2016 +0530 socket: Reap own-threads Backport of f8948e2 from master Dead own-threads are reaped periodically (currently every minute). This helps avoid memory being leaked, and should help prevent memory starvation issues with GlusterD. Change-Id: Ifb3442a91891b164655bb2aa72210b13cee31599 BUG: 1268125 Signed-off-by: Kaushal M <kaushal> Reviewed-originally-on: http://review.gluster.org/14101 Reviewed-on: http://review.gluster.org/14143 Smoke: Gluster Build System <jenkins.com> CentOS-regression: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Jeff Darcy <jdarcy> This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.12, please open a new bug report. glusterfs-3.7.12 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://www.gluster.org/pipermail/gluster-devel/2016-June/049918.html [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user |