1175617 – Glusterd gets killed by oom-killer because of memory consumption

Bug 1175617 - Glusterd gets killed by oom-killer because of memory consumption

Summary: Glusterd gets killed by oom-killer because of memory consumption

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.6.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-12-18 08:18 UTC by Mikko Tiainen
Modified:	2016-08-01 04:43 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-08-01 04:42:36 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Mikko Tiainen 2014-12-18 08:18:24 UTC

Description of problem:
After upgrading the glustefs from 3.5.2 to 3.6.1 on environment were two glusters ( one Distributed-Replicate & one Distribute) are formed as follows:

gluster volume info
 
Volume Name: ingest_vol
Type: Distributed-Replicate
Volume ID: acdd2208-5ed1-4729-9d27-923c42f22e2c
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: passtorage1:/mnt/ingest/brick
Brick2: passtorage2:/mnt/ingest/brick
Brick3: passtorage3:/mnt/ingest/brick
Brick4: passtorage4:/mnt/ingest/brick
Options Reconfigured:
user.cifs: disable
performance.force-readdirp: off
cluster.extra-hash-regex: "(.*)\\.tmp"
performance.lazy-open: off
performance.strict-o-direct: on
performance.flush-behind: on
performance.read-ahead: on
performance.write-behind: on
performance.stat-prefetch: on
nfs.disable: on
 
Volume Name: storage_vol01
Type: Distribute
Volume ID: 946a01dd-5546-4e1a-b1c1-fd02fb5d157a
Status: Started
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: passtorage1:/mnt/storage/brick
Brick2: passtorage2:/mnt/storage/brick
Brick3: passtorage3:/mnt/storage/brick
Brick4: passtorage4:/mnt/storage/brick
Brick5: passtorage5:/mnt/storage/brick
Options Reconfigured:
user.cifs: disable
nfs.disable: on

machine named passtorage1 glusterd component tries to allocated all the memory from the OS and gets killed by oom-killer. Gluster did run 2 and half days before glusterd crash with light load.

All machines have memory as follows:
cat /proc/meminfo 
MemTotal:       99025408 kB
cat /proc/swaps 
Filename				Type		Size	Used	Priority
/dev/dm-1                               partition	8388604	32568	-1


Following glusterd logs are gathered from this incident:

passtorage1:
[2014-12-11 22:33:59.999976] E [glusterd-mgmt.c:127:gd_mgmt_v3_collate_errors] 0-management: Locking failed on passtorage4. Please check log file for details.

passtorage2:
[2014-12-11 22:38:34.095010] I [MSGID: 106004] [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer b15935ea-4e92-42d5-9828-fedb1877a83a, in Peer in Cluster state, has disconnected from glusterd.
[2014-12-11 22:38:34.095746] W [glusterd-locks.c:647:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f49103e0420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x428)[0x7f4905d49228] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x262)[0x7f4905cbe1c2] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7f4905ca9980] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7f49101b5f11] ))))) 0-management: Lock for vol ingest_vol not held
[2014-12-11 22:38:34.096080] W [glusterd-locks.c:647:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f49103e0420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x428)[0x7f4905d49228] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x262)[0x7f4905cbe1c2] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7f4905ca9980] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7f49101b5f11] ))))) 0-management: Lock for vol storage_vol01 not held
[2014-12-11 22:38:34.096134] E [glusterd-utils.c:148:glusterd_lock] 0-management: Unable to get lock for uuid: c20d61b6-0b70-4cae-a941-b4e5e5168548, lock held by: 765b1cb3-354d-4dd3-9ca5-59b6d0081e13

passtorage3:
[2014-12-11 22:38:34.094573] W [glusterd-locks.c:647:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f04a77ef420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x428)[0x7f049d158228] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x262)[0x7f049d0cd1c2] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7f049d0b8980] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7f04a75c4f11] ))))) 0-management: Lock for vol ingest_vol not held
[2014-12-11 22:38:34.094759] W [glusterd-locks.c:647:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f04a77ef420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x428)[0x7f049d158228] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x262)[0x7f049d0cd1c2] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7f049d0b8980] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7f04a75c4f11] ))))) 0-management: Lock for vol storage_vol01 not held
[2014-12-11 22:38:34.094785] E [glusterd-utils.c:148:glusterd_lock] 0-management: Unable to get lock for uuid: 765b1cb3-354d-4dd3-9ca5-59b6d0081e13, lock held by: 765b1cb3-354d-4dd3-9ca5-59b6d0081e13

[2014-12-11 22:38:34.094289] I [MSGID: 106004] [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer b15935ea-4e92-42d5-9828-fedb1877a83a, in Peer in Cluster state, has disconnected from glusterd.

passtorage4:
[2014-12-11 22:33:02.912414] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fd1648fb420] (--> /usr/lib64/glusterf
s/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7fd15a264baa] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(+0x4eb9f)[0x7fd15a1e0b9f] (--> /usr/lib64/
glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_op_sm+0x1e5)[0x7fd15a1e4005] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(+0xeba44)[0x7fd15a27da44] ))))) 0-managem
ent: Lock for storage_vol01 held by b15935ea-4e92-42d5-9828-fedb1877a83a
[2014-12-11 22:33:02.912468] E [glusterd-op-sm.c:3058:glusterd_op_ac_lock] 0-management: Unable to acquire lock for storage_vol01
[2014-12-11 22:33:02.912539] E [glusterd-op-sm.c:6584:glusterd_op_sm] 0-management: handler returned: -1

passtorage5:
[2014-12-11 22:23:02.894479] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fd20e981420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7fd204cebbaa] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(+0x4eb9f)[0x7fd204c67b9f] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_op_sm+0x1e5)[0x7fd204c6b005] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(+0xeba44)[0x7fd204d04a44] ))))) 0-management: Lock for ingest_vol held by b15935ea-4e92-42d5-9828-fedb1877a83a
[2014-12-11 22:23:02.894524] E [glusterd-op-sm.c:3058:glusterd_op_ac_lock] 0-management: Unable to acquire lock for ingest_vol
[2014-12-11 22:23:02.894594] E [glusterd-op-sm.c:6584:glusterd_op_sm] 0-management: handler returned: -1

[2014-12-11 22:38:34.094271] I [MSGID: 106004] [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer b15935ea-4e92-42d5-9828-fedb1877a83a, in Peer in Cluster state, has disconnected from glusterd.
[2014-12-11 22:38:34.094732] W [glusterd-locks.c:647:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fd20e981420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x428)[0x7fd204ceb228] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x262)[0x7fd204c601c2] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7fd204c4b980] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7fd20e756f11] ))))) 0-management: Lock for vol ingest_vol not held
[2014-12-11 22:38:34.095098] W [glusterd-locks.c:647:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fd20e981420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x428)[0x7fd204ceb228] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x262)[0x7fd204c601c2] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7fd204c4b980] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7fd20e756f11] ))))) 0-management: Lock for vol storage_vol01 not held
[2014-12-11 22:38:34.095152] E [glusterd-utils.c:148:glusterd_lock] 0-management: Unable to get lock for uuid: b35df837-1761-41dc-8e27-8d99c75dbe79, lock held by: 765b1cb3-354d-4dd3-9ca5-59b6d0081e13


Version-Release number of selected component (if applicable):
3.6.1 glusterd

How reproducible:
not sure howto reproduce, system was running three days and then one glusterd process crashed

Steps to Reproduce:
1. updgrade glusterfs into 3.6.1 release
2. run the gluster until one glusterd gets killed
3.

Actual results:
glusterd process gets killed by oom-killer after running the gluster some time

Expected results:
glusterd does not try to allocate all the memory from OS but runs with moderate memory consumption.

Additional info:

Comment 1 Atin Mukherjee 2014-12-24 04:16:22 UTC

Can you please attach the core file of the glusterd instance which crashed. Also we have identified an area of code in locking/unlocking path leading to memory leaks which we are planning to fix in 3.6.2. Fix (http://review.gluster.org/#/c/9269/) is already available in master branch

Comment 2 Mikko Tiainen 2015-01-12 12:47:14 UTC

Hi,
unfoturnately core-file is not available from the crash, at that moment core-files where disabled from the running system.

Comment 3 Atin Mukherjee 2016-08-01 04:42:36 UTC

This is not a security bug, not going to fix this in 3.6.x because of
http://www.gluster.org/pipermail/gluster-users/2016-July/027682.html

Comment 4 Atin Mukherjee 2016-08-01 04:43:56 UTC

If the issue persists in the latest releases, please feel free to clone them

Note You need to log in before you can comment on or make changes to this bug.