Bug 1175617

Summary: Glusterd gets killed by oom-killer because of memory consumption
Product: [Community] GlusterFS Reporter: Mikko Tiainen <mikko.tiainen>
Component: glusterdAssignee: bugs <bugs>
Status: CLOSED WONTFIX QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.1CC: amukherj, bugs, mikko.tiainen, sasundar, vpvainio
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-01 04:42:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mikko Tiainen 2014-12-18 08:18:24 UTC
Description of problem:
After upgrading the glustefs from 3.5.2 to 3.6.1 on environment were two glusters ( one Distributed-Replicate & one Distribute) are formed as follows:

gluster volume info
 
Volume Name: ingest_vol
Type: Distributed-Replicate
Volume ID: acdd2208-5ed1-4729-9d27-923c42f22e2c
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: passtorage1:/mnt/ingest/brick
Brick2: passtorage2:/mnt/ingest/brick
Brick3: passtorage3:/mnt/ingest/brick
Brick4: passtorage4:/mnt/ingest/brick
Options Reconfigured:
user.cifs: disable
performance.force-readdirp: off
cluster.extra-hash-regex: "(.*)\\.tmp"
performance.lazy-open: off
performance.strict-o-direct: on
performance.flush-behind: on
performance.read-ahead: on
performance.write-behind: on
performance.stat-prefetch: on
nfs.disable: on
 
Volume Name: storage_vol01
Type: Distribute
Volume ID: 946a01dd-5546-4e1a-b1c1-fd02fb5d157a
Status: Started
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: passtorage1:/mnt/storage/brick
Brick2: passtorage2:/mnt/storage/brick
Brick3: passtorage3:/mnt/storage/brick
Brick4: passtorage4:/mnt/storage/brick
Brick5: passtorage5:/mnt/storage/brick
Options Reconfigured:
user.cifs: disable
nfs.disable: on

machine named passtorage1 glusterd component tries to allocated all the memory from the OS and gets killed by oom-killer. Gluster did run 2 and half days before glusterd crash with light load.

All machines have memory as follows:
cat /proc/meminfo 
MemTotal:       99025408 kB
cat /proc/swaps 
Filename				Type		Size	Used	Priority
/dev/dm-1                               partition	8388604	32568	-1


Following glusterd logs are gathered from this incident:

passtorage1:
[2014-12-11 22:33:59.999976] E [glusterd-mgmt.c:127:gd_mgmt_v3_collate_errors] 0-management: Locking failed on passtorage4. Please check log file for details.

passtorage2:
[2014-12-11 22:38:34.095010] I [MSGID: 106004] [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer b15935ea-4e92-42d5-9828-fedb1877a83a, in Peer in Cluster state, has disconnected from glusterd.
[2014-12-11 22:38:34.095746] W [glusterd-locks.c:647:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f49103e0420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x428)[0x7f4905d49228] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x262)[0x7f4905cbe1c2] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7f4905ca9980] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7f49101b5f11] ))))) 0-management: Lock for vol ingest_vol not held
[2014-12-11 22:38:34.096080] W [glusterd-locks.c:647:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f49103e0420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x428)[0x7f4905d49228] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x262)[0x7f4905cbe1c2] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7f4905ca9980] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7f49101b5f11] ))))) 0-management: Lock for vol storage_vol01 not held
[2014-12-11 22:38:34.096134] E [glusterd-utils.c:148:glusterd_lock] 0-management: Unable to get lock for uuid: c20d61b6-0b70-4cae-a941-b4e5e5168548, lock held by: 765b1cb3-354d-4dd3-9ca5-59b6d0081e13

passtorage3:
[2014-12-11 22:38:34.094573] W [glusterd-locks.c:647:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f04a77ef420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x428)[0x7f049d158228] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x262)[0x7f049d0cd1c2] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7f049d0b8980] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7f04a75c4f11] ))))) 0-management: Lock for vol ingest_vol not held
[2014-12-11 22:38:34.094759] W [glusterd-locks.c:647:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f04a77ef420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x428)[0x7f049d158228] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x262)[0x7f049d0cd1c2] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7f049d0b8980] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7f04a75c4f11] ))))) 0-management: Lock for vol storage_vol01 not held
[2014-12-11 22:38:34.094785] E [glusterd-utils.c:148:glusterd_lock] 0-management: Unable to get lock for uuid: 765b1cb3-354d-4dd3-9ca5-59b6d0081e13, lock held by: 765b1cb3-354d-4dd3-9ca5-59b6d0081e13

[2014-12-11 22:38:34.094289] I [MSGID: 106004] [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer b15935ea-4e92-42d5-9828-fedb1877a83a, in Peer in Cluster state, has disconnected from glusterd.

passtorage4:
[2014-12-11 22:33:02.912414] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fd1648fb420] (--> /usr/lib64/glusterf
s/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7fd15a264baa] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(+0x4eb9f)[0x7fd15a1e0b9f] (--> /usr/lib64/
glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_op_sm+0x1e5)[0x7fd15a1e4005] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(+0xeba44)[0x7fd15a27da44] ))))) 0-managem
ent: Lock for storage_vol01 held by b15935ea-4e92-42d5-9828-fedb1877a83a
[2014-12-11 22:33:02.912468] E [glusterd-op-sm.c:3058:glusterd_op_ac_lock] 0-management: Unable to acquire lock for storage_vol01
[2014-12-11 22:33:02.912539] E [glusterd-op-sm.c:6584:glusterd_op_sm] 0-management: handler returned: -1

passtorage5:
[2014-12-11 22:23:02.894479] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fd20e981420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7fd204cebbaa] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(+0x4eb9f)[0x7fd204c67b9f] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_op_sm+0x1e5)[0x7fd204c6b005] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(+0xeba44)[0x7fd204d04a44] ))))) 0-management: Lock for ingest_vol held by b15935ea-4e92-42d5-9828-fedb1877a83a
[2014-12-11 22:23:02.894524] E [glusterd-op-sm.c:3058:glusterd_op_ac_lock] 0-management: Unable to acquire lock for ingest_vol
[2014-12-11 22:23:02.894594] E [glusterd-op-sm.c:6584:glusterd_op_sm] 0-management: handler returned: -1

[2014-12-11 22:38:34.094271] I [MSGID: 106004] [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer b15935ea-4e92-42d5-9828-fedb1877a83a, in Peer in Cluster state, has disconnected from glusterd.
[2014-12-11 22:38:34.094732] W [glusterd-locks.c:647:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fd20e981420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x428)[0x7fd204ceb228] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x262)[0x7fd204c601c2] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7fd204c4b980] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7fd20e756f11] ))))) 0-management: Lock for vol ingest_vol not held
[2014-12-11 22:38:34.095098] W [glusterd-locks.c:647:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fd20e981420] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x428)[0x7fd204ceb228] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x262)[0x7fd204c601c2] (--> /usr/lib64/glusterfs/3.6.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7fd204c4b980] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7fd20e756f11] ))))) 0-management: Lock for vol storage_vol01 not held
[2014-12-11 22:38:34.095152] E [glusterd-utils.c:148:glusterd_lock] 0-management: Unable to get lock for uuid: b35df837-1761-41dc-8e27-8d99c75dbe79, lock held by: 765b1cb3-354d-4dd3-9ca5-59b6d0081e13


Version-Release number of selected component (if applicable):
3.6.1 glusterd

How reproducible:
not sure howto reproduce, system was running three days and then one glusterd process crashed

Steps to Reproduce:
1. updgrade glusterfs into 3.6.1 release
2. run the gluster until one glusterd gets killed
3.

Actual results:
glusterd process gets killed by oom-killer after running the gluster some time

Expected results:
glusterd does not try to allocate all the memory from OS but runs with moderate memory consumption.

Additional info:

Comment 1 Atin Mukherjee 2014-12-24 04:16:22 UTC
Can you please attach the core file of the glusterd instance which crashed. Also we have identified an area of code in locking/unlocking path leading to memory leaks which we are planning to fix in 3.6.2. Fix (http://review.gluster.org/#/c/9269/) is already available in master branch

Comment 2 Mikko Tiainen 2015-01-12 12:47:14 UTC
Hi,
unfoturnately core-file is not available from the crash, at that moment core-files where disabled from the running system.

Comment 3 Atin Mukherjee 2016-08-01 04:42:36 UTC
This is not a security bug, not going to fix this in 3.6.x because of
http://www.gluster.org/pipermail/gluster-users/2016-July/027682.html

Comment 4 Atin Mukherjee 2016-08-01 04:43:56 UTC
If the issue persists in the latest releases, please feel free to clone them