Bug 1145524 - glusterd OOM killed while performing rebalance/remove-brick operations
Summary: glusterd OOM killed while performing rebalance/remove-brick operations
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: rhgs-3.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Gaurav Yadav
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard: dht-try-latest-build, triaged
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-09-23 09:18 UTC by Shruti Sampat
Modified: 2017-12-27 11:50 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-09 02:34:31 UTC
Embargoed:


Attachments (Terms of Use)
sosreport (17.77 MB, application/x-xz)
2014-09-23 09:18 UTC, Shruti Sampat
no flags Details

Description Shruti Sampat 2014-09-23 09:18:30 UTC
Created attachment 940324 [details]
sosreport

Description of problem:
-------------------------

glusterd invoked oom-killer on a a node while repeated rebalance and remove-brick operations were run on a couple of volumes in a 4 node cluster.

Following is from dmesg output -
-------------------------------------------------------------------------------

[ 3668]     0  3668  2465581  1905446   0       0             0 glusterd
[22962]     0 22962   147184      860   1       0             0 gluster
[22977]     0 22977   123751     3426   2       0             0 glusterfs
[22984]     0 22984   187951      330   0       0             0 glusterfs
[22990]    29 22990     6622      178   3       0             0 rpc.statd
[23022]     0 23022    25227      125   0       0             0 sleep
Out of memory: Kill process 3668 (glusterd) score 904 or sacrifice child
Killed process 3668, UID 0, (glusterd) total-vm:9862324kB, anon-rss:7620504kB, file-rss:1280kB

Find sosreport attached.

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
glusterfs-3.6.0.28-1.el6rhs.x86_64

How reproducible:
-------------------

Saw it once.

Steps to Reproduce:
---------------------

1. Was running rebalance and remove-brick multiple times on a distributed volume (2 bricks), and a distributed-replicate volume (2 x 2)

Actual results:
-----------------

glusterd on one node invoked oom-killer.

Expected results:
------------------

glusterd is not expected to be oom-killed.

Additional info:

Comment 5 Gaurav Yadav 2017-07-24 11:45:51 UTC
After seeing glusterd,cmd_history from attached sos report, it's evident that multiple commands being executed on the same volume simultaneously which is resulting in transactions collision hence locking is getting failed and sometimes resulting in above mention behaviour.

Even though I tried to reproduce issue on my system locally by executing below mentioned steps
1. Create a 4 node cluster.
2. Repeatedly calling rebalance and remove-brick from all the nodes.(In loop from one node)

Could you please try to recreate the issue by executing  rebalance and remove-brick commands from one node.

Comment 6 Shruti Sampat 2017-08-07 11:16:36 UTC
Hello, I do not work on this anymore, moving needinfo request to Rahul Hinduja.

Comment 8 Gaurav Yadav 2017-08-09 02:34:31 UTC
Issue is not getting reproduced after multiple trials, hence I am closing this issue.


Note You need to log in before you can comment on or make changes to this bug.