1145524 – glusterd OOM killed while performing rebalance/remove-brick operations

Bug 1145524 - glusterd OOM killed while performing rebalance/remove-brick operations

Summary: glusterd OOM killed while performing rebalance/remove-brick operations

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Gaurav Yadav
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:	dht-try-latest-build, triaged
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-09-23 09:18 UTC by Shruti Sampat
Modified:	2017-12-27 11:50 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-09 02:34:31 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sosreport (17.77 MB, application/x-xz) 2014-09-23 09:18 UTC, Shruti Sampat	no flags	Details
View All

Description Shruti Sampat 2014-09-23 09:18:30 UTC

Created attachment 940324 [details]
sosreport

Description of problem:
-------------------------

glusterd invoked oom-killer on a a node while repeated rebalance and remove-brick operations were run on a couple of volumes in a 4 node cluster.

Following is from dmesg output -
-------------------------------------------------------------------------------

[ 3668]     0  3668  2465581  1905446   0       0             0 glusterd
[22962]     0 22962   147184      860   1       0             0 gluster
[22977]     0 22977   123751     3426   2       0             0 glusterfs
[22984]     0 22984   187951      330   0       0             0 glusterfs
[22990]    29 22990     6622      178   3       0             0 rpc.statd
[23022]     0 23022    25227      125   0       0             0 sleep
Out of memory: Kill process 3668 (glusterd) score 904 or sacrifice child
Killed process 3668, UID 0, (glusterd) total-vm:9862324kB, anon-rss:7620504kB, file-rss:1280kB

Find sosreport attached.

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
glusterfs-3.6.0.28-1.el6rhs.x86_64

How reproducible:
-------------------

Saw it once.

Steps to Reproduce:
---------------------

1. Was running rebalance and remove-brick multiple times on a distributed volume (2 bricks), and a distributed-replicate volume (2 x 2)

Actual results:
-----------------

glusterd on one node invoked oom-killer.

Expected results:
------------------

glusterd is not expected to be oom-killed.

Additional info:

Comment 5 Gaurav Yadav 2017-07-24 11:45:51 UTC

After seeing glusterd,cmd_history from attached sos report, it's evident that multiple commands being executed on the same volume simultaneously which is resulting in transactions collision hence locking is getting failed and sometimes resulting in above mention behaviour.

Even though I tried to reproduce issue on my system locally by executing below mentioned steps
1. Create a 4 node cluster.
2. Repeatedly calling rebalance and remove-brick from all the nodes.(In loop from one node)

Could you please try to recreate the issue by executing  rebalance and remove-brick commands from one node.

Comment 6 Shruti Sampat 2017-08-07 11:16:36 UTC

Hello, I do not work on this anymore, moving needinfo request to Rahul Hinduja.

Comment 8 Gaurav Yadav 2017-08-09 02:34:31 UTC

Issue is not getting reproduced after multiple trials, hence I am closing this issue.

Note You need to log in before you can comment on or make changes to this bug.