Created attachment 940324 [details] sosreport Description of problem: ------------------------- glusterd invoked oom-killer on a a node while repeated rebalance and remove-brick operations were run on a couple of volumes in a 4 node cluster. Following is from dmesg output - ------------------------------------------------------------------------------- [ 3668] 0 3668 2465581 1905446 0 0 0 glusterd [22962] 0 22962 147184 860 1 0 0 gluster [22977] 0 22977 123751 3426 2 0 0 glusterfs [22984] 0 22984 187951 330 0 0 0 glusterfs [22990] 29 22990 6622 178 3 0 0 rpc.statd [23022] 0 23022 25227 125 0 0 0 sleep Out of memory: Kill process 3668 (glusterd) score 904 or sacrifice child Killed process 3668, UID 0, (glusterd) total-vm:9862324kB, anon-rss:7620504kB, file-rss:1280kB Find sosreport attached. Version-Release number of selected component (if applicable): -------------------------------------------------------------- glusterfs-3.6.0.28-1.el6rhs.x86_64 How reproducible: ------------------- Saw it once. Steps to Reproduce: --------------------- 1. Was running rebalance and remove-brick multiple times on a distributed volume (2 bricks), and a distributed-replicate volume (2 x 2) Actual results: ----------------- glusterd on one node invoked oom-killer. Expected results: ------------------ glusterd is not expected to be oom-killed. Additional info:
After seeing glusterd,cmd_history from attached sos report, it's evident that multiple commands being executed on the same volume simultaneously which is resulting in transactions collision hence locking is getting failed and sometimes resulting in above mention behaviour. Even though I tried to reproduce issue on my system locally by executing below mentioned steps 1. Create a 4 node cluster. 2. Repeatedly calling rebalance and remove-brick from all the nodes.(In loop from one node) Could you please try to recreate the issue by executing rebalance and remove-brick commands from one node.
Hello, I do not work on this anymore, moving needinfo request to Rahul Hinduja.
Issue is not getting reproduced after multiple trials, hence I am closing this issue.