1413351 – [Scale] : Brick process oom-killed and rebalance failed.

Bug 1413351 - [Scale] : Brick process oom-killed and rebalance failed.

Summary: [Scale] : Brick process oom-killed and rebalance failed.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	upcall
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Mohit Agrawal
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351528
TreeView+	depends on / blocked

Reported:	2017-01-15 10:03 UTC by Ambarish
Modified:	2017-03-28 06:50 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-3.8.4-13
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-23 06:03:21 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1411329	0	unspecified	CLOSED	OOM kill of glusterfsd during continuous add-bricks	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Internal Links: 1411329

Description Ambarish 2017-01-15 10:03:48 UTC

Description of problem:
-----------------------

4 node Gluster cluster.4 clients mounts the volume via FUSE.

The intent was to scale from 1*2 to 6*2 and then back to 1*2 amidst continuous I/O from FUSE mounts.

On scaling up from 4*2 to 5*2,4 of my brick processes crashed and rebalance failed as well :

[root@gqas009 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             0             0            completed        0:13:31
      gqas015.sbu.lab.eng.bos.redhat.com                0        0Bytes             0             0             0            completed        0:14:27
      gqas014.sbu.lab.eng.bos.redhat.com                0        0Bytes           260             0             0            completed        0:13:1
      gqas010.sbu.lab.eng.bos.redhat.com            30338       162.4GB        177268             8             0               failed        8:16:47
volume rebalance: butcher: success
[root@gqas009 ~]# 

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

glusterfs-3.8.4-11.el7rhgs.x86_64

How reproducible:
-----------------

Reporting the first occurence.


Actual results:
---------------

Brick processes crashed and migration/rebalance failed.

Expected results:
-----------------

No crashes and a clean rebalance.

Additional info:
-----------------

*Client and Server OS* : RHEL 7.3

*Vol Status* :

[root@gqas009 ~]# gluster v status
Status of volume: butcher
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks1/A                                    N/A       N/A        N       N/A  
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks1/A                                    N/A       N/A        N       N/A  
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks2/A                                    N/A       N/A        N       N/A  
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks2/A                                    N/A       N/A        N       N/A  
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks3/A                                    49154     0          Y       24074
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks3/A                                    49154     0          Y       24472
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks4/A                                    49155     0          Y       24872
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks4/A                                    49155     0          Y       25346
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks5/A                                    49153     0          Y       32088
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks5/A                                    49153     0          Y       431  
Self-heal Daemon on localhost               N/A       N/A        Y       4098 
Quota Daemon on localhost                   N/A       N/A        Y       4106 
Self-heal Daemon on gqas015.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       526  
Quota Daemon on gqas015.sbu.lab.eng.bos.red
hat.com                                     N/A       N/A        Y       535  
Self-heal Daemon on gqas014.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       32177
Quota Daemon on gqas014.sbu.lab.eng.bos.red
hat.com                                     N/A       N/A        Y       32185
Self-heal Daemon on gqas010.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       3803 
Quota Daemon on gqas010.sbu.lab.eng.bos.red
hat.com                                     N/A       N/A        Y       3802 
 
Task Status of Volume butcher
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 93e83af5-411b-4310-a90a-2d3290ffd6c2
Status               : failed

Comment 5 Ambarish 2017-01-15 11:30:45 UTC

**************
EXACT WORKLOAD
**************

Client 1 : tarball untar + Recursive ls

Client 2 : Bonnie++

Client 3 : tarball untar

Client 4:  finds and Bonnie++

Comment 9 Soumya Koduri 2017-01-16 06:25:17 UTC

Assigning to Mohit as he has worked on the fix. The fix was to avoid dict ref leaks as part of xattr invalidations done when md-cache settings are enabled. 

Also before jumping into conclusion, to be sure that the OOM kill was indeed because of ref leak in upcall xlator, I suggest to disable md-cache settings and re-run the tests. Thanks!

Comment 10 Soumya Koduri 2017-01-16 06:26:27 UTC

(In reply to Soumya Koduri from comment #9)
> Assigning to Mohit as he has worked on the fix. The fix was to avoid dict
> ref leaks as part of xattr invalidations done when md-cache settings are
> enabled. 
> 
> Also before jumping into conclusion, to be sure that the OOM kill was indeed
> because of ref leak in upcall xlator, I suggest to disable md-cache settings
> and re-run the tests. Thanks!

This exercise is just to make sure that we do not overlook any other leaks in other code paths.

Comment 11 Atin Mukherjee 2017-01-17 07:41:09 UTC

upstream mainline : http://review.gluster.org/16392
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/95350

Comment 12 Ambarish 2017-01-23 07:23:51 UTC

Test Blocker for Scale Tests.

Comment 14 Ambarish 2017-01-31 06:27:08 UTC

I scaled out 1*2 to 6*2 and then back to 1*2 on 3.8.4-13 on FUSE.

It worked seamlessly.

Verified.

Comment 16 errata-xmlrpc 2017-03-23 06:03:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.