Bug 1401380

Summary: [Compound FOPs] : Memory leaks while doing deep directory creation
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: replicateAssignee: Krutika Dhananjay <kdhananj>
Status: CLOSED ERRATA QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, mchangir, pkarampu, ravishankar, rcyriac, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-10 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-23 05:54:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1401218, 1402212    
Bug Blocks: 1351528    

Description Nag Pavan Chilakam 2016-12-05 05:18:19 UTC
Description of problem:
===================
When I was doing deep directory creation from two clients , i observe memory consumption increasing and not getting released.


Version-Release number of selected component (if applicable):
=======
3.8.4-6

How reproducible:


Steps to Reproduce:
1.create 1x2 volume and enable compound fops
2.mount vol on two clients
3.create same directory structure from both the clients parallely(say breadth 10 and depth 10)
keep observing fuse process memory consumption

Comment 2 Nag Pavan Chilakam 2016-12-05 05:21:29 UTC
I found this while verifying  1393709  [Compound FOPs] Client side IObuff leaks at a high pace consumes complete client memory and hence making gluster volume inaccessible 
And is that bug and this one seemed to be root caused to be different hence raising a new bug

Comment 4 Nag Pavan Chilakam 2016-12-05 05:26:38 UTC
Krutika has  a bug  upstream for the same  https://bugzilla.redhat.com/show_bug.cgi?id=1401218 - Fix compound fops memory leaks

Giving QA_ack as it is a must fix for compound fops to go as a feature

Comment 6 Atin Mukherjee 2016-12-05 05:40:12 UTC
upstream mainline : http://review.gluster.org/16020
downstream : https://code.engineering.redhat.com/gerrit/#/c/92061/

Comment 7 Nag Pavan Chilakam 2016-12-05 09:55:28 UTC
sosreports
[qe@rhsqe-repo clients]$ pwd
/var/www/html/sosreports/nchilaka/bug.1401380/clients

Comment 8 Ravishankar N 2016-12-05 16:03:00 UTC
Moving it to POST as I will use the BZ to backport http://review.gluster.org/#/c/16029/

Comment 10 Nag Pavan Chilakam 2016-12-19 12:25:34 UTC
On_qa validation:
While validation , i still found memory consumption high:
Will have to fail the fix:
Below are the findings:
Clients:
client1 (rhs-client45):
Without CFOPs:
Mon Dec 19 17:01:35 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10306 root      20   0  778260  55824   4024 S  43.8  0.3  18:16.37 gluster+

Mon Dec 19 17:17:10 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10306 root      20   0  778260  76924   4036 S  43.8  0.5  24:17.27 gluster+

===>consumed about 20MB in 15min

with CFOPS:
Mon Dec 19 17:24:14 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6669 root      20   0  775092  53316   3908 S  37.5  0.3   0:05.77 gluster+
=========>started dir create 
Mon Dec 19 17:24:44 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6669 root      20   0  775092  71252   3908 S  31.2  0.4   0:17.29 gluster+
Mon Dec 19 17:25:14 IST 2016


Mon Dec 19 17:41:20 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6669 root      20   0 1299380 634920   3924 S  43.8  3.9   6:33.27 gluster+
====>stopped creations

============> increased by 550mb in about 15min
####################################################################################
Client2:(rhs-client24):
without cfops enabled
========>started dir create from 45 and 24 
Mon Dec 19 17:04:29 IST 2016
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND         
20323 root      20   0  696m  44m 3972 S 45.8  0.3   0:12.85 glusterfs     

Mon Dec 19 17:20:16 IST 2016
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND         
20323 root      20   0  761m  61m 3984 S 45.4  0.4   7:30.26 glusterfs 

===>consumed about 15MB in 15min



With cfops:
=========>started dir create 
Mon Dec 19 17:27:32 IST 2016
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND         
 3599 root      20   0  696m  53m 3992 S 48.4  0.3   0:14.33 glusterfs       
Mon Dec 19 17:28:02 IST 2016
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND         
 3599 root      20   0  696m  65m 3992 S 46.3  0.4   0:28.36 glusterfs       
====>stopped creations
Mon Dec 19 17:44:51 IST 2016
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND         
 3599 root      20   0 1145m 442m 3992 S  0.0  2.8   7:59.31 glusterfs       
^C
====>consumed about 440 mb in 16min
####################################################################################
Server side:
brick1:
Without cfops enabled:
Mon Dec 19 17:08:18 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
24524 root      20   0 1382176  37840   3132 S  50.0  0.5   2:17.61 gluster+
	
Mon Dec 19 17:23:23 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
24524 root      20   0 1382176  62904   3160 S  50.0  0.8  10:10.55 gluster+

===>consumed about 25MB in 15min

with CFOPS:
Mon Dec 19 17:29:54 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
25058 root      20   0 1250076  39396   4312 S  37.5  0.5   0:07.09
=========>started dir create 
Mon Dec 19 17:30:25 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
25058 root      20   0 1250076  43620   4312 S  47.1  0.5   0:22.52
Mon Dec 19 17:30:55 IST 2016

Mon Dec 19 17:47:00 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
25058 root      20   0 1513248 191188   4328 S  56.2  2.4   8:40.34
===>consumed about 150mb in 16min

################################################################################
brick2:
Without cfops enabled:
Mon Dec 19 17:08:29 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
23156 root      20   0 1385296  51752   3124 S  50.0  0.6  12:41.95 gluster+


Mon Dec 19 17:23:34 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
23156 root      20   0 1385296  75764   3128 S   0.0  0.9  19:54.83 gluster+

===>consumed about 15MB in 15min


==>with CFOPs:
Mon Dec 19 17:29:48 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
24747 root      20   0 1250080  41108   4320 S  43.8  0.5   0:03.97 glusterfsd
=========>started dir create 


Mon Dec 19 17:46:54 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
24747 root      20   0 1447716 193404   4336 S  50.0  2.4   8:37.14 glusterfsd
===>consumed about 140MB

Comment 11 Nag Pavan Chilakam 2016-12-20 06:25:16 UTC
As discussed offline, I will be moving to failed-qa

Comment 12 Atin Mukherjee 2016-12-20 07:42:10 UTC
upstream mainline patch http://review.gluster.org/#/c/16210/ posted for review.

Comment 13 Atin Mukherjee 2016-12-21 10:04:18 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/93488/

Comment 14 Milind Changire 2016-12-22 18:47:59 UTC
BZ added to erratum https://errata.devel.redhat.com/advisory/24866
Moving to ON_QA

Comment 15 Nag Pavan Chilakam 2016-12-27 06:42:09 UTC
on_qa validation:
==================
reran same testcase, ie simultaneous deep dir creations from two differen clients(one rhel6.7 and other rhel7.3, the same clients used when the bz was raised)

had a 1x2 vol with cfops enabled as below


[root@dhcp35-37 ~]# gluster v info cfp
 
Volume Name: cfp
Type: Replicate
Volume ID: d84aaa47-62d3-485b-92ff-83f3b5cfc46b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.35.37:/rhs/brick1/cfp
Brick2: 10.70.35.116:/rhs/brick1/cfp
Options Reconfigured:
cluster.use-compound-fops: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@dhcp35-37 ~]# gluster v status cfp
Status of volume: cfp
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.37:/rhs/brick1/cfp           49155     0          Y       7255 
Brick 10.70.35.116:/rhs/brick1/cfp          49155     0          Y       3352 
Self-heal Daemon on localhost               N/A       N/A        Y       7275 
Self-heal Daemon on dhcp35-116.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       3428 
Self-heal Daemon on dhcp35-239.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       8537 
Self-heal Daemon on dhcp35-196.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       13185
Self-heal Daemon on dhcp35-135.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       14433
Self-heal Daemon on dhcp35-8.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       14371
 
Task Status of Volume cfp
------------------------------------------------------------------------------
There are no active volume tasks
 





I saw that with a span of about 15min the mem consumed was as below

client45:el7.3:

Mon Dec 26 20:00:12 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
30215 root      20   0  775108 151512   3924 S  37.5  0.9  24:27.19 glusterfs

############## 
Mon Dec 26 20:15:17 IST 2016
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
30215 root      20   0  775108 181924   3924 S  33.3  1.1  30:06.44 glusterfs


so the mem consumed was seen as about 30MB



client24:el6.7:
====
############## 
Mon Dec 26 20:00:14 IST 2016
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND               
15038 root      20   0  761m  88m 4000 S 43.5  0.6  27:32.00 glusterfs  
############## 
Mon Dec 26 20:15:31 IST 2016
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND               
15038 root      20   0  761m 103m 4000 S 51.6  0.6  34:31.00 glusterfs 

so mem consumer was about 15MB 



Which are almost equal to what the clients consume without cfop enabled

Hence marking as fixed

test version:
glusterfs-libs-3.8.4-10.el7.x86_64
glusterfs-devel-3.8.4-10.el7.x86_64
glusterfs-api-devel-3.8.4-10.el7.x86_64
glusterfs-cli-3.8.4-10.el7.x86_64
glusterfs-3.8.4-10.el7.x86_64
glusterfs-api-3.8.4-10.el7.x86_64
glusterfs-fuse-3.8.4-10.el7.x86_64
glusterfs-client-xlators-3.8.4-10.el7.x86_64
glusterfs-rdma-3.8.4-10.el7.x86_64


the brick process consumption too was normal

Comment 17 errata-xmlrpc 2017-03-23 05:54:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html