Description of problem: =================== When I was doing deep directory creation from two clients , i observe memory consumption increasing and not getting released. Version-Release number of selected component (if applicable): ======= 3.8.4-6 How reproducible: Steps to Reproduce: 1.create 1x2 volume and enable compound fops 2.mount vol on two clients 3.create same directory structure from both the clients parallely(say breadth 10 and depth 10) keep observing fuse process memory consumption
I found this while verifying 1393709 [Compound FOPs] Client side IObuff leaks at a high pace consumes complete client memory and hence making gluster volume inaccessible And is that bug and this one seemed to be root caused to be different hence raising a new bug
Krutika has a bug upstream for the same https://bugzilla.redhat.com/show_bug.cgi?id=1401218 - Fix compound fops memory leaks Giving QA_ack as it is a must fix for compound fops to go as a feature
upstream mainline : http://review.gluster.org/16020 downstream : https://code.engineering.redhat.com/gerrit/#/c/92061/
sosreports [qe@rhsqe-repo clients]$ pwd /var/www/html/sosreports/nchilaka/bug.1401380/clients
Moving it to POST as I will use the BZ to backport http://review.gluster.org/#/c/16029/
On_qa validation: While validation , i still found memory consumption high: Will have to fail the fix: Below are the findings: Clients: client1 (rhs-client45): Without CFOPs: Mon Dec 19 17:01:35 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10306 root 20 0 778260 55824 4024 S 43.8 0.3 18:16.37 gluster+ Mon Dec 19 17:17:10 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10306 root 20 0 778260 76924 4036 S 43.8 0.5 24:17.27 gluster+ ===>consumed about 20MB in 15min with CFOPS: Mon Dec 19 17:24:14 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6669 root 20 0 775092 53316 3908 S 37.5 0.3 0:05.77 gluster+ =========>started dir create Mon Dec 19 17:24:44 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6669 root 20 0 775092 71252 3908 S 31.2 0.4 0:17.29 gluster+ Mon Dec 19 17:25:14 IST 2016 Mon Dec 19 17:41:20 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6669 root 20 0 1299380 634920 3924 S 43.8 3.9 6:33.27 gluster+ ====>stopped creations ============> increased by 550mb in about 15min #################################################################################### Client2:(rhs-client24): without cfops enabled ========>started dir create from 45 and 24 Mon Dec 19 17:04:29 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20323 root 20 0 696m 44m 3972 S 45.8 0.3 0:12.85 glusterfs Mon Dec 19 17:20:16 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20323 root 20 0 761m 61m 3984 S 45.4 0.4 7:30.26 glusterfs ===>consumed about 15MB in 15min With cfops: =========>started dir create Mon Dec 19 17:27:32 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3599 root 20 0 696m 53m 3992 S 48.4 0.3 0:14.33 glusterfs Mon Dec 19 17:28:02 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3599 root 20 0 696m 65m 3992 S 46.3 0.4 0:28.36 glusterfs ====>stopped creations Mon Dec 19 17:44:51 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3599 root 20 0 1145m 442m 3992 S 0.0 2.8 7:59.31 glusterfs ^C ====>consumed about 440 mb in 16min #################################################################################### Server side: brick1: Without cfops enabled: Mon Dec 19 17:08:18 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24524 root 20 0 1382176 37840 3132 S 50.0 0.5 2:17.61 gluster+ Mon Dec 19 17:23:23 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24524 root 20 0 1382176 62904 3160 S 50.0 0.8 10:10.55 gluster+ ===>consumed about 25MB in 15min with CFOPS: Mon Dec 19 17:29:54 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ 25058 root 20 0 1250076 39396 4312 S 37.5 0.5 0:07.09 =========>started dir create Mon Dec 19 17:30:25 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ 25058 root 20 0 1250076 43620 4312 S 47.1 0.5 0:22.52 Mon Dec 19 17:30:55 IST 2016 Mon Dec 19 17:47:00 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ 25058 root 20 0 1513248 191188 4328 S 56.2 2.4 8:40.34 ===>consumed about 150mb in 16min ################################################################################ brick2: Without cfops enabled: Mon Dec 19 17:08:29 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23156 root 20 0 1385296 51752 3124 S 50.0 0.6 12:41.95 gluster+ Mon Dec 19 17:23:34 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23156 root 20 0 1385296 75764 3128 S 0.0 0.9 19:54.83 gluster+ ===>consumed about 15MB in 15min ==>with CFOPs: Mon Dec 19 17:29:48 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24747 root 20 0 1250080 41108 4320 S 43.8 0.5 0:03.97 glusterfsd =========>started dir create Mon Dec 19 17:46:54 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24747 root 20 0 1447716 193404 4336 S 50.0 2.4 8:37.14 glusterfsd ===>consumed about 140MB
As discussed offline, I will be moving to failed-qa
upstream mainline patch http://review.gluster.org/#/c/16210/ posted for review.
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/93488/
BZ added to erratum https://errata.devel.redhat.com/advisory/24866 Moving to ON_QA
on_qa validation: ================== reran same testcase, ie simultaneous deep dir creations from two differen clients(one rhel6.7 and other rhel7.3, the same clients used when the bz was raised) had a 1x2 vol with cfops enabled as below [root@dhcp35-37 ~]# gluster v info cfp Volume Name: cfp Type: Replicate Volume ID: d84aaa47-62d3-485b-92ff-83f3b5cfc46b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.70.35.37:/rhs/brick1/cfp Brick2: 10.70.35.116:/rhs/brick1/cfp Options Reconfigured: cluster.use-compound-fops: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@dhcp35-37 ~]# gluster v status cfp Status of volume: cfp Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.37:/rhs/brick1/cfp 49155 0 Y 7255 Brick 10.70.35.116:/rhs/brick1/cfp 49155 0 Y 3352 Self-heal Daemon on localhost N/A N/A Y 7275 Self-heal Daemon on dhcp35-116.lab.eng.blr. redhat.com N/A N/A Y 3428 Self-heal Daemon on dhcp35-239.lab.eng.blr. redhat.com N/A N/A Y 8537 Self-heal Daemon on dhcp35-196.lab.eng.blr. redhat.com N/A N/A Y 13185 Self-heal Daemon on dhcp35-135.lab.eng.blr. redhat.com N/A N/A Y 14433 Self-heal Daemon on dhcp35-8.lab.eng.blr.re dhat.com N/A N/A Y 14371 Task Status of Volume cfp ------------------------------------------------------------------------------ There are no active volume tasks I saw that with a span of about 15min the mem consumed was as below client45:el7.3: Mon Dec 26 20:00:12 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30215 root 20 0 775108 151512 3924 S 37.5 0.9 24:27.19 glusterfs ############## Mon Dec 26 20:15:17 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30215 root 20 0 775108 181924 3924 S 33.3 1.1 30:06.44 glusterfs so the mem consumed was seen as about 30MB client24:el6.7: ==== ############## Mon Dec 26 20:00:14 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15038 root 20 0 761m 88m 4000 S 43.5 0.6 27:32.00 glusterfs ############## Mon Dec 26 20:15:31 IST 2016 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15038 root 20 0 761m 103m 4000 S 51.6 0.6 34:31.00 glusterfs so mem consumer was about 15MB Which are almost equal to what the clients consume without cfop enabled Hence marking as fixed test version: glusterfs-libs-3.8.4-10.el7.x86_64 glusterfs-devel-3.8.4-10.el7.x86_64 glusterfs-api-devel-3.8.4-10.el7.x86_64 glusterfs-cli-3.8.4-10.el7.x86_64 glusterfs-3.8.4-10.el7.x86_64 glusterfs-api-3.8.4-10.el7.x86_64 glusterfs-fuse-3.8.4-10.el7.x86_64 glusterfs-client-xlators-3.8.4-10.el7.x86_64 glusterfs-rdma-3.8.4-10.el7.x86_64 the brick process consumption too was normal
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html