Description of problem: ----------------------- 4 node Gluster cluster.4 clients mounts the volume via FUSE. The intent was to scale from 1*2 to 6*2 and then back to 1*2 amidst continuous I/O from FUSE mounts. On scaling up from 4*2 to 5*2,4 of my brick processes crashed and rebalance failed as well : [root@gqas009 ~]# gluster v rebalance butcher status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 0 0 completed 0:13:31 gqas015.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 completed 0:14:27 gqas014.sbu.lab.eng.bos.redhat.com 0 0Bytes 260 0 0 completed 0:13:1 gqas010.sbu.lab.eng.bos.redhat.com 30338 162.4GB 177268 8 0 failed 8:16:47 volume rebalance: butcher: success [root@gqas009 ~]# Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-3.8.4-11.el7rhgs.x86_64 How reproducible: ----------------- Reporting the first occurence. Actual results: --------------- Brick processes crashed and migration/rebalance failed. Expected results: ----------------- No crashes and a clean rebalance. Additional info: ----------------- *Client and Server OS* : RHEL 7.3 *Vol Status* : [root@gqas009 ~]# gluster v status Status of volume: butcher Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks1/A N/A N/A N N/A Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks1/A N/A N/A N N/A Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks2/A N/A N/A N N/A Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks2/A N/A N/A N N/A Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks3/A 49154 0 Y 24074 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks3/A 49154 0 Y 24472 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks4/A 49155 0 Y 24872 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks4/A 49155 0 Y 25346 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks5/A 49153 0 Y 32088 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks5/A 49153 0 Y 431 Self-heal Daemon on localhost N/A N/A Y 4098 Quota Daemon on localhost N/A N/A Y 4106 Self-heal Daemon on gqas015.sbu.lab.eng.bos .redhat.com N/A N/A Y 526 Quota Daemon on gqas015.sbu.lab.eng.bos.red hat.com N/A N/A Y 535 Self-heal Daemon on gqas014.sbu.lab.eng.bos .redhat.com N/A N/A Y 32177 Quota Daemon on gqas014.sbu.lab.eng.bos.red hat.com N/A N/A Y 32185 Self-heal Daemon on gqas010.sbu.lab.eng.bos .redhat.com N/A N/A Y 3803 Quota Daemon on gqas010.sbu.lab.eng.bos.red hat.com N/A N/A Y 3802 Task Status of Volume butcher ------------------------------------------------------------------------------ Task : Rebalance ID : 93e83af5-411b-4310-a90a-2d3290ffd6c2 Status : failed
************** EXACT WORKLOAD ************** Client 1 : tarball untar + Recursive ls Client 2 : Bonnie++ Client 3 : tarball untar Client 4: finds and Bonnie++
Assigning to Mohit as he has worked on the fix. The fix was to avoid dict ref leaks as part of xattr invalidations done when md-cache settings are enabled. Also before jumping into conclusion, to be sure that the OOM kill was indeed because of ref leak in upcall xlator, I suggest to disable md-cache settings and re-run the tests. Thanks!
(In reply to Soumya Koduri from comment #9) > Assigning to Mohit as he has worked on the fix. The fix was to avoid dict > ref leaks as part of xattr invalidations done when md-cache settings are > enabled. > > Also before jumping into conclusion, to be sure that the OOM kill was indeed > because of ref leak in upcall xlator, I suggest to disable md-cache settings > and re-run the tests. Thanks! This exercise is just to make sure that we do not overlook any other leaks in other code paths.
upstream mainline : http://review.gluster.org/16392 downstream patch : https://code.engineering.redhat.com/gerrit/#/c/95350
Test Blocker for Scale Tests.
I scaled out 1*2 to 6*2 and then back to 1*2 on 3.8.4-13 on FUSE. It worked seamlessly. Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html