Description of problem: ----------------------- 4 Node cluster,1 EC volume exported via Ganesha. Replaced 4 bricks in different subvols,which triggered a heal. Data was being populated from multiple Ganesha mounts while heal happened. The shd logs and Ganesha-gfapi logs were literally flooded with Heal failures : The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-4: Heal failed [Invalid argument]" repeated 4 times between [2017-01-22 09:58:51.836831] and [2017-01-22 09:58:52.109527] [2017-01-22 09:58:52.225738] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-4: Heal failed [Invalid argument] [2017-01-22 09:58:52.228903] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-4: Heal failed [Invalid argument] The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]" repeated 8 times between [2017-01-22 09:58:52.066587] and [2017-01-22 09:58:52.328220] [2017-01-22 09:58:52.340050] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument] [2017-01-22 09:58:52.379059] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-6: Heal failed [Invalid argument] The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]" repeated 4 times between [2017-01-22 09:58:52.340050] and [2017-01-22 09:58:52.453352] [2017-01-22 09:58:52.461780] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-6: Heal failed [Invalid argument] [2017-01-22 09:58:52.495767] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument] [2017-01-22 09:58:52.544095] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument] [2017-01-22 09:58:52.584513] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument] The log message was populated nearly 1000000 times per node in the last 24 hours : [root@gqas015 ~]# cat /var/log/ganesha-gfapi.log |grep -i "Heal fail"|wc -l 1433241 [root@gqas009 ~]# cat /var/log/glusterfs/glustershd.log|grep -i "Heal fail"|wc -l 1075672 [root@gqas009 ~]# It's filling up / quickly : [root@gqas015 ~]# ll -h /var/log/glusterfs/glustershd.log -rw------- 1 root root 403M Jan 23 01:57 /var/log/glusterfs/glustershd.log [root@gqas015 ~]# [root@gqas015 ~]# ll -h /var/log/ganesha-gfapi.log -rw------- 1 root root 538M Jan 23 00:59 /var/log/ganesha-gfapi.log [root@gqas015 ~]# ********************************* EXACT WORKLOAD on Ganesha mounts ********************************* Smallfile Appends,ll -R,tarball untar Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-2.4.1-6.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-12.el7rhgs.x86_64 How reproducible: ----------------- Reporting the first occurence. Actual results: --------------- Logs spammed with "Heal Failed" errors Expected results: ----------------- No Log spamming. Additional info: ---------------- [root@gqas009 ~]# gluster v status Status of volume: gluster_shared_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas015.sbu.lab.eng.bos.redhat.com:/v ar/lib/glusterd/ss_brick 49152 0 Y 26457 Brick gqas014.sbu.lab.eng.bos.redhat.com:/v ar/lib/glusterd/ss_brick 49152 0 Y 25391 Brick gqas009.sbu.lab.eng.bos.redhat.com:/v ar/lib/glusterd/ss_brick 49152 0 Y 25747 Self-heal Daemon on localhost N/A N/A Y 17960 Self-heal Daemon on gqas014.sbu.lab.eng.bos .redhat.com N/A N/A Y 17200 Self-heal Daemon on gqas010.sbu.lab.eng.bos .redhat.com N/A N/A Y 13756 Self-heal Daemon on gqas015.sbu.lab.eng.bos .redhat.com N/A N/A Y 17415 Task Status of Volume gluster_shared_storage ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: replicate Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks12/bricknew 49153 0 Y 27931 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks12/bricknew 49152 0 Y 27177 Self-heal Daemon on localhost N/A N/A Y 17960 Self-heal Daemon on gqas010.sbu.lab.eng.bos .redhat.com N/A N/A Y 13756 Self-heal Daemon on gqas015.sbu.lab.eng.bos .redhat.com N/A N/A Y 17415 Self-heal Daemon on gqas014.sbu.lab.eng.bos .redhat.com N/A N/A Y 17200 Task Status of Volume replicate ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks1/brick1 49158 0 Y 29725 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49164 0 Y 24750 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49164 0 Y 24867 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49164 0 Y 25931 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49165 0 Y 25201 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49165 0 Y 24769 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks2/brick1 49153 0 Y 29771 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49166 0 Y 24788 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49165 0 Y 24886 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49165 0 Y 25950 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49166 0 Y 24905 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49166 0 Y 25969 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks4/brick1 49154 0 Y 29827 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks4/brick 49167 0 Y 24807 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks4/brick 49167 0 Y 25988 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks4/brick 49167 0 Y 24924 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks5/brick 49168 0 Y 25258 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks5/brick 49168 0 Y 24826 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks6/brick 49169 0 Y 25277 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks6/brick 49169 0 Y 24845 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks6/brick 49168 0 Y 26007 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks6/brick 49168 0 Y 24943 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks5/brick 49169 0 Y 24962 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks5/brick 49169 0 Y 26026 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks7/brick1 49155 0 Y 29909 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks7/brick 49170 0 Y 24864 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks7/brick 49170 0 Y 26045 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks7/brick 49170 0 Y 24981 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks8/brick 49171 0 Y 24883 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks8/brick 49171 0 Y 25315 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49172 0 Y 25336 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49172 0 Y 24902 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49171 0 Y 26064 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49171 0 Y 25000 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks8/brick 49172 0 Y 25019 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks8/brick 49172 0 Y 26083 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49173 0 Y 25355 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49173 0 Y 24921 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49173 0 Y 26102 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49173 0 Y 25038 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks11/brick1 49156 0 Y 30009 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49174 0 Y 24940 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks12/brick 49175 0 Y 25393 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks12/brick 49175 0 Y 24959 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks12/brick 49174 0 Y 26121 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks12/brick 49174 0 Y 25057 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49175 0 Y 25076 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49175 0 Y 26140 Self-heal Daemon on localhost N/A N/A Y 17960 Self-heal Daemon on gqas010.sbu.lab.eng.bos .redhat.com N/A N/A Y 13756 Self-heal Daemon on gqas015.sbu.lab.eng.bos .redhat.com N/A N/A Y 17415 Self-heal Daemon on gqas014.sbu.lab.eng.bos .redhat.com N/A N/A Y 17200 Task Status of Volume testvol ------------------------------------------------------------------------------ Task : Rebalance ID : 60d3e3e4-661c-4520-9f5f-482d95d81a82 Status : in progress [root@gqas009 ~]#
Proposing blocker ,as usability(and UX in general) is impacted.
upstream patch - https://review.gluster.org/#/c/16473/
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/96373/
Cannot reproduce on glusterfs-3.8.4-14,Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html