Bug 1415583

Summary: [Stress] : SHD Logs flooded with "Heal Failed" messages,filling up "/" quickly
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: disperseAssignee: Ashish Pandey <aspandey>
Status: CLOSED ERRATA QA Contact: Ambarish <asoman>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, asoman, bturner, jthottan, rcyriac, rhinduja, rhs-bugs, sbhaloth, skoduri, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-14 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1417050 (view as bug list) Environment:
Last Closed: 2017-03-23 06:04:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1417050, 1417135    
Bug Blocks: 1351528    

Description Ambarish 2017-01-23 07:03:51 UTC
Description of problem:
-----------------------

4 Node cluster,1 EC volume exported via Ganesha.


Replaced 4 bricks in different subvols,which triggered a heal.

Data was being populated from multiple Ganesha mounts while heal happened.


The shd logs and Ganesha-gfapi logs were literally flooded with Heal failures :

The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-4: Heal failed [Invalid argument]" repeated 4 times between [2017-01-22 09:58:51.836831] and [2017-01-22 09:58:52.109527]
[2017-01-22 09:58:52.225738] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-4: Heal failed [Invalid argument]
[2017-01-22 09:58:52.228903] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-4: Heal failed [Invalid argument]
The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]" repeated 8 times between [2017-01-22 09:58:52.066587] and [2017-01-22 09:58:52.328220]
[2017-01-22 09:58:52.340050] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]
[2017-01-22 09:58:52.379059] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-6: Heal failed [Invalid argument]
The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]" repeated 4 times between [2017-01-22 09:58:52.340050] and [2017-01-22 09:58:52.453352]
[2017-01-22 09:58:52.461780] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-6: Heal failed [Invalid argument]
[2017-01-22 09:58:52.495767] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]
[2017-01-22 09:58:52.544095] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]
[2017-01-22 09:58:52.584513] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]


The log message was populated nearly 1000000 times per node in the last 24 hours :

[root@gqas015 ~]# cat /var/log/ganesha-gfapi.log |grep -i "Heal fail"|wc -l
1433241

[root@gqas009 ~]# cat /var/log/glusterfs/glustershd.log|grep -i "Heal fail"|wc -l
1075672
[root@gqas009 ~]# 


It's filling up / quickly :


[root@gqas015 ~]# ll -h /var/log/glusterfs/glustershd.log
-rw------- 1 root root 403M Jan 23 01:57 /var/log/glusterfs/glustershd.log
[root@gqas015 ~]# 
[root@gqas015 ~]# ll -h /var/log/ganesha-gfapi.log 
-rw------- 1 root root 538M Jan 23 00:59 /var/log/ganesha-gfapi.log
[root@gqas015 ~]# 



*********************************
EXACT WORKLOAD on Ganesha mounts 
*********************************

Smallfile Appends,ll -R,tarball untar



Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-2.4.1-6.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-12.el7rhgs.x86_64

How reproducible:
-----------------

Reporting the first occurence.


Actual results:
---------------

Logs spammed with "Heal Failed" errors

Expected results:
-----------------

No Log spamming.

Additional info:
----------------

[root@gqas009 ~]# gluster v status
Status of volume: gluster_shared_storage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas015.sbu.lab.eng.bos.redhat.com:/v
ar/lib/glusterd/ss_brick                    49152     0          Y       26457
Brick gqas014.sbu.lab.eng.bos.redhat.com:/v
ar/lib/glusterd/ss_brick                    49152     0          Y       25391
Brick gqas009.sbu.lab.eng.bos.redhat.com:/v
ar/lib/glusterd/ss_brick                    49152     0          Y       25747
Self-heal Daemon on localhost               N/A       N/A        Y       17960
Self-heal Daemon on gqas014.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       17200
Self-heal Daemon on gqas010.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       13756
Self-heal Daemon on gqas015.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       17415
 
Task Status of Volume gluster_shared_storage
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: replicate
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks12/bricknew                            49153     0          Y       27931
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks12/bricknew                            49152     0          Y       27177
Self-heal Daemon on localhost               N/A       N/A        Y       17960
Self-heal Daemon on gqas010.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       13756
Self-heal Daemon on gqas015.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       17415
Self-heal Daemon on gqas014.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       17200
 
Task Status of Volume replicate
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks1/brick1                               49158     0          Y       29725
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks1/brick                                49164     0          Y       24750
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks1/brick                                49164     0          Y       24867
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks1/brick                                49164     0          Y       25931
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks3/brick                                49165     0          Y       25201
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks3/brick                                49165     0          Y       24769
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks2/brick1                               49153     0          Y       29771
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks2/brick                                49166     0          Y       24788
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks2/brick                                49165     0          Y       24886
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks2/brick                                49165     0          Y       25950
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks3/brick                                49166     0          Y       24905
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks3/brick                                49166     0          Y       25969
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks4/brick1                               49154     0          Y       29827
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks4/brick                                49167     0          Y       24807
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks4/brick                                49167     0          Y       25988
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks4/brick                                49167     0          Y       24924
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks5/brick                                49168     0          Y       25258
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks5/brick                                49168     0          Y       24826
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks6/brick                                49169     0          Y       25277
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks6/brick                                49169     0          Y       24845
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks6/brick                                49168     0          Y       26007
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks6/brick                                49168     0          Y       24943
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks5/brick                                49169     0          Y       24962
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks5/brick                                49169     0          Y       26026
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks7/brick1                               49155     0          Y       29909
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks7/brick                                49170     0          Y       24864
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks7/brick                                49170     0          Y       26045
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks7/brick                                49170     0          Y       24981
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks8/brick                                49171     0          Y       24883
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks8/brick                                49171     0          Y       25315
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks9/brick                                49172     0          Y       25336
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks9/brick                                49172     0          Y       24902
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks9/brick                                49171     0          Y       26064
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks9/brick                                49171     0          Y       25000
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks8/brick                                49172     0          Y       25019
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks8/brick                                49172     0          Y       26083
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks10/brick                               49173     0          Y       25355
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks10/brick                               49173     0          Y       24921
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks10/brick                               49173     0          Y       26102
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks10/brick                               49173     0          Y       25038
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks11/brick1                              49156     0          Y       30009
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks11/brick                               49174     0          Y       24940
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks12/brick                               49175     0          Y       25393
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks12/brick                               49175     0          Y       24959
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks12/brick                               49174     0          Y       26121
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks12/brick                               49174     0          Y       25057
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks11/brick                               49175     0          Y       25076
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks11/brick                               49175     0          Y       26140
Self-heal Daemon on localhost               N/A       N/A        Y       17960
Self-heal Daemon on gqas010.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       13756
Self-heal Daemon on gqas015.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       17415
Self-heal Daemon on gqas014.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       17200
 
Task Status of Volume testvol
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 60d3e3e4-661c-4520-9f5f-482d95d81a82
Status               : in progress         
 
[root@gqas009 ~]#

Comment 2 Ambarish 2017-01-23 07:10:12 UTC
Proposing blocker ,as usability(and UX in general) is impacted.

Comment 4 Ashish Pandey 2017-01-27 06:09:47 UTC
upstream patch -
https://review.gluster.org/#/c/16473/

Comment 5 Atin Mukherjee 2017-01-30 07:29:26 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/96373/

Comment 7 Ambarish 2017-02-13 07:28:12 UTC
Cannot reproduce on glusterfs-3.8.4-14,Verified.

Comment 9 errata-xmlrpc 2017-03-23 06:04:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html