1417050 – [Stress] : SHD Logs flooded with "Heal Failed" messages,filling up "/" quickly

Bug 1417050 - [Stress] : SHD Logs flooded with "Heal Failed" messages,filling up "/" quickly

Summary: [Stress] : SHD Logs flooded with "Heal Failed" messages,filling up "/" quickly

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Ashish Pandey
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1415583 1417135
TreeView+	depends on / blocked

Reported:	2017-01-27 05:54 UTC by Ashish Pandey
Modified:	2017-05-30 18:40 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-3.11.0
Clone Of:	1415583
Clones:	1417135 (view as bug list)
Environment:
Last Closed:	2017-05-30 18:40:10 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Ashish Pandey 2017-01-27 05:56:14 UTC

Description of problem:
-----------------------

4 Node cluster,1 EC volume exported via Ganesha.


Replaced 4 bricks in different subvols,which triggered a heal.

Data was being populated from multiple Ganesha mounts while heal happened.


The shd logs and Ganesha-gfapi logs were literally flooded with Heal failures :

The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-4: Heal failed [Invalid argument]" repeated 4 times between [2017-01-22 09:58:51.836831] and [2017-01-22 09:58:52.109527]
[2017-01-22 09:58:52.225738] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-4: Heal failed [Invalid argument]
[2017-01-22 09:58:52.228903] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-4: Heal failed [Invalid argument]
The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]" repeated 8 times between [2017-01-22 09:58:52.066587] and [2017-01-22 09:58:52.328220]
[2017-01-22 09:58:52.340050] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]
[2017-01-22 09:58:52.379059] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-6: Heal failed [Invalid argument]
The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]" repeated 4 times between [2017-01-22 09:58:52.340050] and [2017-01-22 09:58:52.453352]
[2017-01-22 09:58:52.461780] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-6: Heal failed [Invalid argument]
[2017-01-22 09:58:52.495767] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]
[2017-01-22 09:58:52.544095] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]
[2017-01-22 09:58:52.584513] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-testvol-disperse-0: Heal failed [Invalid argument]


The log message was populated nearly 1000000 times per node in the last 24 hours :

[root@gqas015 ~]# cat /var/log/ganesha-gfapi.log |grep -i "Heal fail"|wc -l
1433241

[root@gqas009 ~]# cat /var/log/glusterfs/glustershd.log|grep -i "Heal fail"|wc -l
1075672
[root@gqas009 ~]# 


It's filling up / quickly :


[root@gqas015 ~]# ll -h /var/log/glusterfs/glustershd.log
-rw------- 1 root root 403M Jan 23 01:57 /var/log/glusterfs/glustershd.log
[root@gqas015 ~]# 
[root@gqas015 ~]# ll -h /var/log/ganesha-gfapi.log 
-rw------- 1 root root 538M Jan 23 00:59 /var/log/ganesha-gfapi.log
[root@gqas015 ~]# 



*********************************
EXACT WORKLOAD on Ganesha mounts 
*********************************

Smallfile Appends,ll -R,tarball untar



Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-2.4.1-6.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-12.el7rhgs.x86_64

How reproducible:
-----------------

Reporting the first occurence.


Actual results:
---------------

Logs spammed with "Heal Failed" errors

Expected results:
-----------------

No Log spamming.

Comment 2 Worker Ant 2017-01-27 06:08:51 UTC

REVIEW: https://review.gluster.org/16473 (cluster/ec: Change level of messages to DEBUG) posted (#1) for review on master by Ashish Pandey (aspandey)

Comment 3 Worker Ant 2017-01-27 11:51:43 UTC

COMMIT: https://review.gluster.org/16473 committed in master by Xavier Hernandez (xhernandez) 
------
commit 23a9385953e6b2f05508d9ef85bbbb2200f18fe2
Author: Ashish Pandey <aspandey>
Date:   Fri Jan 27 11:31:40 2017 +0530

    cluster/ec: Change level of messages to DEBUG
    
    Heal failed or passed should not be logged as warning.
    These can be observed from heal info if the heal is
    happening or not. If we require to debug a case where
    heal is not happening, we can set the level to DEBUG.
    
    Change-Id: I347665c8c8b6223bb08a9f3dd5643a10ddc3b93e
    BUG: 1417050
    Signed-off-by: Ashish Pandey <aspandey>
    Reviewed-on: https://review.gluster.org/16473
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Xavier Hernandez <xhernandez>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 4 Shyamsundar 2017-05-30 18:40:10 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.