Bug 1282731

Summary:	Entry heal messages in glustershd.log while no entries shown in heal info
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Bhaskarakiran <byarlaga>
Component:	disperse	Assignee:	Sunil Kumar Acharya <sheggodu>
Status:	CLOSED ERRATA	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	rhgs-3.1	CC:	amukherj, aspandey, nchilaka, pkarampu, rhinduja, rhs-bugs, sheggodu, ubansal
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	rebase
Fixed In Version:	glusterfs-3.12.2-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1420619 (view as bug list)		Environment:
Last Closed:	2018-09-04 06:26:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1420619, 1422766, 1422783, 1503134

Description Bhaskarakiran 2015-11-17 09:26:05 UTC

Description of problem:
=======================

This is a logging issue. After replace-brick, glustershd log shows entry heal successful messages while no entries are shown with heal info command. These are informatory messages.

[2015-11-17 09:09:17.262558] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.8/newfile.79: name heal successful on FFF
[2015-11-17 09:09:17.457770] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.8/newfile.80: name heal successful on FFF
[2015-11-17 09:09:17.597917] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.9/newfile.94: name heal successful on FFF
[2015-11-17 09:09:17.625835] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.8/newfile.81: name heal successful on FFF
[2015-11-17 09:09:18.067542] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.8/newfile.82: name heal successful on FFF
[2015-11-17 09:09:18.348196] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.9/newfile.95: name heal successful on FFF


Version-Release number of selected component (if applicable):
=============================================================
3.7.5-6

[root@transformers ~]# gluster --version
glusterfs 3.7.5 built on Nov 10 2015 00:13:38
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
=================
glustershd should show the entries only for the files which get healed.

Additional info:

Comment 3 Krutika Dhananjay 2016-02-15 13:27:14 UTC

Bhaskar,

This is my first bug in disperse component. Could you please share the steps to recreate the issue?

I tried the following steps but to no avail:
1) create start and mount a 6+2 disperse volume (all bricks on the same node).
2) Create data in the volume -- do kernel untar on the mount point.
3) Once data is created, replace the last brick with replace-brick commit force command.
4) Monitor o/p of `gluster volume heal <VOL> info` and "watch" the output of `tail -F glustershd.log` on the adjacent pane.
The result: I see heal info reporting entries constantly and the list only keeps growing with time. On the other hand in the glustershd.log file, I actually see the following:

[2016-02-15 13:20:30.002542] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'
The message "W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'" repeated 2 times between [2016-02-15 13:20:30.002542] and [2016-02-15 13:20:30.003022]
[2016-02-15 13:20:30.003648] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:20:30.003696] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]
[2016-02-15 13:20:30.004380] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:20:30.004388] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2016-02-15 13:20:30.004403] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]
[2016-02-15 13:20:30.004524] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2016-02-15 13:20:30.004588] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:20:30.004599] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]
[2016-02-15 13:20:30.005269] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:20:30.005297] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]
[2016-02-15 13:20:31.610589] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2016-02-15 13:20:31.610612] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:20:31.610621] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]
[2016-02-15 13:21:31.002399] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2016-02-15 13:21:31.002579] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:21:31.002589] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]
The message "W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'" repeated 460 times between [2016-02-15 13:21:31.002399] and [2016-02-15 13:22:53.470491]
The message "W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)" repeated 455 times between [2016-02-15 13:21:31.002579] and [2016-02-15 13:22:53.470496]
The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]" repeated 455 times between [2016-02-15 13:21:31.002589] and [2016-02-15 13:22:53.470498]
[2016-02-15 13:23:32.002522] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2016-02-15 13:23:32.002699] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:23:32.002711] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]


-Krutika

Comment 4 Krutika Dhananjay 2016-02-15 13:53:06 UTC

I now find heal-info report zero entries intermittently while shd continues to log names of files it is healing. Canceling NEEDINFO.

Comment 7 Sunil Kumar Acharya 2017-01-11 08:59:36 UTC

Steps followed to verify the issue.

1. Created a Distributed-Disperse volume.
2. Created 5 files on the volume.
3. Replaced one of the brick.
4. Viewed the list of files that needed healing.

*************************************************************************
 [root@buildserver glusterfs]# gluster volume heal ec-1 info
 Brick buildserver:/lab_work/store/ec-1
 /f5 
 /f10 
 Status: Connected
 Number of entries: 2
 
 Brick buildserver:/lab_work/store/ec-2
 /f5 
 /f10 
 Status: Connected
 Number of entries: 2
 
 Brick buildserver:/lab_work/store/ec-3
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-4
 /f5 
 /f10 
 Status: Connected
 Number of entries: 2 
 
 Brick buildserver:/lab_work/store/ec-5
 /f5 
 /f10 
 Status: Connected
 Number of entries: 2 
 
 Brick buildserver:/lab_work/store/ec-6
 /f5 
 /f10 
 Status: Connected
 Number of entries: 2
  
 Brick buildserver:/lab_work/store/ec-7
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-8
 Status: Connected
 Number of entries: 0 
 
 Brick buildserver:/lab_work/store/ec-9
 Status: Connected
 Number of entries: 0 
 
 Brick buildserver:/lab_work/store/ec-10
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-11
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-12
 Status: Connected
 Number of entries: 0
 
 [root@buildserver glusterfs]# 
*************************************************************************

5. No heal message was noticed in the log.

root@buildserver glusterfs]# grep -i 122058 *.log
root@buildserver glusterfs]# 

6.Triggered self-heal only for the files which require healing.

 [root@buildserver ~]# gluster volume heal ec-1
 Launching heal operation to perform index self heal on volume ec-1 has been successful 
 Use heal info commands to check status
 [root@buildserver ~]#

7. Heal message corresponding to the files which got healed were noticed.

*************************************************************************
 [root@buildserver glusterfs]# grep -i 122058 *.log
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:49.921672] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:50.942957] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:52.084542] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:58.187489] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:59.192478] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:01.232530] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:06.336938] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:15.489458] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:16.510026] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:20.565422] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:28.895143] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:The message "I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F" repeated 2 times between [2017-01-04 08:08:28.895143] and           [2017-01-04 08:08:49.354147]
 lab_work-fuse_mounts-mount1.log:The message "I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F" repeated 6 times between [2017-01-04 08:08:20.565422] and            [2017-01-04 08:09:00.313412]
 [root@buildserver glusterfs]#
*************************************************************************

8. Post healing no files were listed under "need healing"

*************************************************************************
 [root@buildserver ~]# gluster volume heal ec-1 info
 Brick buildserver:/lab_work/store/ec-1
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-2
 Status: Connected
 Number of entries: 0 
 
 Brick buildserver:/lab_work/store/ec-3
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-4
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-5
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-6
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-7
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-8
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-9
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-10
 Status: Connected
 Number of entries: 0 
 
 Brick buildserver:/lab_work/store/ec-11
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-12
 Status: Connected
 Number of entries: 0 

 [root@buildserver ~]#
*************************************************************************

From the above observation it can be noticed that the component is behaving as expected.

Comment 8 Ashish Pandey 2017-02-08 08:46:39 UTC


This bug will be fixed by https://review.gluster.org/#/c/16552/
as it is not triggering heal for a file it does not require.

However, I think we should change the level of the gf_msg to DEBUG level as we have done in https://review.gluster.org/#/c/16478/

Sunil, 

could you please send the patch ofr this?

Comment 10 Sunil Kumar Acharya 2017-02-21 06:10:36 UTC

Snippet from comment 8 referring the fix for this issue:

> This bug will be fixed by https://review.gluster.org/#/c/16552/
> as it is not triggering heal for a file it does not require.

Comment 11 Sunil Kumar Acharya 2017-09-13 06:57:39 UTC

Changed the log level as part of : https://review.gluster.org/#/c/16580/

Comment 14 Nag Pavan Chilakam 2018-05-21 11:41:04 UTC

I don't know if the fix we are delivering is inline with the expectation of what the issue is actually.
The issue from what i understand say, heal messages were seen when healinfo shows zero, that means there were entries undergoing heal  but were not reflecting in heal info.
however the fix says all healing logs are to be pushed into debug mode.
Going by this, the issue gets rather hidden instead of actually being fixed
Correct me if i am wrong

Comment 16 Nag Pavan Chilakam 2018-05-21 14:26:29 UTC

onqa validation
(based on comment#15)
1) had set log level(brick and client to debug) on an ecvolume had few files created and while a new file was created and being written into using dd say "new_brickdown"
2)brought down bricks and brought them online , while new_brickdown was still getting written into 
3) while dd was still happening on new_brickdown , started reads on healthy files(ie other files which didnt require heals) using md5sum and also started to append to healthy file which didnt require heals
4)didn't see any heal messages for healthy files(while doing reads to these files from another client) , which is expected(previously we were seeing heal messages on healthy files too, which was not right as we dont have to heal them) , while i see the message for file getting healed as part of append(which was supposed to get healed as part of dd ie new_brickdown)
NOTE:THIS IS SEEN IN FUSE MOUNT LOG
[2018-05-21 14:14:36.431467] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F
[2018-05-21 14:14:36.505557] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F
[2018-05-21 14:14:36.578142] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F
[2018-05-21 14:14:36.608751] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F
[2018-05-21 14:14:36.618004] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F



hence moving to verified, as other files are not getting picked for heal (as they dont require heal)

tested on 3.12.2-11

Comment 18 errata-xmlrpc 2018-09-04 06:26:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607