Bug 1282731 - Entry heal messages in glustershd.log while no entries shown in heal info
Entry heal messages in glustershd.log while no entries shown in heal info
Status: MODIFIED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: disperse (Show other bugs)
3.1
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: Sunil Kumar Acharya
Matt Zywusko
rebase
: ZStream
Depends On:
Blocks: 1420619 1422766 1422783
  Show dependency treegraph
 
Reported: 2015-11-17 04:26 EST by Bhaskarakiran
Modified: 2017-10-16 08:17 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1420619 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Bhaskarakiran 2015-11-17 04:26:05 EST
Description of problem:
=======================

This is a logging issue. After replace-brick, glustershd log shows entry heal successful messages while no entries are shown with heal info command. These are informatory messages.

[2015-11-17 09:09:17.262558] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.8/newfile.79: name heal successful on FFF
[2015-11-17 09:09:17.457770] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.8/newfile.80: name heal successful on FFF
[2015-11-17 09:09:17.597917] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.9/newfile.94: name heal successful on FFF
[2015-11-17 09:09:17.625835] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.8/newfile.81: name heal successful on FFF
[2015-11-17 09:09:18.067542] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.8/newfile.82: name heal successful on FFF
[2015-11-17 09:09:18.348196] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.9/newfile.95: name heal successful on FFF


Version-Release number of selected component (if applicable):
=============================================================
3.7.5-6

[root@transformers ~]# gluster --version
glusterfs 3.7.5 built on Nov 10 2015 00:13:38
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
=================
glustershd should show the entries only for the files which get healed.

Additional info:
Comment 3 Krutika Dhananjay 2016-02-15 08:27:14 EST
Bhaskar,

This is my first bug in disperse component. Could you please share the steps to recreate the issue?

I tried the following steps but to no avail:
1) create start and mount a 6+2 disperse volume (all bricks on the same node).
2) Create data in the volume -- do kernel untar on the mount point.
3) Once data is created, replace the last brick with replace-brick commit force command.
4) Monitor o/p of `gluster volume heal <VOL> info` and "watch" the output of `tail -F glustershd.log` on the adjacent pane.
The result: I see heal info reporting entries constantly and the list only keeps growing with time. On the other hand in the glustershd.log file, I actually see the following:

[2016-02-15 13:20:30.002542] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'
The message "W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'" repeated 2 times between [2016-02-15 13:20:30.002542] and [2016-02-15 13:20:30.003022]
[2016-02-15 13:20:30.003648] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:20:30.003696] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]
[2016-02-15 13:20:30.004380] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:20:30.004388] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2016-02-15 13:20:30.004403] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]
[2016-02-15 13:20:30.004524] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2016-02-15 13:20:30.004588] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:20:30.004599] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]
[2016-02-15 13:20:30.005269] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:20:30.005297] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]
[2016-02-15 13:20:31.610589] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2016-02-15 13:20:31.610612] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:20:31.610621] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]
[2016-02-15 13:21:31.002399] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2016-02-15 13:21:31.002579] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:21:31.002589] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]
The message "W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'" repeated 460 times between [2016-02-15 13:21:31.002399] and [2016-02-15 13:22:53.470491]
The message "W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)" repeated 455 times between [2016-02-15 13:21:31.002579] and [2016-02-15 13:22:53.470496]
The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]" repeated 455 times between [2016-02-15 13:21:31.002589] and [2016-02-15 13:22:53.470498]
[2016-02-15 13:23:32.002522] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2016-02-15 13:23:32.002699] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-02-15 13:23:32.002711] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]


-Krutika
Comment 4 Krutika Dhananjay 2016-02-15 08:53:06 EST
I now find heal-info report zero entries intermittently while shd continues to log names of files it is healing. Canceling NEEDINFO.
Comment 7 Sunil Kumar Acharya 2017-01-11 03:59:36 EST
Steps followed to verify the issue.

1. Created a Distributed-Disperse volume.
2. Created 5 files on the volume.
3. Replaced one of the brick.
4. Viewed the list of files that needed healing.

*************************************************************************
 [root@buildserver glusterfs]# gluster volume heal ec-1 info
 Brick buildserver:/lab_work/store/ec-1
 /f5 
 /f10 
 Status: Connected
 Number of entries: 2
 
 Brick buildserver:/lab_work/store/ec-2
 /f5 
 /f10 
 Status: Connected
 Number of entries: 2
 
 Brick buildserver:/lab_work/store/ec-3
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-4
 /f5 
 /f10 
 Status: Connected
 Number of entries: 2 
 
 Brick buildserver:/lab_work/store/ec-5
 /f5 
 /f10 
 Status: Connected
 Number of entries: 2 
 
 Brick buildserver:/lab_work/store/ec-6
 /f5 
 /f10 
 Status: Connected
 Number of entries: 2
  
 Brick buildserver:/lab_work/store/ec-7
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-8
 Status: Connected
 Number of entries: 0 
 
 Brick buildserver:/lab_work/store/ec-9
 Status: Connected
 Number of entries: 0 
 
 Brick buildserver:/lab_work/store/ec-10
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-11
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-12
 Status: Connected
 Number of entries: 0
 
 [root@buildserver glusterfs]# 
*************************************************************************

5. No heal message was noticed in the log.

root@buildserver glusterfs]# grep -i 122058 *.log
root@buildserver glusterfs]# 

6.Triggered self-heal only for the files which require healing.

 [root@buildserver ~]# gluster volume heal ec-1
 Launching heal operation to perform index self heal on volume ec-1 has been successful 
 Use heal info commands to check status
 [root@buildserver ~]#

7. Heal message corresponding to the files which got healed were noticed.

*************************************************************************
 [root@buildserver glusterfs]# grep -i 122058 *.log
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:49.921672] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:50.942957] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:52.084542] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:58.187489] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:59.192478] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:01.232530] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:06.336938] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:15.489458] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:16.510026] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:20.565422] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:28.895143] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F
 lab_work-fuse_mounts-mount1.log:The message "I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F" repeated 2 times between [2017-01-04 08:08:28.895143] and           [2017-01-04 08:08:49.354147]
 lab_work-fuse_mounts-mount1.log:The message "I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F" repeated 6 times between [2017-01-04 08:08:20.565422] and            [2017-01-04 08:09:00.313412]
 [root@buildserver glusterfs]#
*************************************************************************

8. Post healing no files were listed under "need healing"

*************************************************************************
 [root@buildserver ~]# gluster volume heal ec-1 info
 Brick buildserver:/lab_work/store/ec-1
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-2
 Status: Connected
 Number of entries: 0 
 
 Brick buildserver:/lab_work/store/ec-3
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-4
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-5
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-6
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-7
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-8
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-9
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-10
 Status: Connected
 Number of entries: 0 
 
 Brick buildserver:/lab_work/store/ec-11
 Status: Connected
 Number of entries: 0
 
 Brick buildserver:/lab_work/store/ec-12
 Status: Connected
 Number of entries: 0 

 [root@buildserver ~]#
*************************************************************************

From the above observation it can be noticed that the component is behaving as expected.
Comment 8 Ashish Pandey 2017-02-08 03:46:39 EST

This bug will be fixed by https://review.gluster.org/#/c/16552/
as it is not triggering heal for a file it does not require.

However, I think we should change the level of the gf_msg to DEBUG level as we have done in https://review.gluster.org/#/c/16478/

Sunil, 

could you please send the patch ofr this?
Comment 10 Sunil Kumar Acharya 2017-02-21 01:10:36 EST
Snippet from comment 8 referring the fix for this issue:

> This bug will be fixed by https://review.gluster.org/#/c/16552/
> as it is not triggering heal for a file it does not require.
Comment 11 Sunil Kumar Acharya 2017-09-13 02:57:39 EDT
Changed the log level as part of : https://review.gluster.org/#/c/16580/

Note You need to log in before you can comment on or make changes to this bug.