Description of problem: ======================= This is a logging issue. After replace-brick, glustershd log shows entry heal successful messages while no entries are shown with heal info command. These are informatory messages. [2015-11-17 09:09:17.262558] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.8/newfile.79: name heal successful on FFF [2015-11-17 09:09:17.457770] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.8/newfile.80: name heal successful on FFF [2015-11-17 09:09:17.597917] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.9/newfile.94: name heal successful on FFF [2015-11-17 09:09:17.625835] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.8/newfile.81: name heal successful on FFF [2015-11-17 09:09:18.067542] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.8/newfile.82: name heal successful on FFF [2015-11-17 09:09:18.348196] I [MSGID: 122058] [ec-heal.c:2338:ec_heal_do] 0-vol1-disperse-0: /files/dir.9/newfile.95: name heal successful on FFF Version-Release number of selected component (if applicable): ============================================================= 3.7.5-6 [root@transformers ~]# gluster --version glusterfs 3.7.5 built on Nov 10 2015 00:13:38 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: ================= glustershd should show the entries only for the files which get healed. Additional info:
Bhaskar, This is my first bug in disperse component. Could you please share the steps to recreate the issue? I tried the following steps but to no avail: 1) create start and mount a 6+2 disperse volume (all bricks on the same node). 2) Create data in the volume -- do kernel untar on the mount point. 3) Once data is created, replace the last brick with replace-brick commit force command. 4) Monitor o/p of `gluster volume heal <VOL> info` and "watch" the output of `tail -F glustershd.log` on the adjacent pane. The result: I see heal info reporting entries constantly and the list only keeps growing with time. On the other hand in the glustershd.log file, I actually see the following: [2016-02-15 13:20:30.002542] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP' The message "W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'" repeated 2 times between [2016-02-15 13:20:30.002542] and [2016-02-15 13:20:30.003022] [2016-02-15 13:20:30.003648] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:20:30.003696] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] [2016-02-15 13:20:30.004380] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:20:30.004388] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP' [2016-02-15 13:20:30.004403] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] [2016-02-15 13:20:30.004524] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP' [2016-02-15 13:20:30.004588] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:20:30.004599] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] [2016-02-15 13:20:30.005269] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:20:30.005297] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] [2016-02-15 13:20:31.610589] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP' [2016-02-15 13:20:31.610612] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:20:31.610621] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] [2016-02-15 13:21:31.002399] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP' [2016-02-15 13:21:31.002579] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:21:31.002589] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] The message "W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'" repeated 460 times between [2016-02-15 13:21:31.002399] and [2016-02-15 13:22:53.470491] The message "W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)" repeated 455 times between [2016-02-15 13:21:31.002579] and [2016-02-15 13:22:53.470496] The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]" repeated 455 times between [2016-02-15 13:21:31.002589] and [2016-02-15 13:22:53.470498] [2016-02-15 13:23:32.002522] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP' [2016-02-15 13:23:32.002699] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:23:32.002711] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] -Krutika
I now find heal-info report zero entries intermittently while shd continues to log names of files it is healing. Canceling NEEDINFO.
Steps followed to verify the issue. 1. Created a Distributed-Disperse volume. 2. Created 5 files on the volume. 3. Replaced one of the brick. 4. Viewed the list of files that needed healing. ************************************************************************* [root@buildserver glusterfs]# gluster volume heal ec-1 info Brick buildserver:/lab_work/store/ec-1 /f5 /f10 Status: Connected Number of entries: 2 Brick buildserver:/lab_work/store/ec-2 /f5 /f10 Status: Connected Number of entries: 2 Brick buildserver:/lab_work/store/ec-3 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-4 /f5 /f10 Status: Connected Number of entries: 2 Brick buildserver:/lab_work/store/ec-5 /f5 /f10 Status: Connected Number of entries: 2 Brick buildserver:/lab_work/store/ec-6 /f5 /f10 Status: Connected Number of entries: 2 Brick buildserver:/lab_work/store/ec-7 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-8 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-9 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-10 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-11 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-12 Status: Connected Number of entries: 0 [root@buildserver glusterfs]# ************************************************************************* 5. No heal message was noticed in the log. root@buildserver glusterfs]# grep -i 122058 *.log root@buildserver glusterfs]# 6.Triggered self-heal only for the files which require healing. [root@buildserver ~]# gluster volume heal ec-1 Launching heal operation to perform index self heal on volume ec-1 has been successful Use heal info commands to check status [root@buildserver ~]# 7. Heal message corresponding to the files which got healed were noticed. ************************************************************************* [root@buildserver glusterfs]# grep -i 122058 *.log lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:49.921672] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:50.942957] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:52.084542] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:58.187489] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:59.192478] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:01.232530] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:06.336938] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:15.489458] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:16.510026] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:20.565422] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:28.895143] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F lab_work-fuse_mounts-mount1.log:The message "I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F" repeated 2 times between [2017-01-04 08:08:28.895143] and [2017-01-04 08:08:49.354147] lab_work-fuse_mounts-mount1.log:The message "I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F" repeated 6 times between [2017-01-04 08:08:20.565422] and [2017-01-04 08:09:00.313412] [root@buildserver glusterfs]# ************************************************************************* 8. Post healing no files were listed under "need healing" ************************************************************************* [root@buildserver ~]# gluster volume heal ec-1 info Brick buildserver:/lab_work/store/ec-1 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-2 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-3 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-4 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-5 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-6 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-7 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-8 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-9 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-10 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-11 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-12 Status: Connected Number of entries: 0 [root@buildserver ~]# ************************************************************************* From the above observation it can be noticed that the component is behaving as expected.
This bug will be fixed by https://review.gluster.org/#/c/16552/ as it is not triggering heal for a file it does not require. However, I think we should change the level of the gf_msg to DEBUG level as we have done in https://review.gluster.org/#/c/16478/ Sunil, could you please send the patch ofr this?
Snippet from comment 8 referring the fix for this issue: > This bug will be fixed by https://review.gluster.org/#/c/16552/ > as it is not triggering heal for a file it does not require.
Changed the log level as part of : https://review.gluster.org/#/c/16580/
I don't know if the fix we are delivering is inline with the expectation of what the issue is actually. The issue from what i understand say, heal messages were seen when healinfo shows zero, that means there were entries undergoing heal but were not reflecting in heal info. however the fix says all healing logs are to be pushed into debug mode. Going by this, the issue gets rather hidden instead of actually being fixed Correct me if i am wrong
onqa validation (based on comment#15) 1) had set log level(brick and client to debug) on an ecvolume had few files created and while a new file was created and being written into using dd say "new_brickdown" 2)brought down bricks and brought them online , while new_brickdown was still getting written into 3) while dd was still happening on new_brickdown , started reads on healthy files(ie other files which didnt require heals) using md5sum and also started to append to healthy file which didnt require heals 4)didn't see any heal messages for healthy files(while doing reads to these files from another client) , which is expected(previously we were seeing heal messages on healthy files too, which was not right as we dont have to heal them) , while i see the message for file getting healed as part of append(which was supposed to get healed as part of dd ie new_brickdown) NOTE:THIS IS SEEN IN FUSE MOUNT LOG [2018-05-21 14:14:36.431467] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F [2018-05-21 14:14:36.505557] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F [2018-05-21 14:14:36.578142] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F [2018-05-21 14:14:36.608751] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F [2018-05-21 14:14:36.618004] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F hence moving to verified, as other files are not getting picked for heal (as they dont require heal) tested on 3.12.2-11
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607