Bug 1282731
Summary: | Entry heal messages in glustershd.log while no entries shown in heal info | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Bhaskarakiran <byarlaga> | |
Component: | disperse | Assignee: | Sunil Kumar Acharya <sheggodu> | |
Status: | CLOSED ERRATA | QA Contact: | Nag Pavan Chilakam <nchilaka> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | rhgs-3.1 | CC: | amukherj, aspandey, nchilaka, pkarampu, rhinduja, rhs-bugs, sheggodu, ubansal | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | RHGS 3.4.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | rebase | |||
Fixed In Version: | glusterfs-3.12.2-1 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1420619 (view as bug list) | Environment: | ||
Last Closed: | 2018-09-04 06:26:58 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1420619, 1422766, 1422783, 1503134 |
Description
Bhaskarakiran
2015-11-17 09:26:05 UTC
Bhaskar, This is my first bug in disperse component. Could you please share the steps to recreate the issue? I tried the following steps but to no avail: 1) create start and mount a 6+2 disperse volume (all bricks on the same node). 2) Create data in the volume -- do kernel untar on the mount point. 3) Once data is created, replace the last brick with replace-brick commit force command. 4) Monitor o/p of `gluster volume heal <VOL> info` and "watch" the output of `tail -F glustershd.log` on the adjacent pane. The result: I see heal info reporting entries constantly and the list only keeps growing with time. On the other hand in the glustershd.log file, I actually see the following: [2016-02-15 13:20:30.002542] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP' The message "W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'" repeated 2 times between [2016-02-15 13:20:30.002542] and [2016-02-15 13:20:30.003022] [2016-02-15 13:20:30.003648] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:20:30.003696] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] [2016-02-15 13:20:30.004380] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:20:30.004388] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP' [2016-02-15 13:20:30.004403] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] [2016-02-15 13:20:30.004524] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP' [2016-02-15 13:20:30.004588] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:20:30.004599] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] [2016-02-15 13:20:30.005269] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:20:30.005297] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] [2016-02-15 13:20:31.610589] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP' [2016-02-15 13:20:31.610612] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:20:31.610621] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] [2016-02-15 13:21:31.002399] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP' [2016-02-15 13:21:31.002579] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:21:31.002589] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] The message "W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP'" repeated 460 times between [2016-02-15 13:21:31.002399] and [2016-02-15 13:22:53.470491] The message "W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)" repeated 455 times between [2016-02-15 13:21:31.002579] and [2016-02-15 13:22:53.470496] The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument]" repeated 455 times between [2016-02-15 13:21:31.002589] and [2016-02-15 13:22:53.470498] [2016-02-15 13:23:32.002522] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-ec-disperse-0: Mismatching xdata in answers of 'LOOKUP' [2016-02-15 13:23:32.002699] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20) [2016-02-15 13:23:32.002711] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-ec-disperse-0: Heal failed [Invalid argument] -Krutika I now find heal-info report zero entries intermittently while shd continues to log names of files it is healing. Canceling NEEDINFO. Steps followed to verify the issue. 1. Created a Distributed-Disperse volume. 2. Created 5 files on the volume. 3. Replaced one of the brick. 4. Viewed the list of files that needed healing. ************************************************************************* [root@buildserver glusterfs]# gluster volume heal ec-1 info Brick buildserver:/lab_work/store/ec-1 /f5 /f10 Status: Connected Number of entries: 2 Brick buildserver:/lab_work/store/ec-2 /f5 /f10 Status: Connected Number of entries: 2 Brick buildserver:/lab_work/store/ec-3 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-4 /f5 /f10 Status: Connected Number of entries: 2 Brick buildserver:/lab_work/store/ec-5 /f5 /f10 Status: Connected Number of entries: 2 Brick buildserver:/lab_work/store/ec-6 /f5 /f10 Status: Connected Number of entries: 2 Brick buildserver:/lab_work/store/ec-7 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-8 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-9 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-10 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-11 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-12 Status: Connected Number of entries: 0 [root@buildserver glusterfs]# ************************************************************************* 5. No heal message was noticed in the log. root@buildserver glusterfs]# grep -i 122058 *.log root@buildserver glusterfs]# 6.Triggered self-heal only for the files which require healing. [root@buildserver ~]# gluster volume heal ec-1 Launching heal operation to perform index self heal on volume ec-1 has been successful Use heal info commands to check status [root@buildserver ~]# 7. Heal message corresponding to the files which got healed were noticed. ************************************************************************* [root@buildserver glusterfs]# grep -i 122058 *.log lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:49.921672] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:50.942957] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:52.084542] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:58.187489] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:07:59.192478] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:01.232530] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:06.336938] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:15.489458] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:16.510026] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:20.565422] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F lab_work-fuse_mounts-mount1.log:[2017-01-04 08:08:28.895143] I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F lab_work-fuse_mounts-mount1.log:The message "I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f10: name heal successful on 3F" repeated 2 times between [2017-01-04 08:08:28.895143] and [2017-01-04 08:08:49.354147] lab_work-fuse_mounts-mount1.log:The message "I [MSGID: 122058] [ec-heal.c:2364:ec_heal_do] 0-ec-1-disperse-0: /f5: name heal successful on 3F" repeated 6 times between [2017-01-04 08:08:20.565422] and [2017-01-04 08:09:00.313412] [root@buildserver glusterfs]# ************************************************************************* 8. Post healing no files were listed under "need healing" ************************************************************************* [root@buildserver ~]# gluster volume heal ec-1 info Brick buildserver:/lab_work/store/ec-1 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-2 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-3 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-4 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-5 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-6 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-7 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-8 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-9 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-10 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-11 Status: Connected Number of entries: 0 Brick buildserver:/lab_work/store/ec-12 Status: Connected Number of entries: 0 [root@buildserver ~]# ************************************************************************* From the above observation it can be noticed that the component is behaving as expected. This bug will be fixed by https://review.gluster.org/#/c/16552/ as it is not triggering heal for a file it does not require. However, I think we should change the level of the gf_msg to DEBUG level as we have done in https://review.gluster.org/#/c/16478/ Sunil, could you please send the patch ofr this? Snippet from comment 8 referring the fix for this issue: > This bug will be fixed by https://review.gluster.org/#/c/16552/ > as it is not triggering heal for a file it does not require. Changed the log level as part of : https://review.gluster.org/#/c/16580/ I don't know if the fix we are delivering is inline with the expectation of what the issue is actually. The issue from what i understand say, heal messages were seen when healinfo shows zero, that means there were entries undergoing heal but were not reflecting in heal info. however the fix says all healing logs are to be pushed into debug mode. Going by this, the issue gets rather hidden instead of actually being fixed Correct me if i am wrong onqa validation (based on comment#15) 1) had set log level(brick and client to debug) on an ecvolume had few files created and while a new file was created and being written into using dd say "new_brickdown" 2)brought down bricks and brought them online , while new_brickdown was still getting written into 3) while dd was still happening on new_brickdown , started reads on healthy files(ie other files which didnt require heals) using md5sum and also started to append to healthy file which didnt require heals 4)didn't see any heal messages for healthy files(while doing reads to these files from another client) , which is expected(previously we were seeing heal messages on healthy files too, which was not right as we dont have to heal them) , while i see the message for file getting healed as part of append(which was supposed to get healed as part of dd ie new_brickdown) NOTE:THIS IS SEEN IN FUSE MOUNT LOG [2018-05-21 14:14:36.431467] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F [2018-05-21 14:14:36.505557] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F [2018-05-21 14:14:36.578142] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F [2018-05-21 14:14:36.608751] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F [2018-05-21 14:14:36.618004] D [MSGID: 0] [ec-heal.c:2486:ec_heal_do] 2-dispersevol1-disperse-0: /dir1/new_brickdown: name heal successful on 3F hence moving to verified, as other files are not getting picked for heal (as they dont require heal) tested on 3.12.2-11 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607 |