Bug 1645480
Summary: | Files pending heal in Arbiter volume | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Anees Patel <anepatel> |
Component: | arbiter | Assignee: | Karthik U S <ksubrahm> |
Status: | CLOSED ERRATA | QA Contact: | Anees Patel <anepatel> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | rhgs-3.4 | CC: | amukherj, anepatel, apaladug, bkunal, ksubrahm, nchilaka, rcyriac, rhs-bugs, sanandpa, sankarshan, sheggodu, storage-qa-internal |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | RHGS 3.4.z Batch Update 3 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.12.2-33 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-02-04 07:41:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1362129, 1655854, 1657783 | ||
Bug Blocks: |
Description
Anees Patel
2018-11-02 10:39:37 UTC
*** Bug 1645482 has been marked as a duplicate of this bug. *** I tried this locally, and I got two issues. 1. As pointed by Anees, arbiter becoming source according to the xattrs set. This is because of the new entry marking. When the entry got healed from source brick, in this case the 2nd & 3rd bricks are the sources since that had the renamed file and the 1st brick was sink. Heal happened from 2nd brick to the 1st brick. During this the file got recreated on the sink brick and new entry marking will happen on the source bricks. Now on brick 3 i.e., arbiter brick we have data pending attributes set for both the data brick (for brick 2 as part of step 3 in the description and for brick 1 as part of the new entry marking during entry heal). This issue is the duplicate of BZ #1340032 Step 6 and 7 would have succeeded because of caching. Can you give the client log, to verify that step 7 is actually succeeded or not. You should see a write failed with error EIO in the logs for step 7. 2. Sometimes in step 8, after heal is completed and even though the entry on brick 1 gets recreated as part of entry heal it is having data pending attribute on brick 2. # file: home/kus/gbricks/br1/retry2 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a757365725f686f6d655f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.gv1-client-1=0x000000010000000000000000 trusted.gfid=0x2d22d2e8ef714deebd78519fa072538e trusted.gfid2path.026991ed7ee7606b=0x30303030303030302d303030302d303030302d303030302d30303030303030303030 30312f726574727931 trusted.gfid2path.1bba90641876e471=0x30303030303030302d303030302d303030302d303030302d30303030303030303030 30312f726574727932 # file: home/kus/gbricks/br2/retry2 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a757365725f686f6d655f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.gv1-client-0=0x000000010000000000000000 trusted.gfid=0x2d22d2e8ef714deebd78519fa072538e trusted.gfid2path.1bba90641876e471=0x30303030303030302d303030302d303030302d303030302d30303030303030303030 30312f726574727932 # file: home/kus/gbricks/br3/retry2 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a757365725f686f6d655f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.gv1-client-0=0x000000010000000000000000 trusted.afr.gv1-client-1=0x000000000000000000000000 trusted.gfid=0x2d22d2e8ef714deebd78519fa072538e SHD log snippet: [2018-11-09 13:04:34.543762] I [MSGID: 108026] [afr-self-heal-entry.c:887:afr_selfheal_entry_do] 0-gv1-replicate-0: performing entry selfheal on 00000000-0000-0000-0000-000000000001 [2018-11-09 13:04:34.553159] W [MSGID: 108015] [afr-self-heal-entry.c:56:afr_selfheal_entry_delete] 0-gv1-replicate-0: expunging file 00000000-0000-0000-0000-000000000001/retry1 (2d22d2e8-ef71-4dee-bd78-519fa072538e) on gv1-client-0 [2018-11-09 13:04:34.584284] I [MSGID: 108026] [afr-self-heal-common.c:1732:afr_log_selfheal] 0-gv1-replicate-0: Completed data selfheal on 2d22d2e8-ef71-4dee-bd78-519fa072538e. sources=[0] 2 sinks=1 [2018-11-09 13:04:34.584719] W [MSGID: 114031] [client-rpc-fops.c:2865:client3_3_lookup_cbk] 0-gv1-client-0: remote operation failed. Path: <gfid:2d22d2e8-ef71-4dee-bd78-519fa072538e> (2d22d2e8-ef71-4dee-bd78-519fa072538e) [No such file or directory] [2018-11-09 13:04:34.587619] I [MSGID: 108026] [afr-self-heal-common.c:1732:afr_log_selfheal] 0-gv1-replicate-0: Completed entry selfheal on 00000000-0000-0000-0000-000000000001. sources=[1] 2 sinks=0 [2018-11-09 13:04:35.609488] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-gv1-replicate-0: performing metadata selfheal on 2d22d2e8-ef71-4dee-bd78-519fa072538e [2018-11-09 13:04:35.611987] I [MSGID: 108026] [afr-self-heal-common.c:1732:afr_log_selfheal] 0-gv1-replicate-0: Completed metadata selfheal on 2d22d2e8-ef71-4dee-bd78-519fa072538e. sources=[1] 2 sinks=0 In both these cases the heal will not proceed as arbiter is the only source for heal, but it can not be a source for data heal. I have to investigate further to see why issue 2 is happening. Adding need info for the client log to verify step 7 did not give error because of write-behind. getfattr for the above file "retry2" brick1: # getfattr -d -m . -e hex /var/lib/heketi/mounts/vg_122ea358da894515bd3c3076cc136006/brick_a3e16a8004a3cb29edafab859d182fc0/brick/retry2 getfattr: Removing leading '/' from absolute path names # file: var/lib/heketi/mounts/vg_122ea358da894515bd3c3076cc136006/brick_a3e16a8004a3cb29edafab859d182fc0/brick/retry2 security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.vol_a00f012d78a5a167acc5abc66795ef95-client-1=0x000000010000000000000000 trusted.gfid=0x7c0c51d1c6c7455d8e408877c197996d trusted.gfid2path.1bba90641876e471=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f726574727932 brick2: # getfattr -d -m . -e hex /var/lib/heketi/mounts/vg_53cc243e52473fe12b459a5e46181e0c/brick_ee67805c4ba175f0a122b0374c15a130/brick/retry2 getfattr: Removing leading '/' from absolute path names # file: var/lib/heketi/mounts/vg_53cc243e52473fe12b459a5e46181e0c/brick_ee67805c4ba175f0a122b0374c15a130/brick/retry2 security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.vol_a00f012d78a5a167acc5abc66795ef95-client-0=0x000000010000000000000000 trusted.gfid=0x7c0c51d1c6c7455d8e408877c197996d trusted.gfid2path.026991ed7ee7606b=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f726574727931 trusted.gfid2path.1bba90641876e471=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f726574727932 arbiter brick: # getfattr -d -m . -e hex /var/lib/heketi/mounts/vg_34af297f6dca3c38b7a297db36acc230/brick_748ba54728b88270ab0dc0d960006aae/brick/retry2 getfattr: Removing leading '/' from absolute path names # file: var/lib/heketi/mounts/vg_34af297f6dca3c38b7a297db36acc230/brick_748ba54728b88270ab0dc0d960006aae/brick/retry2 security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.vol_a00f012d78a5a167acc5abc66795ef95-client-0=0x000000000000000000000000 trusted.afr.vol_a00f012d78a5a167acc5abc66795ef95-client-1=0x000000010000000000000000 trusted.gfid=0x7c0c51d1c6c7455d8e408877c197996d trusted.gfid2path.1bba90641876e471=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f726574727932 The issue is reproducible on plain arbiter volume also. Tested on the latest build # rpm -qa | grep gluster glusterfs-libs-3.12.2-27.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-27.el7rhgs.x86_64 python2-gluster-3.12.2-27.el7rhgs.x86_64 glusterfs-events-3.12.2-27.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-27.el7rhgs.x86_64 Do we know how many customer have reported the issue so far ? Verified the fix per the above test-plan, Verified on build: # rpm -qa | grep gluster gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-libs-3.12.2-36.el7rhgs.x86_64 glusterfs-events-3.12.2-36.el7rhgs.x86_64 glusterfs-fuse-3.12.2-36.el7rhgs.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0263 |