Bug 1792821
Summary: | Heal pending on brick post upgrading from RHV 4.2.8 or RHV 4.3.7 to RHV 4.3.8 | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | milind <mwaykole> | |
Component: | rhhi | Assignee: | Ravishankar N <ravishankar> | |
Status: | CLOSED ERRATA | QA Contact: | milind <mwaykole> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhhiv-1.7 | CC: | godas, mmuench, pasik, ravishankar, rcyriac, rhs-bugs, sasundar, smitra, swachira | |
Target Milestone: | --- | |||
Target Release: | RHHI-V 1.8 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Previously, healing of entries in directories could be triggered when only the heal source (and not the heal target) was available. This led to replication extended attributes being reset and resulted in a GFID split-brain condition when the heal target became available again. Entry healing is now triggered only when all bricks in a replicated set are available, to avoid this issue.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1801624 1804164 (view as bug list) | Environment: | ||
Last Closed: | 2020-08-04 14:51:32 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1848893 | |||
Bug Blocks: | 1779975 |
Description
milind
2020-01-20 05:07:22 UTC
Ravi, what's next step on this bug? On looking at the setup we found that the entry was not getting healed because the parent dir did not have any entry pending xattrs. The test (thanks Sas for the info) that writes to the prob file apparently unlinks the file before continuing to write to it, so maybe the expected result is that the file be _removed_ from all bricks, not that it is present on them: ------------------------ f = os.open(path, os.O_WRONLY | os.O_DIRECT | os.O_DSYNC | os.O_CREAT | os.O_EXCL, stat.S_IRUSR | stat.S_IWUSR) #time.sleep(20) os.unlink(path) #time.sleep(20) m = mmap.mmap(-1, 1024) s = b' ' * 1024 m.write(s) os.write(f, m) os.close(f) ------------------------ So it looks like one of the bricks (engine-client-0) was killed at the time of unlink of the prob file so the unlink did not go through on it. But AFR should have marked pending xattrs during post-op on the good bricks (so that selfheal later on removes the prob file from this brick also). I do not see any network errors on the client log which can explain a post-op failure, so I'm not sure what happened here. We need to see if this can be consistently recreated. Leaving a need-info on Milind for the same. We need the exact time the killing and restating of the bricks happen to correlate it with the log. I have also seen the same behavior when upgrading from RHV 4.2.8 to RHV 4.3.8 and also from RHV 4.3.7 to RHV 4.3.8 During this upgrade, one of the bricks were killed, and gluster software was upgraded from RHGS 3.4.4 ( gluster-3.12.2-47.5 ) to RHGS 3.5.1 ( gluster-6.0-29 ) After upgrading one of the node, the he.metadata and he.lockspace files were shown are pending to heal and that continued forever. On checking for its GFID, then it was mismatching with the same file on other 2 bricks, but self-heal was not happening though, as the changelog entry was missing in the parent directory. So I am able to reproduce the issue fairly consistently. 1. Create a 1x3 volume with RHHI options enabled. 2. Create and write to a file from the mount. 3. Bring one brick down, delete and re-create the file so that there is pending (granular) entry heal. 4. With the brick still down, launch the index heal. xat Even though there is nothing to be healed (since the sink brick is still down), index heal seems to be doing a no-op and resetting parent dir's afr changelog xattrs, which is why the entry never gets healed. In the QE setup also, this race is what is happening. Even before the upgraded node comes online, the shd does the entry heal described above. We can see messages like these in the shd log where there is no 'source' and the good bricks are 'sinks': [2020-02-10 05:57:55.847756] I [MSGID: 108026] [afr-self-heal-common.c:1750:afr_log_selfheal] 0-testvol-replicate-0: Completed entry selfheal on 77dd5a45-dbf5-4592-b31b-b440382302e9. sources= sinks=0 2 I need to check where the bug is in the code, if it is specific to granular entry heal and how to fix it. (In reply to Ravishankar N from comment #8) > I need to check where the bug is in the code, if it is specific to granular > entry heal and how to fix it. So the gfid split-brain will happen only if granular-entry heal is enabled, but even otherwise, even if only two good bricks are up, spurious entry heals are triggered continuously leading to multiple unnecessary network ops. I'm sending a fix upstream for review. Upstream patch: https://review.gluster.org/#/c/glusterfs/+/24109/ [node1]# rpm -qa | grep -i glusterfs glusterfs-libs-6.0-37.1.el8rhgs.x86_64 glusterfs-geo-replication-6.0-37.1.el8rhgs.x86_64 glusterfs-rdma-6.0-37.1.el8rhgs.x86_64 glusterfs-api-6.0-37.1.el8rhgs.x86_64 glusterfs-server-6.0-37.1.el8rhgs.x86_64 glusterfs-fuse-6.0-37.1.el8rhgs.x86_64 glusterfs-cli-6.0-37.1.el8rhgs.x86_64 glusterfs-events-6.0-37.1.el8rhgs.x86_64 glusterfs-6.0-37.1.el8rhgs.x86_64 glusterfs-client-xlators-6.0-37.1.el8rhgs.x86_64 [node1]# imgbase w You are on rhvh-4.4.1.1-0.20200713.0+1 [node1]# rpm -qa | grep -i ansible gluster-ansible-maintenance-1.0.1-9.el8rhgs.noarch gluster-ansible-cluster-1.0-1.el8rhgs.noarch ansible-2.9.10-1.el8ae.noarch gluster-ansible-features-1.0.5-7.el8rhgs.noarch gluster-ansible-roles-1.0.5-17.el8rhgs.noarch ovirt-ansible-engine-setup-1.2.4-1.el8ev.noarch gluster-ansible-infra-1.0.4-11.el8rhgs.noarch ovirt-ansible-hosted-engine-setup-1.1.6-1.el8ev.noarch gluster-ansible-repositories-1.0.1-2.el8rhgs.noarch As i dont see any pending heal in RHHI-V setup , Heance marking this bug as verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHHI for Virtualization 1.8 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:3314 |