Bug 1368437
Summary: | Remove-brick: Remove-brick rebalance failed during continuous lookup+directory rename | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Prasad Desala <tdesala> | |
Component: | distribute | Assignee: | Csaba Henk <csaba> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Sayalee <saraut> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | rhgs-3.1 | CC: | rhs-bugs, saraut, storage-qa-internal | |
Target Milestone: | --- | Keywords: | Triaged, ZStream | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1395221 (view as bug list) | Environment: | ||
Last Closed: | 2020-12-17 08:52:15 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Prasad Desala
2016-08-19 11:50:31 UTC
In the current code, if it is a remove-brick operation we abort migration for any kind of failures. <code snippet from gf_defrag_fix_layout> ret = syncop_lookup (this, loc, &iatt, NULL, NULL, NULL); if (ret) { gf_log (this->name, GF_LOG_ERROR, "Lookup failed on %s", loc->path); ret = -1; goto out; } <gf_defrag_fix_layout> Since, the reproducer involves rename directories, it is a race condition where readdirp has returned the old name, lookup as part of fix_layout happens post rename leading to failure. I think we can make remove-brick ignore ENOENT errors(Not sure about ESTALE). For ESTALE may need to consider all the cases. Will send a patch once I resolve the ESTALE part. Since the operation is remove-brick, and the race pointed in comment 3 can result in directories being not migrated (no fix-layout + no migration for the entire sub-tree). In my opinion we should retain the failure as it is, which will be an indication to admin that there may be files left on the removed brick. Nithya, need your input on this. upstream patch: http://review.gluster.org/#/c/15846 While verifying Bug 1400037 on 3.8.4-8 . I saw below observation:- 1) continuous multiple errors for failed look up :- <SNIP> 2016-12-13 11:22:11.661154] E [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-samsung-dht: /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_07/d_005/_07_500_.d lookup failed with 2 [2016-12-13 11:22:11.663923] E [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-samsung-dht: /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_07/d_005/_07_501_.d lookup failed with 2 </SNIP> 2) Setxattr failed ERROR:- [2016-12-13 11:28:05.935641] E [dht-rebalance.c:3348:gf_defrag_fix_layout] 0-samsung-dht: Setxattr failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05/d_009/d_007/_05_9732_.d 3) Fix layout failing on ERRORs:- [2016-12-13 11:28:05.936034] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05/d_009/d_007 [2016-12-13 11:28:05.936398] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05/d_009 [2016-12-13 11:28:05.936739] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05 [2016-12-13 11:28:05.937916] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com [2016-12-13 11:28:05.938438] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir 4) Rebalance Failure:- [2016-12-13 11:28:05.939820] I [MSGID: 109028] [dht-rebalance.c:4126:gf_defrag_status_get] 0-samsung-dht: Rebalance is failed. Time taken is 378.00 secs [2016-12-13 11:28:05.939848] I [MSGID: 109028] [dht-rebalance.c:4130:gf_defrag_status_get] 0-samsung-dht: Files migrated: 0, size: 0, lookups: 0, failures: 7, skipped: 0 Here is a update on the scope of the fix that is in upstream right now. The patch is certainly immune to one directory rename, but not continuous dir renames e.g. renaming 1->2, 2->3, 3->4 and so forth. The patch in it's current state does try to get the new name of the directory and move on with the new name. But in the scenario of continuous renames, even the new name rebalance got, wouldn't be existing, since client would have renamed that entry as well. As part of my testing, I renamed the directory just before fix-layout was called and rebalance carried on successfully for the new name. Want to set the right expectation here, so that there would be no surprises. Regards, Susant (In reply to Susant Kumar Palai from comment #4) > Since the operation is remove-brick, and the race pointed in comment 3 can > result in directories being not migrated (no fix-layout + no migration for > the entire sub-tree). > > In my opinion we should retain the failure as it is, which will be an > indication to admin that there may be files left on the removed brick. > > Nithya, need your input on this. Yes, I agree. *** Bug 1368093 has been marked as a duplicate of this bug. *** |