Bug 996132
| Summary: | Dist-geo-rep : too many creation and deletion of files in loop results in geo-rep stopped processing changelogs , consequently stopped syncing to slave | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Vijaykumar Koppad <vkoppad> | |
| Component: | geo-replication | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> | |
| Status: | CLOSED EOL | QA Contact: | storage-qa-internal <storage-qa-internal> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 2.1 | CC: | avishwan, chrisw, csaba, david.macdonald, rhs-bugs, vagarwal, vbhat, vshankar | |
| Target Milestone: | --- | Keywords: | ZStream | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | consistency | |||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1285205 (view as bug list) | Environment: | ||
| Last Closed: | Type: | Bug | ||
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1285205 | |||
|
Description
Vijaykumar Koppad
2013-08-12 13:28:14 UTC
strace shown the slave gsyncd stuck in removing a directory that is not empty (we do not recursively delete the entries inside the directory as CHANGELOG should already have the order preserved).
gsyncd on the slave is stuck at:
[pid 27358] lgetxattr(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91", "glusterfs.gfid.string", "dab97311-5b80-44c8-9f23-f4cfc6deb170", 37) = 36
[pid 27358] unlink(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91") = -1 EISDIR (Is a directory)
[pid 27358] rmdir(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91") = -1 ENOTEMPTY (Directory not empty)
[pid 27358] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[pid 27332] <... select resumed> ) = 0 (Timeout)
[pid 27332] select(0, [], [], [], {1, 0} <unfinished ...>
[pid 27358] <... select resumed> ) = 0 (Timeout)
[pid 27358] lgetxattr(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91", "glusterfs.gfid.string", "dab97311-5b80-44c8-9f23-f4cfc6deb170", 37) = 36
[pid 27358] unlink(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91") = -1 EISDIR (Is a directory)
[pid 27358] rmdir(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91") = -1 ENOTEMPTY (Directory not empty)
[pid 27358] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[pid 27332] <... select resumed> ) = 0 (Timeout)
[pid 27332] select(0, [], [], [], {1, 0} <unfinished ...>
[pid 27358] <... select resumed> ) = 0 (Timeout)
--
As we see it's fails with ENOTEMPTY for '.gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91'.
listing the entries on the slave for the above pargfid/basename shows:
[root@thunderball imaster]# ls .gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91
5208ba31~~78YQJ4TGHZ 5208ba33~~81VT1C0AKX 5208ba34~~19OE20K6WM 5208ba35~~8I9CJ6L2NE 5208ba36~~KIA71OVBCB
5208ba31~~N57NLJ2N97 5208ba33~~BBP9IDENQT 5208ba34~~40HVOMY0PT 5208ba35~~JRQJSL8PZL 5208ba36~~MFPVHLZ46F
5208ba32~~0Y289V0SN5 5208ba33~~FQVYWH24JS 5208ba34~~6HSRE4JUX7 5208ba35~~KH2Q1YH260 5208ba36~~O09F2ADM5W
5208ba32~~4F6BD26C4A 5208ba33~~GIHT8A5NB8 5208ba34~~9JMNCM30R6 5208ba35~~PQHK13EGEJ 5208ba36~~PSPKSKZTKO
5208ba32~~BGULLGFHTD 5208ba33~~I2JADKAR63 5208ba34~~F9XZQUM8I1 5208ba35~~XKWYEJYN5X 5208ba36~~TXZRDWLLYG
5208ba32~~IE7CAOX1BX 5208ba33~~L7B07556KD 5208ba34~~GTFTN4SE10 5208ba36~~412NZSA5BM 5208ba36~~XSUKX9QG96
5208ba32~~NWT4RK4P29 5208ba33~~LODHOWA7CI 5208ba34~~X1M4HNDCV2 5208ba36~~4RTD6EDIMP 5208ba37~~1DX62FW5WJ
5208ba32~~Y51Q0F72TP 5208ba33~~UAOUZUEQQA 5208ba35~~42PUU0WUV6 5208ba36~~834XSUJCSR 5208ba37~~2U2U68GIT9
5208ba33~~0HEKVJSHDP 5208ba33~~UE3LYEAJR1 5208ba35~~4SYG0V8UXZ 5208ba36~~DSGENQ0H0U 5208ba37~~AGIHOHORAL
5208ba33~~35XLBDJ8UX 5208ba33~~WRCNC15TCQ 5208ba35~~7TWJYMWOOT 5208ba36~~I34Q61H5DE 5208ba37~~TPXHT1YEIP
--
The first entry '5208ba31~~78YQJ4TGHZ' is still pr
esent on the slave (ie it's not yet purged). On observing the CHANGELOG the entry is before
rmdir() of the parent as seen from the line number(which is expected):
[root@shaktiman .processing]# grep -n 'dab97311-5b80-44c8-9f23-f4cfc6deb170%2F5208ba31~~78YQJ4TGHZ' CHANGELOG.1376305335
895:E 252f0791-0acf-4e96-94f1-21dd9466f107 UNLINK dab97311-5b80-44c8-9f23-f4cfc6deb170%2F5208ba31~~78YQJ4TGHZ
[root@shaktiman .processing]# grep -n '99e3e52c-380a-44e4-817a-7753235cdf06%2Flevel91' CHANGELOG.1376305335
996:E dab97311-5b80-44c8-9f23-f4cfc6deb170 RMDIR 99e3e52c-380a-44e4-817a-7753235cdf06%2Flevel91
So, for some reason the purge of gfid '252f0791-0acf-4e96-94f1-21dd9466f107' failed (along with a bunch of other unlink failures too).
Will look at the brick logs and update this BZ.
I restarted the geo-replication session and now the entries are purged from the slave. This looks like the entry operation issue for gfid based access (errno get swallowed ie. operation is not successful and errno is not given back). For entry creates we "fix" this by retrying the entire CHANGELOG (which is not full proof anyway, but we do this as a best effort). *** Bug 994957 has been marked as a duplicate of this bug. *** We expect this to be addressed in 3.0. Keeping this open in case a backport to 2.1.z is needed. Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again. Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |