Description of problem: If too many creation and deletion of files in quick succession are executed on the master mount point, the geo-rep stopped processing changelogs accumulated in .processing directory, and consequently stopped syncing files to slave. The number of file in .processing kept on increasing, until the creation and deletion of files was stopped on master. Even after 2 hours after IOs on master were stopped, the number of files in .processing didn't come down, which means there were changelogs processed by geo-rep and no files synced to slave, and the number files on the slave also kept constant. There were no logs of "failed to sync" or "rsync error-code 23" in geo-rep logs, ie there were no warning or error messages in geo-rep logs. Version-Release number of selected component (if applicable):glusterfs-3.4.0.18rhs-1.el6rhs.x86_64 How reproducible:Didn't try to reproduce it Steps to Reproduce: 1.Create and start geo-rep relationship between master and slave. 2.run the this script on master for some one hour, "while :;do rm -rf /mnt/master/* ; ./crefi.py -n 100 --multi -d 10 -b 10 --random --max=500K --min=10 /mnt/master/ ; rm -rf /mnt/master/* ; ./crefi.py -n 100 --multi -d 10 -b 10 --random --max=500K --min=10 /mnt/master/; done" Actual results: geo-rep stops syncing to slave Expected results: It should never stop syncing to slave. Additional info: Machine info: master: 10.70.43.86,88,130,186 slave: 10.70.43.23,41,45, 10.70.42.216
strace shown the slave gsyncd stuck in removing a directory that is not empty (we do not recursively delete the entries inside the directory as CHANGELOG should already have the order preserved). gsyncd on the slave is stuck at: [pid 27358] lgetxattr(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91", "glusterfs.gfid.string", "dab97311-5b80-44c8-9f23-f4cfc6deb170", 37) = 36 [pid 27358] unlink(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91") = -1 EISDIR (Is a directory) [pid 27358] rmdir(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91") = -1 ENOTEMPTY (Directory not empty) [pid 27358] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...> [pid 27332] <... select resumed> ) = 0 (Timeout) [pid 27332] select(0, [], [], [], {1, 0} <unfinished ...> [pid 27358] <... select resumed> ) = 0 (Timeout) [pid 27358] lgetxattr(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91", "glusterfs.gfid.string", "dab97311-5b80-44c8-9f23-f4cfc6deb170", 37) = 36 [pid 27358] unlink(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91") = -1 EISDIR (Is a directory) [pid 27358] rmdir(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91") = -1 ENOTEMPTY (Directory not empty) [pid 27358] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...> [pid 27332] <... select resumed> ) = 0 (Timeout) [pid 27332] select(0, [], [], [], {1, 0} <unfinished ...> [pid 27358] <... select resumed> ) = 0 (Timeout) -- As we see it's fails with ENOTEMPTY for '.gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91'. listing the entries on the slave for the above pargfid/basename shows: [root@thunderball imaster]# ls .gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91 5208ba31~~78YQJ4TGHZ 5208ba33~~81VT1C0AKX 5208ba34~~19OE20K6WM 5208ba35~~8I9CJ6L2NE 5208ba36~~KIA71OVBCB 5208ba31~~N57NLJ2N97 5208ba33~~BBP9IDENQT 5208ba34~~40HVOMY0PT 5208ba35~~JRQJSL8PZL 5208ba36~~MFPVHLZ46F 5208ba32~~0Y289V0SN5 5208ba33~~FQVYWH24JS 5208ba34~~6HSRE4JUX7 5208ba35~~KH2Q1YH260 5208ba36~~O09F2ADM5W 5208ba32~~4F6BD26C4A 5208ba33~~GIHT8A5NB8 5208ba34~~9JMNCM30R6 5208ba35~~PQHK13EGEJ 5208ba36~~PSPKSKZTKO 5208ba32~~BGULLGFHTD 5208ba33~~I2JADKAR63 5208ba34~~F9XZQUM8I1 5208ba35~~XKWYEJYN5X 5208ba36~~TXZRDWLLYG 5208ba32~~IE7CAOX1BX 5208ba33~~L7B07556KD 5208ba34~~GTFTN4SE10 5208ba36~~412NZSA5BM 5208ba36~~XSUKX9QG96 5208ba32~~NWT4RK4P29 5208ba33~~LODHOWA7CI 5208ba34~~X1M4HNDCV2 5208ba36~~4RTD6EDIMP 5208ba37~~1DX62FW5WJ 5208ba32~~Y51Q0F72TP 5208ba33~~UAOUZUEQQA 5208ba35~~42PUU0WUV6 5208ba36~~834XSUJCSR 5208ba37~~2U2U68GIT9 5208ba33~~0HEKVJSHDP 5208ba33~~UE3LYEAJR1 5208ba35~~4SYG0V8UXZ 5208ba36~~DSGENQ0H0U 5208ba37~~AGIHOHORAL 5208ba33~~35XLBDJ8UX 5208ba33~~WRCNC15TCQ 5208ba35~~7TWJYMWOOT 5208ba36~~I34Q61H5DE 5208ba37~~TPXHT1YEIP -- The first entry '5208ba31~~78YQJ4TGHZ' is still pr esent on the slave (ie it's not yet purged). On observing the CHANGELOG the entry is before rmdir() of the parent as seen from the line number(which is expected): [root@shaktiman .processing]# grep -n 'dab97311-5b80-44c8-9f23-f4cfc6deb170%2F5208ba31~~78YQJ4TGHZ' CHANGELOG.1376305335 895:E 252f0791-0acf-4e96-94f1-21dd9466f107 UNLINK dab97311-5b80-44c8-9f23-f4cfc6deb170%2F5208ba31~~78YQJ4TGHZ [root@shaktiman .processing]# grep -n '99e3e52c-380a-44e4-817a-7753235cdf06%2Flevel91' CHANGELOG.1376305335 996:E dab97311-5b80-44c8-9f23-f4cfc6deb170 RMDIR 99e3e52c-380a-44e4-817a-7753235cdf06%2Flevel91 So, for some reason the purge of gfid '252f0791-0acf-4e96-94f1-21dd9466f107' failed (along with a bunch of other unlink failures too). Will look at the brick logs and update this BZ.
I restarted the geo-replication session and now the entries are purged from the slave. This looks like the entry operation issue for gfid based access (errno get swallowed ie. operation is not successful and errno is not given back). For entry creates we "fix" this by retrying the entire CHANGELOG (which is not full proof anyway, but we do this as a best effort).
*** Bug 994957 has been marked as a duplicate of this bug. ***
We expect this to be addressed in 3.0. Keeping this open in case a backport to 2.1.z is needed.
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days