996132 – Dist-geo-rep : too many creation and deletion of files in loop results in geo-rep stopped processing changelogs , consequently stopped syncing to slave

Bug 996132 - Dist-geo-rep : too many creation and deletion of files in loop results in geo-rep stopped processing changelogs , consequently stopped syncing to slave

Summary: Dist-geo-rep : too many creation and deletion of files in loop results in geo...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:	consistency
Duplicates (1):	994957 (view as bug list)
Depends On:
Blocks:	1285205
TreeView+	depends on / blocked

Reported:	2013-08-12 13:28 UTC by Vijaykumar Koppad
Modified:	2023-09-14 01:49 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1285205 (view as bug list)
Environment:
Last Closed:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vijaykumar Koppad 2013-08-12 13:28:14 UTC

Description of problem: If too many creation and deletion of files  in quick succession are executed on the master mount point, the geo-rep stopped processing changelogs accumulated  in .processing directory, and consequently stopped syncing files to slave. The number of file in .processing kept on increasing, until the creation and deletion of files was stopped on master. Even after 2 hours after IOs on master were stopped, the number of files in .processing didn't come down, which means there were changelogs processed by geo-rep and no files synced to slave, and the number files on the slave also kept constant. There were no logs of "failed to sync" or "rsync error-code 23" in geo-rep logs, ie there were no warning or error messages in geo-rep logs.      



Version-Release number of selected component (if applicable):glusterfs-3.4.0.18rhs-1.el6rhs.x86_64


How reproducible:Didn't try to reproduce it 


Steps to Reproduce:
1.Create and start geo-rep relationship between master and slave.
2.run the this script on master for some one hour, 
"while :;do  rm -rf /mnt/master/* ; ./crefi.py -n 100 --multi -d 10 -b 10  --random --max=500K --min=10 /mnt/master/ ; rm -rf /mnt/master/* ; ./crefi.py -n 100 --multi -d 10 -b 10  --random --max=500K --min=10 /mnt/master/; done" 


Actual results: geo-rep stops syncing to slave 


Expected results: It should never stop syncing to slave. 


Additional info:

Machine info:
master: 10.70.43.86,88,130,186
slave: 10.70.43.23,41,45, 10.70.42.216

Comment 2 Venky Shankar 2013-08-13 17:00:08 UTC

strace shown the slave gsyncd stuck in removing a directory that is not empty (we do not recursively delete the entries inside the directory as CHANGELOG should already have the order preserved).

gsyncd on the slave is stuck at:

[pid 27358] lgetxattr(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91", "glusterfs.gfid.string", "dab97311-5b80-44c8-9f23-f4cfc6deb170", 37) = 36
[pid 27358] unlink(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91") = -1 EISDIR (Is a directory)
[pid 27358] rmdir(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91") = -1 ENOTEMPTY (Directory not empty)
[pid 27358] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[pid 27332] <... select resumed> )      = 0 (Timeout)
[pid 27332] select(0, [], [], [], {1, 0} <unfinished ...>
[pid 27358] <... select resumed> )      = 0 (Timeout)
[pid 27358] lgetxattr(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91", "glusterfs.gfid.string", "dab97311-5b80-44c8-9f23-f4cfc6deb170", 37) = 36
[pid 27358] unlink(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91") = -1 EISDIR (Is a directory)
[pid 27358] rmdir(".gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91") = -1 ENOTEMPTY (Directory not empty)
[pid 27358] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[pid 27332] <... select resumed> )      = 0 (Timeout)
[pid 27332] select(0, [], [], [], {1, 0} <unfinished ...>
[pid 27358] <... select resumed> )      = 0 (Timeout)


--

As we see it's fails with ENOTEMPTY for '.gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91'.
listing the entries on the slave for the above pargfid/basename shows:
[root@thunderball imaster]# ls .gfid/99e3e52c-380a-44e4-817a-7753235cdf06/level91
5208ba31~~78YQJ4TGHZ  5208ba33~~81VT1C0AKX  5208ba34~~19OE20K6WM  5208ba35~~8I9CJ6L2NE  5208ba36~~KIA71OVBCB
5208ba31~~N57NLJ2N97  5208ba33~~BBP9IDENQT  5208ba34~~40HVOMY0PT  5208ba35~~JRQJSL8PZL  5208ba36~~MFPVHLZ46F
5208ba32~~0Y289V0SN5  5208ba33~~FQVYWH24JS  5208ba34~~6HSRE4JUX7  5208ba35~~KH2Q1YH260  5208ba36~~O09F2ADM5W
5208ba32~~4F6BD26C4A  5208ba33~~GIHT8A5NB8  5208ba34~~9JMNCM30R6  5208ba35~~PQHK13EGEJ  5208ba36~~PSPKSKZTKO
5208ba32~~BGULLGFHTD  5208ba33~~I2JADKAR63  5208ba34~~F9XZQUM8I1  5208ba35~~XKWYEJYN5X  5208ba36~~TXZRDWLLYG
5208ba32~~IE7CAOX1BX  5208ba33~~L7B07556KD  5208ba34~~GTFTN4SE10  5208ba36~~412NZSA5BM  5208ba36~~XSUKX9QG96
5208ba32~~NWT4RK4P29  5208ba33~~LODHOWA7CI  5208ba34~~X1M4HNDCV2  5208ba36~~4RTD6EDIMP  5208ba37~~1DX62FW5WJ
5208ba32~~Y51Q0F72TP  5208ba33~~UAOUZUEQQA  5208ba35~~42PUU0WUV6  5208ba36~~834XSUJCSR  5208ba37~~2U2U68GIT9
5208ba33~~0HEKVJSHDP  5208ba33~~UE3LYEAJR1  5208ba35~~4SYG0V8UXZ  5208ba36~~DSGENQ0H0U  5208ba37~~AGIHOHORAL
5208ba33~~35XLBDJ8UX  5208ba33~~WRCNC15TCQ  5208ba35~~7TWJYMWOOT  5208ba36~~I34Q61H5DE  5208ba37~~TPXHT1YEIP

--

The first entry '5208ba31~~78YQJ4TGHZ' is still pr
esent on the slave (ie it's not yet purged). On observing the CHANGELOG the entry is before
rmdir() of the parent as seen from the line number(which is expected):

[root@shaktiman .processing]# grep -n 'dab97311-5b80-44c8-9f23-f4cfc6deb170%2F5208ba31~~78YQJ4TGHZ' CHANGELOG.1376305335
895:E 252f0791-0acf-4e96-94f1-21dd9466f107 UNLINK dab97311-5b80-44c8-9f23-f4cfc6deb170%2F5208ba31~~78YQJ4TGHZ

[root@shaktiman .processing]# grep -n '99e3e52c-380a-44e4-817a-7753235cdf06%2Flevel91' CHANGELOG.1376305335
996:E dab97311-5b80-44c8-9f23-f4cfc6deb170 RMDIR 99e3e52c-380a-44e4-817a-7753235cdf06%2Flevel91


So, for some reason the purge of gfid '252f0791-0acf-4e96-94f1-21dd9466f107' failed (along with a bunch of other unlink failures too).
Will look at the brick logs and update this BZ.

Comment 3 Venky Shankar 2013-08-13 17:12:58 UTC

I restarted the geo-replication session and now the entries are purged from the slave.

This looks like the entry operation issue for gfid based access (errno get swallowed ie. operation is not successful and errno is not given back). For entry creates we "fix" this by retrying the entire CHANGELOG (which is not full proof anyway, but we do this as a best effort).

Comment 4 Venky Shankar 2013-08-19 02:56:16 UTC

*** Bug 994957 has been marked as a duplicate of this bug. ***

Comment 7 Aravinda VK 2014-12-24 07:23:16 UTC

We expect this to be addressed in 3.0. Keeping this open in case a backport to 2.1.z is needed.

Comment 8 Aravinda VK 2015-11-25 08:50:43 UTC

Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Comment 9 Aravinda VK 2015-11-25 08:52:00 UTC

Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Comment 10 Red Hat Bugzilla 2023-09-14 01:49:06 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.