Hide Forgot
Description of problem: geo-rep status goes to faulty if the quota limit is reached on the slave side. Version-Release number of selected component (if applicable): glusterfs-3.4.0.34rhs-1.el6rhs.x86_64 How reproducible: happens every time Steps to Reproduce: 1.create and start a geo-rep relationship between master and slave 2.set the quota limit on the slave side 3.Create data on the master more that the limit set on the slave Actual results: geo-rep status becomes faulty after the quota limit is reached Expected results: Geo-rep should properly handle such errors. Additional info:
geo-rep maser logs, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-10-16 14:22:50.90783] I [master(/bricks/brick3):57:gmaster_builder] <top>: setting up changelog change detection mode [2013-10-16 14:22:50.94122] I [master(/bricks/brick3):816:register] _GMaster: xsync temp directory: /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/cfdffea3581f40685f18a34384edc263/xsync [2013-10-16 14:22:50.251959] I [master(/bricks/brick3):335:crawlwrap] _GMaster: primary master with volume id 19e43408-4960-4183-8852-bdfd91062b9b ... [2013-10-16 14:22:50.258965] I [master(/bricks/brick3):345:crawlwrap] _GMaster: crawl interval: 60 seconds [2013-10-16 14:22:53.35978] I [master(/bricks/brick3):903:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/cfdffea3581f40685f18a34384edc263/xsync/XSYNC-CHANGELOG.1381913570 [2013-10-16 14:22:54.883975] E [resource(/bricks/brick3):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-tIO1s6/gsycnd-ssh-%r@%h:%p --compress root.43.135:/proc/7181/cwd" returned with 12, saying: [2013-10-16 14:22:54.884723] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync: close failed on "/proc/7181/cwd/": Disk quota exceeded (122) [2013-10-16 14:22:54.884990] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync error: error in file IO (code 11) at receiver.c(730) [receiver=3.0.6] [2013-10-16 14:22:54.885329] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (2496 bytes received so far) [sender] [2013-10-16 14:22:54.885576] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6] [2013-10-16 14:22:54.886130] I [syncdutils(/bricks/brick3):159:finalize] <top>: exiting. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
And the state is Unrecoverable.
It is recoverable only when the quota limit is increased on the slave side.
In order to reduce the space I deleted few files from the master. It didn't help the status the of the geo-rep. Later I increased the quota limit on the slave. The status of the geo-rep became stable, but the deleted files were not deleted, as shown in the status detail. MASTER: master SLAVE: ssh://10.70.43.76::slave NODE HEALTH UPTIME FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING -------------------------------------------------------------------------------------------------------------------- redcell.blr.redhat.com Stable 00:17:51 3227 0 0Bytes 0 redlake.blr.redhat.com Stable 00:17:47 3169 0 0Bytes 267 redwood.blr.redhat.com Stable 23:27:11 0 0 0Bytes 0 redcloak.blr.redhat.com Stable 23:27:11 0 0 0Bytes 0 This is understandable, since geo-rep after it restarts, it does a xsync crawl and xsync crawl won't be able to pick up the deletes. Second problem was, out of 2 active gsyncd, one had stopped crawling, and it was not syncing any file created. Strace of the gsyncd process will give that, it was stuck in some FUTEX _wait_private [root@redlake ~]# strace -s 500 -p 10629 Process 10629 attached - interrupt to quit futex(0x1e66940, FUTEX_WAIT_PRIVATE, 0, NULL
Per discussion with Shanks and Saurabh, moving this out of u1
Removed it from the u1 list by mistake, adding it back to u1
bug 1025476 was the cause for this, and considering its fixed now, can we see if this is fixed?
I tried it on the build glusterfs-3.4.0.43rhs-1, still it happens. Geo-rep logs has these >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-11-13 17:53:41.726465] I [master(/bricks/brick1):1089:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/bd42ad17ef8864d51407b1c6478f5dc6/xsync/XSYNC-CHANGELOG.1384345420 [2013-11-13 17:53:42.613092] E [resource(/bricks/brick1):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-Gt6Fhm/gsycnd-ssh-%r@%h:%p --compress root.43.174:/proc/13900/cwd" returned with 12, saying: [2013-11-13 17:53:42.613451] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: close failed on "/proc/13900/cwd/": Disk quota exceeded (122) [2013-11-13 17:53:42.613559] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in file IO (code 11) at receiver.c(730) [receiver=3.0.6] [2013-11-13 17:53:42.613702] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (240 bytes received so far) [sender] [2013-11-13 17:53:42.613802] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6] [2013-11-13 17:53:42.614023] I [syncdutils(/bricks/brick1):159:finalize] <top>: exiting. [2013-11-13 17:53:52.690727] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2013-11-13 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Considering the changes till quota limit is met are synced to slave, I would mark this bug as lesser priority. As of now, geo-replication has two status, one is 'Stable' (ie, everything is normal, whats written on master side is synced to slave), another is 'faulty' (which means, we can't sync changes in master side to slave side). Here, geo-replication rightly says its faulty (because it can't sync any more data to slave). Please note that if quota was set on slave before 'creating' geo-replication, it would be warned before to the user saying there was less space on slave. I want to know if its a bug at all in other words. --- In comment #8, i falsely analyzed the problem to be linked to a changelog crash, which was why this bug was marked ON_QA in first place. ---
In a discussion with Naga, changing the summary to RFE.
Closing as we have encountered an expected behavior with quota being set on the slave.