Bug 1019679

Summary: [RFE] Dist-geo-rep : geo-rep status goes to faulty if the quota limit is reached on the slave side.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijaykumar Koppad <vkoppad>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED NOTABUG QA Contact: amainkar
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: aavati, avishwan, csaba, david.macdonald, grajaiya, nlevinki, psriniva, rwheeler, vagarwal, vshankar
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-12-24 08:51:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vijaykumar Koppad 2013-10-16 08:56:12 UTC
Description of problem: geo-rep status goes to faulty if the quota limit is reached on the slave side. 


Version-Release number of selected component (if applicable):
glusterfs-3.4.0.34rhs-1.el6rhs.x86_64

How reproducible: happens every time 


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave 
2.set the quota limit on the slave side 
3.Create data on the master more that the limit set on the slave 

Actual results: geo-rep status becomes faulty after the quota limit is reached 


Expected results: Geo-rep should properly handle such errors. 


Additional info:

Comment 1 Vijaykumar Koppad 2013-10-16 08:58:02 UTC
geo-rep maser logs, 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-10-16 14:22:50.90783] I [master(/bricks/brick3):57:gmaster_builder] <top>: setting up changelog change detection mode
[2013-10-16 14:22:50.94122] I [master(/bricks/brick3):816:register] _GMaster: xsync temp directory: /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/cfdffea3581f40685f18a34384edc263/xsync
[2013-10-16 14:22:50.251959] I [master(/bricks/brick3):335:crawlwrap] _GMaster: primary master with volume id 19e43408-4960-4183-8852-bdfd91062b9b ...
[2013-10-16 14:22:50.258965] I [master(/bricks/brick3):345:crawlwrap] _GMaster: crawl interval: 60 seconds
[2013-10-16 14:22:53.35978] I [master(/bricks/brick3):903:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/cfdffea3581f40685f18a34384edc263/xsync/XSYNC-CHANGELOG.1381913570
[2013-10-16 14:22:54.883975] E [resource(/bricks/brick3):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-tIO1s6/gsycnd-ssh-%r@%h:%p --compress root.43.135:/proc/7181/cwd" returned with 12, saying:
[2013-10-16 14:22:54.884723] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync: close failed on "/proc/7181/cwd/": Disk quota exceeded (122)
[2013-10-16 14:22:54.884990] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync error: error in file IO (code 11) at receiver.c(730) [receiver=3.0.6]
[2013-10-16 14:22:54.885329] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (2496 bytes received so far) [sender]
[2013-10-16 14:22:54.885576] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
[2013-10-16 14:22:54.886130] I [syncdutils(/bricks/brick3):159:finalize] <top>: exiting.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 2 Vijaykumar Koppad 2013-10-16 09:04:09 UTC
And the state is Unrecoverable.

Comment 3 Vijaykumar Koppad 2013-10-16 09:11:08 UTC
It is recoverable only when the quota limit is increased on the slave side.

Comment 5 Vijaykumar Koppad 2013-10-16 10:04:59 UTC
In order to reduce the space I deleted few files from the master. It didn't help the status the of the geo-rep. Later I increased the quota limit on the slave. The status of the geo-rep became stable, but the deleted files were not deleted, as shown in the status detail.  

                                  MASTER: master  SLAVE: ssh://10.70.43.76::slave

NODE                         HEALTH    UPTIME      FILES SYNCD    FILES PENDING    BYTES PENDING    DELETES PENDING
--------------------------------------------------------------------------------------------------------------------
redcell.blr.redhat.com       Stable    00:17:51    3227           0                0Bytes           0
redlake.blr.redhat.com       Stable    00:17:47    3169           0                0Bytes           267
redwood.blr.redhat.com       Stable    23:27:11    0              0                0Bytes           0
redcloak.blr.redhat.com      Stable    23:27:11    0              0                0Bytes           0


This is understandable, since geo-rep after it restarts, it does a xsync crawl and xsync crawl won't be able to pick up the deletes. Second problem was, out of 2 active gsyncd, one had stopped crawling, and it was not syncing any file created. Strace of the gsyncd process will give that, it was stuck in some FUTEX 
_wait_private 

[root@redlake ~]# strace -s 500 -p 10629
Process 10629 attached - interrupt to quit
futex(0x1e66940, FUTEX_WAIT_PRIVATE, 0, NULL

Comment 6 Vivek Agarwal 2013-10-17 06:58:37 UTC
Per discussion with Shanks and Saurabh, moving this out of u1

Comment 7 Vivek Agarwal 2013-10-17 07:44:21 UTC
Removed it from the u1 list by mistake, adding it back to u1

Comment 8 Amar Tumballi 2013-11-11 10:08:04 UTC
bug 1025476 was the cause for this, and considering its fixed now, can we see if this is fixed?

Comment 9 Vijaykumar Koppad 2013-11-13 12:30:41 UTC
I tried it on the build glusterfs-3.4.0.43rhs-1, still it happens. Geo-rep logs has these

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

[2013-11-13 17:53:41.726465] I [master(/bricks/brick1):1089:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/bd42ad17ef8864d51407b1c6478f5dc6/xsync/XSYNC-CHANGELOG.1384345420
[2013-11-13 17:53:42.613092] E [resource(/bricks/brick1):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-Gt6Fhm/gsycnd-ssh-%r@%h:%p --compress root.43.174:/proc/13900/cwd" returned with 12, saying:
[2013-11-13 17:53:42.613451] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: close failed on "/proc/13900/cwd/": Disk quota exceeded (122)
[2013-11-13 17:53:42.613559] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in file IO (code 11) at receiver.c(730) [receiver=3.0.6]
[2013-11-13 17:53:42.613702] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (240 bytes received so far) [sender]
[2013-11-13 17:53:42.613802] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
[2013-11-13 17:53:42.614023] I [syncdutils(/bricks/brick1):159:finalize] <top>: exiting.
[2013-11-13 17:53:52.690727] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-11-13 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 10 Vijaykumar Koppad 2013-11-13 12:31:56 UTC
I tried it on the build glusterfs-3.4.0.43rhs-1, still it happens. Geo-rep logs has these

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

[2013-11-13 17:53:41.726465] I [master(/bricks/brick1):1089:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/bd42ad17ef8864d51407b1c6478f5dc6/xsync/XSYNC-CHANGELOG.1384345420
[2013-11-13 17:53:42.613092] E [resource(/bricks/brick1):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-Gt6Fhm/gsycnd-ssh-%r@%h:%p --compress root.43.174:/proc/13900/cwd" returned with 12, saying:
[2013-11-13 17:53:42.613451] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: close failed on "/proc/13900/cwd/": Disk quota exceeded (122)
[2013-11-13 17:53:42.613559] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in file IO (code 11) at receiver.c(730) [receiver=3.0.6]
[2013-11-13 17:53:42.613702] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (240 bytes received so far) [sender]
[2013-11-13 17:53:42.613802] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
[2013-11-13 17:53:42.614023] I [syncdutils(/bricks/brick1):159:finalize] <top>: exiting.
[2013-11-13 17:53:52.690727] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-11-13 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 11 Amar Tumballi 2013-11-13 12:47:51 UTC
Considering the changes till quota limit is met are synced to slave, I would mark this bug as lesser priority.

As of now, geo-replication has two status, one is 'Stable' (ie, everything is normal, whats written on master side is synced to slave), another is 'faulty' (which means, we can't sync changes in master side to slave side).

Here, geo-replication rightly says its faulty (because it can't sync any more data to slave). Please note that if quota was set on slave before 'creating' geo-replication, it would be warned before to the user saying there was less space on slave.

I want to know if its a bug at all in other words.

---
In comment #8, i falsely analyzed the problem to be linked to a changelog crash, which was why this bug was marked ON_QA in first place.
---

Comment 13 Pavithra 2013-11-19 10:27:22 UTC
In a discussion with Naga, changing the summary to RFE.

Comment 17 Aravinda VK 2014-12-24 08:51:47 UTC
Closing as we have encountered an expected behavior with quota being set on the slave.