Bug 1019679 - [RFE] Dist-geo-rep : geo-rep status goes to faulty if the quota limit is reached on the slave side.
[RFE] Dist-geo-rep : geo-rep status goes to faulty if the quota limit is reac...
Status: CLOSED NOTABUG
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
2.1
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Bug Updates Notification Mailing List
amainkar
: FutureFeature
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-16 04:56 EDT by Vijaykumar Koppad
Modified: 2015-03-17 10:43 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-12-24 03:51:47 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Vijaykumar Koppad 2013-10-16 04:56:12 EDT
Description of problem: geo-rep status goes to faulty if the quota limit is reached on the slave side. 


Version-Release number of selected component (if applicable):
glusterfs-3.4.0.34rhs-1.el6rhs.x86_64

How reproducible: happens every time 


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave 
2.set the quota limit on the slave side 
3.Create data on the master more that the limit set on the slave 

Actual results: geo-rep status becomes faulty after the quota limit is reached 


Expected results: Geo-rep should properly handle such errors. 


Additional info:
Comment 1 Vijaykumar Koppad 2013-10-16 04:58:02 EDT
geo-rep maser logs, 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-10-16 14:22:50.90783] I [master(/bricks/brick3):57:gmaster_builder] <top>: setting up changelog change detection mode
[2013-10-16 14:22:50.94122] I [master(/bricks/brick3):816:register] _GMaster: xsync temp directory: /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/cfdffea3581f40685f18a34384edc263/xsync
[2013-10-16 14:22:50.251959] I [master(/bricks/brick3):335:crawlwrap] _GMaster: primary master with volume id 19e43408-4960-4183-8852-bdfd91062b9b ...
[2013-10-16 14:22:50.258965] I [master(/bricks/brick3):345:crawlwrap] _GMaster: crawl interval: 60 seconds
[2013-10-16 14:22:53.35978] I [master(/bricks/brick3):903:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/cfdffea3581f40685f18a34384edc263/xsync/XSYNC-CHANGELOG.1381913570
[2013-10-16 14:22:54.883975] E [resource(/bricks/brick3):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-tIO1s6/gsycnd-ssh-%r@%h:%p --compress root@10.70.43.135:/proc/7181/cwd" returned with 12, saying:
[2013-10-16 14:22:54.884723] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync: close failed on "/proc/7181/cwd/": Disk quota exceeded (122)
[2013-10-16 14:22:54.884990] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync error: error in file IO (code 11) at receiver.c(730) [receiver=3.0.6]
[2013-10-16 14:22:54.885329] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (2496 bytes received so far) [sender]
[2013-10-16 14:22:54.885576] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
[2013-10-16 14:22:54.886130] I [syncdutils(/bricks/brick3):159:finalize] <top>: exiting.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Comment 2 Vijaykumar Koppad 2013-10-16 05:04:09 EDT
And the state is Unrecoverable.
Comment 3 Vijaykumar Koppad 2013-10-16 05:11:08 EDT
It is recoverable only when the quota limit is increased on the slave side.
Comment 5 Vijaykumar Koppad 2013-10-16 06:04:59 EDT
In order to reduce the space I deleted few files from the master. It didn't help the status the of the geo-rep. Later I increased the quota limit on the slave. The status of the geo-rep became stable, but the deleted files were not deleted, as shown in the status detail.  

                                  MASTER: master  SLAVE: ssh://10.70.43.76::slave

NODE                         HEALTH    UPTIME      FILES SYNCD    FILES PENDING    BYTES PENDING    DELETES PENDING
--------------------------------------------------------------------------------------------------------------------
redcell.blr.redhat.com       Stable    00:17:51    3227           0                0Bytes           0
redlake.blr.redhat.com       Stable    00:17:47    3169           0                0Bytes           267
redwood.blr.redhat.com       Stable    23:27:11    0              0                0Bytes           0
redcloak.blr.redhat.com      Stable    23:27:11    0              0                0Bytes           0


This is understandable, since geo-rep after it restarts, it does a xsync crawl and xsync crawl won't be able to pick up the deletes. Second problem was, out of 2 active gsyncd, one had stopped crawling, and it was not syncing any file created. Strace of the gsyncd process will give that, it was stuck in some FUTEX 
_wait_private 

[root@redlake ~]# strace -s 500 -p 10629
Process 10629 attached - interrupt to quit
futex(0x1e66940, FUTEX_WAIT_PRIVATE, 0, NULL
Comment 6 Vivek Agarwal 2013-10-17 02:58:37 EDT
Per discussion with Shanks and Saurabh, moving this out of u1
Comment 7 Vivek Agarwal 2013-10-17 03:44:21 EDT
Removed it from the u1 list by mistake, adding it back to u1
Comment 8 Amar Tumballi 2013-11-11 05:08:04 EST
bug 1025476 was the cause for this, and considering its fixed now, can we see if this is fixed?
Comment 9 Vijaykumar Koppad 2013-11-13 07:30:41 EST
I tried it on the build glusterfs-3.4.0.43rhs-1, still it happens. Geo-rep logs has these

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

[2013-11-13 17:53:41.726465] I [master(/bricks/brick1):1089:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/bd42ad17ef8864d51407b1c6478f5dc6/xsync/XSYNC-CHANGELOG.1384345420
[2013-11-13 17:53:42.613092] E [resource(/bricks/brick1):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-Gt6Fhm/gsycnd-ssh-%r@%h:%p --compress root@10.70.43.174:/proc/13900/cwd" returned with 12, saying:
[2013-11-13 17:53:42.613451] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: close failed on "/proc/13900/cwd/": Disk quota exceeded (122)
[2013-11-13 17:53:42.613559] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in file IO (code 11) at receiver.c(730) [receiver=3.0.6]
[2013-11-13 17:53:42.613702] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (240 bytes received so far) [sender]
[2013-11-13 17:53:42.613802] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
[2013-11-13 17:53:42.614023] I [syncdutils(/bricks/brick1):159:finalize] <top>: exiting.
[2013-11-13 17:53:52.690727] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-11-13 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Comment 10 Vijaykumar Koppad 2013-11-13 07:31:56 EST
I tried it on the build glusterfs-3.4.0.43rhs-1, still it happens. Geo-rep logs has these

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

[2013-11-13 17:53:41.726465] I [master(/bricks/brick1):1089:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/bd42ad17ef8864d51407b1c6478f5dc6/xsync/XSYNC-CHANGELOG.1384345420
[2013-11-13 17:53:42.613092] E [resource(/bricks/brick1):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-Gt6Fhm/gsycnd-ssh-%r@%h:%p --compress root@10.70.43.174:/proc/13900/cwd" returned with 12, saying:
[2013-11-13 17:53:42.613451] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: close failed on "/proc/13900/cwd/": Disk quota exceeded (122)
[2013-11-13 17:53:42.613559] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in file IO (code 11) at receiver.c(730) [receiver=3.0.6]
[2013-11-13 17:53:42.613702] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (240 bytes received so far) [sender]
[2013-11-13 17:53:42.613802] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
[2013-11-13 17:53:42.614023] I [syncdutils(/bricks/brick1):159:finalize] <top>: exiting.
[2013-11-13 17:53:52.690727] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-11-13 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Comment 11 Amar Tumballi 2013-11-13 07:47:51 EST
Considering the changes till quota limit is met are synced to slave, I would mark this bug as lesser priority.

As of now, geo-replication has two status, one is 'Stable' (ie, everything is normal, whats written on master side is synced to slave), another is 'faulty' (which means, we can't sync changes in master side to slave side).

Here, geo-replication rightly says its faulty (because it can't sync any more data to slave). Please note that if quota was set on slave before 'creating' geo-replication, it would be warned before to the user saying there was less space on slave.

I want to know if its a bug at all in other words.

---
In comment #8, i falsely analyzed the problem to be linked to a changelog crash, which was why this bug was marked ON_QA in first place.
---
Comment 13 Pavithra 2013-11-19 05:27:22 EST
In a discussion with Naga, changing the summary to RFE.
Comment 17 Aravinda VK 2014-12-24 03:51:47 EST
Closing as we have encountered an expected behavior with quota being set on the slave.

Note You need to log in before you can comment on or make changes to this bug.