Bug 1019679
Summary: | [RFE] Dist-geo-rep : geo-rep status goes to faulty if the quota limit is reached on the slave side. | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Vijaykumar Koppad <vkoppad> |
Component: | geo-replication | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> |
Status: | CLOSED NOTABUG | QA Contact: | amainkar |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 2.1 | CC: | aavati, avishwan, csaba, david.macdonald, grajaiya, nlevinki, psriniva, rwheeler, vagarwal, vshankar |
Target Milestone: | --- | Keywords: | FutureFeature |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-12-24 08:51:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Vijaykumar Koppad
2013-10-16 08:56:12 UTC
geo-rep maser logs, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-10-16 14:22:50.90783] I [master(/bricks/brick3):57:gmaster_builder] <top>: setting up changelog change detection mode [2013-10-16 14:22:50.94122] I [master(/bricks/brick3):816:register] _GMaster: xsync temp directory: /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/cfdffea3581f40685f18a34384edc263/xsync [2013-10-16 14:22:50.251959] I [master(/bricks/brick3):335:crawlwrap] _GMaster: primary master with volume id 19e43408-4960-4183-8852-bdfd91062b9b ... [2013-10-16 14:22:50.258965] I [master(/bricks/brick3):345:crawlwrap] _GMaster: crawl interval: 60 seconds [2013-10-16 14:22:53.35978] I [master(/bricks/brick3):903:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/cfdffea3581f40685f18a34384edc263/xsync/XSYNC-CHANGELOG.1381913570 [2013-10-16 14:22:54.883975] E [resource(/bricks/brick3):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-tIO1s6/gsycnd-ssh-%r@%h:%p --compress root.43.135:/proc/7181/cwd" returned with 12, saying: [2013-10-16 14:22:54.884723] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync: close failed on "/proc/7181/cwd/": Disk quota exceeded (122) [2013-10-16 14:22:54.884990] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync error: error in file IO (code 11) at receiver.c(730) [receiver=3.0.6] [2013-10-16 14:22:54.885329] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (2496 bytes received so far) [sender] [2013-10-16 14:22:54.885576] E [resource(/bricks/brick3):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6] [2013-10-16 14:22:54.886130] I [syncdutils(/bricks/brick3):159:finalize] <top>: exiting. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And the state is Unrecoverable. It is recoverable only when the quota limit is increased on the slave side. In order to reduce the space I deleted few files from the master. It didn't help the status the of the geo-rep. Later I increased the quota limit on the slave. The status of the geo-rep became stable, but the deleted files were not deleted, as shown in the status detail. MASTER: master SLAVE: ssh://10.70.43.76::slave NODE HEALTH UPTIME FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING -------------------------------------------------------------------------------------------------------------------- redcell.blr.redhat.com Stable 00:17:51 3227 0 0Bytes 0 redlake.blr.redhat.com Stable 00:17:47 3169 0 0Bytes 267 redwood.blr.redhat.com Stable 23:27:11 0 0 0Bytes 0 redcloak.blr.redhat.com Stable 23:27:11 0 0 0Bytes 0 This is understandable, since geo-rep after it restarts, it does a xsync crawl and xsync crawl won't be able to pick up the deletes. Second problem was, out of 2 active gsyncd, one had stopped crawling, and it was not syncing any file created. Strace of the gsyncd process will give that, it was stuck in some FUTEX _wait_private [root@redlake ~]# strace -s 500 -p 10629 Process 10629 attached - interrupt to quit futex(0x1e66940, FUTEX_WAIT_PRIVATE, 0, NULL Per discussion with Shanks and Saurabh, moving this out of u1 Removed it from the u1 list by mistake, adding it back to u1 bug 1025476 was the cause for this, and considering its fixed now, can we see if this is fixed? I tried it on the build glusterfs-3.4.0.43rhs-1, still it happens. Geo-rep logs has these >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-11-13 17:53:41.726465] I [master(/bricks/brick1):1089:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/bd42ad17ef8864d51407b1c6478f5dc6/xsync/XSYNC-CHANGELOG.1384345420 [2013-11-13 17:53:42.613092] E [resource(/bricks/brick1):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-Gt6Fhm/gsycnd-ssh-%r@%h:%p --compress root.43.174:/proc/13900/cwd" returned with 12, saying: [2013-11-13 17:53:42.613451] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: close failed on "/proc/13900/cwd/": Disk quota exceeded (122) [2013-11-13 17:53:42.613559] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in file IO (code 11) at receiver.c(730) [receiver=3.0.6] [2013-11-13 17:53:42.613702] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (240 bytes received so far) [sender] [2013-11-13 17:53:42.613802] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6] [2013-11-13 17:53:42.614023] I [syncdutils(/bricks/brick1):159:finalize] <top>: exiting. [2013-11-13 17:53:52.690727] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2013-11-13 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I tried it on the build glusterfs-3.4.0.43rhs-1, still it happens. Geo-rep logs has these >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-11-13 17:53:41.726465] I [master(/bricks/brick1):1089:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/bd42ad17ef8864d51407b1c6478f5dc6/xsync/XSYNC-CHANGELOG.1384345420 [2013-11-13 17:53:42.613092] E [resource(/bricks/brick1):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-Gt6Fhm/gsycnd-ssh-%r@%h:%p --compress root.43.174:/proc/13900/cwd" returned with 12, saying: [2013-11-13 17:53:42.613451] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: close failed on "/proc/13900/cwd/": Disk quota exceeded (122) [2013-11-13 17:53:42.613559] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in file IO (code 11) at receiver.c(730) [receiver=3.0.6] [2013-11-13 17:53:42.613702] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (240 bytes received so far) [sender] [2013-11-13 17:53:42.613802] E [resource(/bricks/brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6] [2013-11-13 17:53:42.614023] I [syncdutils(/bricks/brick1):159:finalize] <top>: exiting. [2013-11-13 17:53:52.690727] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2013-11-13 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering the changes till quota limit is met are synced to slave, I would mark this bug as lesser priority. As of now, geo-replication has two status, one is 'Stable' (ie, everything is normal, whats written on master side is synced to slave), another is 'faulty' (which means, we can't sync changes in master side to slave side). Here, geo-replication rightly says its faulty (because it can't sync any more data to slave). Please note that if quota was set on slave before 'creating' geo-replication, it would be warned before to the user saying there was less space on slave. I want to know if its a bug at all in other words. --- In comment #8, i falsely analyzed the problem to be linked to a changelog crash, which was why this bug was marked ON_QA in first place. --- In a discussion with Naga, changing the summary to RFE. Closing as we have encountered an expected behavior with quota being set on the slave. |