Bug 997851 - Dist-geo-rep: few files are not synced to slave after geo-rep restart
Dist-geo-rep: few files are not synced to slave after geo-rep restart
Status: CLOSED EOL
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
2.1
All Linux
high Severity medium
: ---
: ---
Assigned To: Bug Updates Notification Mailing List
storage-qa-internal@redhat.com
consistency
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-16 05:50 EDT by M S Vishwanath Bhat
Modified: 2016-05-31 21:57 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-25 03:49:01 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description M S Vishwanath Bhat 2013-08-16 05:50:36 EDT
Description of problem:
I started creating few text files on master and restarted the geo-rep session in the midst of file creation. After complete sync (I did not use checkpoint. After nearly 2 days) few files are not synced to slave. The number of files (including directories) is less in slave volume. The arequal-checksum also didn't match. The status detail was of showing the 0 files/bytes/delete pending.

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.20rhs-1.el6rhs.x86_64

How reproducible:
I hit this few times. But not every time. Not 100% reproducible.

Steps to Reproduce:
1. Create a geo-replication session between 2*2 distributed-replicated master and slave.
2. Now start creating the files using crefi
3. Now restart the geo-rep session. stop and the start the session.
4. After creation is complete, stop the session. (don't wait for the sync to complete)
5. Now again start creating files. Use crefi
6. After the data creation is complete, start the geo-rep session.
7. Wait for geo-rep to sync all the files.

Actual results:
After about 2 days few files were still missing from the slave.

[root@spacex ~]# /opt/qa/tools/arequal-checksum /master/

Entry counts
Regular files   : 10019
Directories     : 203
Symbolic links  : 0
Other           : 0
Total           : 10222

Metadata checksums
Regular files   : 486e85
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 3b25fabd0bca7f9b7b34d0f7562e697c
Directories     : 73521a636f275662
Symbolic links  : 0
Other           : 0
Total           : 3343302932c34085




[root@spacex ~]# /opt/qa/tools/arequal-checksum /slave/

Entry counts
Regular files   : 10015
Directories     : 203
Symbolic links  : 0
Other           : 0
Total           : 10218

Metadata checksums
Regular files   : 486e85
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 6f838599ab33483a5cfaddd8efda5f96
Directories     : 76231b7475382c1d
Symbolic links  : 0
Other           : 0
Total           : 455a433531d13bb1


Expected results:
The files should get sync and the arequal-checksums of master and slave volume should be same.

Additional info:

This was the only error messages seen in the geo-replication log files.

[2013-08-14 18:47:43.370575] I [monitor(monitor):81:set_state] Monitor: new state: Stable
[2013-08-14 18:50:05.485461] I [master(/rhs/bricks/brick0):335:crawlwrap] _GMaster: primary master with volume id 909d2220-7c8b-486d-8aeb-d3190bd9e526 ...
[2013-08-14 18:50:05.496843] I [master(/rhs/bricks/brick0):345:crawlwrap] _GMaster: crawl interval: 3 seconds
[2013-08-14 18:53:18.593717] E [syncdutils(/rhs/bricks/brick0):189:log_raise_exception] <top>: connection to peer is broken
[2013-08-14 18:53:18.598598] E [resource(/rhs/bricks/brick0):204:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/
secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-268YVh/gsycnd-ssh-%r@%h:%p root@falcon /nonexistent/gsyncd --session-owner 909d2220-7c8b-486d-8aeb-d3190bd9e526 -N --listen --time
out 120 gluster://localhost:slave" returned with 255, saying:
[2013-08-14 18:53:18.599035] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-14 13:16:43.955010] I [socket.c:3487:socket_init] 0-glusterfs: SSL support is NOT enabled
[2013-08-14 18:53:18.599292] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-14 13:16:43.955066] I [socket.c:3502:socket_init] 0-glusterfs: using system polling thread
[2013-08-14 18:53:18.599530] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-14 13:16:44.225446] I [cli-rpc-ops.c:5461:gf_cli_getwd_cbk] 0-cli: Received resp to getwd
[2013-08-14 18:53:18.599793] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-14 13:16:44.225780] I [input.c:36:cli_batch] 0-: Exiting with: 0
[2013-08-14 18:53:18.600095] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> Killed by signal 15.
[2013-08-14 18:53:18.600682] I [syncdutils(/rhs/bricks/brick0):158:finalize] <top>: exiting.
[2013-08-14 18:53:18.603801] I [monitor(monitor):81:set_state] Monitor: new state: faulty
[2013-08-14 18:59:50.871346] I [monitor(monitor):237:distribute] <top>: slave bricks: [{'host': 'falcon', 'dir': '/rhs/bricks/brick0'}, {'host': 'hornet', 'dir': '/rhs/bricks/brick1'}, {'host': 'interceptor', 'dir': '/rhs/bricks/brick2'}, {'host': 'lightning', 'dir': '/rhs/bricks/brick3'}]
[2013-08-14 18:59:50.872292] I [monitor(monitor):256:distribute] <top>: worker specs: [('/rhs/bricks/brick0', 'ssh://root@falcon:gluster://localhost:slave')]
[2013-08-14 18:59:50.873248] I [monitor(monitor):81:set_state] Monitor: new state: Initializing...
[2013-08-14 18:59:50.876266] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------



I have taken sosreports from all the nodes and have also copied the geo-rep working dir.
Comment 2 M S Vishwanath Bhat 2013-08-16 05:59:06 EDT
I forgot to mention that it was the NFS client which was pushing the data.
Comment 5 Aravinda VK 2015-11-25 03:49:01 EST
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.
Comment 6 Aravinda VK 2015-11-25 03:50:53 EST
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Note You need to log in before you can comment on or make changes to this bug.