Bug 997851

Summary: Dist-geo-rep: few files are not synced to slave after geo-rep restart
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: M S Vishwanath Bhat <vbhat>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED EOL QA Contact: storage-qa-internal <storage-qa-internal>
Severity: medium Docs Contact:
Priority: high    
Version: 2.1CC: avishwan, chrisw, csaba, mzywusko, rhs-bugs, vagarwal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: consistency
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-25 08:49:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description M S Vishwanath Bhat 2013-08-16 09:50:36 UTC
Description of problem:
I started creating few text files on master and restarted the geo-rep session in the midst of file creation. After complete sync (I did not use checkpoint. After nearly 2 days) few files are not synced to slave. The number of files (including directories) is less in slave volume. The arequal-checksum also didn't match. The status detail was of showing the 0 files/bytes/delete pending.

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.20rhs-1.el6rhs.x86_64

How reproducible:
I hit this few times. But not every time. Not 100% reproducible.

Steps to Reproduce:
1. Create a geo-replication session between 2*2 distributed-replicated master and slave.
2. Now start creating the files using crefi
3. Now restart the geo-rep session. stop and the start the session.
4. After creation is complete, stop the session. (don't wait for the sync to complete)
5. Now again start creating files. Use crefi
6. After the data creation is complete, start the geo-rep session.
7. Wait for geo-rep to sync all the files.

Actual results:
After about 2 days few files were still missing from the slave.

[root@spacex ~]# /opt/qa/tools/arequal-checksum /master/

Entry counts
Regular files   : 10019
Directories     : 203
Symbolic links  : 0
Other           : 0
Total           : 10222

Metadata checksums
Regular files   : 486e85
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 3b25fabd0bca7f9b7b34d0f7562e697c
Directories     : 73521a636f275662
Symbolic links  : 0
Other           : 0
Total           : 3343302932c34085




[root@spacex ~]# /opt/qa/tools/arequal-checksum /slave/

Entry counts
Regular files   : 10015
Directories     : 203
Symbolic links  : 0
Other           : 0
Total           : 10218

Metadata checksums
Regular files   : 486e85
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 6f838599ab33483a5cfaddd8efda5f96
Directories     : 76231b7475382c1d
Symbolic links  : 0
Other           : 0
Total           : 455a433531d13bb1


Expected results:
The files should get sync and the arequal-checksums of master and slave volume should be same.

Additional info:

This was the only error messages seen in the geo-replication log files.

[2013-08-14 18:47:43.370575] I [monitor(monitor):81:set_state] Monitor: new state: Stable
[2013-08-14 18:50:05.485461] I [master(/rhs/bricks/brick0):335:crawlwrap] _GMaster: primary master with volume id 909d2220-7c8b-486d-8aeb-d3190bd9e526 ...
[2013-08-14 18:50:05.496843] I [master(/rhs/bricks/brick0):345:crawlwrap] _GMaster: crawl interval: 3 seconds
[2013-08-14 18:53:18.593717] E [syncdutils(/rhs/bricks/brick0):189:log_raise_exception] <top>: connection to peer is broken
[2013-08-14 18:53:18.598598] E [resource(/rhs/bricks/brick0):204:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/
secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-268YVh/gsycnd-ssh-%r@%h:%p root@falcon /nonexistent/gsyncd --session-owner 909d2220-7c8b-486d-8aeb-d3190bd9e526 -N --listen --time
out 120 gluster://localhost:slave" returned with 255, saying:
[2013-08-14 18:53:18.599035] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-14 13:16:43.955010] I [socket.c:3487:socket_init] 0-glusterfs: SSL support is NOT enabled
[2013-08-14 18:53:18.599292] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-14 13:16:43.955066] I [socket.c:3502:socket_init] 0-glusterfs: using system polling thread
[2013-08-14 18:53:18.599530] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-14 13:16:44.225446] I [cli-rpc-ops.c:5461:gf_cli_getwd_cbk] 0-cli: Received resp to getwd
[2013-08-14 18:53:18.599793] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-14 13:16:44.225780] I [input.c:36:cli_batch] 0-: Exiting with: 0
[2013-08-14 18:53:18.600095] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> Killed by signal 15.
[2013-08-14 18:53:18.600682] I [syncdutils(/rhs/bricks/brick0):158:finalize] <top>: exiting.
[2013-08-14 18:53:18.603801] I [monitor(monitor):81:set_state] Monitor: new state: faulty
[2013-08-14 18:59:50.871346] I [monitor(monitor):237:distribute] <top>: slave bricks: [{'host': 'falcon', 'dir': '/rhs/bricks/brick0'}, {'host': 'hornet', 'dir': '/rhs/bricks/brick1'}, {'host': 'interceptor', 'dir': '/rhs/bricks/brick2'}, {'host': 'lightning', 'dir': '/rhs/bricks/brick3'}]
[2013-08-14 18:59:50.872292] I [monitor(monitor):256:distribute] <top>: worker specs: [('/rhs/bricks/brick0', 'ssh://root@falcon:gluster://localhost:slave')]
[2013-08-14 18:59:50.873248] I [monitor(monitor):81:set_state] Monitor: new state: Initializing...
[2013-08-14 18:59:50.876266] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------



I have taken sosreports from all the nodes and have also copied the geo-rep working dir.

Comment 2 M S Vishwanath Bhat 2013-08-16 09:59:06 UTC
I forgot to mention that it was the NFS client which was pushing the data.

Comment 5 Aravinda VK 2015-11-25 08:49:01 UTC
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Comment 6 Aravinda VK 2015-11-25 08:50:53 UTC
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.