Bug 997851
| Summary: | Dist-geo-rep: few files are not synced to slave after geo-rep restart | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | M S Vishwanath Bhat <vbhat> |
| Component: | geo-replication | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> |
| Status: | CLOSED EOL | QA Contact: | storage-qa-internal <storage-qa-internal> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 2.1 | CC: | avishwan, chrisw, csaba, mzywusko, rhs-bugs, vagarwal |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | consistency | ||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-11-25 08:49:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I forgot to mention that it was the NFS client which was pushing the data. Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again. Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again. |
Description of problem: I started creating few text files on master and restarted the geo-rep session in the midst of file creation. After complete sync (I did not use checkpoint. After nearly 2 days) few files are not synced to slave. The number of files (including directories) is less in slave volume. The arequal-checksum also didn't match. The status detail was of showing the 0 files/bytes/delete pending. Version-Release number of selected component (if applicable): glusterfs-3.4.0.20rhs-1.el6rhs.x86_64 How reproducible: I hit this few times. But not every time. Not 100% reproducible. Steps to Reproduce: 1. Create a geo-replication session between 2*2 distributed-replicated master and slave. 2. Now start creating the files using crefi 3. Now restart the geo-rep session. stop and the start the session. 4. After creation is complete, stop the session. (don't wait for the sync to complete) 5. Now again start creating files. Use crefi 6. After the data creation is complete, start the geo-rep session. 7. Wait for geo-rep to sync all the files. Actual results: After about 2 days few files were still missing from the slave. [root@spacex ~]# /opt/qa/tools/arequal-checksum /master/ Entry counts Regular files : 10019 Directories : 203 Symbolic links : 0 Other : 0 Total : 10222 Metadata checksums Regular files : 486e85 Directories : 24d74c Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 3b25fabd0bca7f9b7b34d0f7562e697c Directories : 73521a636f275662 Symbolic links : 0 Other : 0 Total : 3343302932c34085 [root@spacex ~]# /opt/qa/tools/arequal-checksum /slave/ Entry counts Regular files : 10015 Directories : 203 Symbolic links : 0 Other : 0 Total : 10218 Metadata checksums Regular files : 486e85 Directories : 24d74c Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 6f838599ab33483a5cfaddd8efda5f96 Directories : 76231b7475382c1d Symbolic links : 0 Other : 0 Total : 455a433531d13bb1 Expected results: The files should get sync and the arequal-checksums of master and slave volume should be same. Additional info: This was the only error messages seen in the geo-replication log files. [2013-08-14 18:47:43.370575] I [monitor(monitor):81:set_state] Monitor: new state: Stable [2013-08-14 18:50:05.485461] I [master(/rhs/bricks/brick0):335:crawlwrap] _GMaster: primary master with volume id 909d2220-7c8b-486d-8aeb-d3190bd9e526 ... [2013-08-14 18:50:05.496843] I [master(/rhs/bricks/brick0):345:crawlwrap] _GMaster: crawl interval: 3 seconds [2013-08-14 18:53:18.593717] E [syncdutils(/rhs/bricks/brick0):189:log_raise_exception] <top>: connection to peer is broken [2013-08-14 18:53:18.598598] E [resource(/rhs/bricks/brick0):204:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/ secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-268YVh/gsycnd-ssh-%r@%h:%p root@falcon /nonexistent/gsyncd --session-owner 909d2220-7c8b-486d-8aeb-d3190bd9e526 -N --listen --time out 120 gluster://localhost:slave" returned with 255, saying: [2013-08-14 18:53:18.599035] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-14 13:16:43.955010] I [socket.c:3487:socket_init] 0-glusterfs: SSL support is NOT enabled [2013-08-14 18:53:18.599292] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-14 13:16:43.955066] I [socket.c:3502:socket_init] 0-glusterfs: using system polling thread [2013-08-14 18:53:18.599530] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-14 13:16:44.225446] I [cli-rpc-ops.c:5461:gf_cli_getwd_cbk] 0-cli: Received resp to getwd [2013-08-14 18:53:18.599793] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-14 13:16:44.225780] I [input.c:36:cli_batch] 0-: Exiting with: 0 [2013-08-14 18:53:18.600095] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> Killed by signal 15. [2013-08-14 18:53:18.600682] I [syncdutils(/rhs/bricks/brick0):158:finalize] <top>: exiting. [2013-08-14 18:53:18.603801] I [monitor(monitor):81:set_state] Monitor: new state: faulty [2013-08-14 18:59:50.871346] I [monitor(monitor):237:distribute] <top>: slave bricks: [{'host': 'falcon', 'dir': '/rhs/bricks/brick0'}, {'host': 'hornet', 'dir': '/rhs/bricks/brick1'}, {'host': 'interceptor', 'dir': '/rhs/bricks/brick2'}, {'host': 'lightning', 'dir': '/rhs/bricks/brick3'}] [2013-08-14 18:59:50.872292] I [monitor(monitor):256:distribute] <top>: worker specs: [('/rhs/bricks/brick0', 'ssh://root@falcon:gluster://localhost:slave')] [2013-08-14 18:59:50.873248] I [monitor(monitor):81:set_state] Monitor: new state: Initializing... [2013-08-14 18:59:50.876266] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ I have taken sosreports from all the nodes and have also copied the geo-rep working dir.