Bug 1141520
| Summary: | dist-geo-rep: After running truncate tests with xsync on nfs mount, slave has less size than master volume | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | M S Vishwanath Bhat <vbhat> |
| Component: | geo-replication | Assignee: | Aravinda VK <avishwan> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | storage-qa-internal <storage-qa-internal> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | rhgs-3.0 | CC: | avishwan, chrisw, csaba, mzywusko, nlevinki, otira.fuu, rallan, smanjara, smohan |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | consistency | ||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-09-06 06:17:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
This happened again in other setup while I was doing tests with xsync as change_detctor and with tarssh.
Below are the some log snippets.
F127.0.0.1%3Aslave/466a4e79fdb4cd4055d2776214abfa35/xsync/XSYNC-CHANGELOG.1410806852
[2014-09-16 00:17:32.490181] I [master(/bricks/brick0):1310:crawl] _GMaster: finished hybrid crawl syncing, stime: (1410806194, 553824)
[2014-09-16 00:17:48.292979] E [syncdutils(/bricks/brick0):270:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
main_i()
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 643, in main_i
local.service_loop(*[r for r in [remote] if r])
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1343, in service_loop
g2.crawlwrap()
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 527, in crawlwrap
time.sleep(self.sleep_interval)
File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 374, in <lambda>
def set_term_handler(hook=lambda *a: finalize(*a, **{'exval': 1})):
File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 206, in finalize
shutil.rmtree(gconf.ssh_ctl_dir)
File "/usr/lib64/python2.6/shutil.py", line 217, in rmtree
onerror(os.remove, fullname, sys.exc_info())
File "/usr/lib64/python2.6/shutil.py", line 215, in rmtree
os.remove(fullname)
OSError: [Errno 2] No such file or directory: '/tmp/gsyncd-aux-ssh-YxiAao/23879b48233bff64daeb27b531bb6a42.sock'
[2014-09-16 00:17:48.294410] E [syncdutils(/bricks/brick0):246:log_raise_exception] <top>: connection to peer is broken
[2014-09-16 00:17:48.294410] E [syncdutils(/bricks/brick0):246:log_raise_exception] <top>: connection to peer is broken
[2014-09-16 00:17:48.298872] E [resource(/bricks/brick0):221:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-YxiAao/23879b48233bff64daeb27b531bb6a42.sock root.eng.blr.redhat.com /nonexistent/gsyncd --session-owner bc2a1d22-d822-451b-88ba-78520c7feed9 -N --listen --timeout 120 gluster://localhost:slave" returned with 255, saying:
[2014-09-16 00:17:48.299129] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.557235] I [cli.c:570:cli_rpc_init] 0-cli: Connecting to remote glusterd at localhost
[2014-09-16 00:17:48.299343] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.557360] D [rpc-clnt.c:972:rpc_clnt_connection_init] 0-glusterfs: defaulting frame-timeout to 30mins
[2014-09-16 00:17:48.299555] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.557378] D [rpc-clnt.c:986:rpc_clnt_connection_init] 0-glusterfs: disable ping-timeout
[2014-09-16 00:17:48.299735] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.557404] D [rpc-transport.c:188:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to "socket"
[2014-09-16 00:17:48.299923] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.557422] D [rpc-transport.c:262:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/3.6.0.28/rpc-transport/socket.so
[2014-09-16 00:17:48.300116] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.564020] T [options.c:87:xlator_option_validate_int] 0-glusterfs: no range check required for 'option remote-port 24007'
[2014-09-16 00:17:48.300376] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.564096] D [socket.c:3634:socket_init] 0-glusterfs: SSL support is NOT enabled
[2014-09-16 00:17:48.300565] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.564114] D [socket.c:3650:socket_init] 0-glusterfs: using system polling thread
[2014-09-16 00:17:48.300780] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.564135] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
[2014-09-16 00:17:48.300973] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.564149] T [socket.c:2723:socket_connect] 0-glusterfs: connecting 0x13dc5f0, state=0 gen=0 sock=-1
[root@rhsauto044 ~]# du -sh /mnt/master/
64M /mnt/master/
[root@rhsauto044 ~]# du -sh /mnt/slave/
23M /mnt/slave/
|
Description of problem: I was running automated tests with xsync on NFS mount, and after truncate tests, the slave has less size than master. And hence the arequal checksum of master and slave didn't match. Version-Release number of selected component (if applicable): glusterfs-3.6.0.28-1.el6rhs.x86_64 How reproducible: Not sure. Hit only once. Steps to Reproduce: 1. Create and start a geo-rep session between 2*2 dist-rep master and 2*2 dist-rep slave. 2. Run truncate tests with xsync as change_detector and rsync as the syncing mode. Actual results: Even after more than a day, the data is still not synced properly. The master has 64MB data while the slave has 24MB of data. And hence the arequal-cehcksum do not match. arequal-checksum of master Entry counts Regular files : 25000 Directories : 3011 Symbolic links : 19800 Other : 0 Total : 47811 Metadata checksums Regular files : 457283c2a8 Directories : acebe6410c0 Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 2e24eda76372b7aeac7976c12dc3902e Directories : 7d1e761e640f536b Symbolic links : 71046f0d3a370214 Other : 0 Total : 8e478275108976ff arequal-cehcksum of slave Entry counts Regular files : 25000 Directories : 3011 Symbolic links : 19800 Other : 0 Total : 47811 Metadata checksums Regular files : 457283c2a8 Directories : ad1b326eb40 Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 6d56442a1d4a4d42c90f210b1d9e2d21 Directories : 7d1e761e640f536b Symbolic links : 71046f0d3a370214 Other : 0 Total : a8437c325eec311c Expected results: The data should be synced properly between master and slave. The arequal-cehcksum should match. Additional info: This backtrace was found in one of the node. Not sure if it would help [2014-09-13 15:55:00.862567] I [master(/bricks/brick0):1310:crawl] _GMaster: finished hybrid crawl syncing, stime: (1410541253, 518263) [2014-09-13 15:55:12.485877] E [syncdutils(/bricks/brick0):270:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 643, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1343, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 527, in crawlwrap time.sleep(self.sleep_interval) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 374, in <lambda> def set_term_handler(hook=lambda *a: finalize(*a, **{'exval': 1})): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 206, in finalize shutil.rmtree(gconf.ssh_ctl_dir) File "/usr/lib64/python2.6/shutil.py", line 217, in rmtree onerror(os.remove, fullname, sys.exc_info()) File "/usr/lib64/python2.6/shutil.py", line 215, in rmtree os.remove(fullname) OSError: [Errno 2] No such file or directory: '/tmp/gsyncd-aux-ssh-QPLWZU/01bd075dfe08026647f35caf4590254b.sock' [2014-09-13 15:55:12.486850] E [syncdutils(/bricks/brick0):246:log_raise_exception] <top>: connection to peer is broken [2014-09-13 15:55:12.500783] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.033438] I [cli.c:570:cli_rpc_init] 0-cli: Connecting to remote glusterd at localhost [2014-09-13 15:55:12.501396] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.033530] D [rpc-clnt.c:972:rpc_clnt_connection_init] 0-glusterfs: defaulting frame-timeout to 30mins [2014-09-13 15:55:12.501821] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.033548] D [rpc-clnt.c:986:rpc_clnt_connection_init] 0-glusterfs: disable ping-timeout [2014-09-13 15:55:12.502295] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.033576] D [rpc-transport.c:188:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to "socket" [2014-09-13 15:55:12.502811] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.033597] D [rpc-transport.c:262:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/3.6.0.28/rpc-transport/socket.so [2014-09-13 15:55:12.503293] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.039085] T [options.c:87:xlator_option_validate_int] 0-glusterfs: no range check required for 'option remote-port 24007' [2014-09-13 15:55:12.503757] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.039142] D [socket.c:3634:socket_init] 0-glusterfs: SSL support is NOT enabled [2014-09-13 15:55:12.504030] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.039159] D [socket.c:3650:socket_init] 0-glusterfs: using system polling thread [2014-09-13 15:55:12.504374] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.039180] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect [2014-09-13 15:55:12.504640] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.039197] T [socket.c:2723:socket_connect] 0-glusterfs: connecting 0x195f5e0, state=0 gen=0 sock=-1 [2014-09-13 15:55:12.504878] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.039231] T [common-utils.c:206:gf_resolve_ip6] 0-resolver: DNS cache not present, freshly probing hostname: localhost [2014-09-13 15:55:12.505106] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.045840] D [common-utils.c:248:gf_resolve_ip6] 0-resolver: returning ip-127.0.0.1 (port-24007) for hostname: localhost and port: 24007 [2014-09-13 15:55:12.505444] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051362] D [common-utils.c:268:gf_resolve_ip6] 0-resolver: next DNS query will return: ip-127.0.0.1 port-24007 [2014-09-13 15:55:12.505701] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051409] T [socket.c:746:__socket_nodelay] 0-glusterfs: NODELAY enabled for socket 6 [2014-09-13 15:55:12.505925] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051428] T [socket.c:805:__socket_keepalive] 0-glusterfs: Keep-alive enabled for socket 6, interval 2, idle: 20 [2014-09-13 15:55:12.506144] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051733] D [rpc-clnt.c:972:rpc_clnt_connection_init] 0-glusterfs: defaulting frame-timeout to 30mins [2014-09-13 15:55:12.506511] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051757] D [rpc-clnt.c:986:rpc_clnt_connection_init] 0-glusterfs: disable ping-timeout [2014-09-13 15:55:12.506767] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051862] D [rpc-transport.c:262:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/3.6.0.28/rpc-transport/socket.so [2014-09-13 15:55:12.506989] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051982] D [socket.c:3634:socket_init] 0-glusterfs: SSL support is NOT enabled [2014-09-13 15:55:12.507255] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052023] D [socket.c:3650:socket_init] 0-glusterfs: using system polling thread [2014-09-13 15:55:12.507482] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052042] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect [2014-09-13 15:55:12.507728] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052056] T [socket.c:2723:socket_connect] 0-glusterfs: connecting 0x1968bc0, state=0 gen=0 sock=-1 [2014-09-13 15:55:12.507946] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052073] T [name.c:290:af_unix_client_get_remote_sockaddr] 0-glusterfs: using connect-path /tmp/quotad.socket [2014-09-13 15:55:12.508162] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052096] T [name.c:106:af_unix_client_bind] 0-glusterfs: bind-path not specified for unix socket, letting connect to assign default value [2014-09-13 15:55:12.508446] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052276] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2014-09-13 15:55:12.508717] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052322] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0