Bug 1141520

Summary: dist-geo-rep: After running truncate tests with xsync on nfs mount, slave has less size than master volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: M S Vishwanath Bhat <vbhat>
Component: geo-replicationAssignee: Aravinda VK <avishwan>
Status: CLOSED CURRENTRELEASE QA Contact: storage-qa-internal <storage-qa-internal>
Severity: medium Docs Contact:
Priority: high    
Version: rhgs-3.0CC: avishwan, chrisw, csaba, mzywusko, nlevinki, otira.fuu, rallan, smanjara, smohan
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: consistency
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-06 06:17:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description M S Vishwanath Bhat 2014-09-14 11:04:37 UTC
Description of problem:
I was running automated tests with xsync on NFS mount, and after truncate tests, the slave has less size than master. And hence the arequal checksum of master and slave didn't match.

Version-Release number of selected component (if applicable):
glusterfs-3.6.0.28-1.el6rhs.x86_64

How reproducible:
Not sure. Hit only once.

Steps to Reproduce:
1. Create and start a geo-rep session between 2*2 dist-rep master and 2*2 dist-rep slave. 
2. Run truncate tests with xsync as change_detector and rsync as the syncing mode.

Actual results:
Even after more than a day, the data is still not synced properly. The master has 64MB data while the slave has 24MB of data. And hence the arequal-cehcksum do not match.

arequal-checksum of master

Entry counts
Regular files   : 25000
Directories     : 3011
Symbolic links  : 19800
Other           : 0
Total           : 47811

Metadata checksums
Regular files   : 457283c2a8
Directories     : acebe6410c0
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 2e24eda76372b7aeac7976c12dc3902e
Directories     : 7d1e761e640f536b
Symbolic links  : 71046f0d3a370214
Other           : 0
Total           : 8e478275108976ff


arequal-cehcksum of slave

Entry counts
Regular files   : 25000
Directories     : 3011
Symbolic links  : 19800
Other           : 0
Total           : 47811

Metadata checksums
Regular files   : 457283c2a8
Directories     : ad1b326eb40
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 6d56442a1d4a4d42c90f210b1d9e2d21
Directories     : 7d1e761e640f536b
Symbolic links  : 71046f0d3a370214
Other           : 0
Total           : a8437c325eec311c


Expected results:
The data should be synced properly between master and slave. The arequal-cehcksum should match.


Additional info:

This backtrace was found in one of the node. Not sure if it would help

[2014-09-13 15:55:00.862567] I [master(/bricks/brick0):1310:crawl] _GMaster: finished hybrid crawl syncing, stime: (1410541253, 518263)
[2014-09-13 15:55:12.485877] E [syncdutils(/bricks/brick0):270:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 643, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1343, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 527, in crawlwrap
    time.sleep(self.sleep_interval)
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 374, in <lambda>
    def set_term_handler(hook=lambda *a: finalize(*a, **{'exval': 1})):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 206, in finalize
    shutil.rmtree(gconf.ssh_ctl_dir)
  File "/usr/lib64/python2.6/shutil.py", line 217, in rmtree
    onerror(os.remove, fullname, sys.exc_info())
  File "/usr/lib64/python2.6/shutil.py", line 215, in rmtree
    os.remove(fullname)
OSError: [Errno 2] No such file or directory: '/tmp/gsyncd-aux-ssh-QPLWZU/01bd075dfe08026647f35caf4590254b.sock'
[2014-09-13 15:55:12.486850] E [syncdutils(/bricks/brick0):246:log_raise_exception] <top>: connection to peer is broken



[2014-09-13 15:55:12.500783] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.033438] I [cli.c:570:cli_rpc_init] 0-cli: Connecting to remote glusterd at localhost
[2014-09-13 15:55:12.501396] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.033530] D [rpc-clnt.c:972:rpc_clnt_connection_init] 0-glusterfs: defaulting frame-timeout to 30mins
[2014-09-13 15:55:12.501821] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.033548] D [rpc-clnt.c:986:rpc_clnt_connection_init] 0-glusterfs: disable ping-timeout
[2014-09-13 15:55:12.502295] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.033576] D [rpc-transport.c:188:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to "socket"
[2014-09-13 15:55:12.502811] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.033597] D [rpc-transport.c:262:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/3.6.0.28/rpc-transport/socket.so
[2014-09-13 15:55:12.503293] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.039085] T [options.c:87:xlator_option_validate_int] 0-glusterfs: no range check required for 'option remote-port 24007'
[2014-09-13 15:55:12.503757] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.039142] D [socket.c:3634:socket_init] 0-glusterfs: SSL support is NOT enabled
[2014-09-13 15:55:12.504030] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.039159] D [socket.c:3650:socket_init] 0-glusterfs: using system polling thread
[2014-09-13 15:55:12.504374] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.039180] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
[2014-09-13 15:55:12.504640] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.039197] T [socket.c:2723:socket_connect] 0-glusterfs: connecting 0x195f5e0, state=0 gen=0 sock=-1
[2014-09-13 15:55:12.504878] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.039231] T [common-utils.c:206:gf_resolve_ip6] 0-resolver: DNS cache not present, freshly probing hostname: localhost
[2014-09-13 15:55:12.505106] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.045840] D [common-utils.c:248:gf_resolve_ip6] 0-resolver: returning ip-127.0.0.1 (port-24007) for hostname: localhost and port: 24007
[2014-09-13 15:55:12.505444] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051362] D [common-utils.c:268:gf_resolve_ip6] 0-resolver: next DNS query will return: ip-127.0.0.1 port-24007
[2014-09-13 15:55:12.505701] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051409] T [socket.c:746:__socket_nodelay] 0-glusterfs: NODELAY enabled for socket 6
[2014-09-13 15:55:12.505925] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051428] T [socket.c:805:__socket_keepalive] 0-glusterfs: Keep-alive enabled for socket 6, interval 2, idle: 20
[2014-09-13 15:55:12.506144] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051733] D [rpc-clnt.c:972:rpc_clnt_connection_init] 0-glusterfs: defaulting frame-timeout to 30mins
[2014-09-13 15:55:12.506511] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051757] D [rpc-clnt.c:986:rpc_clnt_connection_init] 0-glusterfs: disable ping-timeout
[2014-09-13 15:55:12.506767] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051862] D [rpc-transport.c:262:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/3.6.0.28/rpc-transport/socket.so
[2014-09-13 15:55:12.506989] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.051982] D [socket.c:3634:socket_init] 0-glusterfs: SSL support is NOT enabled
[2014-09-13 15:55:12.507255] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052023] D [socket.c:3650:socket_init] 0-glusterfs: using system polling thread
[2014-09-13 15:55:12.507482] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052042] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
[2014-09-13 15:55:12.507728] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052056] T [socket.c:2723:socket_connect] 0-glusterfs: connecting 0x1968bc0, state=0 gen=0 sock=-1
[2014-09-13 15:55:12.507946] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052073] T [name.c:290:af_unix_client_get_remote_sockaddr] 0-glusterfs: using connect-path /tmp/quotad.socket
[2014-09-13 15:55:12.508162] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052096] T [name.c:106:af_unix_client_bind] 0-glusterfs: bind-path not specified for unix socket, letting connect to assign default value
[2014-09-13 15:55:12.508446] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052276] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0
[2014-09-13 15:55:12.508717] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-13 10:22:58.052322] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0

Comment 2 M S Vishwanath Bhat 2014-09-16 09:51:53 UTC
This happened again in other setup while I was doing tests with xsync as change_detctor and with tarssh.

Below are the some log snippets.

F127.0.0.1%3Aslave/466a4e79fdb4cd4055d2776214abfa35/xsync/XSYNC-CHANGELOG.1410806852
[2014-09-16 00:17:32.490181] I [master(/bricks/brick0):1310:crawl] _GMaster: finished hybrid crawl syncing, stime: (1410806194, 553824)
[2014-09-16 00:17:48.292979] E [syncdutils(/bricks/brick0):270:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 643, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1343, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 527, in crawlwrap
    time.sleep(self.sleep_interval)
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 374, in <lambda>
    def set_term_handler(hook=lambda *a: finalize(*a, **{'exval': 1})):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 206, in finalize
    shutil.rmtree(gconf.ssh_ctl_dir)
  File "/usr/lib64/python2.6/shutil.py", line 217, in rmtree
    onerror(os.remove, fullname, sys.exc_info())
  File "/usr/lib64/python2.6/shutil.py", line 215, in rmtree
    os.remove(fullname)
OSError: [Errno 2] No such file or directory: '/tmp/gsyncd-aux-ssh-YxiAao/23879b48233bff64daeb27b531bb6a42.sock'
[2014-09-16 00:17:48.294410] E [syncdutils(/bricks/brick0):246:log_raise_exception] <top>: connection to peer is broken
[2014-09-16 00:17:48.294410] E [syncdutils(/bricks/brick0):246:log_raise_exception] <top>: connection to peer is broken
[2014-09-16 00:17:48.298872] E [resource(/bricks/brick0):221:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-YxiAao/23879b48233bff64daeb27b531bb6a42.sock root.eng.blr.redhat.com /nonexistent/gsyncd --session-owner bc2a1d22-d822-451b-88ba-78520c7feed9 -N --listen --timeout 120 gluster://localhost:slave" returned with 255, saying:
[2014-09-16 00:17:48.299129] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.557235] I [cli.c:570:cli_rpc_init] 0-cli: Connecting to remote glusterd at localhost
[2014-09-16 00:17:48.299343] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.557360] D [rpc-clnt.c:972:rpc_clnt_connection_init] 0-glusterfs: defaulting frame-timeout to 30mins
[2014-09-16 00:17:48.299555] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.557378] D [rpc-clnt.c:986:rpc_clnt_connection_init] 0-glusterfs: disable ping-timeout
[2014-09-16 00:17:48.299735] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.557404] D [rpc-transport.c:188:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to "socket"
[2014-09-16 00:17:48.299923] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.557422] D [rpc-transport.c:262:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/3.6.0.28/rpc-transport/socket.so
[2014-09-16 00:17:48.300116] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.564020] T [options.c:87:xlator_option_validate_int] 0-glusterfs: no range check required for 'option remote-port 24007'
[2014-09-16 00:17:48.300376] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.564096] D [socket.c:3634:socket_init] 0-glusterfs: SSL support is NOT enabled
[2014-09-16 00:17:48.300565] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.564114] D [socket.c:3650:socket_init] 0-glusterfs: using system polling thread
[2014-09-16 00:17:48.300780] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.564135] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
[2014-09-16 00:17:48.300973] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-15 17:56:37.564149] T [socket.c:2723:socket_connect] 0-glusterfs: connecting 0x13dc5f0, state=0 gen=0 sock=-1



[root@rhsauto044 ~]# du -sh /mnt/master/
64M     /mnt/master/


[root@rhsauto044 ~]# du -sh /mnt/slave/
23M     /mnt/slave/