Description of problem: zero byte files are created in slave volume just after geo-rep session creation and even before any kind of data is created in master volume. I had some geo-rep session running in cluster. After the test I cleaned up all the logs, /var/lib/glusterd and backend bricks (I have not cleaned up the working dir used by the geo-rep session. And note that I have not done geo-rep stop/delete. I have done hard cleanup). Now I have scripts which re-creates the whole setup with same name and same configuration. When I create a geo-rep session this somehow the brick got hashed to same value ( I think) and has started processing the xsync and changelogs present in there. But since there are no data on the master it has only created the files in the slave. All the files in the slave are zero bytes. Version-Release number of selected component (if applicable): glusterfs-3.6.0.28-1.el6rhs.x86_64 How reproducible: Seen twice. I think the brick should somehow get hashed to same value as the previous session present in the same node. Steps to Reproduce: There are no proper steps to it. 1. Create and start a session. 2. Create some data and run anything. 3. Clean the whole thing and re-create the whole session once again. 4. There is chance that slave volume has files before creating the files in master. Actual results: Slave volume has zero byte files before any data is created in master. Expected results: Slave should only sync the data created in master. Additional info: The bug is not serious but very ugly.
The geo-rep status shows that there are files to be synced MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING FILES SKIPPED ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- spitfire.blr.redhat.com master /bricks/brick0 falcon::slave Active N/A Changelog Crawl 0 265 0 0 0 typhoon.blr.redhat.com master /bricks/brick3 lightning::slave Passive N/A N/A 0 0 0 0 0 mustang.blr.redhat.com master /bricks/brick1 interceptor::slave Passive N/A N/A 0 0 0 0 0 harrier.blr.redhat.com master /bricks/brick2 hornet::slave Active N/A Changelog Crawl 0 320 0 0 But there are no files. There are lot of error messages like this [2014-09-18 22:45:28.304728] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390414] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2014-09-18 22:45:28.305028] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390429] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2014-09-18 22:45:28.305279] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390457] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2014-09-18 22:45:28.305529] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390470] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2014-09-18 22:45:28.305809] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390483] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2014-09-18 22:45:28.306107] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390494] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2014-09-18 22:45:28.306365] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390642] T [cli.c:264:cli_rpc_notify] 0-glusterfs: got RPC_CLNT_CONNECT [2014-09-18 22:45:28.306638] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390748] T [cli-quotad-client.c:94:cli_quotad_notify] 0-glusterfs: got RPC_CLNT_CONNECT [2014-09-18 22:45:28.306934] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390771] I [socket.c:2246:socket_event_handler] 0-transport: disconnecting now [2014-09-18 22:45:28.307188] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390801] T [cli-quotad-client.c:100:cli_quotad_notify] 0-glusterfs: got RPC_CLNT_DISCONNECT [2014-09-18 22:45:28.307437] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.391571] T [rpc-clnt.c:1381:rpc_clnt_record] 0-glusterfs: Auth Info: pid: 0, uid: 0, gid: 0, owner: [2014-09-18 22:45:28.307732] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.391708] T [rpc-clnt.c:1238:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 68, payload: 4, rpc hdr: 64 [2014-09-18 22:45:28.308032] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.392263] T [socket.c:2715:socket_connect] (-->/usr/sbin/gluster(cli_cmd_submit+0x8e) [0x40b3ce] (-->/usr/sbin/gluster(cli_submit_request+0xdb) [0x40a5cb] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_submit+0x468) [0x7f111c953f38]))) 0-glusterfs: connect () called on transport already connected [2014-09-18 22:45:28.308284] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.392351] T [rpc-clnt.c:1573:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0x1 Program: Gluster CLI, ProgVers: 2, Proc: 26) to rpc-transport (glusterfs) [2014-09-18 22:45:28.308535] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.392382] D [rpc-clnt-ping.c:231:rpc_clnt_start_ping] 0-glusterfs: ping timeout is 0, returning [2014-09-18 22:45:28.308872] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.393779] T [rpc-clnt.c:660:rpc_clnt_reply_init] 0-glusterfs: received rpc message (RPC XID: 0x1 Program: Gluster CLI, ProgVers: 2, Proc: 26) from rpc-transport (glusterfs) [2014-09-18 22:45:28.309135] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.393827] I [cli-rpc-ops.c:5324:gf_cli_getwd_cbk] 0-cli: Received resp to getwd [2014-09-18 22:45:28.309403] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.394217] D [cli-cmd.c:384:cli_cmd_submit] 0-cli: Returning 0 [2014-09-18 22:45:28.309679] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.394256] D [cli-rpc-ops.c:5353:gf_cli_getwd] 0-cli: Returning 0 [2014-09-18 22:45:28.309979] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.394270] I [input.c:36:cli_batch] 0-: Exiting with: 0 [2014-09-18 22:45:28.310240] E [resource(/bricks/brick0):225:logerr] Popen: ssh> Killed by signal 15. [2014-09-18 22:45:28.310914] I [syncdutils(/bricks/brick0):214:finalize] <top>: exiting. [2014-09-18 22:45:28.322437] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2014-09-18 22:45:28.323137] I [syncdutils(agent):214:finalize] <top>: exiting. [2014-09-18 22:45:28.354732] I [monitor(monitor):109:set_state] Monitor: new state: faulty I will attach the geo-rep logs from two active nodes and copy the sosreports to somewhere accessible. And it seems like geo-rep delete does not delete the temp working directory. In one of the nodes in the logs [2014-09-18 21:08:34.186109] I [master(/bricks/brick1):1281:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.42.217%3Agluster%3A%2F%2F127.0.0.1%3Aslave/bd42ad17ef8864d51407b1c6478f5dc6/xsync After geo-rep delete [root@mustang ~]# ls /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.42.156%3Agluster%3A%2F%2F127.0.0.1%3Aslave/ bd42ad17ef8864d51407b1c6478f5dc6 The dir is still present.
Setup issue, not the problem of working directory is not deleted. Old bricks are used without cleanup. .glusterfs/changelogs directory is not deleted when used the same brick for new volume. Fixes required: Glusterd: While creating Volume it should cleanup .glusterfs dir inside brick. Georep: Delete the working directory when Georep session is deleted.
Observed an issue related to this bug during the following test case: 1. Create a dist-rep 2*2 volume. 2. Start geo-rep. 3. Start untarring a big file. (used linux kernel in my case) 4. While the untar is in progress, reboot an active node in a replica pair on the master cluster. 5. Once it comes back up, check the geo-rep status. The passive node which switched to active at the time of active node reboot, did not switch back to passive. Hence both the nodes in the replica pair remained active. Before rebooting pinkfloyd: # gluster volume geo-replication master acdc::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS -------------------------------------------------------------------------------------------------------------------------------- ccr master /bricks/brick0 nirvana::slave Active N/A Changelog Crawl beatles master /bricks/brick3 rammstein::slave Passive N/A N/A metallica master /bricks/brick1 acdc::slave Passive N/A N/A pinkfloyd master /bricks/brick2 led::slave Active N/A Changelog Crawl ------------------------------------------- After rebooting pinkfloyd: # gluster v geo master acdc::slave status detail MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING FILES SKIPPED ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- metallica master /bricks/brick1 acdc::slave Passive N/A N/A 0 0 0 0 0 pinkfloyd master /bricks/brick2 led::slave Active N/A Changelog Crawl 24691 0 0 0 0 beatles master /bricks/brick3 rammstein::slave Active N/A Changelog Crawl 0 245 0 0 0 ccr master /bricks/brick0 nirvana::slave Active N/A Changelog Crawl 24461 0 0 0 0 The node that was rebooted was rebooted was pinkfloyd and beatles was passive before the reboot. Saw these error messages in the logs on the slave node connected to beatles: [pid 8859] lsetxattr(".gfid/9f78f98a-2042-4a75-a180-c5af76b35a98", "glusterfs.gfid.newfile", "\x00\x00\x00\x00\x00\x00\x00\x005f1deefd-bf1a-4554-b699-1e41de9d2d3e\x00\x00\x00\x81\x80mac-celtic.c\x00\x00\x00\x01\x80\x00\x00\x00\x00\x00\x00\x00", 74, 0) = -1 ENOENT (No such file or directory) [pid 8859] select(0, NULL, NULL, NULL, {0, 250000}) = 0 (Timeout) [pid 8859] lsetxattr(".gfid/9f78f98a-2042-4a75-a180-c5af76b35a98", "glusterfs.gfid.newfile", "\x00\x00\x00\x00\x00\x00\x00\x005f1deefd-bf1a-4554-b699-1e41de9d2d3e\x00\x00\x00\x81\x80mac-celtic.c\x00\x00\x00\x01\x80\x00\x00\x00\x00\x00\x00\x00", 74, 0) = -1 ENOENT (No such file or directory) [pid 8859] select(0, NULL, NULL, NULL, {0, 250000}) = 0 (Timeout) [pid 8859] lsetxattr(".gfid/9f78f98a-2042-4a75-a180-c5af76b35a98", "glusterfs.gfid.newfile", "\x00\x00\x00\x00\x00\x00\x00\x005f1deefd-bf1a-4554-b699-1e41de9d2d3e\x00\x00\x00\x81\x80mac-celtic.c\x00\x00\x00\x01\x80\x00\x00\x00\x00\x00\x00\x00", 74, 0) = -1 ENOENT (No such file or directory) [pid 8859] select(0, NULL, NULL, NULL, {0, 250000}) = 0 (Timeout) [pid 8859] lsetxattr(".gfid/9f78f98a-2042-4a75-a180-c5af76b35a98", "glusterfs.gfid.newfile", "\x00\x00\x00\x00\x00\x00\x00\x005f1deefd-bf1a-4554-b699-1e41de9d2d3e\x00\x00\x00\x81\x80mac-celtic.c\x00\x00\x00\x01\x80\x00\x00\x00\x00\x00\x00\x00", 74, 0) = -1 ENOENT (No such file or directory) [pid 8859] select(0, NULL, NULL, NULL, {0, 250000} <unfinished ...> [pid 8843] <... select resumed> ) = 0 (Timeout) [pid 8843] select(0, [], [], [], {1, 0} <unfinished ...> [pid 8859] <... select resumed> ) = 0 (Timeout) Another issue is all the data was not synced to the slave volume after the reboot. Checksum does not match arequal-checksum of master is : Entry counts Regular files : 44590 Directories : 2871 Symbolic links : 3 Other : 0 Total : 47464 Metadata checksums Regular files : 2cb0 Directories : 24d74c Symbolic links : 5a815a Other : 3e9 Checksums Regular files : 99f622aeeb5d95599be432f82b0d3195 Directories : 100d663f041c7b61 Symbolic links : 4d47000a1d0f3d7c Other : 0 Total : 5f587663d943e2d1 arequal-checksum of geo_rep_slave slave: Entry counts Regular files : 44490 Directories : 2871 Symbolic links : 3 Other : 0 Total : 47364 Metadata checksums Regular files : 49bc Directories : 24c719 Symbolic links : 5a815a Other : 3e9 Checksums Regular files : b7f34c6812d966cb3ed6d65f78fdc2a5 Directories : a3f0e6514757768 Symbolic links : 4d47000a1d0f3d7c Other : 0 Total : ce5d9458635eee7a
Downstream patch sent. https://code.engineering.redhat.com/gerrit/#/c/41583/ Upstream patches: http://review.gluster.org/9082 http://review.gluster.org/9495
Tested and verified on glusterfs-3.6.0.46-1 Test case: -Create files, sync a few files and stop geo-rep. -Due to incomplete sync we should find a changelog file in .processing directory with the files pending sync. -Now start geo-rep. -At this point the old changelog file in .processing dir should be cleaned up. # for i in {1..30}; do touch f$i; done # gluster v geo-rep master acdc::slave stop Stopping geo-replication session between master & acdc::slave has been successful # ls /mnt/slave f1 f10 f13 f16 f18 f19 f2 f20 f21 f22 f23 f25 f26 f27 f30 f6 f7 f9 # ls /mnt/slave | wc -l 18 # ls /mnt/master | wc -l 30 Working directory changelog file in .processing: ./6ba51664ad363afecedbc65396e2c80c/.history/.processing/CHANGELOG.1424944982 After stop/start geo-rep: ./6ba51664ad363afecedbc65396e2c80c/.history/.processing/CHANGELOG.1425025496
Hi Aravinda, The doc text is updated. review the same and sign off if it looks ok.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0682.html