Bug 1144117
Summary: | dist-geo-rep: zero byte files created in slave even before creating any data in master by processing stale changelogs in working dir | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | M S Vishwanath Bhat <vbhat> | |
Component: | geo-replication | Assignee: | Aravinda VK <avishwan> | |
Status: | CLOSED ERRATA | QA Contact: | shilpa <smanjara> | |
Severity: | low | Docs Contact: | ||
Priority: | medium | |||
Version: | rhgs-3.0 | CC: | aavati, annair, avishwan, bmohanra, csaba, khiremat, mzywusko, nlevinki, nsathyan, sanandpa, smanjara, ssamanta | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | RHGS 3.0.4 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.6.0.46-1 | Doc Type: | Bug Fix | |
Doc Text: |
Previously, as the Changelog API were consuming unprocessed Changelogs from the previous run, the Changelogs were replaced in the slave and created empty files/directories. To fix this issue, ensure to cleanup the working directory before Geo-replication Start.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1162057 (view as bug list) | Environment: | ||
Last Closed: | 2015-03-26 06:34:40 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1162057, 1182947 |
Description
M S Vishwanath Bhat
2014-09-18 17:25:22 UTC
The geo-rep status shows that there are files to be synced MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING FILES SKIPPED ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- spitfire.blr.redhat.com master /bricks/brick0 falcon::slave Active N/A Changelog Crawl 0 265 0 0 0 typhoon.blr.redhat.com master /bricks/brick3 lightning::slave Passive N/A N/A 0 0 0 0 0 mustang.blr.redhat.com master /bricks/brick1 interceptor::slave Passive N/A N/A 0 0 0 0 0 harrier.blr.redhat.com master /bricks/brick2 hornet::slave Active N/A Changelog Crawl 0 320 0 0 But there are no files. There are lot of error messages like this [2014-09-18 22:45:28.304728] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390414] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2014-09-18 22:45:28.305028] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390429] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2014-09-18 22:45:28.305279] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390457] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2014-09-18 22:45:28.305529] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390470] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2014-09-18 22:45:28.305809] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390483] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2014-09-18 22:45:28.306107] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390494] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2014-09-18 22:45:28.306365] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390642] T [cli.c:264:cli_rpc_notify] 0-glusterfs: got RPC_CLNT_CONNECT [2014-09-18 22:45:28.306638] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390748] T [cli-quotad-client.c:94:cli_quotad_notify] 0-glusterfs: got RPC_CLNT_CONNECT [2014-09-18 22:45:28.306934] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390771] I [socket.c:2246:socket_event_handler] 0-transport: disconnecting now [2014-09-18 22:45:28.307188] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.390801] T [cli-quotad-client.c:100:cli_quotad_notify] 0-glusterfs: got RPC_CLNT_DISCONNECT [2014-09-18 22:45:28.307437] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.391571] T [rpc-clnt.c:1381:rpc_clnt_record] 0-glusterfs: Auth Info: pid: 0, uid: 0, gid: 0, owner: [2014-09-18 22:45:28.307732] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.391708] T [rpc-clnt.c:1238:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 68, payload: 4, rpc hdr: 64 [2014-09-18 22:45:28.308032] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.392263] T [socket.c:2715:socket_connect] (-->/usr/sbin/gluster(cli_cmd_submit+0x8e) [0x40b3ce] (-->/usr/sbin/gluster(cli_submit_request+0xdb) [0x40a5cb] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_submit+0x468) [0x7f111c953f38]))) 0-glusterfs: connect () called on transport already connected [2014-09-18 22:45:28.308284] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.392351] T [rpc-clnt.c:1573:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0x1 Program: Gluster CLI, ProgVers: 2, Proc: 26) to rpc-transport (glusterfs) [2014-09-18 22:45:28.308535] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.392382] D [rpc-clnt-ping.c:231:rpc_clnt_start_ping] 0-glusterfs: ping timeout is 0, returning [2014-09-18 22:45:28.308872] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.393779] T [rpc-clnt.c:660:rpc_clnt_reply_init] 0-glusterfs: received rpc message (RPC XID: 0x1 Program: Gluster CLI, ProgVers: 2, Proc: 26) from rpc-transport (glusterfs) [2014-09-18 22:45:28.309135] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.393827] I [cli-rpc-ops.c:5324:gf_cli_getwd_cbk] 0-cli: Received resp to getwd [2014-09-18 22:45:28.309403] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.394217] D [cli-cmd.c:384:cli_cmd_submit] 0-cli: Returning 0 [2014-09-18 22:45:28.309679] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.394256] D [cli-rpc-ops.c:5353:gf_cli_getwd] 0-cli: Returning 0 [2014-09-18 22:45:28.309979] E [resource(/bricks/brick0):225:logerr] Popen: ssh> [2014-09-18 15:38:28.394270] I [input.c:36:cli_batch] 0-: Exiting with: 0 [2014-09-18 22:45:28.310240] E [resource(/bricks/brick0):225:logerr] Popen: ssh> Killed by signal 15. [2014-09-18 22:45:28.310914] I [syncdutils(/bricks/brick0):214:finalize] <top>: exiting. [2014-09-18 22:45:28.322437] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2014-09-18 22:45:28.323137] I [syncdutils(agent):214:finalize] <top>: exiting. [2014-09-18 22:45:28.354732] I [monitor(monitor):109:set_state] Monitor: new state: faulty I will attach the geo-rep logs from two active nodes and copy the sosreports to somewhere accessible. And it seems like geo-rep delete does not delete the temp working directory. In one of the nodes in the logs [2014-09-18 21:08:34.186109] I [master(/bricks/brick1):1281:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.42.217%3Agluster%3A%2F%2F127.0.0.1%3Aslave/bd42ad17ef8864d51407b1c6478f5dc6/xsync After geo-rep delete [root@mustang ~]# ls /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.42.156%3Agluster%3A%2F%2F127.0.0.1%3Aslave/ bd42ad17ef8864d51407b1c6478f5dc6 The dir is still present. Setup issue, not the problem of working directory is not deleted. Old bricks are used without cleanup. .glusterfs/changelogs directory is not deleted when used the same brick for new volume. Fixes required: Glusterd: While creating Volume it should cleanup .glusterfs dir inside brick. Georep: Delete the working directory when Georep session is deleted. Observed an issue related to this bug during the following test case: 1. Create a dist-rep 2*2 volume. 2. Start geo-rep. 3. Start untarring a big file. (used linux kernel in my case) 4. While the untar is in progress, reboot an active node in a replica pair on the master cluster. 5. Once it comes back up, check the geo-rep status. The passive node which switched to active at the time of active node reboot, did not switch back to passive. Hence both the nodes in the replica pair remained active. Before rebooting pinkfloyd: # gluster volume geo-replication master acdc::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS -------------------------------------------------------------------------------------------------------------------------------- ccr master /bricks/brick0 nirvana::slave Active N/A Changelog Crawl beatles master /bricks/brick3 rammstein::slave Passive N/A N/A metallica master /bricks/brick1 acdc::slave Passive N/A N/A pinkfloyd master /bricks/brick2 led::slave Active N/A Changelog Crawl ------------------------------------------- After rebooting pinkfloyd: # gluster v geo master acdc::slave status detail MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING FILES SKIPPED ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- metallica master /bricks/brick1 acdc::slave Passive N/A N/A 0 0 0 0 0 pinkfloyd master /bricks/brick2 led::slave Active N/A Changelog Crawl 24691 0 0 0 0 beatles master /bricks/brick3 rammstein::slave Active N/A Changelog Crawl 0 245 0 0 0 ccr master /bricks/brick0 nirvana::slave Active N/A Changelog Crawl 24461 0 0 0 0 The node that was rebooted was rebooted was pinkfloyd and beatles was passive before the reboot. Saw these error messages in the logs on the slave node connected to beatles: [pid 8859] lsetxattr(".gfid/9f78f98a-2042-4a75-a180-c5af76b35a98", "glusterfs.gfid.newfile", "\x00\x00\x00\x00\x00\x00\x00\x005f1deefd-bf1a-4554-b699-1e41de9d2d3e\x00\x00\x00\x81\x80mac-celtic.c\x00\x00\x00\x01\x80\x00\x00\x00\x00\x00\x00\x00", 74, 0) = -1 ENOENT (No such file or directory) [pid 8859] select(0, NULL, NULL, NULL, {0, 250000}) = 0 (Timeout) [pid 8859] lsetxattr(".gfid/9f78f98a-2042-4a75-a180-c5af76b35a98", "glusterfs.gfid.newfile", "\x00\x00\x00\x00\x00\x00\x00\x005f1deefd-bf1a-4554-b699-1e41de9d2d3e\x00\x00\x00\x81\x80mac-celtic.c\x00\x00\x00\x01\x80\x00\x00\x00\x00\x00\x00\x00", 74, 0) = -1 ENOENT (No such file or directory) [pid 8859] select(0, NULL, NULL, NULL, {0, 250000}) = 0 (Timeout) [pid 8859] lsetxattr(".gfid/9f78f98a-2042-4a75-a180-c5af76b35a98", "glusterfs.gfid.newfile", "\x00\x00\x00\x00\x00\x00\x00\x005f1deefd-bf1a-4554-b699-1e41de9d2d3e\x00\x00\x00\x81\x80mac-celtic.c\x00\x00\x00\x01\x80\x00\x00\x00\x00\x00\x00\x00", 74, 0) = -1 ENOENT (No such file or directory) [pid 8859] select(0, NULL, NULL, NULL, {0, 250000}) = 0 (Timeout) [pid 8859] lsetxattr(".gfid/9f78f98a-2042-4a75-a180-c5af76b35a98", "glusterfs.gfid.newfile", "\x00\x00\x00\x00\x00\x00\x00\x005f1deefd-bf1a-4554-b699-1e41de9d2d3e\x00\x00\x00\x81\x80mac-celtic.c\x00\x00\x00\x01\x80\x00\x00\x00\x00\x00\x00\x00", 74, 0) = -1 ENOENT (No such file or directory) [pid 8859] select(0, NULL, NULL, NULL, {0, 250000} <unfinished ...> [pid 8843] <... select resumed> ) = 0 (Timeout) [pid 8843] select(0, [], [], [], {1, 0} <unfinished ...> [pid 8859] <... select resumed> ) = 0 (Timeout) Another issue is all the data was not synced to the slave volume after the reboot. Checksum does not match arequal-checksum of master is : Entry counts Regular files : 44590 Directories : 2871 Symbolic links : 3 Other : 0 Total : 47464 Metadata checksums Regular files : 2cb0 Directories : 24d74c Symbolic links : 5a815a Other : 3e9 Checksums Regular files : 99f622aeeb5d95599be432f82b0d3195 Directories : 100d663f041c7b61 Symbolic links : 4d47000a1d0f3d7c Other : 0 Total : 5f587663d943e2d1 arequal-checksum of geo_rep_slave slave: Entry counts Regular files : 44490 Directories : 2871 Symbolic links : 3 Other : 0 Total : 47364 Metadata checksums Regular files : 49bc Directories : 24c719 Symbolic links : 5a815a Other : 3e9 Checksums Regular files : b7f34c6812d966cb3ed6d65f78fdc2a5 Directories : a3f0e6514757768 Symbolic links : 4d47000a1d0f3d7c Other : 0 Total : ce5d9458635eee7a Downstream patch sent. https://code.engineering.redhat.com/gerrit/#/c/41583/ Upstream patches: http://review.gluster.org/9082 http://review.gluster.org/9495 Tested and verified on glusterfs-3.6.0.46-1 Test case: -Create files, sync a few files and stop geo-rep. -Due to incomplete sync we should find a changelog file in .processing directory with the files pending sync. -Now start geo-rep. -At this point the old changelog file in .processing dir should be cleaned up. # for i in {1..30}; do touch f$i; done # gluster v geo-rep master acdc::slave stop Stopping geo-replication session between master & acdc::slave has been successful # ls /mnt/slave f1 f10 f13 f16 f18 f19 f2 f20 f21 f22 f23 f25 f26 f27 f30 f6 f7 f9 # ls /mnt/slave | wc -l 18 # ls /mnt/master | wc -l 30 Working directory changelog file in .processing: ./6ba51664ad363afecedbc65396e2c80c/.history/.processing/CHANGELOG.1424944982 After stop/start geo-rep: ./6ba51664ad363afecedbc65396e2c80c/.history/.processing/CHANGELOG.1425025496 Hi Aravinda, The doc text is updated. review the same and sign off if it looks ok. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0682.html |