1462687 – [Geo-rep]: Worker stuck in loop on trying to sync a directory

Bug 1462687 - [Geo-rep]: Worker stuck in loop on trying to sync a directory

Summary: [Geo-rep]: Worker stuck in loop on trying to sync a directory

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Kotresh HR
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-06-19 09:31 UTC by Kotresh HR
Modified:	2017-09-21 04:59 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.8.4-29
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-21 04:59:42 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Kotresh HR 2017-06-19 09:31:02 UTC

Description of problem:

Geo-replication worker is stuck in loop on trying to sync a directory
which could be ignored as it's no longer present in master.

Following error message appears in log
[2017-06-17 05:00:13.308384] E [syncdutils(/bricks3/A1):264:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.

It should only log this on irrecoverable scenario. In cases where the directory no longer present in master, it should be ignored.


Version-Release number of selected component (if applicable):
3.8.4-28

How reproducible:
1/1

Steps to Reproduce:
Was running Bonnie,smallfiles,iozone and kernel untar from 8 FUSE mounts.

Actual results:
Worker was stuck in loop

Expected results:
Worker should not get stuck in loop

Additional info:

Comment 4 Atin Mukherjee 2017-06-19 10:46:29 UTC

the fix will be only in downstream as there was an issue with cherrypicking one of the upstream patch.

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/109373/

Comment 6 Rahul Hinduja 2017-08-09 18:44:39 UTC

Steps to reproduce this issue: 

1. Stop the geo-rep
2. Create a directory dir.1 on master
3. Create a directory dir.1 on slave (To force rsync error)
4. Create dir a and files inside dir.1 (cd dir.1, mkdir a, cd a , touch f.{1..9})
5. Start geo-rep 

Worker crashes in loop with following:

[2017-08-09 18:44:14.455389] I [resource(/rhs/brick2/b3):1560:service_loop] GLUSTER: Register time: 1502304254
[2017-08-09 18:44:14.481693] I [master(/rhs/brick2/b3):466:mgmt_lock] _GMaster: Got lock : /rhs/brick2/b3 : Becoming ACTIVE
[2017-08-09 18:44:14.488532] I [gsyncdstatus(/rhs/brick2/b3):271:set_active] GeorepStatus: Worker Status: Active
[2017-08-09 18:44:14.491740] I [gsyncdstatus(/rhs/brick2/b3):244:set_worker_crawl_status] GeorepStatus: Crawl Status: History Crawl
[2017-08-09 18:44:14.492492] I [master(/rhs/brick2/b3):1167:crawl] _GMaster: starting history crawl... turns: 1, stime: (1502302155, 0), etime: 1502304254
[2017-08-09 18:44:15.496545] I [master(/rhs/brick2/b3):1195:crawl] _GMaster: slave's time: (1502302155, 0)
[2017-08-09 18:44:15.539921] E [master(/rhs/brick2/b3):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '31fed76e-6995-4fc3-bbab-3a4c007de3bb', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/dir1', 'op': 'MKDIR'}, 17, 'b480df27-3401-461f-8bc1-5b9f661a2669')
[2017-08-09 18:44:15.540421] E [syncdutils(/rhs/brick2/b3):264:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.
[2017-08-09 18:44:15.541254] I [syncdutils(/rhs/brick2/b3):237:finalize] <top>: exiting.
[2017-08-09 18:44:15.547810] I [repce(/rhs/brick2/b3):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-08-09 18:44:15.548373] I [syncdutils(/rhs/brick2/b3):237:finalize] <top>: exiting.
[2017-08-09 18:44:16.238763] I [master(/rhs/brick1/b1):1249:register] _GMaster: Working dir: /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.41.161%3Agluster%3A%2F%2F127.0.0.1%3Aslave/c19b89ac45352ab8c894d210d136dd56
[2017-08-09 18:44:16.239428] I [resource(/rhs/brick1/b1):1560:service_loop] GLUSTER: Register time: 1502304256
[2017-08-09 18:44:16.266387] I [master(/rhs/brick1/b1):466:mgmt_lock] _GMaster: Got lock : /rhs/brick1/b1 : Becoming ACTIVE
[2017-08-09 18:44:16.275037] I [gsyncdstatus(/rhs/brick1/b1):271:set_active] GeorepStatus: Worker Status: Active
[2017-08-09 18:44:16.278338] I [gsyncdstatus(/rhs/brick1/b1):244:set_worker_crawl_status] GeorepStatus: Crawl Status: History Crawl
[2017-08-09 18:44:16.278990] I [master(/rhs/brick1/b1):1167:crawl] _GMaster: starting history crawl... turns: 1, stime: (1502302155, 0), etime: 1502304256
[2017-08-09 18:44:16.283341] I [master(/rhs/brick1/b1):1195:crawl] _GMaster: slave's time: (1502302155, 0)
[2017-08-09 18:44:16.419384] I [monitor(monitor):357:monitor] Monitor: worker(/rhs/brick2/b3) died in startup phase
[2017-08-09 18:44:16.424354] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty
[2017-08-09 18:44:16.953702] I [master(/rhs/brick1/b1):1195:crawl] _GMaster: slave's time: (1502302155, 0)
[2017-08-09 18:44:16.985115] E [master(/rhs/brick1/b1):782:log_failures] _GMaster: ENTRY FAILED: ({'uid': 0, 'gfid': '31fed76e-6995-4fc3-bbab-3a4c007de3bb', 'gid': 0, 'mode': 16877, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/dir1', 'op': 'MKDIR'}, 17, 'b480df27-3401-461f-8bc1-5b9f661a2669')
[2017-08-09 18:44:16.985630] E [syncdutils(/rhs/brick1/b1):264:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.
[2017-08-09 18:44:16.986583] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting.

Comment 7 Rochelle 2017-08-10 07:10:22 UTC

Steps to reproduce this issue: 
===============================
1. Stop the geo-rep
2. Create a directory dir on master
3. Create a directory dir on slave (To force gfid mismatch)
4. Remove directory on master (rm -rf dir)
5. Created 3 new directories on master and files under them
6. Start the geo-rep session (it goes to faulty)

build : glusterfs-geo-replication-3.8.4-27.el7rhgs.x86_64

Worker crashes in loop with following:
======================================
[2017-08-10 06:57:05.827300] I [master(/rhs/brick1/b1):1195:crawl] _GMaster: slave's time: (1502347714, 0)
[2017-08-10 06:57:05.875091] E [syncdutils(/rhs/brick1/b1):264:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.
[2017-08-10 06:57:05.875889] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting.
[2017-08-10 06:57:05.882951] I [repce(/rhs/brick1/b1):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-08-10 06:57:05.883761] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting.
[2017-08-10 06:57:06.217839] I [master(/rhs/brick2/b3):1195:crawl] _GMaster: slave's time: (1502347714, 0)
[2017-08-10 06:57:06.259336] E [syncdutils(/rhs/brick2/b3):264:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.
[2017-08-10 06:57:06.260150] I [syncdutils(/rhs/brick2/b3):237:finalize] <top>: exiting.
[2017-08-10 06:57:06.266977] I [repce(/rhs/brick2/b3):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-08-10 06:57:06.267626] I [syncdutils(/rhs/brick2/b3):237:finalize] <top>: exiting.
[2017-08-10 06:57:06.739125] I [monitor(monitor):357:monitor] Monitor: worker(/rhs/brick1/b1) died in startup phase
[2017-08-10 06:57:06.744327] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty
[2017-08-10 06:57:07.137171] I [monitor(monitor):357:monitor] Monitor: worker(/rhs/brick2/b3) died in startup phase
[2017-08-10 06:57:07.141954] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty
[2017-08-10 06:57:17.40195] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick1/b1). Slave node: ssh://root.41.161:gluster://localhost:slave
[2017-08-10 06:57:17.256321] I [resource(/rhs/brick1/b1):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-08-10 06:57:17.256897] I [changelogagent(/rhs/brick1/b1):73:__init__] ChangelogAgent: Agent listining...
[2017-08-10 06:57:17.441454] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick2/b3). Slave node: ssh://root.41.161:gluster://localhost:slave
[2017-08-10 06:57:17.655935] I [resource(/rhs/brick2/b3):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-08-10 06:57:17.656696] I [changelogagent(/rhs/brick2/b3):73:__init__] ChangelogAgent: Agent listining...
[2017-08-10 06:57:23.521589] I [resource(/rhs/brick1/b1):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 6.2648 secs
[2017-08-10 06:57:23.522162] I [resource(/rhs/brick1/b1):1416:connect] GLUSTER: Mounting gluster volume locally...
[2017-08-10 06:57:23.853297] I [resource(/rhs/brick2/b3):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 6.1967 secs
[2017-08-10 06:57:23.853840] I [resource(/rhs/brick2/b3):1416:connect] GLUSTER: Mounting gluster volume locally...
[2017-08-10 06:57:28.654113] I [resource(/rhs/brick1/b1):1429:connect] GLUSTER: Mounted gluster volume. Time taken: 5.1315 secs
[2017-08-10 06:57:28.654645] I [gsyncd(/rhs/brick1/b1):778:main_i] <top>: Closing feedback fd, waking up the monitor
[2017-08-10 06:57:28.988323] I [resource(/rhs/brick2/b3):1429:connect] GLUSTER: Mounted gluster volume. Time taken: 5.1341 secs
[2017-08-10 06:57:28.988893] I [gsyncd(/rhs/brick2/b3):778:main_i] <top>: Closing feedback fd, waking up the monitor
[2017-08-10 06:57:30.696382] I [master(/rhs/brick1/b1):1249:register] _GMaster: Working dir: /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.41.161%3Agluster%3A%2F%2F127.0.0.1%3Aslave/c19b89ac45352ab8c894d210d136dd56
[2017-08-10 06:57:30.697042] I [resource(/rhs/brick1/b1):1560:service_loop] GLUSTER: Register time: 1502348250
[2017-08-10 06:57:30.723392] I [master(/rhs/brick1/b1):466:mgmt_lock] _GMaster: Got lock : /rhs/brick1/b1 : Becoming ACTIVE
[2017-08-10 06:57:30.729737] I [gsyncdstatus(/rhs/brick1/b1):271:set_active] GeorepStatus: Worker Status: Active
[2017-08-10 06:57:30.732889] I [gsyncdstatus(/rhs/brick1/b1):244:set_worker_crawl_status] GeorepStatus: Crawl Status: History Crawl
[2017-08-10 06:57:30.733663] I [master(/rhs/brick1/b1):1167:crawl] _GMaster: starting history crawl... turns: 1, stime: (1502347714, 0), etime: 1502348250
[2017-08-10 06:57:31.29339] I [master(/rhs/brick2/b3):1249:register] _GMaster: Working dir: /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.41.161%3Agluster%3A%2F%2F127.0.0.1%3Aslave/00e802dec1dcae8a424e05e9c2987767
[2017-08-10 06:57:31.30242] I [resource(/rhs/brick2/b3):1560:service_loop] GLUSTER: Register time: 1502348251
[2017-08-10 06:57:31.52994] I [master(/rhs/brick2/b3):466:mgmt_lock] _GMaster: Got lock : /rhs/brick2/b3 : Becoming ACTIVE
[2017-08-10 06:57:31.75947] I [gsyncdstatus(/rhs/brick2/b3):271:set_active] GeorepStatus: Worker Status: Active
[2017-08-10 06:57:31.88232] I [gsyncdstatus(/rhs/brick2/b3):244:set_worker_crawl_status] GeorepStatus: Crawl Status: History Crawl
[2017-08-10 06:57:31.89019] I [master(/rhs/brick2/b3):1167:crawl] _GMaster: starting history crawl... turns: 1, stime: (1502347714, 0), etime: 1502348251
[2017-08-10 06:57:31.738142] I [master(/rhs/brick1/b1):1195:crawl] _GMaster: slave's time: (1502347714, 0)
[2017-08-10 06:57:31.782474] E [syncdutils(/rhs/brick1/b1):264:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.
[2017-08-10 06:57:31.783341] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting.
[2017-08-10 06:57:31.790743] I [repce(/rhs/brick1/b1):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-08-10 06:57:31.791493] I [syncdutils(/rhs/brick1/b1):237:finalize] <top>: exiting.
[2017-08-10 06:57:32.95484] I [master(/rhs/brick2/b3):1195:crawl] _GMaster: slave's time: (1502347714, 0)
[2017-08-10 06:57:32.137804] E [syncdutils(/rhs/brick2/b3):264:log_raise_exception] <top>: The above directory failed to sync. Please fix it to proceed further.
[2017-08-10 06:57:32.138684] I [syncdutils(/rhs/brick2/b3):237:finalize] <top>: exiting.

Comment 8 Rahul Hinduja 2017-08-10 07:13:22 UTC

Verified with the build: glusterfs-geo-replication-3.8.4-38.el7rhgs.x86_64

Carried the same steps mentioned in comment 7, the workers do not stuck in faulty state and new data creation is synced to slave successfully.

Moving this bug to verified state.

Comment 10 errata-xmlrpc 2017-09-21 04:59:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.