1119228 – [Dist-geo-rep] geo-rep start failed with Staging failed on few nodes after multiple snap-shot restore.

Bug 1119228 - [Dist-geo-rep] geo-rep start failed with Staging failed on few nodes after multiple snap-shot restore.

Summary: [Dist-geo-rep] geo-rep start failed with Staging failed on few nodes after ...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:	usability
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-07-14 10:27 UTC by Vijaykumar Koppad
Modified:	2018-04-16 15:56 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-16 15:56:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sosreport of the all the nodes. (60 bytes, text/plain) 2014-07-14 10:33 UTC, Vijaykumar Koppad	no flags	Details
View All

Description Vijaykumar Koppad 2014-07-14 10:27:43 UTC

Description of problem: geo-rep start failed with  Staging failed  on few nodes after multiple snap-shot restore, and geo-rep status for those nodes were config corrupted. Looks like for some reason it was looking at wrong path in those nodes after restoration.

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
[2014-07-14 06:50:51.471888] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request
[2014-07-14 06:50:55.818258] I [glusterd-geo-rep.c:1835:glusterd_get_statefile_name] 0-: Using passed config template(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf).
[2014-07-14 06:50:57.675504] E [glusterd-syncop.c:160:gd_collate_errors] 0-: Staging failed on 10.70.43.107. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf).
[2014-07-14 06:50:57.692179] E [glusterd-syncop.c:160:gd_collate_errors] 0-: Staging failed on 10.70.43.162. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf).
[2014-07-14 06:51:04.960108] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

14-07-14 06:51:13.622016] I [glusterd-geo-rep.c:3497:glusterd_read_status_file] 0-: Using passed config template(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf).
[2014-07-14 06:51:13.912704] E [glusterd-geo-rep.c:3187:glusterd_gsync_read_frm_status] 0-: Unable to read gsyncd status file
[2014-07-14 06:51:13.912786] E [glusterd-geo-rep.c:3584:glusterd_read_status_file] 0-: Unable to read the statusfile for /var/run/gluster/snaps/d323c7a44453466da4145761575da350/brick1/master_b1 brick for  master(master), 10.70.43.170::slave(slave) session
[2014-07-14 06:51:13.912842] E [glusterd-geo-rep.c:3187:glusterd_gsync_read_frm_status] 0-: Unable to read gsyncd status file
[2014-07-14 06:51:13.912861] E [glusterd-geo-rep.c:3584:glusterd_read_status_file] 0-: Unable to read the statusfile for /var/run/gluster/snaps/d323c7a44453466da4145761575da350/brick5/master_b5 brick for  master(master), 10.70.43.170::slave(slave) session
[2014-07-14 06:51:13.912904] E [glusterd-geo-rep.c:3187:glusterd_gsync_read_frm_status] 0-: Unable to read gsyncd status file
[2014-07-14 06:51:13.912925] E [glusterd-geo-rep.c:3584:glusterd_read_status_file] 0-: Unable to read the statusfile for /var/run/gluster/snaps/d323c7a44453466da4145761575da350/brick9/master_b9 brick for  master(master), 10.70.43.170::slave(slave) session
[2014-07-14 06:51:14.204313] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


Version-Release number of selected component (if applicable): glusterfs-3.6.0.24-1.el6rhs


How reproducible: I was able to hit the issue only once.


Steps to Reproduce:
1. create geo-rep setup. 
2. take multiple snapshots with geo-rep while IOs are happening on the master (follow steps to create snapshot with geo-rep)
3. restore to 2 immediate snapshots then restore one of the older snapshots. 
(follow the steps to restore snaps with geo-rep)
4. In the case where this crash happened, after the the third snapshot, geo-rep start failed with "Staging failed on 10.70.43.107. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf)"
Staging failed on 10.70.43.162. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf)

5: After this restarting glusterd on the node where all the commands executed resulted in glusterd crash both the nodes where staging failed, which was actually because it failed to create geo-rep status file. 


Actual results: geo-rep start failed with staging failed on few nodes.


Expected results: Geo-rep start should succeed after snapshot restoration. 


Additional info:

Comment 1 Vijaykumar Koppad 2014-07-14 10:33:01 UTC

Created attachment 917729 [details]
sosreport of the all the nodes.

Comment 3 Avra Sengupta 2014-07-16 06:29:28 UTC

In the nodes where staging failed, it was seen that any external binary/script (in this case gsyncd binary) that is being run through glusterd's runner_run interface was failing to run, whereas the same binary/script when run independently ran just fine. We also checked if glusterd was running with right permissions, or is selinux being enabled was preventing the runner_run interface from executing the binary and found that glusterd was running with the right permissions, and that selinux is disabled.

Note You need to log in before you can comment on or make changes to this bug.