Bug 1119228

Summary: [Dist-geo-rep] geo-rep start failed with Staging failed on few nodes after multiple snap-shot restore.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijaykumar Koppad <vkoppad>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED WONTFIX QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: medium    
Version: rhgs-3.0CC: avishwan, chrisw, csaba, david.macdonald, mzywusko, nlevinki, smohan
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: usability
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-16 15:56:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreport of the all the nodes. none

Description Vijaykumar Koppad 2014-07-14 10:27:43 UTC
Description of problem: geo-rep start failed with  Staging failed  on few nodes after multiple snap-shot restore, and geo-rep status for those nodes were config corrupted. Looks like for some reason it was looking at wrong path in those nodes after restoration.

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
[2014-07-14 06:50:51.471888] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request
[2014-07-14 06:50:55.818258] I [glusterd-geo-rep.c:1835:glusterd_get_statefile_name] 0-: Using passed config template(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf).
[2014-07-14 06:50:57.675504] E [glusterd-syncop.c:160:gd_collate_errors] 0-: Staging failed on 10.70.43.107. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf).
[2014-07-14 06:50:57.692179] E [glusterd-syncop.c:160:gd_collate_errors] 0-: Staging failed on 10.70.43.162. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf).
[2014-07-14 06:51:04.960108] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

14-07-14 06:51:13.622016] I [glusterd-geo-rep.c:3497:glusterd_read_status_file] 0-: Using passed config template(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf).
[2014-07-14 06:51:13.912704] E [glusterd-geo-rep.c:3187:glusterd_gsync_read_frm_status] 0-: Unable to read gsyncd status file
[2014-07-14 06:51:13.912786] E [glusterd-geo-rep.c:3584:glusterd_read_status_file] 0-: Unable to read the statusfile for /var/run/gluster/snaps/d323c7a44453466da4145761575da350/brick1/master_b1 brick for  master(master), 10.70.43.170::slave(slave) session
[2014-07-14 06:51:13.912842] E [glusterd-geo-rep.c:3187:glusterd_gsync_read_frm_status] 0-: Unable to read gsyncd status file
[2014-07-14 06:51:13.912861] E [glusterd-geo-rep.c:3584:glusterd_read_status_file] 0-: Unable to read the statusfile for /var/run/gluster/snaps/d323c7a44453466da4145761575da350/brick5/master_b5 brick for  master(master), 10.70.43.170::slave(slave) session
[2014-07-14 06:51:13.912904] E [glusterd-geo-rep.c:3187:glusterd_gsync_read_frm_status] 0-: Unable to read gsyncd status file
[2014-07-14 06:51:13.912925] E [glusterd-geo-rep.c:3584:glusterd_read_status_file] 0-: Unable to read the statusfile for /var/run/gluster/snaps/d323c7a44453466da4145761575da350/brick9/master_b9 brick for  master(master), 10.70.43.170::slave(slave) session
[2014-07-14 06:51:14.204313] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


Version-Release number of selected component (if applicable): glusterfs-3.6.0.24-1.el6rhs


How reproducible: I was able to hit the issue only once.


Steps to Reproduce:
1. create geo-rep setup. 
2. take multiple snapshots with geo-rep while IOs are happening on the master (follow steps to create snapshot with geo-rep)
3. restore to 2 immediate snapshots then restore one of the older snapshots. 
(follow the steps to restore snaps with geo-rep)
4. In the case where this crash happened, after the the third snapshot, geo-rep start failed with "Staging failed on 10.70.43.107. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf)"
Staging failed on 10.70.43.162. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf)

5: After this restarting glusterd on the node where all the commands executed resulted in glusterd crash both the nodes where staging failed, which was actually because it failed to create geo-rep status file. 


Actual results: geo-rep start failed with staging failed on few nodes.


Expected results: Geo-rep start should succeed after snapshot restoration. 


Additional info:

Comment 1 Vijaykumar Koppad 2014-07-14 10:33:01 UTC
Created attachment 917729 [details]
sosreport of the all the nodes.

Comment 3 Avra Sengupta 2014-07-16 06:29:28 UTC
In the nodes where staging failed, it was seen that any external binary/script (in this case gsyncd binary) that is being run through glusterd's runner_run interface was failing to run, whereas the same binary/script when run independently ran just fine. We also checked if glusterd was running with right permissions, or is selinux being enabled was preventing the runner_run interface from executing the binary and found that glusterd was running with the right permissions, and that selinux is disabled.