Bug 1119228 - [Dist-geo-rep] geo-rep start failed with Staging failed on few nodes after multiple snap-shot restore.
Summary: [Dist-geo-rep] geo-rep start failed with Staging failed on few nodes after ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Bug Updates Notification Mailing List
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard: usability
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-07-14 10:27 UTC by Vijaykumar Koppad
Modified: 2018-04-16 15:56 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-16 15:56:13 UTC
Embargoed:


Attachments (Terms of Use)
sosreport of the all the nodes. (60 bytes, text/plain)
2014-07-14 10:33 UTC, Vijaykumar Koppad
no flags Details

Description Vijaykumar Koppad 2014-07-14 10:27:43 UTC
Description of problem: geo-rep start failed with  Staging failed  on few nodes after multiple snap-shot restore, and geo-rep status for those nodes were config corrupted. Looks like for some reason it was looking at wrong path in those nodes after restoration.

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
[2014-07-14 06:50:51.471888] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request
[2014-07-14 06:50:55.818258] I [glusterd-geo-rep.c:1835:glusterd_get_statefile_name] 0-: Using passed config template(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf).
[2014-07-14 06:50:57.675504] E [glusterd-syncop.c:160:gd_collate_errors] 0-: Staging failed on 10.70.43.107. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf).
[2014-07-14 06:50:57.692179] E [glusterd-syncop.c:160:gd_collate_errors] 0-: Staging failed on 10.70.43.162. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf).
[2014-07-14 06:51:04.960108] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

14-07-14 06:51:13.622016] I [glusterd-geo-rep.c:3497:glusterd_read_status_file] 0-: Using passed config template(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf).
[2014-07-14 06:51:13.912704] E [glusterd-geo-rep.c:3187:glusterd_gsync_read_frm_status] 0-: Unable to read gsyncd status file
[2014-07-14 06:51:13.912786] E [glusterd-geo-rep.c:3584:glusterd_read_status_file] 0-: Unable to read the statusfile for /var/run/gluster/snaps/d323c7a44453466da4145761575da350/brick1/master_b1 brick for  master(master), 10.70.43.170::slave(slave) session
[2014-07-14 06:51:13.912842] E [glusterd-geo-rep.c:3187:glusterd_gsync_read_frm_status] 0-: Unable to read gsyncd status file
[2014-07-14 06:51:13.912861] E [glusterd-geo-rep.c:3584:glusterd_read_status_file] 0-: Unable to read the statusfile for /var/run/gluster/snaps/d323c7a44453466da4145761575da350/brick5/master_b5 brick for  master(master), 10.70.43.170::slave(slave) session
[2014-07-14 06:51:13.912904] E [glusterd-geo-rep.c:3187:glusterd_gsync_read_frm_status] 0-: Unable to read gsyncd status file
[2014-07-14 06:51:13.912925] E [glusterd-geo-rep.c:3584:glusterd_read_status_file] 0-: Unable to read the statusfile for /var/run/gluster/snaps/d323c7a44453466da4145761575da350/brick9/master_b9 brick for  master(master), 10.70.43.170::slave(slave) session
[2014-07-14 06:51:14.204313] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


Version-Release number of selected component (if applicable): glusterfs-3.6.0.24-1.el6rhs


How reproducible: I was able to hit the issue only once.


Steps to Reproduce:
1. create geo-rep setup. 
2. take multiple snapshots with geo-rep while IOs are happening on the master (follow steps to create snapshot with geo-rep)
3. restore to 2 immediate snapshots then restore one of the older snapshots. 
(follow the steps to restore snaps with geo-rep)
4. In the case where this crash happened, after the the third snapshot, geo-rep start failed with "Staging failed on 10.70.43.107. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf)"
Staging failed on 10.70.43.162. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf)

5: After this restarting glusterd on the node where all the commands executed resulted in glusterd crash both the nodes where staging failed, which was actually because it failed to create geo-rep status file. 


Actual results: geo-rep start failed with staging failed on few nodes.


Expected results: Geo-rep start should succeed after snapshot restoration. 


Additional info:

Comment 1 Vijaykumar Koppad 2014-07-14 10:33:01 UTC
Created attachment 917729 [details]
sosreport of the all the nodes.

Comment 3 Avra Sengupta 2014-07-16 06:29:28 UTC
In the nodes where staging failed, it was seen that any external binary/script (in this case gsyncd binary) that is being run through glusterd's runner_run interface was failing to run, whereas the same binary/script when run independently ran just fine. We also checked if glusterd was running with right permissions, or is selinux being enabled was preventing the runner_run interface from executing the binary and found that glusterd was running with the right permissions, and that selinux is disabled.


Note You need to log in before you can comment on or make changes to this bug.