Description of problem: geo-rep start failed with Staging failed on few nodes after multiple snap-shot restore, and geo-rep status for those nodes were config corrupted. Looks like for some reason it was looking at wrong path in those nodes after restoration. ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: [2014-07-14 06:50:51.471888] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request [2014-07-14 06:50:55.818258] I [glusterd-geo-rep.c:1835:glusterd_get_statefile_name] 0-: Using passed config template(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf). [2014-07-14 06:50:57.675504] E [glusterd-syncop.c:160:gd_collate_errors] 0-: Staging failed on 10.70.43.107. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf). [2014-07-14 06:50:57.692179] E [glusterd-syncop.c:160:gd_collate_errors] 0-: Staging failed on 10.70.43.162. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf). [2014-07-14 06:51:04.960108] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 14-07-14 06:51:13.622016] I [glusterd-geo-rep.c:3497:glusterd_read_status_file] 0-: Using passed config template(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf). [2014-07-14 06:51:13.912704] E [glusterd-geo-rep.c:3187:glusterd_gsync_read_frm_status] 0-: Unable to read gsyncd status file [2014-07-14 06:51:13.912786] E [glusterd-geo-rep.c:3584:glusterd_read_status_file] 0-: Unable to read the statusfile for /var/run/gluster/snaps/d323c7a44453466da4145761575da350/brick1/master_b1 brick for master(master), 10.70.43.170::slave(slave) session [2014-07-14 06:51:13.912842] E [glusterd-geo-rep.c:3187:glusterd_gsync_read_frm_status] 0-: Unable to read gsyncd status file [2014-07-14 06:51:13.912861] E [glusterd-geo-rep.c:3584:glusterd_read_status_file] 0-: Unable to read the statusfile for /var/run/gluster/snaps/d323c7a44453466da4145761575da350/brick5/master_b5 brick for master(master), 10.70.43.170::slave(slave) session [2014-07-14 06:51:13.912904] E [glusterd-geo-rep.c:3187:glusterd_gsync_read_frm_status] 0-: Unable to read gsyncd status file [2014-07-14 06:51:13.912925] E [glusterd-geo-rep.c:3584:glusterd_read_status_file] 0-: Unable to read the statusfile for /var/run/gluster/snaps/d323c7a44453466da4145761575da350/brick9/master_b9 brick for master(master), 10.70.43.170::slave(slave) session [2014-07-14 06:51:14.204313] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: Version-Release number of selected component (if applicable): glusterfs-3.6.0.24-1.el6rhs How reproducible: I was able to hit the issue only once. Steps to Reproduce: 1. create geo-rep setup. 2. take multiple snapshots with geo-rep while IOs are happening on the master (follow steps to create snapshot with geo-rep) 3. restore to 2 immediate snapshots then restore one of the older snapshots. (follow the steps to restore snaps with geo-rep) 4. In the case where this crash happened, after the the third snapshot, geo-rep start failed with "Staging failed on 10.70.43.107. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf)" Staging failed on 10.70.43.162. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf) 5: After this restarting glusterd on the node where all the commands executed resulted in glusterd crash both the nodes where staging failed, which was actually because it failed to create geo-rep status file. Actual results: geo-rep start failed with staging failed on few nodes. Expected results: Geo-rep start should succeed after snapshot restoration. Additional info:
Created attachment 917729 [details] sosreport of the all the nodes.
In the nodes where staging failed, it was seen that any external binary/script (in this case gsyncd binary) that is being run through glusterd's runner_run interface was failing to run, whereas the same binary/script when run independently ran just fine. We also checked if glusterd was running with right permissions, or is selinux being enabled was preventing the runner_run interface from executing the binary and found that glusterd was running with the right permissions, and that selinux is disabled.