Description of problem: After node reboot, service glusterd status shows service failed. However ps -ef | grep glusterd shows glusterd process id. All glusterd command works normal. Note: Snapshots were scheduled using scheduler. There were 214 present in the system, out of which 100 snapshots were activated. Version-Release number of selected component (if applicable): 2/2 How reproducible: Steps to Reproduce: 1. Created 3*2 distributed replicated volume 2. Enabled shared storage 3. Scheduled snapshot using scheduler 4. Restart one of the Server node Actual results: service glusterd status shows wrong information Expected results: service glusterd status should show service as started after node reboot. Additional info: [root@rhs-client47 ~]# pgrep glusterd 5369 ================================================== [root@rhs-client47 ~]# service glusterd status Redirecting to /bin/systemctl status glusterd.service ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: failed (Result: timeout) since Fri 2017-01-27 14:58:13 IST; 8min ago Process: 5359 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=killed, signal=TERM) CGroup: /system.slice/glusterd.service ================================================= [root@rhs-client47 ~]# ps -ef | grep glusterd root 5369 1 5 14:56 ? 00:00:34 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO root 12718 1 0 15:00 ? 00:00:00 /usr/sbin/glusterfsd -s 10.70.36.71 --volfile-id gluster_shared_storage.10.70.36.71.var-lib-glusterd-ss_brick -p /var/lib/glusterd/vols/gluster_shared_storage/run/10.70.36.71-var-lib-glusterd-ss_brick.pid -S /var/run/gluster/b4b9de5f50f03a63767ac2ff35837e9b.socket --brick-name /var/lib/glusterd/ss_brick -l /var/log/glusterfs/bricks/var-lib-glusterd-ss_brick.log --xlator-option *-posix.glusterd-uuid=d2106cc7-a569-420c-b8f1-67d0af8d9088 --brick-port 49152 --xlator-option gluster_shared_storage-server.listen-port=49152 root 12724 1 0 15:00 ? 00:00:00 /usr/sbin/glusterfsd -s 10.70.36.71 --volfile-id vol1.10.70.36.71.rhs-brick3-b6 -p /var/lib/glusterd/vols/vol1/run/10.70.36.71-rhs-brick3-b6.pid -S /var/run/gluster/d400b1c7ba59bb493bea0849a4a1f228.socket --brick-name /rhs/brick3/b6 -l /var/log/glusterfs/bricks/rhs-brick3-b6.log --xlator-option *-posix.glusterd-uuid=d2106cc7-a569-420c-b8f1-67d0af8d9088 --brick-port 49154 --xlator-option vol1-server.listen-port=49154 root 12733 1 0 15:00 ? 00:00:00 /usr/sbin/glusterfsd -s 10.70.36.71 --volfile-id /snaps/snap0/f13392d210834b358190ed819871fbe9.10.70.36.71.run-gluster-snaps-f13392d210834b358190ed819871fbe9-brick6-b6 -p /var/lib/glusterd/snaps/snap0/f13392d210834b358190ed819871fbe9/run/10.70.36.71-run-gluster-snaps-f13392d210834b358190ed819871fbe9-brick6-b6.pid -S /var/run/gluster/8e4a712cc3999ec51996a70d35afb7f5.socket --brick-name /run/gluster/snaps/f13392d210834b358190ed819871fbe9/brick6/b6 -l /var/log/glusterfs/bricks/run-gluster-snaps-f13392d210834b358190ed819871fbe9-brick6-b6.log --xlator-option *-posix.glusterd-uuid=d2106cc7-a569-420c-b8f1-67d0af8d9088 --brick-port 49156 --xlator-option f13392d210834b358190ed819871fbe9-server.listen-port=49156 root 12739 1 0 15:00 ? 00:00:01 /usr/sbin/glusterfsd -s 10.70.36.71 --volfile-id /snaps/snap0/f13392d210834b358190ed819871fbe9.10.70.36.71.run-gluster-snaps-f13392d210834b358190ed819871fbe9-brick2-b2 -p /var/lib/glusterd/snaps/snap0/f13392d210834b358190ed819871fbe9/run/10.70.36.71-run-gluster-snaps-f13392d210834b358190ed819871fbe9-brick2-b2.pid -S /var/run/gluster/b64939264020b4f34d07a6529cb2eea3.socket --brick-name /run/gluster/snaps/f13392d210834b358190ed819871fbe9/brick2/b2 -l /var/log/glusterfs/bricks/run-gluster-snaps-f13392d210834b358190ed819871fbe9-brick2-b2.log --xlator-option *-posix.glusterd-uuid=d2106cc7-a569-420c-b8f1-67d0af8d9088 --brick-port 49155 --xlator-option f13392d210834b358190ed819871fbe9-server.listen-port=49155 root 12745 1 0 15:00 ? 00:00:01 /usr/sbin/glusterfsd -s 10.70.36.71 --volfile-id /snaps/snap1/5f6b89e7b8e1413a98863d3cc046dbbe.10.70.36.71.run-gluster-snaps-5f6b89e7b8e1413a98863d3cc046dbbe-brick2-b2 -p /var/lib/glusterd/snaps/snap1/5f6b89e7b8e1413a98863d3cc046dbbe/run/10.70.36.71-run-gluster-snaps-5f6b89e7b8e1413a98863d3cc046dbbe-brick2-b2.pid -S /var/run/gluster/d27c49377a2d1fede1c5937937e585ba.socket --brick-name /run/gluster/snaps/5f6b89e7b8e1413a98863d3cc046dbbe/brick2/b2 -l /var/log/glusterfs/bricks/run-gluster-snaps-5f6b89e7b8e1413a98863d3cc046dbbe-brick2-b2.log --xlator-option *-posix.glusterd-uuid=d2106cc7-a569-420c-b8f1-67d0af8d9088 --brick-port 49157 --xlator-option 5f6b89e7b8e1413a98863d3cc046dbbe-server.listen-port=49157 root 12751 1 0 15:00 ? 00:00:00 /usr/sbin/glusterfsd -s 10.70.36.71 --volfile-id vol1.10.70.36.71.rhs-brick2-b2 -p /var/lib/glusterd/vols/vol1/run/10.70.36.71-rhs-brick2-b2.pid -S /var/run/gluster/3450cfa89579f17c22f82ab0b253c1c0.socket --brick-name /rhs/brick2/b2 -l /var/log/glusterfs/bricks/rhs-brick2-b2.log --xlator-option *-posix.glusterd-uuid=d2106cc7-a569-420c-b8f1-67d0af8d9088 --brick-port 49153 --xlator-option vol1-server.listen-port=49153 ===================================================
Every time a node reboots, glusterd has to start many bricks (because of the number of activated snapshots) one after another. This is delaying glusterd to come up on an immediate basis. "systemcts status glusterd" shows glusterd to be failed though the process is running. I assume, systemd's wait time is getting completed, before glusterd fully comes up and responds back to systemd. We won't be able to address this as its a design limitation, with brick multiplexing coming in this should improve, we should revisit this test once brick multiplexing feature is in upstream and evaluate if anything more has to be done. For now, probably exploring about systemd's wait time may give us a workaround to live with this problem.
(In reply to Atin Mukherjee from comment #3) > Every time a node reboots, glusterd has to start many bricks (because of the > number of activated snapshots) one after another. This is delaying glusterd > to come up on an immediate basis. "systemcts status glusterd" shows glusterd > to be failed though the process is running. I assume, systemd's wait time is > getting completed, before glusterd fully comes up and responds back to > systemd. We won't be able to address this as its a design limitation, with > brick multiplexing coming in this should improve, we should revisit this > test once brick multiplexing feature is in upstream and evaluate if anything > more has to be done. For now, probably exploring about systemd's wait time > may give us a workaround to live with this problem. One caveat here is snapshot bricks are not multiplexed, so even with brick multiplexing enabled, this can not be eliminated completely.