Description of problem:
With the default unit file /usr/lib/systemd/system/glusterd.service, the service does not reliably start at boot time
Version-Release number of selected component (if applicable):
gluster --version => glusterfs 3.9.0 built on Nov 15 2016 16:22:39
Very - not 100%, but in the 90's
Steps to Reproduce:
1. Install gluster and configure a replicated volume
2. Reboot one of the servers
systemctl status glusterd => service failed to start
systemctl status glusterd => service is active (running)
Infrastructure: two Fedora VMs running on virtualbox VMs, host Windows.
When the service fails to start, a manual start always works. This suggest to me that there's some sort of undeclared dependency, but it's not clear (to me at least) what it could be. To hack around the problem, I added lines "Restart=on-failure" and "RestartSec=1" to the unit file. With this addition, though there are still startup failures during boot, the service is running by the time I reboot either of the VMs and log in.
Could you also attach glusterd log file ( /var/log/glusterfs/glusterd.log ) ?
Created attachment 1241668 [details]
Created attachment 1241669 [details]
Created attachment 1241670 [details]
(In reply to SATHEESARAN from comment #1)
> Could you also attach glusterd log file ( /var/log/glusterfs/glusterd.log ) ?
Done. These logs are "clean" in that I deleted the old log files and then rebooted. The other trusted server was already running. As expected, systemctl reports that the glusterd service is active (running) after the boot that created these logs.
[2017-01-17 10:51:58.122805] D [MSGID: 0] [glusterd-peer-utils.c:167:glusterd_hostname_to_uuid] 0-management: returning -1
[2017-01-17 10:51:58.122819] D [MSGID: 0] [glusterd-utils.c:1009:glusterd_resolve_brick] 0-management: Returning -1
[2017-01-17 10:51:58.122833] E [MSGID: 106187] [glusterd-store.c:4408:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore
[2017-01-17 10:51:58.122848] D [MSGID: 0] [glusterd-store.c:4481:glusterd_restore] 0-management: Returning -1
The above indicates that the n/w interface might not be up before glusterd tried to resolve the address of the brick. What OS baseline are you using?
This is an issue we've seen before. GlusterD starts before the network is online. The 'Network is unreachable' error and the 'resolve brick failed' error point to this. This shows up particularly in systemd based systems, like RHEL7.
https://bugzilla.redhat.com/show_bug.cgi?id=1260007#c3 explains what is happening in detail.
tl;dr: systemd starts GlusterD, after the network devices are available, not after the network is online. The network-online.target is reached after GlusterD is started, to allow self GlusterFS _netdev mounts to work correctly.
This is not an easy problem to solve correctly. A solution is described in the comment linked, which can help if self mounts are involved. If self-mounts are not present, you could force glusterd to start after network-online.target, by modifying it's service unit file.
I can confirm this same behavior on Gluster 3.9.1 running on Fedora 25. Gluster was installed from the dnf repositories.
I have 100% percent reproducibility with this problem. The steps I have taken to reproduce the problem are listed below:
1) Roll two clean Fedora 25 Servers
2) Generate /etc/hosts files on both servers for name resolution
3) Install GlusterFS on both hosts using dnf install glusterfs-server glusterfs-ganesha
4) Probe Gluster peers
5) Run gluster volume set all cluster.enable-shared-storage enable
-After waiting a few moments the shared storage location will mount to /run/gluster/shared_storage
6) Reboot either node.
7) Restarted node will have failed to start glusterd.service and shared storage will not be mounted
The service will start manually once the server has completely restarted but a mount -a must be run to mount the shared storage.
This is exceptionally problematic as we have attempted to roll an HA cluster using gluster nfs-ganesha enable. Since this leverages shared storage starting in 3.9, restarting a node after enabling the cluster, causes the node to fail to resume its cluster services.
Manually starting gluster, mounting the volume, and restarting corosync, pacemaker, etc does not put the node back into production.
This problem cannot be reproduced on CentOS 7 as we have another Gluster 3.8 cluster running on CentOS.
One more thing to note! Running Gluster 3.9.1 on Fedora 24 does NOT seem to exhibit this behavior.
The services start as expected.
This bug is getting closed because GlusterFS-3.9 has reached its end-of-life .
Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please open a new bug against the newer release.