Bug 1412240

Summary: Glusterd does not reliably start at boot
Product: [Community] GlusterFS Reporter: RW Shore <rws228>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED EOL QA Contact:
Severity: low Docs Contact:
Priority: unspecified    
Version: 3.9CC: bugs, nybble, rws228, sasundar
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Other   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-08 12:33:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
glusterd.log
none
glusterfs-volume.log
none
glustershd.log none

Description RW Shore 2017-01-11 15:26:51 UTC
Description of problem:
With the default unit file /usr/lib/systemd/system/glusterd.service, the service does not reliably start at boot time

Version-Release number of selected component (if applicable):
gluster --version => glusterfs 3.9.0 built on Nov 15 2016 16:22:39

How reproducible:
Very - not 100%, but in the 90's

Steps to Reproduce:
1. Install gluster and configure a replicated volume
2. Reboot one of the servers
3.

Actual results:
systemctl status glusterd => service failed to start

Expected results:
systemctl status glusterd => service is active (running)

Additional info:
Infrastructure: two Fedora VMs running on virtualbox VMs, host Windows.

When the service fails to start, a manual start always works. This suggest to me that there's some sort of undeclared dependency, but it's not clear (to me at least) what it could be. To hack around the problem, I added lines "Restart=on-failure" and "RestartSec=1" to the unit file. With this addition, though there are still startup failures during boot, the service is running by the time I reboot either of the VMs and log in.

Comment 1 SATHEESARAN 2017-01-11 17:38:00 UTC
Hello,

Could you also attach glusterd log file ( /var/log/glusterfs/glusterd.log ) ?

Comment 2 RW Shore 2017-01-17 10:57:32 UTC
Created attachment 1241668 [details]
glusterd.log

Comment 3 RW Shore 2017-01-17 10:58:16 UTC
Created attachment 1241669 [details]
glusterfs-volume.log

Comment 4 RW Shore 2017-01-17 10:58:42 UTC
Created attachment 1241670 [details]
glustershd.log

Comment 5 RW Shore 2017-01-17 11:00:41 UTC
(In reply to SATHEESARAN from comment #1)
> Hello,
> 
> Could you also attach glusterd log file ( /var/log/glusterfs/glusterd.log ) ?

Done. These logs are "clean" in that I deleted the old log files and then rebooted. The other trusted server was already running. As expected, systemctl reports that the glusterd service is active (running) after the boot that created these logs.

Comment 6 Atin Mukherjee 2017-01-23 13:57:55 UTC
[2017-01-17 10:51:58.122805] D [MSGID: 0] [glusterd-peer-utils.c:167:glusterd_hostname_to_uuid] 0-management: returning -1
[2017-01-17 10:51:58.122819] D [MSGID: 0] [glusterd-utils.c:1009:glusterd_resolve_brick] 0-management: Returning -1
[2017-01-17 10:51:58.122833] E [MSGID: 106187] [glusterd-store.c:4408:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore
[2017-01-17 10:51:58.122848] D [MSGID: 0] [glusterd-store.c:4481:glusterd_restore] 0-management: Returning -1

The above indicates that the n/w interface might not be up before glusterd tried to resolve the address of the brick. What OS baseline are you using?

This is an issue we've seen before. GlusterD starts before the network is online. The 'Network is unreachable' error and the 'resolve brick failed' error point to this. This shows up particularly in systemd based systems, like RHEL7.

https://bugzilla.redhat.com/show_bug.cgi?id=1260007#c3 explains what is happening in detail.
tl;dr: systemd starts GlusterD, after the network devices are available, not after the network is online. The network-online.target is reached after GlusterD is started, to allow self GlusterFS _netdev mounts to work correctly.

This is not an easy problem to solve correctly. A solution is described in the comment linked, which can help if self mounts are involved. If self-mounts are not present, you could force glusterd to start after network-online.target, by modifying it's service unit file.

Comment 7 Daniel Scime 2017-02-25 06:19:53 UTC
I can confirm this same behavior on Gluster 3.9.1 running on Fedora 25.  Gluster was installed from the dnf repositories.

I have 100% percent reproducibility with this problem.  The steps I have taken to reproduce the problem are listed below:

1)  Roll two clean Fedora 25 Servers
2)  Generate /etc/hosts files on both servers for name resolution
3)  Install GlusterFS on both hosts using dnf install glusterfs-server glusterfs-ganesha
4)  Probe Gluster peers
5)  Run gluster volume set all cluster.enable-shared-storage enable
-After waiting a few moments the shared storage location will mount to /run/gluster/shared_storage
6)  Reboot either node.
7)  Restarted node will have failed to start glusterd.service and shared storage will not be mounted


The service will start manually once the server has completely restarted but a mount -a must be run to mount the shared storage.

This is exceptionally problematic as we have attempted to roll an HA cluster using gluster nfs-ganesha enable.  Since this leverages shared storage starting in 3.9, restarting a node after enabling the cluster, causes the node to fail to resume its cluster services.

Manually starting gluster, mounting the volume, and restarting corosync, pacemaker, etc does not put the node back into production.

This problem cannot be reproduced on CentOS 7 as we have another Gluster 3.8 cluster running on CentOS.

Comment 8 Daniel Scime 2017-02-25 06:28:29 UTC
One more thing to note!  Running Gluster 3.9.1 on Fedora 24 does NOT seem to exhibit this behavior.

The services start as expected.

Comment 9 Kaushal 2017-03-08 12:33:55 UTC
This bug is getting closed because GlusterFS-3.9 has reached its end-of-life [1].

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please open a new bug against the newer release.

[1]: https://www.gluster.org/community/release-schedule/