Bug 1412240
Summary: | Glusterd does not reliably start at boot | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | RW Shore <rws228> | ||||||||
Component: | glusterd | Assignee: | Atin Mukherjee <amukherj> | ||||||||
Status: | CLOSED EOL | QA Contact: | |||||||||
Severity: | low | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 3.9 | CC: | bugs, nybble, rws228, sasundar | ||||||||
Target Milestone: | --- | Keywords: | Triaged | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Other | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2017-03-08 12:33:55 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
RW Shore
2017-01-11 15:26:51 UTC
Hello, Could you also attach glusterd log file ( /var/log/glusterfs/glusterd.log ) ? Created attachment 1241668 [details]
glusterd.log
Created attachment 1241669 [details]
glusterfs-volume.log
Created attachment 1241670 [details]
glustershd.log
(In reply to SATHEESARAN from comment #1) > Hello, > > Could you also attach glusterd log file ( /var/log/glusterfs/glusterd.log ) ? Done. These logs are "clean" in that I deleted the old log files and then rebooted. The other trusted server was already running. As expected, systemctl reports that the glusterd service is active (running) after the boot that created these logs. [2017-01-17 10:51:58.122805] D [MSGID: 0] [glusterd-peer-utils.c:167:glusterd_hostname_to_uuid] 0-management: returning -1 [2017-01-17 10:51:58.122819] D [MSGID: 0] [glusterd-utils.c:1009:glusterd_resolve_brick] 0-management: Returning -1 [2017-01-17 10:51:58.122833] E [MSGID: 106187] [glusterd-store.c:4408:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore [2017-01-17 10:51:58.122848] D [MSGID: 0] [glusterd-store.c:4481:glusterd_restore] 0-management: Returning -1 The above indicates that the n/w interface might not be up before glusterd tried to resolve the address of the brick. What OS baseline are you using? This is an issue we've seen before. GlusterD starts before the network is online. The 'Network is unreachable' error and the 'resolve brick failed' error point to this. This shows up particularly in systemd based systems, like RHEL7. https://bugzilla.redhat.com/show_bug.cgi?id=1260007#c3 explains what is happening in detail. tl;dr: systemd starts GlusterD, after the network devices are available, not after the network is online. The network-online.target is reached after GlusterD is started, to allow self GlusterFS _netdev mounts to work correctly. This is not an easy problem to solve correctly. A solution is described in the comment linked, which can help if self mounts are involved. If self-mounts are not present, you could force glusterd to start after network-online.target, by modifying it's service unit file. I can confirm this same behavior on Gluster 3.9.1 running on Fedora 25. Gluster was installed from the dnf repositories. I have 100% percent reproducibility with this problem. The steps I have taken to reproduce the problem are listed below: 1) Roll two clean Fedora 25 Servers 2) Generate /etc/hosts files on both servers for name resolution 3) Install GlusterFS on both hosts using dnf install glusterfs-server glusterfs-ganesha 4) Probe Gluster peers 5) Run gluster volume set all cluster.enable-shared-storage enable -After waiting a few moments the shared storage location will mount to /run/gluster/shared_storage 6) Reboot either node. 7) Restarted node will have failed to start glusterd.service and shared storage will not be mounted The service will start manually once the server has completely restarted but a mount -a must be run to mount the shared storage. This is exceptionally problematic as we have attempted to roll an HA cluster using gluster nfs-ganesha enable. Since this leverages shared storage starting in 3.9, restarting a node after enabling the cluster, causes the node to fail to resume its cluster services. Manually starting gluster, mounting the volume, and restarting corosync, pacemaker, etc does not put the node back into production. This problem cannot be reproduced on CentOS 7 as we have another Gluster 3.8 cluster running on CentOS. One more thing to note! Running Gluster 3.9.1 on Fedora 24 does NOT seem to exhibit this behavior. The services start as expected. This bug is getting closed because GlusterFS-3.9 has reached its end-of-life [1]. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please open a new bug against the newer release. [1]: https://www.gluster.org/community/release-schedule/ |