Description of problem: When attempting to start/restart gluster, volumes fail to start. logs indicate timeout issues Version-Release number of selected component (if applicable): glusterfs-server-3.8.8-1.el7.x86_64 How reproducible: always Steps to Reproduce: 1. restart gluster 2. wait 3. gluster volume status /statedump and related commands return Error : Request timed out Actual results: Error : Request timed out Expected results: volume status returns Additional info:
Created attachment 1242236 [details] logfile (sanitized domain name) log file in debug. it expands to ~50Mb
These rpc timeouts occur on all servers.
Hi Joe, Yea, we are seeing these on all the servers. 24007 is open on all hosts though. [lucho@localhost HCI_scripts]$ ansible chi-virt-infra-hosts -m shell -a 'tcping -t 10 chi-virt-103-7-gluster.REDACTED.com 24007' -uroot chi-virt-103-7.REDACTED.com | SUCCESS | rc=0 >> chi-virt-103-7-gluster.REDACTED.com port 24007 open. chi-virt-102-7.REDACTED.com | SUCCESS | rc=0 >> chi-virt-103-7-gluster.REDACTED.com port 24007 open. chi-virt-101-7.REDACTED.com | SUCCESS | rc=0 >> chi-virt-103-7-gluster.REDACTED.com port 24007 open. [lucho@localhost HCI_scripts]$ ansible chi-virt-infra-hosts -m shell -a 'tcping -t 10 chi-virt-102-7-gluster.REDACTED.com 24007' -uroot chi-virt-103-7.REDACTED.com | SUCCESS | rc=0 >> chi-virt-102-7-gluster.REDACTED.com port 24007 open. chi-virt-101-7.REDACTED.com | SUCCESS | rc=0 >> chi-virt-102-7-gluster.REDACTED.com port 24007 open. chi-virt-102-7.REDACTED.com | SUCCESS | rc=0 >> chi-virt-102-7-gluster.REDACTED.com port 24007 open. [lucho@localhost HCI_scripts]$ ansible chi-virt-infra-hosts -m shell -a 'tcping -t 10 chi-virt-101-7-gluster.REDACTED.com 24007' -uroot chi-virt-102-7.REDACTED.com | SUCCESS | rc=0 >> chi-virt-101-7-gluster.REDACTED.com port 24007 open. chi-virt-103-7.REDACTED.com | SUCCESS | rc=0 >> chi-virt-101-7-gluster.REDACTED.com port 24007 open. chi-virt-101-7.REDACTED.com | SUCCESS | rc=0 >> chi-virt-101-7-gluster.REDACTED.com port 24007 open. [lucho@localhost HCI_scripts]$
Is there any additional information I can provide?
(In reply to Luis E. Cerezo from comment #1) > Created attachment 1242236 [details] > logfile (sanitized domain name) > > log file in debug. it expands to ~50Mb The logfile attached is not readable. Could you please check and reattach the glusterd log file?
Here's a pastbin url from the irc chat (DEBUG REMOVED) The attachment is a gzip of the log file. https://paste.fedoraproject.org/529909/47589871/
I'll upload the file again. It's gzipped of etc-glusterfs-glusterd.vol.log on one host in debug mode. I can provide other nodes in this 3 node setup if you wish.
Created attachment 1243905 [details] etc-glusterfs-glusterd.vol.log GZIP
sha512sum etc-glusterfs-glusterd.vol.log.gz 0d1dff013fb7e6a6ed3aeda60498c9565693c6b858b0f0579d02c48f0fb0874e5948e2620dcc54903708e3da9f2e7aabf868facaeb5bdab4fd1e35bd63dc12b1 etc-glusterfs-glusterd.vol.log.gz
I didn't find any evidence of glusterd not coming up from the log file you shared.
"fails to start" is probably not a logically accurate statement. From his user perspective, that's what he's interpreting the symptoms as. The real problem seems to be the repeating "[2017-01-18 00:07:24.745691] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x8 sent = 2017-01-17 23:57:22.580694. timeout = 600 for 10.49.1.145:24007" timeouts he's getting on all servers.
This bug is getting closed because the 3.8 version is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.