Bug 1414519 - Glusterd fails to start: rpc frame timeouts
Summary: Glusterd fails to start: rpc frame timeouts
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.8
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Atin Mukherjee
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-18 17:40 UTC by Luis E. Cerezo
Modified: 2017-11-07 10:37 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-07 10:37:57 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
logfile (sanitized domain name) (2.26 MB, text/plain)
2017-01-18 17:42 UTC, Luis E. Cerezo
no flags Details
etc-glusterfs-glusterd.vol.log GZIP (2.25 MB, application/x-gzip)
2017-01-24 13:31 UTC, Luis E. Cerezo
no flags Details

Description Luis E. Cerezo 2017-01-18 17:40:35 UTC
Description of problem:
When attempting to start/restart gluster, volumes fail to start. logs indicate timeout issues

Version-Release number of selected component (if applicable):
glusterfs-server-3.8.8-1.el7.x86_64

How reproducible:
always

Steps to Reproduce:
1. restart gluster
2. wait
3. gluster volume status /statedump and related commands return Error : Request timed out

Actual results:

Error : Request timed out
Expected results:
volume status returns

Additional info:

Comment 1 Luis E. Cerezo 2017-01-18 17:42:42 UTC
Created attachment 1242236 [details]
logfile (sanitized domain name)

log file in debug. it expands to ~50Mb

Comment 2 Joe Julian 2017-01-18 20:59:14 UTC
These rpc timeouts occur on all servers.

Comment 3 Luis E. Cerezo 2017-01-18 22:18:56 UTC
Hi Joe,

Yea, we are seeing these on all the servers.

24007 is open on all hosts though.
[lucho@localhost HCI_scripts]$  ansible chi-virt-infra-hosts -m shell -a 'tcping -t 10 chi-virt-103-7-gluster.REDACTED.com 24007' -uroot
chi-virt-103-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-103-7-gluster.REDACTED.com port 24007 open.

chi-virt-102-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-103-7-gluster.REDACTED.com port 24007 open.

chi-virt-101-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-103-7-gluster.REDACTED.com port 24007 open.

[lucho@localhost HCI_scripts]$  ansible chi-virt-infra-hosts -m shell -a 'tcping -t 10 chi-virt-102-7-gluster.REDACTED.com 24007' -uroot
chi-virt-103-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-102-7-gluster.REDACTED.com port 24007 open.

chi-virt-101-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-102-7-gluster.REDACTED.com port 24007 open.

chi-virt-102-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-102-7-gluster.REDACTED.com port 24007 open.

[lucho@localhost HCI_scripts]$  ansible chi-virt-infra-hosts -m shell -a 'tcping -t 10 chi-virt-101-7-gluster.REDACTED.com 24007' -uroot
chi-virt-102-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-101-7-gluster.REDACTED.com port 24007 open.

chi-virt-103-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-101-7-gluster.REDACTED.com port 24007 open.

chi-virt-101-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-101-7-gluster.REDACTED.com port 24007 open.

[lucho@localhost HCI_scripts]$

Comment 4 Luis E. Cerezo 2017-01-23 18:48:16 UTC
Is there any additional information I can provide?

Comment 5 Atin Mukherjee 2017-01-24 04:42:59 UTC
(In reply to Luis E. Cerezo from comment #1)
> Created attachment 1242236 [details]
> logfile (sanitized domain name)
> 
> log file in debug. it expands to ~50Mb

The logfile attached is not readable. Could you please check and reattach the glusterd log file?

Comment 6 Luis E. Cerezo 2017-01-24 12:51:35 UTC
Here's a pastbin url from the irc chat (DEBUG REMOVED)

The attachment is a gzip of the log file.

https://paste.fedoraproject.org/529909/47589871/

Comment 7 Luis E. Cerezo 2017-01-24 13:30:38 UTC
I'll upload the file again. It's gzipped of etc-glusterfs-glusterd.vol.log on one host in debug mode. I can provide other nodes in this 3 node setup if you wish.

Comment 8 Luis E. Cerezo 2017-01-24 13:31:16 UTC
Created attachment 1243905 [details]
etc-glusterfs-glusterd.vol.log GZIP

Comment 9 Luis E. Cerezo 2017-01-24 14:37:19 UTC
sha512sum etc-glusterfs-glusterd.vol.log.gz
0d1dff013fb7e6a6ed3aeda60498c9565693c6b858b0f0579d02c48f0fb0874e5948e2620dcc54903708e3da9f2e7aabf868facaeb5bdab4fd1e35bd63dc12b1  etc-glusterfs-glusterd.vol.log.gz

Comment 10 Atin Mukherjee 2017-01-25 06:02:09 UTC
I didn't find any evidence of glusterd not coming up from the log file you shared.

Comment 11 Joe Julian 2017-01-25 06:44:02 UTC
"fails to start" is probably not a logically accurate statement. From his user perspective, that's what he's interpreting the symptoms as.

The real problem seems to be the repeating "[2017-01-18 00:07:24.745691] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x8 sent = 2017-01-17 23:57:22.580694. timeout = 600 for 10.49.1.145:24007" timeouts he's getting on all servers.

Comment 12 Niels de Vos 2017-11-07 10:37:57 UTC
This bug is getting closed because the 3.8 version is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.


Note You need to log in before you can comment on or make changes to this bug.