1414519 – Glusterd fails to start: rpc frame timeouts

Bug 1414519 - Glusterd fails to start: rpc frame timeouts

Summary: Glusterd fails to start: rpc frame timeouts

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.8
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Atin Mukherjee
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-18 17:40 UTC by Luis E. Cerezo
Modified:	2017-11-07 10:37 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-11-07 10:37:57 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
logfile (sanitized domain name) (2.26 MB, text/plain) 2017-01-18 17:42 UTC, Luis E. Cerezo	no flags	Details
etc-glusterfs-glusterd.vol.log GZIP (2.25 MB, application/x-gzip) 2017-01-24 13:31 UTC, Luis E. Cerezo	no flags	Details
View All

Description Luis E. Cerezo 2017-01-18 17:40:35 UTC

Description of problem:
When attempting to start/restart gluster, volumes fail to start. logs indicate timeout issues

Version-Release number of selected component (if applicable):
glusterfs-server-3.8.8-1.el7.x86_64

How reproducible:
always

Steps to Reproduce:
1. restart gluster
2. wait
3. gluster volume status /statedump and related commands return Error : Request timed out

Actual results:

Error : Request timed out
Expected results:
volume status returns

Additional info:

Comment 1 Luis E. Cerezo 2017-01-18 17:42:42 UTC

Created attachment 1242236 [details]
logfile (sanitized domain name)

log file in debug. it expands to ~50Mb

Comment 2 Joe Julian 2017-01-18 20:59:14 UTC

These rpc timeouts occur on all servers.

Comment 3 Luis E. Cerezo 2017-01-18 22:18:56 UTC

Hi Joe,

Yea, we are seeing these on all the servers.

24007 is open on all hosts though.
[lucho@localhost HCI_scripts]$  ansible chi-virt-infra-hosts -m shell -a 'tcping -t 10 chi-virt-103-7-gluster.REDACTED.com 24007' -uroot
chi-virt-103-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-103-7-gluster.REDACTED.com port 24007 open.

chi-virt-102-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-103-7-gluster.REDACTED.com port 24007 open.

chi-virt-101-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-103-7-gluster.REDACTED.com port 24007 open.

[lucho@localhost HCI_scripts]$  ansible chi-virt-infra-hosts -m shell -a 'tcping -t 10 chi-virt-102-7-gluster.REDACTED.com 24007' -uroot
chi-virt-103-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-102-7-gluster.REDACTED.com port 24007 open.

chi-virt-101-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-102-7-gluster.REDACTED.com port 24007 open.

chi-virt-102-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-102-7-gluster.REDACTED.com port 24007 open.

[lucho@localhost HCI_scripts]$  ansible chi-virt-infra-hosts -m shell -a 'tcping -t 10 chi-virt-101-7-gluster.REDACTED.com 24007' -uroot
chi-virt-102-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-101-7-gluster.REDACTED.com port 24007 open.

chi-virt-103-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-101-7-gluster.REDACTED.com port 24007 open.

chi-virt-101-7.REDACTED.com | SUCCESS | rc=0 >>
chi-virt-101-7-gluster.REDACTED.com port 24007 open.

[lucho@localhost HCI_scripts]$

Comment 4 Luis E. Cerezo 2017-01-23 18:48:16 UTC

Is there any additional information I can provide?

Comment 5 Atin Mukherjee 2017-01-24 04:42:59 UTC

(In reply to Luis E. Cerezo from comment #1)
> Created attachment 1242236 [details]
> logfile (sanitized domain name)
> 
> log file in debug. it expands to ~50Mb

The logfile attached is not readable. Could you please check and reattach the glusterd log file?

Comment 6 Luis E. Cerezo 2017-01-24 12:51:35 UTC

Here's a pastbin url from the irc chat (DEBUG REMOVED)

The attachment is a gzip of the log file.

https://paste.fedoraproject.org/529909/47589871/

Comment 7 Luis E. Cerezo 2017-01-24 13:30:38 UTC

I'll upload the file again. It's gzipped of etc-glusterfs-glusterd.vol.log on one host in debug mode. I can provide other nodes in this 3 node setup if you wish.

Comment 8 Luis E. Cerezo 2017-01-24 13:31:16 UTC

Created attachment 1243905 [details]
etc-glusterfs-glusterd.vol.log GZIP

Comment 9 Luis E. Cerezo 2017-01-24 14:37:19 UTC

sha512sum etc-glusterfs-glusterd.vol.log.gz
0d1dff013fb7e6a6ed3aeda60498c9565693c6b858b0f0579d02c48f0fb0874e5948e2620dcc54903708e3da9f2e7aabf868facaeb5bdab4fd1e35bd63dc12b1  etc-glusterfs-glusterd.vol.log.gz

Comment 10 Atin Mukherjee 2017-01-25 06:02:09 UTC

I didn't find any evidence of glusterd not coming up from the log file you shared.

Comment 11 Joe Julian 2017-01-25 06:44:02 UTC

"fails to start" is probably not a logically accurate statement. From his user perspective, that's what he's interpreting the symptoms as.

The real problem seems to be the repeating "[2017-01-18 00:07:24.745691] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x8 sent = 2017-01-17 23:57:22.580694. timeout = 600 for 10.49.1.145:24007" timeouts he's getting on all servers.

Comment 12 Niels de Vos 2017-11-07 10:37:57 UTC

This bug is getting closed because the 3.8 version is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.

Note You need to log in before you can comment on or make changes to this bug.