902953 – Clients return ENOTCONN or EINVAL after restarting brick servers in quick succession

Bug 902953 - Clients return ENOTCONN or EINVAL after restarting brick servers in quick succession

Summary: Clients return ENOTCONN or EINVAL after restarting brick servers in quick suc...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	transport
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-01-22 19:35 UTC by John Morrissey
Modified:	2016-03-16 07:01 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Clones:	1318132 (view as bug list)
Environment:
Last Closed:	2015-10-22 15:46:38 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Comment (81.17 KB, text/plain) 2013-01-22 19:35 UTC, John Morrissey	no flags	Details
View All

Description John Morrissey 2013-01-22 19:35:46 UTC

Created attachment 915661 [details]
Comment

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).

Comment 1 Amar Tumballi 2013-02-14 09:37:39 UTC

Thanks for the report, but one thing is, if a node is (or lot of nodes) are going down and coming back up, isn't it natural to have the operations fail as the filesystem is network based?

Comment 2 John Morrissey 2013-02-15 16:04:24 UTC

Sure, I would expect the operations to fail *while* the Gluster servers are being restarted, but after the servers are running, I would also expect Gluster clients to gracefully reconnect.

As the logs above show, they clearly do not do so after several minutes, or (in our experience) even after several hours.

Comment 3 John Morrissey 2013-04-01 16:28:12 UTC

Looks like this isn't limited to native Gluster clients.

Some of our nodes mount a Gluster instance via NFS. We noticed that these clients can successfully mount the volume, but any I/O to them returns EIO:

    [jwm@elided:pts/13 ~> ls -l /path/to/gluster
    ls: /path/to/gluster: Input/output error

The gluster<->nfs process on the gluster server:

root     27902 12.1  0.7 406064 179052 ?       Ssl  Jan22 11601:30 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /tmp/bf018af881a58acb0efa7cefadd6fb1d.socket

is spinning on a file descriptor that probably used to be connected to a gluster brick, but is now open to /etc/services:

-bash-4.1$ sudo strace -p 27902
Process 27902 attached - interrupt to quit
epoll_wait(3, {{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=19, u64=107374182419}}}, 258, 4294967295) = 1
getsockopt(19, SOL_SOCKET, SO_ERROR, [182050606976860271], [4]) = 0
shutdown(19, 2 /* send and receive */)  = -1 ENOTCONN (Transport endpoint is not connected)
readv(19, [{"\0\0\0\0", 4}], 1)         = 0
epoll_ctl(3, EPOLL_CTL_DEL, 19, NULL)   = 0
close(19)                               = 0
epoll_wait(3, {{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=19, u64=107374182419}}}, 258, 4294967295) = 1
getsockopt(19, SOL_SOCKET, SO_ERROR, [190986337975795823], [4]) = 0
shutdown(19, 2 /* send and receive */)  = -1 ENOTCONN (Transport endpoint is not connected)
readv(19, [{"\0\0\0\0", 4}], 1)         = 0
epoll_ctl(3, EPOLL_CTL_DEL, 19, NULL)   = 0
close(19)                               = 0
epoll_wait(3, {{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=19, u64=107374182419}}}, 258, 4294967295) = 1
-bash-4.1$ sudo lsof -p 27902
COMMAND     PID USER   FD   TYPE             DEVICE  SIZE/OFF      NODE NAME
[...]
glusterfs 27902 root   19u   REG              253,0    640999   3801126 /etc/services

Comment 4 Kaleb KEITHLEY 2015-10-22 15:46:38 UTC

because of the large number of bugs filed against mainline version\ is ambiguous and about to be removed as a choice.

If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.

Note You need to log in before you can comment on or make changes to this bug.