1548337 – hitting EIO error when a brick is restarted in ecvolume

Bug 1548337 - hitting EIO error when a brick is restarted in ecvolume

Summary: hitting EIO error when a brick is restarted in ecvolume

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Ashish Pandey
QA Contact:	Upasana
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1503137
TreeView+	depends on / blocked

Reported:	2018-02-23 09:03 UTC by Nag Pavan Chilakam
Modified:	2018-09-04 06:43 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.12.2-13
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-04 06:42:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:43:52 UTC

Description Nag Pavan Chilakam 2018-02-23 09:03:20 UTC

Description of problem:
=======================
when we do a kill brick and restart glusterd, to bring the brick online, in EC volume, we hit EIO on the client side, while IOs seem to be going ahead.

(this could be possibly due to introduction of disperse.parallel-writes, but I may be wrong)

Version-Release number of selected component (if applicable):
============
3.12.2.3

How reproducible:
============
2/3


Steps to Reproduce:
1.this can be hit during inservice upgrade easily, and also may be hit during brick disconnection
2.have an ecvolume and do linux untar from client . let ecvolume be on old build like 3.3.1 or even first drop of 3.4 
3.do a inservice upgrade of two nodes by bringing down glusterfsd,glusterfs,glusterd

4. upgrade the nodes


observation:
linux untar may continue but at the end exits with errors as below
root@dhcp47-79 ~]# tailf tar.log 
tar: linux-4.15.5/scripts/dtc/include-prefixes/cris: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/dt-bindings: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/metag: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/microblaze: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/mips: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/powerpc: Cannot close: Input/output error
tar: linux-4.15.5/tools/perf/util/help-unknown-cmd.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/powerpc/copyloops/memcpy_64.S: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/powerpc/vphn/vphn.c: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/delay.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/export.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/mutex.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/sched.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/virtio/linux/hrtimer.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/powerpc/primitives/word-at-a-time.h: Cannot create symlink to `../../../../../arch/powerpc/include/asm/word-at-a-time.h': Input/output error
tar: linux-4.15.5/tools/testing/selftests/powerpc/primitives/asm/asm-compat.h: Cannot create symlink to `../.././../../../../arch/powerpc/include/asm/asm-compat.h': Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/xtensa: Cannot create symlink to `../../../arch/xtensa/boot/dts': Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/openrisc: Cannot create symlink to `../../../arch/openrisc/boot/dts': Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/h8300: Cannot create symlink to `../../../arch/h8300/boot/dts': Input/output error
tar: linux-4.15.5/include/dt-bindings/input/linux-event-codes.h: Cannot create symlink to `../../uapi/linux/input-event-codes.h': Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes: Cannot utime: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes: Cannot change ownership to uid 0, gid 0: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes: Cannot change mode to rwxrwxr-x: Input/output error
tar: Exiting with failure status due to previous errors


below are some client side errors:
2018-02-23 06:57:29.625090] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-testo-client-3: disconnected from testo-client-3. Client process will keep trying to connect to glusterd until brick's port is available
[2018-02-23 06:57:29.627296] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fb4fcd5b2e2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fb4fcb208ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fb4fcb209be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7fb4fcb22130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fb4fcb22be0] ))))) 0-testo-client-3: forced unwinding frame type(GlusterFS 3.3) op(FSTAT(25)) called at 2018-02-23 06:57:29.619962 (xid=0xc557)
[2018-02-23 06:57:29.627313] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fb4fcd5b2e2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fb4fcb208ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fb4fcb209be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7fb4fcb22130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fb4fcb22be0] ))))) 0-testo-client-0: forced unwinding frame type(GlusterFS 3.3) op(FSTAT(25)) called at 2018-02-23 06:57:29.619728 (xid=0xc552)
[2018-02-23 06:57:29.627327] W [MSGID: 114031] [client-rpc-fops.c:1495:client3_3_fstat_cbk] 0-testo-client-3: remote operation failed [Transport endpoint is not connected]
[2018-02-23 06:57:29.627364] W [MSGID: 114031] [client-rpc-fops.c:1495:client3_3_fstat_cbk] 0-testo-client-0: remote operation failed [Transport endpoint is not connected]
[2018-02-23 06:57:29.640118] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 06:57:35.759496] W [socket.c:595:__socket_rwv] 0-glusterfs: readv on 10.70.35.153:24007 failed (No data available)
[2018-02-23 06:57:35.759606] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 06:57:35.759623] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2018-02-23 06:57:39.902621] E [socket.c:2360:socket_connect_finish] 0-testo-client-0: connection to 10.70.35.153:24007 failed (Connection refused); disconnecting socket
[2018-02-23 06:57:40.898507] E [socket.c:2360:socket_connect_finish] 0-testo-client-3: connection to 10.70.35.153:24007 failed (Connection refused); disconnecting socket
[2018-02-23 06:57:46.010219] E [socket.c:2360:socket_connect_finish] 0-glusterfs: connection to 10.70.35.153:24007 failed (Connection refused); disconnecting socket
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 8111 times between [2018-02-23 06:57:29.640118] and [2018-02-23 06:58:24.667118]
[2018-02-23 06:58:24.673527] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 06:59:51.141573] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 06:59:51.141643] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 15171 times between [2018-02-23 06:58:24.673527] and [2018-02-23 07:00:24.704978]
[2018-02-23 07:00:24.715908] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 07:01:59.383995] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 07:01:59.384079] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 15631 times between [2018-02-23 07:00:24.715908] and [2018-02-23 07:02:24.801248]
[2018-02-23 07:02:24.847580] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 07:04:07.492088] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 07:04:07.492148] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 16099 times between [2018-02-23 07:02:24.847580] and [2018-02-23 07:04:25.807357]
[2018-02-23 07:04:25.813844] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 07:06:15.684958] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 07:06:15.685036] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 15127 times between [2018-02-23 07:04:25.813844] and [2018-02-23 07:06:25.859202]

Comment 2 Xavi Hernandez 2018-02-23 09:30:58 UTC

The messages form the client's log seem fine.

In step 3 of the way to reproduce the problem, you say you kill glusterfs. Does this process correspond to a FUSE mount where I/O is happening ?

Comment 4 Nag Pavan Chilakam 2018-03-01 05:43:05 UTC

(In reply to Xavi Hernandez from comment #2)
> The messages form the client's log seem fine.
> 
> In step 3 of the way to reproduce the problem, you say you kill glusterfs.
> Does this process correspond to a FUSE mount where I/O is happening ?

pkill glusterfs here was done on server node (not on fuse client), this is done generally (as part of upgrade) to kill all gluster deamons like shd, etc.

Comment 12 Nag Pavan Chilakam 2018-05-16 13:11:24 UTC

moving this to failed_qa as i still see the issue.
more details at https://bugzilla.redhat.com/show_bug.cgi?id=1558948#c33

Comment 17 errata-xmlrpc 2018-09-04 06:42:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.