Bug 1548337

Summary:	hitting EIO error when a brick is restarted in ecvolume
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Nag Pavan Chilakam <nchilaka>
Component:	disperse	Assignee:	Ashish Pandey <aspandey>
Status:	CLOSED ERRATA	QA Contact:	Upasana <ubansal>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.4	CC:	jahernan, nchilaka, rhinduja, rhs-bugs, sheggodu, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.12.2-13	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-04 06:42:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1503137

Description Nag Pavan Chilakam 2018-02-23 09:03:20 UTC

Description of problem:
=======================
when we do a kill brick and restart glusterd, to bring the brick online, in EC volume, we hit EIO on the client side, while IOs seem to be going ahead.

(this could be possibly due to introduction of disperse.parallel-writes, but I may be wrong)

Version-Release number of selected component (if applicable):
============
3.12.2.3

How reproducible:
============
2/3


Steps to Reproduce:
1.this can be hit during inservice upgrade easily, and also may be hit during brick disconnection
2.have an ecvolume and do linux untar from client . let ecvolume be on old build like 3.3.1 or even first drop of 3.4 
3.do a inservice upgrade of two nodes by bringing down glusterfsd,glusterfs,glusterd

4. upgrade the nodes


observation:
linux untar may continue but at the end exits with errors as below
root@dhcp47-79 ~]# tailf tar.log 
tar: linux-4.15.5/scripts/dtc/include-prefixes/cris: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/dt-bindings: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/metag: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/microblaze: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/mips: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/powerpc: Cannot close: Input/output error
tar: linux-4.15.5/tools/perf/util/help-unknown-cmd.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/powerpc/copyloops/memcpy_64.S: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/powerpc/vphn/vphn.c: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/delay.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/export.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/mutex.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/sched.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/virtio/linux/hrtimer.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/powerpc/primitives/word-at-a-time.h: Cannot create symlink to `../../../../../arch/powerpc/include/asm/word-at-a-time.h': Input/output error
tar: linux-4.15.5/tools/testing/selftests/powerpc/primitives/asm/asm-compat.h: Cannot create symlink to `../.././../../../../arch/powerpc/include/asm/asm-compat.h': Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/xtensa: Cannot create symlink to `../../../arch/xtensa/boot/dts': Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/openrisc: Cannot create symlink to `../../../arch/openrisc/boot/dts': Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/h8300: Cannot create symlink to `../../../arch/h8300/boot/dts': Input/output error
tar: linux-4.15.5/include/dt-bindings/input/linux-event-codes.h: Cannot create symlink to `../../uapi/linux/input-event-codes.h': Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes: Cannot utime: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes: Cannot change ownership to uid 0, gid 0: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes: Cannot change mode to rwxrwxr-x: Input/output error
tar: Exiting with failure status due to previous errors


below are some client side errors:
2018-02-23 06:57:29.625090] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-testo-client-3: disconnected from testo-client-3. Client process will keep trying to connect to glusterd until brick's port is available
[2018-02-23 06:57:29.627296] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fb4fcd5b2e2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fb4fcb208ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fb4fcb209be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7fb4fcb22130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fb4fcb22be0] ))))) 0-testo-client-3: forced unwinding frame type(GlusterFS 3.3) op(FSTAT(25)) called at 2018-02-23 06:57:29.619962 (xid=0xc557)
[2018-02-23 06:57:29.627313] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fb4fcd5b2e2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fb4fcb208ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fb4fcb209be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7fb4fcb22130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fb4fcb22be0] ))))) 0-testo-client-0: forced unwinding frame type(GlusterFS 3.3) op(FSTAT(25)) called at 2018-02-23 06:57:29.619728 (xid=0xc552)
[2018-02-23 06:57:29.627327] W [MSGID: 114031] [client-rpc-fops.c:1495:client3_3_fstat_cbk] 0-testo-client-3: remote operation failed [Transport endpoint is not connected]
[2018-02-23 06:57:29.627364] W [MSGID: 114031] [client-rpc-fops.c:1495:client3_3_fstat_cbk] 0-testo-client-0: remote operation failed [Transport endpoint is not connected]
[2018-02-23 06:57:29.640118] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 06:57:35.759496] W [socket.c:595:__socket_rwv] 0-glusterfs: readv on 10.70.35.153:24007 failed (No data available)
[2018-02-23 06:57:35.759606] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 06:57:35.759623] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2018-02-23 06:57:39.902621] E [socket.c:2360:socket_connect_finish] 0-testo-client-0: connection to 10.70.35.153:24007 failed (Connection refused); disconnecting socket
[2018-02-23 06:57:40.898507] E [socket.c:2360:socket_connect_finish] 0-testo-client-3: connection to 10.70.35.153:24007 failed (Connection refused); disconnecting socket
[2018-02-23 06:57:46.010219] E [socket.c:2360:socket_connect_finish] 0-glusterfs: connection to 10.70.35.153:24007 failed (Connection refused); disconnecting socket
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 8111 times between [2018-02-23 06:57:29.640118] and [2018-02-23 06:58:24.667118]
[2018-02-23 06:58:24.673527] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 06:59:51.141573] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 06:59:51.141643] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 15171 times between [2018-02-23 06:58:24.673527] and [2018-02-23 07:00:24.704978]
[2018-02-23 07:00:24.715908] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 07:01:59.383995] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 07:01:59.384079] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 15631 times between [2018-02-23 07:00:24.715908] and [2018-02-23 07:02:24.801248]
[2018-02-23 07:02:24.847580] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 07:04:07.492088] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 07:04:07.492148] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 16099 times between [2018-02-23 07:02:24.847580] and [2018-02-23 07:04:25.807357]
[2018-02-23 07:04:25.813844] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 07:06:15.684958] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 07:06:15.685036] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 15127 times between [2018-02-23 07:04:25.813844] and [2018-02-23 07:06:25.859202]

Comment 2 Xavi Hernandez 2018-02-23 09:30:58 UTC

The messages form the client's log seem fine.

In step 3 of the way to reproduce the problem, you say you kill glusterfs. Does this process correspond to a FUSE mount where I/O is happening ?

Comment 4 Nag Pavan Chilakam 2018-03-01 05:43:05 UTC

(In reply to Xavi Hernandez from comment #2)
> The messages form the client's log seem fine.
> 
> In step 3 of the way to reproduce the problem, you say you kill glusterfs.
> Does this process correspond to a FUSE mount where I/O is happening ?

pkill glusterfs here was done on server node (not on fuse client), this is done generally (as part of upgrade) to kill all gluster deamons like shd, etc.

Comment 12 Nag Pavan Chilakam 2018-05-16 13:11:24 UTC

moving this to failed_qa as i still see the issue.
more details at https://bugzilla.redhat.com/show_bug.cgi?id=1558948#c33

Comment 17 errata-xmlrpc 2018-09-04 06:42:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607