Bug 1548337 - hitting EIO error when a brick is restarted in ecvolume
Summary: hitting EIO error when a brick is restarted in ecvolume
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: disperse
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: RHGS 3.4.0
Assignee: Ashish Pandey
QA Contact: Upasana
URL:
Whiteboard:
Depends On:
Blocks: 1503137
TreeView+ depends on / blocked
 
Reported: 2018-02-23 09:03 UTC by nchilaka
Modified: 2018-09-04 06:43 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.12.2-13
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 06:42:45 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 None None None 2018-09-04 06:43:52 UTC

Description nchilaka 2018-02-23 09:03:20 UTC
Description of problem:
=======================
when we do a kill brick and restart glusterd, to bring the brick online, in EC volume, we hit EIO on the client side, while IOs seem to be going ahead.

(this could be possibly due to introduction of disperse.parallel-writes, but I may be wrong)

Version-Release number of selected component (if applicable):
============
3.12.2.3

How reproducible:
============
2/3


Steps to Reproduce:
1.this can be hit during inservice upgrade easily, and also may be hit during brick disconnection
2.have an ecvolume and do linux untar from client . let ecvolume be on old build like 3.3.1 or even first drop of 3.4 
3.do a inservice upgrade of two nodes by bringing down glusterfsd,glusterfs,glusterd

4. upgrade the nodes


observation:
linux untar may continue but at the end exits with errors as below
root@dhcp47-79 ~]# tailf tar.log 
tar: linux-4.15.5/scripts/dtc/include-prefixes/cris: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/dt-bindings: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/metag: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/microblaze: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/mips: Cannot close: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/powerpc: Cannot close: Input/output error
tar: linux-4.15.5/tools/perf/util/help-unknown-cmd.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/powerpc/copyloops/memcpy_64.S: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/powerpc/vphn/vphn.c: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/delay.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/export.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/mutex.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/sched.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/virtio/linux/hrtimer.h: Cannot close: Input/output error
tar: linux-4.15.5/tools/testing/selftests/powerpc/primitives/word-at-a-time.h: Cannot create symlink to `../../../../../arch/powerpc/include/asm/word-at-a-time.h': Input/output error
tar: linux-4.15.5/tools/testing/selftests/powerpc/primitives/asm/asm-compat.h: Cannot create symlink to `../.././../../../../arch/powerpc/include/asm/asm-compat.h': Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/xtensa: Cannot create symlink to `../../../arch/xtensa/boot/dts': Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/openrisc: Cannot create symlink to `../../../arch/openrisc/boot/dts': Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes/h8300: Cannot create symlink to `../../../arch/h8300/boot/dts': Input/output error
tar: linux-4.15.5/include/dt-bindings/input/linux-event-codes.h: Cannot create symlink to `../../uapi/linux/input-event-codes.h': Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes: Cannot utime: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes: Cannot change ownership to uid 0, gid 0: Input/output error
tar: linux-4.15.5/scripts/dtc/include-prefixes: Cannot change mode to rwxrwxr-x: Input/output error
tar: Exiting with failure status due to previous errors


below are some client side errors:
2018-02-23 06:57:29.625090] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-testo-client-3: disconnected from testo-client-3. Client process will keep trying to connect to glusterd until brick's port is available
[2018-02-23 06:57:29.627296] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fb4fcd5b2e2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fb4fcb208ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fb4fcb209be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7fb4fcb22130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fb4fcb22be0] ))))) 0-testo-client-3: forced unwinding frame type(GlusterFS 3.3) op(FSTAT(25)) called at 2018-02-23 06:57:29.619962 (xid=0xc557)
[2018-02-23 06:57:29.627313] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fb4fcd5b2e2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fb4fcb208ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fb4fcb209be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7fb4fcb22130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fb4fcb22be0] ))))) 0-testo-client-0: forced unwinding frame type(GlusterFS 3.3) op(FSTAT(25)) called at 2018-02-23 06:57:29.619728 (xid=0xc552)
[2018-02-23 06:57:29.627327] W [MSGID: 114031] [client-rpc-fops.c:1495:client3_3_fstat_cbk] 0-testo-client-3: remote operation failed [Transport endpoint is not connected]
[2018-02-23 06:57:29.627364] W [MSGID: 114031] [client-rpc-fops.c:1495:client3_3_fstat_cbk] 0-testo-client-0: remote operation failed [Transport endpoint is not connected]
[2018-02-23 06:57:29.640118] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 06:57:35.759496] W [socket.c:595:__socket_rwv] 0-glusterfs: readv on 10.70.35.153:24007 failed (No data available)
[2018-02-23 06:57:35.759606] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 06:57:35.759623] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2018-02-23 06:57:39.902621] E [socket.c:2360:socket_connect_finish] 0-testo-client-0: connection to 10.70.35.153:24007 failed (Connection refused); disconnecting socket
[2018-02-23 06:57:40.898507] E [socket.c:2360:socket_connect_finish] 0-testo-client-3: connection to 10.70.35.153:24007 failed (Connection refused); disconnecting socket
[2018-02-23 06:57:46.010219] E [socket.c:2360:socket_connect_finish] 0-glusterfs: connection to 10.70.35.153:24007 failed (Connection refused); disconnecting socket
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 8111 times between [2018-02-23 06:57:29.640118] and [2018-02-23 06:58:24.667118]
[2018-02-23 06:58:24.673527] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 06:59:51.141573] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 06:59:51.141643] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 15171 times between [2018-02-23 06:58:24.673527] and [2018-02-23 07:00:24.704978]
[2018-02-23 07:00:24.715908] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 07:01:59.383995] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 07:01:59.384079] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 15631 times between [2018-02-23 07:00:24.715908] and [2018-02-23 07:02:24.801248]
[2018-02-23 07:02:24.847580] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 07:04:07.492088] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 07:04:07.492148] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 16099 times between [2018-02-23 07:02:24.847580] and [2018-02-23 07:04:25.807357]
[2018-02-23 07:04:25.813844] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)
[2018-02-23 07:06:15.684958] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153
[2018-02-23 07:06:15.685036] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 15127 times between [2018-02-23 07:04:25.813844] and [2018-02-23 07:06:25.859202]

Comment 2 Xavi Hernandez 2018-02-23 09:30:58 UTC
The messages form the client's log seem fine.

In step 3 of the way to reproduce the problem, you say you kill glusterfs. Does this process correspond to a FUSE mount where I/O is happening ?

Comment 4 nchilaka 2018-03-01 05:43:05 UTC
(In reply to Xavi Hernandez from comment #2)
> The messages form the client's log seem fine.
> 
> In step 3 of the way to reproduce the problem, you say you kill glusterfs.
> Does this process correspond to a FUSE mount where I/O is happening ?

pkill glusterfs here was done on server node (not on fuse client), this is done generally (as part of upgrade) to kill all gluster deamons like shd, etc.

Comment 12 nchilaka 2018-05-16 13:11:24 UTC
moving this to failed_qa as i still see the issue.
more details at https://bugzilla.redhat.com/show_bug.cgi?id=1558948#c33

Comment 17 errata-xmlrpc 2018-09-04 06:42:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607


Note You need to log in before you can comment on or make changes to this bug.