Description of problem: ======================= when we do a kill brick and restart glusterd, to bring the brick online, in EC volume, we hit EIO on the client side, while IOs seem to be going ahead. (this could be possibly due to introduction of disperse.parallel-writes, but I may be wrong) Version-Release number of selected component (if applicable): ============ 3.12.2.3 How reproducible: ============ 2/3 Steps to Reproduce: 1.this can be hit during inservice upgrade easily, and also may be hit during brick disconnection 2.have an ecvolume and do linux untar from client . let ecvolume be on old build like 3.3.1 or even first drop of 3.4 3.do a inservice upgrade of two nodes by bringing down glusterfsd,glusterfs,glusterd 4. upgrade the nodes observation: linux untar may continue but at the end exits with errors as below root@dhcp47-79 ~]# tailf tar.log tar: linux-4.15.5/scripts/dtc/include-prefixes/cris: Cannot close: Input/output error tar: linux-4.15.5/scripts/dtc/include-prefixes/dt-bindings: Cannot close: Input/output error tar: linux-4.15.5/scripts/dtc/include-prefixes/metag: Cannot close: Input/output error tar: linux-4.15.5/scripts/dtc/include-prefixes/microblaze: Cannot close: Input/output error tar: linux-4.15.5/scripts/dtc/include-prefixes/mips: Cannot close: Input/output error tar: linux-4.15.5/scripts/dtc/include-prefixes/powerpc: Cannot close: Input/output error tar: linux-4.15.5/tools/perf/util/help-unknown-cmd.h: Cannot close: Input/output error tar: linux-4.15.5/tools/testing/selftests/powerpc/copyloops/memcpy_64.S: Cannot close: Input/output error tar: linux-4.15.5/tools/testing/selftests/powerpc/vphn/vphn.c: Cannot close: Input/output error tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/delay.h: Cannot close: Input/output error tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/export.h: Cannot close: Input/output error tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/mutex.h: Cannot close: Input/output error tar: linux-4.15.5/tools/testing/selftests/rcutorture/formal/srcu-cbmc/empty_includes/linux/sched.h: Cannot close: Input/output error tar: linux-4.15.5/tools/virtio/linux/hrtimer.h: Cannot close: Input/output error tar: linux-4.15.5/tools/testing/selftests/powerpc/primitives/word-at-a-time.h: Cannot create symlink to `../../../../../arch/powerpc/include/asm/word-at-a-time.h': Input/output error tar: linux-4.15.5/tools/testing/selftests/powerpc/primitives/asm/asm-compat.h: Cannot create symlink to `../.././../../../../arch/powerpc/include/asm/asm-compat.h': Input/output error tar: linux-4.15.5/scripts/dtc/include-prefixes/xtensa: Cannot create symlink to `../../../arch/xtensa/boot/dts': Input/output error tar: linux-4.15.5/scripts/dtc/include-prefixes/openrisc: Cannot create symlink to `../../../arch/openrisc/boot/dts': Input/output error tar: linux-4.15.5/scripts/dtc/include-prefixes/h8300: Cannot create symlink to `../../../arch/h8300/boot/dts': Input/output error tar: linux-4.15.5/include/dt-bindings/input/linux-event-codes.h: Cannot create symlink to `../../uapi/linux/input-event-codes.h': Input/output error tar: linux-4.15.5/scripts/dtc/include-prefixes: Cannot utime: Input/output error tar: linux-4.15.5/scripts/dtc/include-prefixes: Cannot change ownership to uid 0, gid 0: Input/output error tar: linux-4.15.5/scripts/dtc/include-prefixes: Cannot change mode to rwxrwxr-x: Input/output error tar: Exiting with failure status due to previous errors below are some client side errors: 2018-02-23 06:57:29.625090] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-testo-client-3: disconnected from testo-client-3. Client process will keep trying to connect to glusterd until brick's port is available [2018-02-23 06:57:29.627296] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fb4fcd5b2e2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fb4fcb208ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fb4fcb209be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7fb4fcb22130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fb4fcb22be0] ))))) 0-testo-client-3: forced unwinding frame type(GlusterFS 3.3) op(FSTAT(25)) called at 2018-02-23 06:57:29.619962 (xid=0xc557) [2018-02-23 06:57:29.627313] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fb4fcd5b2e2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fb4fcb208ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fb4fcb209be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7fb4fcb22130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fb4fcb22be0] ))))) 0-testo-client-0: forced unwinding frame type(GlusterFS 3.3) op(FSTAT(25)) called at 2018-02-23 06:57:29.619728 (xid=0xc552) [2018-02-23 06:57:29.627327] W [MSGID: 114031] [client-rpc-fops.c:1495:client3_3_fstat_cbk] 0-testo-client-3: remote operation failed [Transport endpoint is not connected] [2018-02-23 06:57:29.627364] W [MSGID: 114031] [client-rpc-fops.c:1495:client3_3_fstat_cbk] 0-testo-client-0: remote operation failed [Transport endpoint is not connected] [2018-02-23 06:57:29.640118] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9) [2018-02-23 06:57:35.759496] W [socket.c:595:__socket_rwv] 0-glusterfs: readv on 10.70.35.153:24007 failed (No data available) [2018-02-23 06:57:35.759606] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153 [2018-02-23 06:57:35.759623] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers [2018-02-23 06:57:39.902621] E [socket.c:2360:socket_connect_finish] 0-testo-client-0: connection to 10.70.35.153:24007 failed (Connection refused); disconnecting socket [2018-02-23 06:57:40.898507] E [socket.c:2360:socket_connect_finish] 0-testo-client-3: connection to 10.70.35.153:24007 failed (Connection refused); disconnecting socket [2018-02-23 06:57:46.010219] E [socket.c:2360:socket_connect_finish] 0-glusterfs: connection to 10.70.35.153:24007 failed (Connection refused); disconnecting socket The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 8111 times between [2018-02-23 06:57:29.640118] and [2018-02-23 06:58:24.667118] [2018-02-23 06:58:24.673527] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9) [2018-02-23 06:59:51.141573] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153 [2018-02-23 06:59:51.141643] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 15171 times between [2018-02-23 06:58:24.673527] and [2018-02-23 07:00:24.704978] [2018-02-23 07:00:24.715908] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9) [2018-02-23 07:01:59.383995] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153 [2018-02-23 07:01:59.384079] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 15631 times between [2018-02-23 07:00:24.715908] and [2018-02-23 07:02:24.801248] [2018-02-23 07:02:24.847580] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9) [2018-02-23 07:04:07.492088] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153 [2018-02-23 07:04:07.492148] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 16099 times between [2018-02-23 07:02:24.847580] and [2018-02-23 07:04:25.807357] [2018-02-23 07:04:25.813844] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9) [2018-02-23 07:06:15.684958] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.35.153 [2018-02-23 07:06:15.685036] I [glusterfsd-mgmt.c:2235:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-testo-disperse-0: Executing operation with some subvolumes unavailable (9)" repeated 15127 times between [2018-02-23 07:04:25.813844] and [2018-02-23 07:06:25.859202]
The messages form the client's log seem fine. In step 3 of the way to reproduce the problem, you say you kill glusterfs. Does this process correspond to a FUSE mount where I/O is happening ?
(In reply to Xavi Hernandez from comment #2) > The messages form the client's log seem fine. > > In step 3 of the way to reproduce the problem, you say you kill glusterfs. > Does this process correspond to a FUSE mount where I/O is happening ? pkill glusterfs here was done on server node (not on fuse client), this is done generally (as part of upgrade) to kill all gluster deamons like shd, etc.
moving this to failed_qa as i still see the issue. more details at https://bugzilla.redhat.com/show_bug.cgi?id=1558948#c33
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607