Description of problem: ==================== The ec non-disruptive upgrade fails due to some regression Client IO:tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_handles.h: Cannot open: Input/output error linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_import.h tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_import.h: Cannot open: Input/output error linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_intent.h tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_intent.h: Cannot open: Input/output error linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_kernelcomm.h tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_kernelcomm.h: Cannot open: Input/output error linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lib.h tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lib.h: Cannot open: Input/output error linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_linkea.h tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_linkea.h: Cannot open: Input/output error linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lmv.h tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lmv.h: Cannot open: Input/output error linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_log.h tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_log.h: Cannot open: Input/output error linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mdc.h tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mdc.h: Cannot open: Input/output error linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mds.h tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mds.h: Cannot open: Input/output error linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_net.h Client fuse logs: 17-06-27 06:31:41.488462] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4) [2017-06-27 06:31:41.492350] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error] [2017-06-27 06:31:41.495012] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4) [2017-06-27 06:31:41.498939] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error] [2017-06-27 06:31:41.500037] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4) [2017-06-27 06:31:41.501771] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error] [2017-06-27 06:31:41.502741] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4) [2017-06-27 06:31:41.510185] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error] [2017-06-27 06:31:41.512205] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4) [2017-06-27 06:31:41.517462] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error] [2017-06-27 06:31:41.520244] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4) [2017-06-27 06:31:41.522030] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error] [2017-06-27 06:31:41.530202] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4) [2017-06-27 06:31:41.533945] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error] [2017-06-27 06:31:41.536465] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4) [2017-06-27 06:31:41.539042] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error] [2017-06-27 06:31:41.540564] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4) [2017-06-27 06:31:41.544238] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error] [2017-06-27 06:31:41.545663] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4) [2017-06-27 06:31:41.550015] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error] [2017-06-27 06:31:41.552186] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Version-Release number of selected component (if applicable): ============ 3.8.4.28-->3.8.4.29 3.8.4.29-->3.8.4-31 How reproducible: ====== 2/2 Steps to Reproduce: 1.have a 4+2 ec volume on 6 nodes 2.let untar linux kernel go on during this upgrade procedure 3.upgrade node#1 and #2 (kill glusterfsd, glusterfs,stop glusterd and post upgrade of rpm start glusterd) 4. wait for healing to complete 5. post heal completed, and with still kernel untar going on 6. now upgrade node#3((kill glusterfsd, glusterfs,stop glusterd) At this step you will see IO errors with i/o error
upstream patch : https://review.gluster.org/#/c/17703/
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/112278/
on_qa validation: was doing the same test above to verify inservice upgrade from 3.8.4-34 to 3.8.4-35(both versions have the supposed fix), still seeing the above issue of input/output error. Hence moving to failed_qa
a) What is the workaround for performing in-service upgrade of EC volumes? workaround: set "disperse.optimistic-change-log" to off. b) What's the impact of the workaround in terms of data integrity? No impact. c) What are the steps for performing offline upgrade? By offline, I am referring to inaccessibility of storage by the applications/clients during the course of upgrade. Steps are outlined here: https://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/#offline-upgrade-procedure
Thanks. Please get the steps documented for RHGS 3.3.
problem reported in BZ#1473668 has following repercussions: This problem is seen on both fuse and gnfs(and would be existing on smb/ganesha) 1)the heal almost never completes(it does complete but after a very long time or when IOs and entry creates are stopped in that directory), leading to enduser frustration. 2)Inservice upgrade cannot proceed to next set of nodes as after the first node is upgraded,the entries pending for heal stalls at some point and tends to be in that state forever. And as per inservice upgrade we should proceed to next set of nodes only after heal is completed. This may never be achieved. Even the workaround of disabling both optimistic-change-log and eagerlock, doesn't overcome this problem This would mean that inservice upgrade cannot be supported with or without workaround I have tested this on different builds as below upgrade from 3.8.4-18-6(3.2 async GA) to 3.8.4-18-38/41/27 etc or even just doing a pkill glusterfsd,glusterfs and restart of glusterd(just a brick down scenario), with IOs going on : 1)kernel untar 2)file creations under one directory ...say 1million small files
Relevant documentation is done as part of RHGS-3.3.0. Bug 1481946 is "CLOSED CURRENTRELEASE".