Bug 1468261

Summary: Regression: non-disruptive(in-service) upgrade on EC volume fails
Product: [Community] GlusterFS Reporter: Sunil Kumar Acharya <sheggodu>
Component: disperseAssignee: Sunil Kumar Acharya <sheggodu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: amukherj, aspandey, bugs, nchilaka, pasik, pkarampu, rhinduja, rhs-bugs, sheggodu, storage-qa-internal
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.12.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1465289
: 1470938 (view as bug list) Environment:
Last Closed: 2017-09-05 17:36:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1465289, 1470938    

Comment 1 Worker Ant 2017-07-06 15:30:31 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#3) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 2 Worker Ant 2017-07-07 16:04:35 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#4) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 3 Worker Ant 2017-07-10 07:10:11 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#5) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 4 Worker Ant 2017-07-10 09:47:34 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#6) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 5 Worker Ant 2017-07-11 14:32:44 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#7) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 6 Ashish Pandey 2017-07-12 07:33:33 UTC
Description of problem:
====================
The ec non-disruptive upgrade fails due to some regression


Client IO:tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_handles.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_import.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_import.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_intent.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_intent.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_kernelcomm.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_kernelcomm.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lib.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lib.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_linkea.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_linkea.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lmv.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lmv.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_log.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_log.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mdc.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mdc.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mds.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mds.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_net.h



Client fuse logs:
17-06-27 06:31:41.488462] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.492350] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.495012] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.498939] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.500037] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.501771] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.502741] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.510185] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.512205] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.517462] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.520244] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.522030] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.530202] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.533945] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.536465] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.539042] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.540564] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.544238] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.545663] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.550015] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.552186] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0:


Version-Release number of selected component (if applicable):
============
3.8.4.28-->3.8.4.29
3.8.4.29-->3.8.4-31

How reproducible:
======
2/2

Steps to Reproduce:
1.have a 4+2 ec volume on 6 nodes
2.let untar linux kernel go on during this upgrade procedure
3.upgrade node#1 and #2 (kill glusterfsd, glusterfs,stop glusterd and post upgrade of rpm start glusterd)
4. wait for healing to complete
5. post heal completed, and with still kernel untar going on
6. now upgrade node#3((kill glusterfsd, glusterfs,stop glusterd)

At this step you will see IO errors with i/o error

Comment 7 Worker Ant 2017-07-13 10:59:34 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#8) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 8 Worker Ant 2017-07-13 13:15:54 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#9) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 9 Worker Ant 2017-07-14 00:26:08 UTC
COMMIT: https://review.gluster.org/17703 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit d2650feb4bfadf3fb0cdb90236bc78c33b5ea451
Author: Sunil Kumar Acharya <sheggodu>
Date:   Wed Jul 5 16:41:38 2017 +0530

    cluster/ec: Non-disruptive upgrade on EC volume fails
    
    Problem:
    Enabling optimistic changelog on EC volume was not
    handling node down scenarios appropriately resulting
    in volume data inaccessibility.
    
    Solution:
    Update dirty xattr appropriately on good bricks whenever
    nodes are down. This would fix the metadata information
    as part of heal and thus ensures data accessibility.
    
    BUG: 1468261
    Change-Id: I08b0d28df386d9b2b49c3de84b4aac1c729ac057
    Signed-off-by: Sunil Kumar Acharya <sheggodu>
    Reviewed-on: https://review.gluster.org/17703
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 10 Shyamsundar 2017-09-05 17:36:29 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/