Bug 1468261 - Regression: non-disruptive(in-service) upgrade on EC volume fails
Summary: Regression: non-disruptive(in-service) upgrade on EC volume fails
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: disperse
Version: mainline
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Sunil Kumar Acharya
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1465289 1470938
TreeView+ depends on / blocked
 
Reported: 2017-07-06 13:39 UTC by Sunil Kumar Acharya
Modified: 2017-10-26 14:35 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.12.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1465289
: 1470938 (view as bug list)
Environment:
Last Closed: 2017-09-05 17:36:29 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Comment 1 Worker Ant 2017-07-06 15:30:31 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#3) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 2 Worker Ant 2017-07-07 16:04:35 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#4) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 3 Worker Ant 2017-07-10 07:10:11 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#5) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 4 Worker Ant 2017-07-10 09:47:34 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#6) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 5 Worker Ant 2017-07-11 14:32:44 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#7) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 6 Ashish Pandey 2017-07-12 07:33:33 UTC
Description of problem:
====================
The ec non-disruptive upgrade fails due to some regression


Client IO:tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_handles.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_import.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_import.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_intent.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_intent.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_kernelcomm.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_kernelcomm.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lib.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lib.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_linkea.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_linkea.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lmv.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lmv.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_log.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_log.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mdc.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mdc.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mds.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mds.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_net.h



Client fuse logs:
17-06-27 06:31:41.488462] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.492350] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.495012] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.498939] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.500037] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.501771] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.502741] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.510185] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.512205] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.517462] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.520244] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.522030] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.530202] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.533945] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.536465] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.539042] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.540564] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.544238] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.545663] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.550015] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.552186] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0:


Version-Release number of selected component (if applicable):
============
3.8.4.28-->3.8.4.29
3.8.4.29-->3.8.4-31

How reproducible:
======
2/2

Steps to Reproduce:
1.have a 4+2 ec volume on 6 nodes
2.let untar linux kernel go on during this upgrade procedure
3.upgrade node#1 and #2 (kill glusterfsd, glusterfs,stop glusterd and post upgrade of rpm start glusterd)
4. wait for healing to complete
5. post heal completed, and with still kernel untar going on
6. now upgrade node#3((kill glusterfsd, glusterfs,stop glusterd)

At this step you will see IO errors with i/o error

Comment 7 Worker Ant 2017-07-13 10:59:34 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#8) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 8 Worker Ant 2017-07-13 13:15:54 UTC
REVIEW: https://review.gluster.org/17703 (cluster/ec: Non-disruptive upgrade on EC volume fails) posted (#9) for review on master by Sunil Kumar Acharya (sheggodu)

Comment 9 Worker Ant 2017-07-14 00:26:08 UTC
COMMIT: https://review.gluster.org/17703 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit d2650feb4bfadf3fb0cdb90236bc78c33b5ea451
Author: Sunil Kumar Acharya <sheggodu>
Date:   Wed Jul 5 16:41:38 2017 +0530

    cluster/ec: Non-disruptive upgrade on EC volume fails
    
    Problem:
    Enabling optimistic changelog on EC volume was not
    handling node down scenarios appropriately resulting
    in volume data inaccessibility.
    
    Solution:
    Update dirty xattr appropriately on good bricks whenever
    nodes are down. This would fix the metadata information
    as part of heal and thus ensures data accessibility.
    
    BUG: 1468261
    Change-Id: I08b0d28df386d9b2b49c3de84b4aac1c729ac057
    Signed-off-by: Sunil Kumar Acharya <sheggodu>
    Reviewed-on: https://review.gluster.org/17703
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 10 Shyamsundar 2017-09-05 17:36:29 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.