Bug 1465289 - Regression: non-disruptive(in-service) upgrade on EC volume fails
Summary: Regression: non-disruptive(in-service) upgrade on EC volume fails
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: disperse
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Sunil Kumar Acharya
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard:
Depends On: 1468261 1470938
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-27 06:36 UTC by Nag Pavan Chilakam
Modified: 2018-08-14 11:16 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
In service upgrade requires disperse.optimistic-change-log to be OFF. gluster v set <volname> disperse.optimistic-change-log off
Clone Of:
: 1468261 (view as bug list)
Environment:
Last Closed: 2017-12-06 14:20:40 UTC
Embargoed:


Attachments (Terms of Use)

Description Nag Pavan Chilakam 2017-06-27 06:36:22 UTC
Description of problem:
====================
The ec non-disruptive upgrade fails due to some regression


Client IO:tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_handles.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_import.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_import.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_intent.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_intent.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_kernelcomm.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_kernelcomm.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lib.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lib.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_linkea.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_linkea.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lmv.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lmv.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_log.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_log.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mdc.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mdc.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mds.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mds.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_net.h



Client fuse logs:
17-06-27 06:31:41.488462] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.492350] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.495012] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.498939] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.500037] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.501771] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.502741] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.510185] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.512205] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.517462] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.520244] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.522030] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.530202] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.533945] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.536465] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.539042] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.540564] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.544238] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.545663] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.550015] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.552186] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0:


Version-Release number of selected component (if applicable):
============
3.8.4.28-->3.8.4.29
3.8.4.29-->3.8.4-31

How reproducible:
======
2/2

Steps to Reproduce:
1.have a 4+2 ec volume on 6 nodes
2.let untar linux kernel go on during this upgrade procedure
3.upgrade node#1 and #2 (kill glusterfsd, glusterfs,stop glusterd and post upgrade of rpm start glusterd)
4. wait for healing to complete
5. post heal completed, and with still kernel untar going on
6. now upgrade node#3((kill glusterfsd, glusterfs,stop glusterd)

At this step you will see IO errors with i/o error

Comment 5 Atin Mukherjee 2017-07-05 13:08:51 UTC
upstream patch : https://review.gluster.org/#/c/17703/

Comment 12 Atin Mukherjee 2017-07-14 05:59:46 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/112278/

Comment 15 Nag Pavan Chilakam 2017-07-24 11:26:17 UTC
on_qa validation:
was doing the same test above to verify inservice upgrade from 3.8.4-34 to 3.8.4-35(both versions have the supposed fix), still seeing the above issue of input/output error.
Hence moving to failed_qa

Comment 20 Sunil Kumar Acharya 2017-08-11 06:02:16 UTC
a) What is the workaround for performing in-service upgrade of EC volumes?

workaround: set "disperse.optimistic-change-log" to off.

b) What's the impact of the workaround in terms of data integrity?

No impact.

c) What are the steps for performing offline upgrade? By offline, I am referring to inaccessibility  of storage by the applications/clients during the course of upgrade.

Steps are outlined here: https://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/#offline-upgrade-procedure

Comment 22 Alok 2017-08-16 05:48:32 UTC
Thanks. Please get the  steps documented for RHGS 3.3.

Comment 29 Nag Pavan Chilakam 2017-08-18 06:47:01 UTC
problem reported in BZ#1473668 has following repercussions:
This problem is seen on both fuse and gnfs(and would be existing on smb/ganesha)
1)the heal almost never completes(it does complete but after a very long time or when IOs and entry creates are stopped in that directory), leading to enduser frustration.
2)Inservice upgrade cannot proceed to next set of nodes as after the first node is upgraded,the entries pending for heal stalls at some point and tends to be in that state forever. And as per inservice upgrade we should proceed to next set of nodes only after heal is completed. This may never be achieved. 
 Even the workaround of disabling both optimistic-change-log and eagerlock, doesn't overcome this problem
 This would mean that inservice upgrade cannot be supported with or without workaround
I have tested this on different builds as below
upgrade from 3.8.4-18-6(3.2 async GA) to 3.8.4-18-38/41/27 etc
or even just doing a pkill glusterfsd,glusterfs and restart of glusterd(just a brick down scenario), with IOs going on : 1)kernel untar 2)file creations under one directory ...say 1million small files

Comment 33 Sunil Kumar Acharya 2017-12-06 14:20:40 UTC
Relevant documentation is done as part of RHGS-3.3.0. Bug 1481946 is "CLOSED CURRENTRELEASE".


Note You need to log in before you can comment on or make changes to this bug.