Bug 1465289 - Regression: non-disruptive(in-service) upgrade on EC volume fails
Regression: non-disruptive(in-service) upgrade on EC volume fails
Status: ASSIGNED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: disperse (Show other bugs)
3.3
Unspecified Unspecified
unspecified Severity urgent
: ---
: ---
Assigned To: Sunil Kumar Acharya
nchilaka
: Regression
Depends On: 1468261 1470938
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-27 02:36 EDT by nchilaka
Modified: 2017-08-30 06:06 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
In service upgrade requires disperse.optimistic-change-log to be OFF. gluster v set <volname> disperse.optimistic-change-log off
Story Points: ---
Clone Of:
: 1468261 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description nchilaka 2017-06-27 02:36:22 EDT
Description of problem:
====================
The ec non-disruptive upgrade fails due to some regression


Client IO:tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_handles.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_import.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_import.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_intent.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_intent.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_kernelcomm.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_kernelcomm.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lib.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lib.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_linkea.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_linkea.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lmv.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_lmv.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_log.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_log.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mdc.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mdc.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mds.h
tar: linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_mds.h: Cannot open: Input/output error
linux-4.11.7/drivers/staging/lustre/lustre/include/lustre_net.h



Client fuse logs:
17-06-27 06:31:41.488462] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.492350] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.495012] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.498939] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.500037] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.501771] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.502741] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.510185] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.512205] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.517462] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.520244] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.522030] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.530202] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.533945] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.536465] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.539042] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.540564] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.544238] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.545663] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (4)
[2017-06-27 06:31:41.550015] W [MSGID: 122040] [ec-common.c:990:ec_prepare_update_cbk] 0-ecv-disperse-0: Failed to get size and version [Input/output error]
[2017-06-27 06:31:41.552186] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0:


Version-Release number of selected component (if applicable):
============
3.8.4.28-->3.8.4.29
3.8.4.29-->3.8.4-31

How reproducible:
======
2/2

Steps to Reproduce:
1.have a 4+2 ec volume on 6 nodes
2.let untar linux kernel go on during this upgrade procedure
3.upgrade node#1 and #2 (kill glusterfsd, glusterfs,stop glusterd and post upgrade of rpm start glusterd)
4. wait for healing to complete
5. post heal completed, and with still kernel untar going on
6. now upgrade node#3((kill glusterfsd, glusterfs,stop glusterd)

At this step you will see IO errors with i/o error
Comment 5 Atin Mukherjee 2017-07-05 09:08:51 EDT
upstream patch : https://review.gluster.org/#/c/17703/
Comment 12 Atin Mukherjee 2017-07-14 01:59:46 EDT
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/112278/
Comment 15 nchilaka 2017-07-24 07:26:17 EDT
on_qa validation:
was doing the same test above to verify inservice upgrade from 3.8.4-34 to 3.8.4-35(both versions have the supposed fix), still seeing the above issue of input/output error.
Hence moving to failed_qa
Comment 20 Sunil Kumar Acharya 2017-08-11 02:02:16 EDT
a) What is the workaround for performing in-service upgrade of EC volumes?

workaround: set "disperse.optimistic-change-log" to off.

b) What's the impact of the workaround in terms of data integrity?

No impact.

c) What are the steps for performing offline upgrade? By offline, I am referring to inaccessibility  of storage by the applications/clients during the course of upgrade.

Steps are outlined here: https://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/#offline-upgrade-procedure
Comment 22 Alok 2017-08-16 01:48:32 EDT
Thanks. Please get the  steps documented for RHGS 3.3.
Comment 29 nchilaka 2017-08-18 02:47:01 EDT
problem reported in BZ#1473668 has following repercussions:
This problem is seen on both fuse and gnfs(and would be existing on smb/ganesha)
1)the heal almost never completes(it does complete but after a very long time or when IOs and entry creates are stopped in that directory), leading to enduser frustration.
2)Inservice upgrade cannot proceed to next set of nodes as after the first node is upgraded,the entries pending for heal stalls at some point and tends to be in that state forever. And as per inservice upgrade we should proceed to next set of nodes only after heal is completed. This may never be achieved. 
 Even the workaround of disabling both optimistic-change-log and eagerlock, doesn't overcome this problem
 This would mean that inservice upgrade cannot be supported with or without workaround
I have tested this on different builds as below
upgrade from 3.8.4-18-6(3.2 async GA) to 3.8.4-18-38/41/27 etc
or even just doing a pkill glusterfsd,glusterfs and restart of glusterd(just a brick down scenario), with IOs going on : 1)kernel untar 2)file creations under one directory ...say 1million small files

Note You need to log in before you can comment on or make changes to this bug.