1766640 – EC inservice upgrade fails from RHGS 3.3.1->3.5.0

Bug 1766640 - EC inservice upgrade fails from RHGS 3.3.1->3.5.0

Summary: EC inservice upgrade fails from RHGS 3.3.1->3.5.0

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.5.z Batch Update 1
Assignee:	Pranith Kumar K
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1696815
TreeView+	depends on / blocked

Reported:	2019-10-29 14:47 UTC by Nag Pavan Chilakam
Modified:	2020-01-30 06:43 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-6.0-23
Doc Type:	Known Issue
Doc Text:	Special handling is sometimes required to ensure I/O on clients with older versions works correctly during an in-service upgrade. Servers with dispersed volumes do not do this handling for Red Hat Gluster Storage 3.3.1 clients when upgrading to version 3.5. Workaround: If you use dispersed volumes and have clients on Red Hat Gluster Storage 3.3.1, perform an offline upgrade when moving server and client to version 3.5.
Clone Of:
Environment:
Last Closed:	2020-01-30 06:42:48 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0288	0	None	None	None	2020-01-30 06:43:08 UTC

Description Nag Pavan Chilakam 2019-10-29 14:47:32 UTC

Description of problem:
==========================
When Bala(bmekala) ran the testcase of inservice upgrade of ecvolume, from rhgs3.3.1->3.5.0  (ie 3.8.4-54.15  ->6.0.21) the client started to face input/output error of linux untar after 4 nodes were upgraded.

### below is the info ####
While performing in-service upgrade during the upgrade of 4thnode, I have seen Input/Output Errors on the client side with linux untar on disperse and distributed-disperse volumes.
I have turned off  Disperse.optimistic-change-log and Disperse.eager-lock options before starting the upgrade on disperse and distributed-disperse volumes.

After this I haven't proceeded with upgrade on the other two nodes. I stopped the upgrade. Please look into it.
#############
tar: linux-4.20/tools/testing/selftests/powerpc/stringloops: Cannot utime: Input/output error
tar: linux-4.20/tools/testing/selftests/powerpc/stringloops: Cannot change ownership to uid 0, gid 0: Input/output error
tar: linux-4.20/tools/testing/selftests/powerpc/stringloops: Cannot change mode to rwxrwxr-x: Input/output error
tar: linux-4.20/tools/testing/selftests/powerpc/primitives/asm: Cannot utime: Input/output error
tar: linux-4.20/tools/testing/selftests/powerpc/primitives/asm: Cannot change ownership to uid 0, gid 0: Input/output error
tar: linux-4.20/tools/testing/selftests/powerpc/primitives/asm: Cannot change mode to rwxrwxr-x: Input/output error
tar: Exiting with failure status due to previous errors
#############
Cluster Details: Credentials root/1
Upgraded nodes are below
10.70.35.150
10.70.35.210
10.70.35.107
10.70.35.164
Below nodes which are yet to upgrade.
10.70.35.119
10.70.35.46

Clients:
10.70.35.198
10.70.35.147

All the above machines are hosted on "tettnang.lab.eng.blr.redhat.com" 

Regards,
Bala



Version-Release number of selected component (if applicable):
=============
rhgs3.3.1->3.5.0  (ie 3.8.4-54.15  ->6.0.21) 

How reproducible:
=================
hit it once on 2 different ecvolumes on same cluster


Steps to Reproduce:
1.ecvolume and distecvol on 3.3.1. turn off eager lock and optimistic change log
2. mounted on 2 clients, linux untar IO
3. start to upgrade one node at a time

Comment 6 Pranith Kumar K 2019-10-30 05:41:56 UTC

https://code.engineering.redhat.com/gerrit/184178

Comment 12 SATHEESARAN 2019-12-18 17:48:22 UTC

Verified with RHGS 3.5.1 interim build ( glusterfs-6.0-24.el7rhgs ) with the following steps

1. Created a 6 node trusted storage pool ( gluster cluster ) with RHGS 3.3.1 ( glusterfs-3.8.4-54.15.el7rhgs )
2. Created 1x(4+2) and 2x(4+2) disperse volumes
3. Disable disperse.eager-lock and disperse.optimistic-change-log off
4. Mount the volumes from 2 clients
5. Start kernel untar workload
6. Kill glusterfsd(brick), glusterfs and glusterd process in node1 ( # pkill glusterfsd; pkill glusterfs; systemctl stop glusterd )
7. Perform upgrade to glusterfs-6.0-24.el7rhgs
8. Post successful upgrade, start glusterd
9. Wait for self-heal to get completed on both the disperse volumes
10. Repeat steps 6 to 10 on other nodes and monitor the progress of kernel untar workload, post upgrading each node is completed.

Observation
1. Kernel untar workload was in progress and no interruption

With these steps, marking this bug as verified.

After upgrading the server, also bumped-up the op-version to 70000
and also unmounted the client, upgraded the client and remounted the disperse volumes

Comment 14 errata-xmlrpc 2020-01-30 06:42:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0288

Note You need to log in before you can comment on or make changes to this bug.