Bug 1766640

Summary: EC inservice upgrade fails from RHGS 3.3.1->3.5.0
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: disperseAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.5CC: amukherj, pkarampu, pprakash, rhs-bugs, saraut, sasundar, sheggodu, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.5.z Batch Update 1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-6.0-23 Doc Type: Known Issue
Doc Text:
Special handling is sometimes required to ensure I/O on clients with older versions works correctly during an in-service upgrade. Servers with dispersed volumes do not do this handling for Red Hat Gluster Storage 3.3.1 clients when upgrading to version 3.5. Workaround: If you use dispersed volumes and have clients on Red Hat Gluster Storage 3.3.1, perform an offline upgrade when moving server and client to version 3.5.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-30 06:42:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1696815    

Description Nag Pavan Chilakam 2019-10-29 14:47:32 UTC
Description of problem:
==========================
When Bala(bmekala) ran the testcase of inservice upgrade of ecvolume, from rhgs3.3.1->3.5.0  (ie 3.8.4-54.15  ->6.0.21) the client started to face input/output error of linux untar after 4 nodes were upgraded.

### below is the info ####
While performing in-service upgrade during the upgrade of 4thnode, I have seen Input/Output Errors on the client side with linux untar on disperse and distributed-disperse volumes.
I have turned off  Disperse.optimistic-change-log and Disperse.eager-lock options before starting the upgrade on disperse and distributed-disperse volumes.

After this I haven't proceeded with upgrade on the other two nodes. I stopped the upgrade. Please look into it.
#############
tar: linux-4.20/tools/testing/selftests/powerpc/stringloops: Cannot utime: Input/output error
tar: linux-4.20/tools/testing/selftests/powerpc/stringloops: Cannot change ownership to uid 0, gid 0: Input/output error
tar: linux-4.20/tools/testing/selftests/powerpc/stringloops: Cannot change mode to rwxrwxr-x: Input/output error
tar: linux-4.20/tools/testing/selftests/powerpc/primitives/asm: Cannot utime: Input/output error
tar: linux-4.20/tools/testing/selftests/powerpc/primitives/asm: Cannot change ownership to uid 0, gid 0: Input/output error
tar: linux-4.20/tools/testing/selftests/powerpc/primitives/asm: Cannot change mode to rwxrwxr-x: Input/output error
tar: Exiting with failure status due to previous errors
#############
Cluster Details: Credentials root/1
Upgraded nodes are below
10.70.35.150
10.70.35.210
10.70.35.107
10.70.35.164
Below nodes which are yet to upgrade.
10.70.35.119
10.70.35.46

Clients:
10.70.35.198
10.70.35.147

All the above machines are hosted on "tettnang.lab.eng.blr.redhat.com" 

Regards,
Bala



Version-Release number of selected component (if applicable):
=============
rhgs3.3.1->3.5.0  (ie 3.8.4-54.15  ->6.0.21) 

How reproducible:
=================
hit it once on 2 different ecvolumes on same cluster


Steps to Reproduce:
1.ecvolume and distecvol on 3.3.1. turn off eager lock and optimistic change log
2. mounted on 2 clients, linux untar IO
3. start to upgrade one node at a time

Comment 6 Pranith Kumar K 2019-10-30 05:41:56 UTC
https://code.engineering.redhat.com/gerrit/184178

Comment 12 SATHEESARAN 2019-12-18 17:48:22 UTC
Verified with RHGS 3.5.1 interim build ( glusterfs-6.0-24.el7rhgs ) with the following steps

1. Created a 6 node trusted storage pool ( gluster cluster ) with RHGS 3.3.1 ( glusterfs-3.8.4-54.15.el7rhgs )
2. Created 1x(4+2) and 2x(4+2) disperse volumes
3. Disable disperse.eager-lock and disperse.optimistic-change-log off
4. Mount the volumes from 2 clients
5. Start kernel untar workload
6. Kill glusterfsd(brick), glusterfs and glusterd process in node1 ( # pkill glusterfsd; pkill glusterfs; systemctl stop glusterd )
7. Perform upgrade to glusterfs-6.0-24.el7rhgs
8. Post successful upgrade, start glusterd
9. Wait for self-heal to get completed on both the disperse volumes
10. Repeat steps 6 to 10 on other nodes and monitor the progress of kernel untar workload, post upgrading each node is completed.

Observation
1. Kernel untar workload was in progress and no interruption

With these steps, marking this bug as verified.

After upgrading the server, also bumped-up the op-version to 70000
and also unmounted the client, upgraded the client and remounted the disperse volumes

Comment 14 errata-xmlrpc 2020-01-30 06:42:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0288