Bug 1347251
Summary: | fix the issue of Rolling upgrade or non-disruptive upgrade of disperse or erasure code volume to work | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Nag Pavan Chilakam <nchilaka> | |
Component: | disperse | Assignee: | Ashish Pandey <aspandey> | |
Status: | CLOSED ERRATA | QA Contact: | Nag Pavan Chilakam <nchilaka> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.1 | CC: | amukherj, aspandey, asrivast, pkarampu, rcyriac, rhinduja, rhs-bugs | |
Target Milestone: | --- | Flags: | aspandey:
needinfo-
|
|
Target Release: | RHGS 3.2.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.8.4-1 | Doc Type: | Bug Fix | |
Doc Text: |
Online rolling upgrades were not possible from Red Hat Gluster Storage 3.1.x to 3.1.y (where y is more recent than x) because of client limitations. Red Hat Gluster Storage 3.2 enables online rolling upgrades from 3.2.x to 3.2.y (where y is more recent than x).
|
Story Points: | --- | |
Clone Of: | ||||
: | 1347686 1422539 (view as bug list) | Environment: | ||
Last Closed: | 2017-03-23 05:36:44 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1347686, 1351522, 1351530, 1360152, 1360174 |
Description
Nag Pavan Chilakam
2016-06-16 11:44:31 UTC
Note for QE and Dev:Once this bug is fixed, make sure that the documentation is updated for that release accordingly Refer 1347252 - [DOC]: Have a note saying non-disruptive or rolling upgrade is not supported for a disperse volume Upstream mainline : http://review.gluster.org/14761 Upstream 3.8 : http://review.gluster.org/15013 And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4. Still seeing EIO: had a 4+2 setup on a 3 node setup, means 2 ec bricks on each node. pumping IOs(lin untar) from one client upgraded node#1 and healing completed Now brought down node#2 Seeing EIO as soon as node#2 is down: tar: linux-4.8.6/drivers/nfc/nfcmrvl/spi.c: Cannot open: Input/output error linux-4.8.6/drivers/nfc/nfcmrvl/uart.c tar: linux-4.8.6/drivers/nfc/nfcmrvl/uart.c: Cannot open: Input/output error linux-4.8.6/drivers/nfc/nfcmrvl/usb.c tar: linux-4.8.6/drivers/nfc/nfcmrvl/usb.c: Cannot open: Input/output error linux-4.8.6/drivers/nfc/nfcsim.c tar: linux-4.8.6/drivers/nfc/nfcsim.c: Cannot open: Input/output error linux-4.8.6/drivers/nfc/nfcwilink.c tar: linux-4.8.6/drivers/nfc/nfcwilink.c: Cannot open: Input/output error linux-4.8.6/drivers/nfc/nxp-nci/ tar: linux-4.8.6/drivers/nfc/nxp-nci: Cannot mkdir: Input/output error linux-4.8.6/drivers/nfc/nxp-nci/Kconfig tar: linux-4.8.6/drivers/nfc/nxp-nci/Kconfig: Cannot open: Input/output error linux-4.8.6/drivers/nfc/nxp-nci/Makefile tar: linux-4.8.6/drivers/nfc/nxp-nci/Makefile: Cannot open: Input/output error linux-4.8.6/drivers/nfc/nxp-nci/core.c tar: linux-4.8.6/drivers/nfc/nxp-nci/core.c: Cannot open: Input/output error linux-4.8.6/drivers/nfc/nxp-nci/firmware.c tar: linux-4.8.6/drivers/nfc/nxp-nci/firmware.c: Cannot open: Input/output error linux-4.8.6/drivers/nfc/nxp-nci/i2c.c tar: linux-4.8.6/drivers/nfc/nxp-nci/i2c.c: Cannot open: Input/output error linux-4.8.6/drivers/nfc/nxp-nci/nxp-nci.h tar: linux-4.8.6/drivers/nfc/nxp-nci/nxp-nci.h: Cannot open: Input/output error linux-4.8.6/drivers/nfc/pn533/ tar: linux-4.8.6/drivers/nfc/pn533: Cannot mkdir: Input/output error linux-4.8.6/drivers/nfc/pn533/Kconfig tar: linux-4.8.6/drivers/nfc/pn533/Kconfig: Cannot open: Input/output error linux-4.8.6/drivers/nfc/pn533/Makefile tar: linux-4.8.6/drivers/nfc/pn533/Makefile: Cannot open: Input/output error linux-4.8.6/drivers/nfc/pn533/i2c.c tar: linux-4.8.6/drivers/nfc/pn533/i2c.c: Cannot open: Input/output error linux-4.8.6/drivers/nfc/pn533/pn533.c tar: linux-4.8.6/drivers/nfc/pn533/pn533.c: Cannot open: Input/output error linux-4.8.6/drivers/nfc/pn533/pn533.h tar: linux-4.8.6/drivers/nfc/pn533/pn533.h: Cannot open: Input/output error linux-4.8.6/drivers/nfc/pn533/usb.c tar: linux-4.8.6/drivers/nfc/pn533/usb.c: Cannot open: Input/output error linux-4.8.6/drivers/nfc/pn544/ tar: linux-4.8.6/drivers/nfc/pn544: Cannot mkdir: Input/output error linux-4.8.6/drivers/nfc/pn544/Kconfig tar: linux-4.8.6/drivers/nfc/pn544/Kconfig: Cannot open: Input/output error linux-4.8.6/drivers/nfc/pn544/Makefile tar: linux-4.8.6/drivers/nfc/pn544/Makefile: Cannot open: Input/output error linux-4.8.6/drivers/nfc/pn544/i2c.c tar: linux-4.8.6/drivers/nfc/pn544/i2c.c: Cannot open: Input/output error linux-4.8.6/drivers/nfc/pn544/mei.c Another note: On a 6 nodes setup seeing spurious heal info entries(only mistake or configuration change from my side was that the clients were already upgraded) I tried this on a 4+2 setup 6node setup I fuse mounted the vol on two clients and trigger linux untar and dd of 10000files in each of the cleint I then brought down node#1 and #2 and upgraded them. The upgrade went smooth. But the heal info never completes due to spurious entries. THat is the files getting written at that time are being shown as heal pending. This leaves the admin confused as the heal info never shows as complete until IOs are stopped. Hence admin can never proceed with completing the upgrade . Discussed with Pranith and hence moving to failed_qa (At the best it is still a blocker for verifying this bug) [root@dhcp35-239 ~]# [root@dhcp35-239 ~]# rpm -qa|grep gluster gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-server-3.7.9-12.el7rhgs.x86_64 glusterfs-client-xlators-3.7.9-12.el7rhgs.x86_64 python-gluster-3.7.9-12.el7rhgs.noarch glusterfs-libs-3.7.9-12.el7rhgs.x86_64 glusterfs-api-3.7.9-12.el7rhgs.x86_64 glusterfs-cli-3.7.9-12.el7rhgs.x86_64 glusterfs-geo-replication-3.7.9-12.el7rhgs.x86_64 vdsm-gluster-4.17.33-1.el7rhgs.noarch glusterfs-rdma-3.7.9-12.el7rhgs.x86_64 glusterfs-3.7.9-12.el7rhgs.x86_64 gluster-nagios-addons-0.2.7-1.el7rhgs.x86_64 glusterfs-fuse-3.7.9-12.el7rhgs.x86_64 Main issue of this BZ has been fixed on client side - ------------------------------------ This issue arises when we do a rolling update from 3.7.5 to 3.7.9. For 4+2 volume running 3.7.5, if we update 2 nodes and after heal completion kill 2 older nodes, this problem can be seen. After update and killing of bricks, 2 nodes will return inodelk count key in dict while other 2 nodes will not have inodelk count in dict. This is also true for get-link-count. During dictionary match , ec_dict_compare, this will lead to mismatch of answers and the file operation on mount point will fail with IO error. To solve this don't match inode, entry and link count keys while comparing two dictionaries. However, while combining the data in ec_dict_combine, go through all the dictionaries and select the maximum values received in different dicts for these keys. Due to this we have to upgrade client first to make sure that we do not compare inodelk counts, which we get from server and are different, on client side. This is what mentioned in previous comment. ----------------------------------- However, this brings the new issue. In patch http://review.gluster.org/#/c/13733/, for every update fop we set dirty flag. Which means we create index entries then perform fop and then unset dirty flag. So while fop is going on and we use heal info command, it also shows the files on which IO is going on. Setting and clearing dirty flag triggers from client side. To solve this issue, we have sent patch http://review.gluster.org/#/c/15543/ In this patch we take _LOCK_ on a file for which index entry is created. We check the version and size and only when these two differ, we list it in heal info. The problem is - - we have upgraded the client, that means it will set dirty flag and create index entries on servers. - Now, we are upgrading nodes to new version one by one. However, in this process, some old nodes will NOT have the second patch which can take LOCKS on files and investigate if it really needs heal or not. So the nodes which have old version will list those indices as heal NEEDED. I am setting require_doc_text as ? as I think customer should know the issue about rolling upgrade for 3.1.X, although this is not fixed for 3.1.X also make sure to retest (as part of sanity) 1347257 - spurious heal info as pending heal entries never end on an EC volume while IOs are going on Cloned this bug to 1422539 - IO error seen with Rolling or non-disruptive upgrade of an distribute-disperse(EC) volume from 3.1.2 to 3.1.3 I am closed the BZ#1422539 to can't fix as it is theoritically not possible as we can't make change to a shipped release. Hence I am changing the title of this bz to reflect the actual problem, ie rolling upgrade of an EC volume based on above comments, I am moving to verified Points to note with this fix: ============================= 1)Rolling or non-disruptive upgrade will work with base release as 3.2.0 to any greater version release. ie 3.2.0 to 3.2.0_beyond say 3.3 2)Rolling or non-disruptive upgrade will NOT work with base release as <3.2 to any supported release. Eg:3.1.3 to 3.2 or 3.1.2 to 3.1.3 or 3.1.2 to 3.2 will Not work, hence do disruptive upgrade. 3) to test the fix, we tested between dev builds of 3.2.0 ie b/w 3.8.4-3 to 3.8.4-12 , the rolling upgrade worked in this case Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |