1094716 – Upgrade: glusterd crashes when a peer detach force is executed from 2.1 (The detached machine is on 3.0)

Bug 1094716 - Upgrade: glusterd crashes when a peer detach force is executed from 2.1 (The detached machine is on 3.0)

Summary: Upgrade: glusterd crashes when a peer detach force is executed from 2.1 (The ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	rhgs-3.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.0.0
Assignee:	Kaushal
QA Contact:	Sachidananda Urs
Docs Contact:
URL:
Whiteboard:
Depends On:	1101903
Blocks:
TreeView+	depends on / blocked

Reported:	2014-05-06 10:57 UTC by Sachidananda Urs
Modified:	2015-05-15 17:41 UTC (History)
CC List:	4 users (show)
Fixed In Version:	glusterfs-3.6.0.14-1.el6rhs
Doc Type:	Bug Fix
Doc Text:	Cause: Glusterd was not properly backward compliant with RHS-2.1 Consequence: This lead to peer probe not completing successfully, when probed from a RHS-2.1 peer, and lead to glusterd crashing when peer detach was attempted. Fix: Glusterd has been fixed to make it backward compliant with RHS-2.1 Result: Peer probes now complete successfully, removing the need for the peer detach, and hence glusterd no longer crashes.
Clone Of:
Environment:
Last Closed:	2014-09-22 19:36:39 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
glusterd log (29.85 KB, text/x-log) 2014-05-06 10:57 UTC, Sachidananda Urs	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:1278	0	normal	SHIPPED_LIVE	Red Hat Storage Server 3.0 bug fix and enhancement update	2014-09-22 23:26:55 UTC

Description Sachidananda Urs 2014-05-06 10:57:22 UTC

Created attachment 892840 [details]
glusterd log

Description of problem:

While doing rolling upgrade, when one of the machines is 2.1U2 and other 3.5qa2;  peer detach force is executed on the updated machine (3.5qa2) from 2.1U2. 
glusterd on 2.1U2 crashes. Peer detach is never successful.

Version-Release number of selected component (if applicable):

glusterfs 3.4.0.59rhs built on Feb  4 2014 08:44:13
glusterfs 3.5qa2 built on May  2 2014 06:17:44

How reproducible:

Always.

Steps to Reproduce:
1. Create a 2x2 cluster (Version: 2.1U2)
2. Perform rolling upgrade (IO is not stopped); Upgrade one of the machines to 3.5qa2. From 3.5qa2 peer probe the 2.1U2 machine.
3. From 2.1U2 execute `peer detach force' to detach the 3.5qa2 node
4. glusterd will crash

Actual results:

glusterd crashes

Expected results:

glusterd should not crash

Additional info:

Comment 1 Sachidananda Urs 2014-05-06 11:00:24 UTC

Core dumps can be found at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1094716/

Comment 3 Kaushal 2014-05-21 09:05:20 UTC

Sac,
The cores don't seem to be working for me. I'm trying to get a backtrace with glusterfs-3.4.0.59rhs with them, and all I get are '??'. This could be due to an issue with my environment. But I just want to confirm that they are from glusterfs-3.4.0.59rhs.

Also, it will be helpful if you could provide logs from the crashed node.

Comment 6 Kaushal 2014-05-27 10:28:32 UTC

Thanks Sac for giving access to the systems, it's gonna help identify the problem quickly.

Meanwhile, I've got concerns on this bugs status as a possible release blocker for rhs-3.0.

We currently don't have officially documented steps for rhs-2.1 to rhs-3.0 upgrades. Proper upgrade documents would prevent the user from performing a combination of operations, as in this bug, which could lead failures. So, with proper upgrade documentation, we will not be hitting this issue.

Also, the crash was observed in rhs-2.1, not in rhs-3.0, so having this bug as a blocker for rhs-3.0, doesn't feel right.

Because of the above two reasons, I believe this bug should be removed from the scope of the rhs-3.0 release and be targeted for a future release.

I'll still continue my investigation as I've got access to the systems currently and I'd most likely not be able to get it later.

Comment 7 Sachidananda Urs 2014-05-27 11:59:40 UTC

Kaushal, we are documenting rolling upgrade for Denali. Which means, during upgrades we will have both 2.1U2 and 3.0 machines. And in the case above:

* We upgrade one of the machines and probe 3.0 into the cluster.

Is there any other way we are planning to do this? When can I expect the officially documented steps? I'm blocked on upgrade testing...

Comment 8 Kaushal 2014-05-27 12:38:52 UTC

(In reply to Sachidananda Urs from comment #7)
> Kaushal, we are documenting rolling upgrade for Denali. Which means, during
> upgrades we will have both 2.1U2 and 3.0 machines. And in the case above:
> 
> * We upgrade one of the machines and probe 3.0 into the cluster.
> 
> Is there any other way we are planning to do this? When can I expect the
> officially documented steps? I'm blocked on upgrade testing...

The steps you are following with along with saving and restoring /var/lib/glusterd should be the required steps for upgrade and should ideally work without needing any more intervention. But it isn't the case now.

rhs-3.0 has added some new fields into the volinfo struct which are needed for snapshot. As it is currently, the import_volinfo code which imports the volumes during peer handshaking, expects these fields to be present and will fail the import if they are not present. Failure of this import leads to the different behaviours depending on the original starting point.

1. If this import was being done on a new peer on a peer probe, its failure would lead to the probe process being stuck. The peer status would be in an intermediate 'Probe sent to peer' state.
Doing a peer detach in this case could lead to glusterd crashing, as it happened in this bug report.

2. If this import was happening during the startup of a Glusterd, this failure could lead to the peer entering a peer rejected state (I still need to validate if this is correct).
So even if /var/lib/glusterd was saved and restored during the upgrade, this could still lead to peer being rejected, thus causing upgrade problems.

So we still need to fix rhs-3.0 as it is not backwards compatible enough yet. But we also need to still come up with the correct documentation for upgrades.

I'm removing the rhs-future flag and giving devel_ack for this bug. Please mark this bug as a blocker, as we shouldn't be breaking backwards compatibility as we are doing now.

Comment 10 Nagaprasad Sathyanarayana 2014-06-09 05:07:19 UTC

Patch merged upstream. Hence moving it to POST.

Comment 11 Atin Mukherjee 2014-06-09 05:23:52 UTC

Patch merged downstream : https://code.engineering.redhat.com/gerrit/#/c/26430/

Comment 12 Sachidananda Urs 2014-06-11 09:39:35 UTC

Verified on glusterfs 3.6.0.15

I no longer see the crash, with 2.1U2 and 3.0 machines in the cluster.

Comment 14 errata-xmlrpc 2014-09-22 19:36:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html

Note You need to log in before you can comment on or make changes to this bug.