Bug 1306656

Summary:	[GSS] - Brick ports changed after configuring I/O and management encryption
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Mukul Malhotra <mmalhotr>
Component:	glusterd	Assignee:	Kaushal <kaushal>
Status:	CLOSED ERRATA	QA Contact:	Byreddy <bsrirama>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	rhgs-3.1	CC:	amukherj, bkunal, bsrirama, ccalhoun, kaushal, kramdoss, mmalhotr, rcyriac, rhinduja, rhs-bugs, storage-qa-internal, vbellur
Target Milestone:	---
Target Release:	RHGS 3.2.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-23 05:27:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1313628, 1316391
Bug Blocks:	1268895, 1351515

Comment 2 Atin Mukherjee 2016-02-11 16:05:22 UTC

Kaushal,

Could you please check this?

~Atin

Comment 5 krishnaram Karthick 2016-02-12 09:34:45 UTC

I am able to set up management SSL, IO encryption and start the volume without a need to force start it. Did you restart glusterd on all nodes after setting up management SSL? If you have not done that, you won't be able to start the volume.

Steps I followed,

1) setup a 2 node gluster cluster with build - glusterfs-server-3.7.1-16.el7rhgs.x86_64
2) create necessary certificates (glusterfs.key, glusterfs.pem and glusterfs.ca) on all server and client
3) Enable management SSL ('touch /var/lib/glusterd/secure-access' on all nodes)
4) Restart glusterd
5) create volume and set necessary SSL options (auth.allow, server.ssl on and client.ssl on)
6) start the volume

Comment 9 Kaushal 2016-02-12 11:08:27 UTC

GlusterD allocates a port for a brick, if brickinfo->port is 0. Normally, brickinfo->port would be saved and restored from the brick info file in /var/lib/gluster/vols/<volname>/bricks directory.

But as it appears in this case that the ports got changed on restart, it's possible that either the port information wasn't stored or the port information wasn't restored.

I'll check the sos-reports to see what I can find.

Comment 11 Kaushal 2016-02-12 11:50:09 UTC

Mukul, can you get the names of the actual volumes whose bricks changed ports? I cannot properly analyze logs without knowing what I'm looking for.

Also, I don't think this bug and #1304274 are related right now, but I need to investigate further.

Comment 18 Kaushal 2016-02-15 09:22:26 UTC

As I mentioned earlier, glusterd should only assign new ports to a brick if the brickinfo->port is 0. This should only happen if brick restore didn't correctly restore the port from the stored info file. I'm still trying to figure out from the logs if/how this could happen.

I do see the assigned ports for a brick changing between in 2 consecutive start logs for the brick. So this is real.

Comment 19 Mukul Malhotra 2016-02-15 17:36:12 UTC

Hello Kaushal,

OK, so is this issue has any relation with management encryption ?

Can you explain the issue in detail.

Thanks
Mukul

Comment 20 Kaushal 2016-02-17 09:16:49 UTC

I don't believe this is related to management encryption.

Management encryption should only affect connections to/from GlusterD. 

GlusterD assigns the port for a brick, when it starts the brick for the first time only. When the brick is started for the first time, it's port is 0. The brick start function in GlusterD searches for a new port and assigns it to the brick when port is 0. Once a brick has been assigned a port, this information would be persisted by GlusterD in the brickinfo file at /var/lib/glusterd/vols/<volume>/bricks/<brickinfo-file>.

This port is passed to the brick as a command line argument. So there is no way management encryption could affect this.

Whenever GlusterD restarts, it will read this file and restore the port number for the brick, before starting the brick.

But as the port changed/re-assigned, and the only way that a new port could have been assigned is if port number is 0 when starting the brick, I think there could have been a failure to restore port. I'm trying to verify if this is even possible. I'll also be trying to find if we have another path that could lead to port becoming 0.

Comment 22 Kaushal 2016-02-17 10:43:59 UTC

So I've found a sequence of actions that leads to brick ports getting reassigned. This is not dependent on management encryption in any way, but the sequence can be hit when enabling encryption.

The sequence is as follows, (assume a volume with bricks across a cluster)
1. Stop the volume
2. Stop glusterd on one node.
3a. Start the volume from some other node, or
3b. do a volume set operation
4. Start glusterd on the downed node again.
5. If 3b was done, start volume now.

This should lead to the port for bricks on the node with the downed glusterd changing. This is an existing bug in glusterd, which was unknown till now.

This sequence could be hit during the process of enabling management encryption.

Comment 24 Mukul Malhotra 2016-02-17 12:48:04 UTC

Hello Kaushal,

Thanks for the analysis. I had provided the suggested analysis to the customer.

>This should lead to the port for bricks on the node with the downed glusterd changing. This is an existing bug in glusterd, which was unknown till now.

Could you provide the bz ?

Thanks
Mukul

Comment 25 Atin Mukherjee 2016-02-17 12:49:18 UTC

(In reply to Mukul Malhotra from comment #24)
> Hello Kaushal,
> 
> Thanks for the analysis. I had provided the suggested analysis to the
> customer.
> 
> >This should lead to the port for bricks on the node with the downed glusterd changing. This is an existing bug in glusterd, which was unknown till now.
> 
> Could you provide the bz ?
There is no old BZ since the issue was unknown.
> 
> Thanks
> Mukul

Comment 27 Kaushal 2016-02-17 13:42:47 UTC

(In reply to Mukul Malhotra from comment #26)
> Hello,
> 
> >From the steps that the customer provided in case#01573615, the above mentioned sequence was very likely hit. If you can confirm if this is indeed the case, it would be helpful. In any case, we'll start working on fixing this.
> 
> Customer sequence matches the suggested steps when enabling Encryption.
> 
> Also, customer wanted to know below details as,
> 
> * When would this bug get fixed ?

This is a simple enough bug to fix. But the fix will definitely not be available in 3.1.2. We can get it in for 3.1.3 if we get the fix upstream before the downstream rebase.

Till then, to avoid this bug, the documentation for enabling management encryption could explicitly mention that no operation should be performed on the volumes before all GlusterDs are back up. This includes starting the stopped volume or setting options on the volume. 

> * Is there a feature coming to manually assign and force ports ?

This is the first time hearing of a request for this feature. This would be an RFE, and need to be evaluated on the feasibility of implementation.

> Mukul

Comment 28 Atin Mukherjee 2016-02-19 12:07:42 UTC

Kaushal,

Could you update the doc text?

~Atin

Comment 30 Kaushal 2016-02-29 05:20:04 UTC

Doc text looks good to me.

Comment 31 Atin Mukherjee 2016-03-22 05:35:27 UTC

Upstream patch http://review.gluster.org/13578 is merged now

Comment 32 Mike McCune 2016-03-28 22:51:39 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 43 Byreddy 2016-09-28 05:39:27 UTC

Brick port allocation logic is changed in rhgs 3.2, it's verified in the BZ-1263090  so with this new logic, brick port change can happen based on the operations.

Expected from this bug is invalidated by the bug-1263090 Fix.

Moving to verified state based on BZ-1263090 verification details.

Comment 45 errata-xmlrpc 2017-03-23 05:27:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html