Bug 1452563

Summary:	deployment stuck if gluster node went offline
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Alexander Koksharov <akokshar>
Component:	heketi	Assignee:	Humble Chirammal <hchiramm>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	cns-3.5	CC:	akhakhar, andcosta, annair, aos-bugs, bkunal, hchiramm, jarrpa, jmulligan, kramdoss, madam, pprakash, rgeorge, rhs-bugs, rreddy, rtalur, sankarshan, storage-qa-internal, vinug
Target Milestone:	---
Target Release:	OCS 3.11
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	rhgs-volmanager-container-3.11.0-4	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-24 04:51:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1622458, 1629577

Description Alexander Koksharov 2017-05-19 08:33:47 UTC

Description of problem:

OCP cluster is integrated with gluster cluster in the same way as described here https://docs.openshift.com/container-platform/3.5/install_config/storage_examples/gluster_example.html

All is working just fine until one of the gluster nodes goes offline. If that happens scaling deployment which is using PV baked by gluster may hang for several minutes.

I have successfully replicated this on v3.2 and it is also reproducable on v3.5. However, on 3.5 it is a bit better. I have installed two OCP clusters of different version and integrated them with one 2-nodes gluster cluster. I then simulated one gluster node failure by adding iptables rule to silently drop all the traffic.
- 3.2 it always stuck on scaling up if one (any) of the gluster cluster members is not reachable. However, after several minutes I found out that it succeeded and new replica is running
- 3.5 on my setup somehow depends on which node of a gluster cluster I take out. If I take one node out, OCP takes few additional seconds to scale deployment up. But if I take aother gluster cluster node out, OCP need 4 minutes to start new application replica.

I believe that this all is happening because Gluster cluster node did not reply at all and systemc have to wait for TCP to timeout for avery connection attempt. If I change my DROP iptables rule to REJECT on one Gluster node OCP semms to not be affected anyhow.

Natively mounting glusterfs systems outside openshift/docker/Kube provides the option to define a secondary servers within fstab (or the mount command itself) e.g. backupvolfile-server.

It would be good to have backupvolfile-server in PV as an option.

Version-Release number of selected component (if applicable):
3.x

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 2 Humble Chirammal 2017-06-16 07:03:49 UTC

In problem description you mentioned, 

--snip--

I have successfully replicated this on v3.2 and it is also reproducable on v3.5. However, on 3.5 it is a bit better. I have installed two OCP clusters of different version and integrated them with one 2-nodes gluster cluster..
--/snip--

In CNS, we only support 'replica 3' configuration. Please refer this doc

https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.2/html/container-native_storage_for_openshift_container_platform/chap-documentation-red_hat_gluster_storage_container_native_with_openshift_platform-setting_the_environment-deploy_cns

Replica 2 could be prone to splitbrain, so the recommendation to use replica 3. 

Can you please look into this configuration and reproduce the issue again ?

Comment 3 Alexander Koksharov 2017-06-19 08:03:50 UTC

Hello Humble,

Could you please clarify how having 3-nodes gluster cluster can yield any improvement to the subject problem? 

If there is kind of 'split brain' situation and gluster node which is in minority does not accept mount requests, why volume gets eventually mounted after a few monutes delay? I had only two nodes in my test and if they got split, they if the above is true (which is not) then nothing should work. But it does. I does not work well, but it does work!

Issue is not in the Gluster cluster at all. Issue is on a connection level. 
Openshift configuration consist of:
- endpoints representing each gluster node
- service this loadshare endpoints (by means of iptables)
- system uses service IP to connect to gluster. Service IP gets translated to one of the endpoints (gluster nodes) IPs.

When system tries to establish connection it chooses one enpoint and send tcp-syn. If tcp-syn is not replied due to iptables DROP rule we have a delay of several minuted. If tcp-syn is rejected instead meaning that "ICMP unreachable" is sent back we have no delay in mounting a volume.

If you still insist this should be tested with 3-nodes gluster cluster please explain to me call flows between openshift and gluster cluster in a situations when gluster cluster if healthy and in a situation when one node went down. 
Will wery appreciate if you can clarify what logic openshift applies when it tries to find node to connect?

Thank you
Lex.

Comment 4 Andre Costa 2018-01-03 15:20:42 UTC

Hi Humble,

Can you give me an update on this bugzilla? 

Thanks,
Andre

Comment 21 RamaKasturi 2018-10-15 15:13:23 UTC

@humble, can you please update the FIV for this bug ?

Comment 24 RamaKasturi 2018-10-16 11:50:03 UTC

Below are the test steps i executed as per comment 13 to move this bug to verified state:
========================================================================
I have tried performing test case 1 in comment 13 and i see below results.

1) created an cirros app pod with gluster backed pvc and mounted it in /mnt. I see that  volume is mounted using the server 10.70.46.170 inside the pod.

2) I powered off the node with which the volume was mounted.

3) Now i try to restart app pod by deleting the pod using the command 'oc delete pod <pod_name> 


4) Now i see that another pod comes up and see that volume is mounted using the same node which is 10.70.46.170 inside the pod. 

Bug verification is done, but waiting for FIV to be put so that this can be moved to verified state.

Comment 25 Humble Chirammal 2018-10-17 06:49:32 UTC

>sorry humble. My bad

Nw :)

>Looks like the hypervisor where the vm is hosted has high memory consumption and >due to this command was stuck. Now the memory consumption has reduced so i was >able to execute the command. Thanks karthick for helping me out with this.

I am glad to hear this :)

>Can you please update the Fixed In Version for this bug which would help me to >move the bug to verified state ?

rhgs-volmanager-container-3.11.0-4

Comment 26 RamaKasturi 2018-10-17 08:38:57 UTC

Verified the fix in rhgs-volmanager-rhel7:3.11.0-4 and it works fine as per comment 24.

Comment 28 errata-xmlrpc 2018-10-24 04:51:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2986