Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1498456

Summary: openshift HA hangs if one of the members disappears
Product: OpenShift Container Platform Reporter: Alexander Koksharov <akokshar>
Component: MasterAssignee: Michal Fojtik <mfojtik>
Status: CLOSED CURRENTRELEASE QA Contact: Wang Haoran <haowang>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.0CC: aos-bugs, enagai, fabian, geliu, hgomes, jokerman, jorge_martinez, jrosenta, lxia, mfojtik, mmccomas, nnosenzo, rkshirsa, tkimura, vlaad, vwalek, wmeng
Target Milestone: ---   
Target Release: 3.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-07 15:06:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alexander Koksharov 2017-10-04 11:30:57 UTC
Description of problem:

Environment:
- nodes are vmware instances
- multimaster install with 3 masters
- loadbalancer is an external device

First, successful scenario:
- login to one of the masters, and run "shutdown -h now".
- open openshift webconsole in browser.
- everything is working as expected.

Failing scenario:
- login to vmware console and forcibly halt _any_ one of the masters.
- open openshift webconsole in browser.
- webconsole can not load.
- oc command executed on master console is hanging.

So, sudden disappearance of a master node is causing the problems.
I did try to get loadbalancer out of the system by defining master's public name in 'hosts' file - without any improvement.
If master api service is restarted on _any_ of the remaining masters, everything gets to normal - webconsole start working, oc commands do not hang.

We did several tests together with a Customer and were not able to find any condition in which forceful halt of a master wont cause problems. We were trying to halt the one running controller service, the one running etcd leader, the one not running these functioins, etc. Problem is 100% reproducible on Customers setup.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 Robert Rati 2017-10-05 19:38:50 UTC
What version of 3.5 is the customer running?  This sounds very familiar to:

https://bugzilla.redhat.com/show_bug.cgi?id=1490427

In that scenario when the etcd master goes away, all cluster commands hang for 10-15 minutes, sometimes more.

Comment 6 Alexander Koksharov 2017-10-06 08:54:39 UTC
the following version is installed:
atomic-openshift-3.5.5.31

Comment 7 Alexander Koksharov 2017-10-06 09:00:38 UTC
Robert,

During our tests we tried to halt 
- etcd leader
- openshift controller
- master node which does not run any of these functions

result was always the same - api hangs. and restart of master-api service on any remaining nodes unblocks it.

Is there any test we can run to confirm that bz you mentioned is actually about the same issue?

Comment 8 Robert Rati 2017-10-10 13:05:38 UTC
The best way that I can think of to verify it is the same issue is to try the fix with your scenario.

Comment 9 Robert Rati 2017-10-17 13:39:33 UTC
Please test this scenario with the fix for 
https://bugzilla.redhat.com/show_bug.cgi?id=1490427

Comment 10 ge liu 2017-10-23 09:45:34 UTC
Verified on ocp env, installed HA env on aws cluster(3 master+3 etcd +2 nodes+lb ), login aws console and halt any one of master instance, then tried the openshift ui and oc command, it works well.

openshift v3.7.0-0.158.0
kubernetes v1.7.6+a08f5eeb62

Comment 11 Alexander Koksharov 2017-10-23 09:58:36 UTC
Will there be a fix for 3.6 or the only way to overcome this is to upgrade?

Comment 12 Alexander Koksharov 2017-10-23 10:01:21 UTC
Oh. Customer is on 3.5. Is there an option to backport?

Comment 19 Takayoshi Kimura 2018-02-28 06:27:25 UTC
I tested this on my fresh 3.6 HA cluster with HAProxy on RHEV.

Performed "echo c > /proc/sysrq-trigger" on master03 host when running "oc get all" in a loop. The "oc get all" command hang for 30 sec then got normal result.

Looking at the master logs, it seems that the etcd cluster detected the member failure in 5 sec, the ongoing connections from api to master03 etcd waited for 30 sec timeout, then retried and success. This is reasonable behavior.

atomic-openshift-3.6.173.0.96-1.git.0.8f6ff22.el7.x86_64

There may be different behavior based on the timing or other environment factor. At least this seems not 100% reproducible.

Comment 20 Takayoshi Kimura 2018-02-28 08:02:15 UTC
I can reproduce the hang in 2nd test, this time master01 etcd leader down, the api service is repeating timeout and it seems that the dead etcd is not expelled from the request target. Will create another ticket and attach logs.

Comment 21 Takayoshi Kimura 2018-03-02 05:53:46 UTC
I created fresh bugzilla for the hang issue using OpenShift 3.6 with details: https://bugzilla.redhat.com/show_bug.cgi?id=1550470