Bug 1498456
| Summary: | openshift HA hangs if one of the members disappears | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Alexander Koksharov <akokshar> |
| Component: | Master | Assignee: | Michal Fojtik <mfojtik> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Wang Haoran <haowang> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.5.0 | CC: | aos-bugs, enagai, fabian, geliu, hgomes, jokerman, jorge_martinez, jrosenta, lxia, mfojtik, mmccomas, nnosenzo, rkshirsa, tkimura, vlaad, vwalek, wmeng |
| Target Milestone: | --- | ||
| Target Release: | 3.5.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-08-07 15:06:27 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Alexander Koksharov
2017-10-04 11:30:57 UTC
What version of 3.5 is the customer running? This sounds very familiar to: https://bugzilla.redhat.com/show_bug.cgi?id=1490427 In that scenario when the etcd master goes away, all cluster commands hang for 10-15 minutes, sometimes more. the following version is installed: atomic-openshift-3.5.5.31 Robert, During our tests we tried to halt - etcd leader - openshift controller - master node which does not run any of these functions result was always the same - api hangs. and restart of master-api service on any remaining nodes unblocks it. Is there any test we can run to confirm that bz you mentioned is actually about the same issue? The best way that I can think of to verify it is the same issue is to try the fix with your scenario. Please test this scenario with the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1490427 Verified on ocp env, installed HA env on aws cluster(3 master+3 etcd +2 nodes+lb ), login aws console and halt any one of master instance, then tried the openshift ui and oc command, it works well. openshift v3.7.0-0.158.0 kubernetes v1.7.6+a08f5eeb62 Will there be a fix for 3.6 or the only way to overcome this is to upgrade? Oh. Customer is on 3.5. Is there an option to backport? I tested this on my fresh 3.6 HA cluster with HAProxy on RHEV. Performed "echo c > /proc/sysrq-trigger" on master03 host when running "oc get all" in a loop. The "oc get all" command hang for 30 sec then got normal result. Looking at the master logs, it seems that the etcd cluster detected the member failure in 5 sec, the ongoing connections from api to master03 etcd waited for 30 sec timeout, then retried and success. This is reasonable behavior. atomic-openshift-3.6.173.0.96-1.git.0.8f6ff22.el7.x86_64 There may be different behavior based on the timing or other environment factor. At least this seems not 100% reproducible. I can reproduce the hang in 2nd test, this time master01 etcd leader down, the api service is repeating timeout and it seems that the dead etcd is not expelled from the request target. Will create another ticket and attach logs. I created fresh bugzilla for the hang issue using OpenShift 3.6 with details: https://bugzilla.redhat.com/show_bug.cgi?id=1550470 |