Bug 1844387
Summary: | 4.6: OpenStack: keepalive health check only fails on connection errors, not non-200 http rc | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Stefan Schimanski <sttts> | |
Component: | Machine Config Operator | Assignee: | Yossi Boaron <yboaron> | |
Status: | CLOSED ERRATA | QA Contact: | weiwei jiang <wjiang> | |
Severity: | high | Docs Contact: | ||
Priority: | urgent | |||
Version: | 4.5 | CC: | asegurap, dahernan, ingvarr.zhmakin, mnguyen, yboaron | |
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause:
Keepalived is used to provide HA for both API and default router, Keepalived instance in each node monitors local health by curling local entity (e.g: local kube-apiserver) health endpoint.
The used curl command failed only when the tcp connection failed, not on http non-200 errors.
Consequence:
Keepalived sometimes didn't failover to another healthy node although local entity was unhealthy. which leads to errors in API requests.
Fix:
Update curl command to fail also when the server replied with non-200 retcode.
Result:
API and Ingress failover to a healthy node in case of failure in a local entity.
|
Story Points: | --- | |
Clone Of: | 1844384 | |||
: | 1873401 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 16:05:27 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1844384, 1844446, 1873401 |
Description
Stefan Schimanski
2020-06-05 09:54:58 UTC
Checked with 4.6.0-0.nightly-2020-06-26-035408, moved to verified. $ oc version Client Version: 4.6.0-202006270004.p0-ad8b00f Server Version: 4.6.0-0.nightly-2020-06-26-035408 Kubernetes Version: v1.18.3+8871b3d $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-06-26-035408 True False 17m Cluster version is 4.6.0-0.nightly-2020-06-26-035408 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME wj46ios629a-b4pbw-master-0 Ready master 36m v1.18.3+ba54539 192.168.2.202 <none> Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev wj46ios629a-b4pbw-master-1 Ready master 36m v1.18.3+ba54539 192.168.1.6 <none> Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev wj46ios629a-b4pbw-master-2 Ready master 36m v1.18.3+ba54539 192.168.1.184 <none> Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev wj46ios629a-b4pbw-worker-j9zl8 Ready worker 20m v1.18.3+ba54539 192.168.2.27 <none> Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev wj46ios629a-b4pbw-worker-mjrfc Ready worker 22m v1.18.3+ba54539 192.168.3.56 <none> Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev wj46ios629a-b4pbw-worker-mwdbk Ready worker 24m v1.18.3+ba54539 192.168.2.119 <none> Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev $ oc get pods -n openshift-openstack-infra -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES coredns-wj46ios629a-b4pbw-master-0 1/1 Running 0 35m 192.168.2.202 wj46ios629a-b4pbw-master-0 <none> <none> coredns-wj46ios629a-b4pbw-master-1 1/1 Running 0 35m 192.168.1.6 wj46ios629a-b4pbw-master-1 <none> <none> coredns-wj46ios629a-b4pbw-master-2 1/1 Running 0 35m 192.168.1.184 wj46ios629a-b4pbw-master-2 <none> <none> coredns-wj46ios629a-b4pbw-worker-j9zl8 1/1 Running 0 21m 192.168.2.27 wj46ios629a-b4pbw-worker-j9zl8 <none> <none> coredns-wj46ios629a-b4pbw-worker-mjrfc 1/1 Running 0 21m 192.168.3.56 wj46ios629a-b4pbw-worker-mjrfc <none> <none> coredns-wj46ios629a-b4pbw-worker-mwdbk 1/1 Running 0 23m 192.168.2.119 wj46ios629a-b4pbw-worker-mwdbk <none> <none> haproxy-wj46ios629a-b4pbw-master-0 2/2 Running 0 35m 192.168.2.202 wj46ios629a-b4pbw-master-0 <none> <none> haproxy-wj46ios629a-b4pbw-master-1 2/2 Running 0 35m 192.168.1.6 wj46ios629a-b4pbw-master-1 <none> <none> haproxy-wj46ios629a-b4pbw-master-2 2/2 Running 0 35m 192.168.1.184 wj46ios629a-b4pbw-master-2 <none> <none> keepalived-wj46ios629a-b4pbw-master-0 1/1 Running 0 35m 192.168.2.202 wj46ios629a-b4pbw-master-0 <none> <none> keepalived-wj46ios629a-b4pbw-master-1 1/1 Running 0 35m 192.168.1.6 wj46ios629a-b4pbw-master-1 <none> <none> keepalived-wj46ios629a-b4pbw-master-2 1/1 Running 0 35m 192.168.1.184 wj46ios629a-b4pbw-master-2 <none> <none> keepalived-wj46ios629a-b4pbw-worker-j9zl8 1/1 Running 0 20m 192.168.2.27 wj46ios629a-b4pbw-worker-j9zl8 <none> <none> keepalived-wj46ios629a-b4pbw-worker-mjrfc 1/1 Running 0 21m 192.168.3.56 wj46ios629a-b4pbw-worker-mjrfc <none> <none> keepalived-wj46ios629a-b4pbw-worker-mwdbk 1/1 Running 0 23m 192.168.2.119 wj46ios629a-b4pbw-worker-mwdbk <none> <none> mdns-publisher-wj46ios629a-b4pbw-master-0 1/1 Running 0 35m 192.168.2.202 wj46ios629a-b4pbw-master-0 <none> <none> mdns-publisher-wj46ios629a-b4pbw-master-1 1/1 Running 0 35m 192.168.1.6 wj46ios629a-b4pbw-master-1 <none> <none> mdns-publisher-wj46ios629a-b4pbw-master-2 1/1 Running 0 35m 192.168.1.184 wj46ios629a-b4pbw-master-2 <none> <none> mdns-publisher-wj46ios629a-b4pbw-worker-j9zl8 1/1 Running 0 20m 192.168.2.27 wj46ios629a-b4pbw-worker-j9zl8 <none> <none> mdns-publisher-wj46ios629a-b4pbw-worker-mjrfc 1/1 Running 0 21m 192.168.3.56 wj46ios629a-b4pbw-worker-mjrfc <none> <none> mdns-publisher-wj46ios629a-b4pbw-worker-mwdbk 1/1 Running 0 24m 192.168.2.119 wj46ios629a-b4pbw-worker-mwdbk <none> <none> $ oc -n openshift-openstack-infra rsh keepalived-wj46ios629a-b4pbw-master-0 sh-4.2# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 123020 6912 ? Ss 01:33 0:00 /usr/sbin/keepalived -f /etc/keepalived/keepalived.conf --dont-fork --vrrp --log-detail --log-console root 8 0.1 0.0 127288 6244 ? S 01:33 0:03 /usr/sbin/keepalived -f /etc/keepalived/keepalived.conf --dont-fork --vrrp --log-detail --log-console root 20500 0.0 0.0 11836 2792 pts/0 Ss 02:11 0:00 /bin/sh root 20548 0.0 0.0 51768 3472 pts/0 R+ 02:11 0:00 ps aux sh-4.2# cat /etc/keepalived/keepalived.conf sh-4.2# cat /etc/keepalived/keepalived.conf vrrp_script chk_ocp { script "/usr/bin/curl -o /dev/null -kLfs https://localhost:6443/readyz && /usr/bin/curl -o /dev/null -kLfs http://localhost:50936/readyz" interval 1 weight 50 } # TODO: Improve this check. The port is assumed to be alive. # Need to assess what is the ramification if the port is not there. vrrp_script chk_ingress { script "/usr/bin/curl -o /dev/null -Lfs http://localhost:1936/healthz/ready" interval 1 weight 50 } vrrp_instance wj46ios629a_API { state BACKUP interface ens3 virtual_router_id 197 priority 40 advert_int 1 authentication { auth_type PASS auth_pass wj46ios629a_api_vip } virtual_ipaddress { 192.168.0.5/18 } track_script { chk_ocp } } vrrp_instance wj46ios629a_INGRESS { state BACKUP interface ens3 virtual_router_id 180 priority 40 advert_int 1 authentication { auth_type PASS auth_pass wj46ios629a_ingress_vip } virtual_ipaddress { 192.168.0.7/18 } track_script { chk_ingress } } Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |