Bug 1844387 - 4.6: OpenStack: keepalive health check only fails on connection errors, not non-200 http rc
Summary: 4.6: OpenStack: keepalive health check only fails on connection errors, not n...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.6.0
Assignee: Yossi Boaron
QA Contact: weiwei jiang
URL:
Whiteboard:
Depends On:
Blocks: 1844384 1844446 1873401
TreeView+ depends on / blocked
 
Reported: 2020-06-05 09:54 UTC by Stefan Schimanski
Modified: 2020-10-27 16:05 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Keepalived is used to provide HA for both API and default router, Keepalived instance in each node monitors local health by curling local entity (e.g: local kube-apiserver) health endpoint. The used curl command failed only when the tcp connection failed, not on http non-200 errors. Consequence: Keepalived sometimes didn't failover to another healthy node although local entity was unhealthy. which leads to errors in API requests. Fix: Update curl command to fail also when the server replied with non-200 retcode. Result: API and Ingress failover to a healthy node in case of failure in a local entity.
Clone Of: 1844384
: 1873401 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:05:27 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1785 0 None closed Bug 1844387: Fail healthz/readyz curls on non-200 http errors 2021-01-21 17:40:53 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:05:49 UTC

Description Stefan Schimanski 2020-06-05 09:54:58 UTC
+++ This bug was initially created as a clone of Bug #1844384 +++

Description of problem:

OpenStack keepalive health check only fails on connection errors:

  https://github.com/openshift/machine-config-operator/blame/master/templates/master/00-master/openstack/files/openstack-keepalived-keepalived.yaml#L6

Background: `curl -s` does not fail on non-200 errors with successful tcp connect.

--- Additional comment from Antonio Murdaca on 2020-06-05 11:53:18 CEST ---

Moving to the openstack owners

Comment 3 weiwei jiang 2020-06-29 02:24:02 UTC
Checked with 4.6.0-0.nightly-2020-06-26-035408, moved to verified.

$ oc version
Client Version: 4.6.0-202006270004.p0-ad8b00f
Server Version: 4.6.0-0.nightly-2020-06-26-035408
Kubernetes Version: v1.18.3+8871b3d

$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-06-26-035408   True        False         17m     Cluster version is 4.6.0-0.nightly-2020-06-26-035408

$ oc get nodes -o wide 
NAME                             STATUS   ROLES    AGE   VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION          CONTAINER-RUNTIME                                                               
wj46ios629a-b4pbw-master-0       Ready    master   36m   v1.18.3+ba54539   192.168.2.202   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev                               
wj46ios629a-b4pbw-master-1       Ready    master   36m   v1.18.3+ba54539   192.168.1.6     <none>        Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev                               
wj46ios629a-b4pbw-master-2       Ready    master   36m   v1.18.3+ba54539   192.168.1.184   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev                               
wj46ios629a-b4pbw-worker-j9zl8   Ready    worker   20m   v1.18.3+ba54539   192.168.2.27    <none>        Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev                               
wj46ios629a-b4pbw-worker-mjrfc   Ready    worker   22m   v1.18.3+ba54539   192.168.3.56    <none>        Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev                               
wj46ios629a-b4pbw-worker-mwdbk   Ready    worker   24m   v1.18.3+ba54539   192.168.2.119   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev  

$ oc get pods -n openshift-openstack-infra -o wide 
NAME                                            READY   STATUS    RESTARTS   AGE   IP              NODE                             NOMINATED NODE   READINESS GATES
coredns-wj46ios629a-b4pbw-master-0              1/1     Running   0          35m   192.168.2.202   wj46ios629a-b4pbw-master-0       <none>           <none>
coredns-wj46ios629a-b4pbw-master-1              1/1     Running   0          35m   192.168.1.6     wj46ios629a-b4pbw-master-1       <none>           <none>
coredns-wj46ios629a-b4pbw-master-2              1/1     Running   0          35m   192.168.1.184   wj46ios629a-b4pbw-master-2       <none>           <none>
coredns-wj46ios629a-b4pbw-worker-j9zl8          1/1     Running   0          21m   192.168.2.27    wj46ios629a-b4pbw-worker-j9zl8   <none>           <none>
coredns-wj46ios629a-b4pbw-worker-mjrfc          1/1     Running   0          21m   192.168.3.56    wj46ios629a-b4pbw-worker-mjrfc   <none>           <none>
coredns-wj46ios629a-b4pbw-worker-mwdbk          1/1     Running   0          23m   192.168.2.119   wj46ios629a-b4pbw-worker-mwdbk   <none>           <none>
haproxy-wj46ios629a-b4pbw-master-0              2/2     Running   0          35m   192.168.2.202   wj46ios629a-b4pbw-master-0       <none>           <none>
haproxy-wj46ios629a-b4pbw-master-1              2/2     Running   0          35m   192.168.1.6     wj46ios629a-b4pbw-master-1       <none>           <none>
haproxy-wj46ios629a-b4pbw-master-2              2/2     Running   0          35m   192.168.1.184   wj46ios629a-b4pbw-master-2       <none>           <none>
keepalived-wj46ios629a-b4pbw-master-0           1/1     Running   0          35m   192.168.2.202   wj46ios629a-b4pbw-master-0       <none>           <none>
keepalived-wj46ios629a-b4pbw-master-1           1/1     Running   0          35m   192.168.1.6     wj46ios629a-b4pbw-master-1       <none>           <none>
keepalived-wj46ios629a-b4pbw-master-2           1/1     Running   0          35m   192.168.1.184   wj46ios629a-b4pbw-master-2       <none>           <none>
keepalived-wj46ios629a-b4pbw-worker-j9zl8       1/1     Running   0          20m   192.168.2.27    wj46ios629a-b4pbw-worker-j9zl8   <none>           <none>
keepalived-wj46ios629a-b4pbw-worker-mjrfc       1/1     Running   0          21m   192.168.3.56    wj46ios629a-b4pbw-worker-mjrfc   <none>           <none>
keepalived-wj46ios629a-b4pbw-worker-mwdbk       1/1     Running   0          23m   192.168.2.119   wj46ios629a-b4pbw-worker-mwdbk   <none>           <none>
mdns-publisher-wj46ios629a-b4pbw-master-0       1/1     Running   0          35m   192.168.2.202   wj46ios629a-b4pbw-master-0       <none>           <none>
mdns-publisher-wj46ios629a-b4pbw-master-1       1/1     Running   0          35m   192.168.1.6     wj46ios629a-b4pbw-master-1       <none>           <none>
mdns-publisher-wj46ios629a-b4pbw-master-2       1/1     Running   0          35m   192.168.1.184   wj46ios629a-b4pbw-master-2       <none>           <none>
mdns-publisher-wj46ios629a-b4pbw-worker-j9zl8   1/1     Running   0          20m   192.168.2.27    wj46ios629a-b4pbw-worker-j9zl8   <none>           <none>
mdns-publisher-wj46ios629a-b4pbw-worker-mjrfc   1/1     Running   0          21m   192.168.3.56    wj46ios629a-b4pbw-worker-mjrfc   <none>           <none>
mdns-publisher-wj46ios629a-b4pbw-worker-mwdbk   1/1     Running   0          24m   192.168.2.119   wj46ios629a-b4pbw-worker-mwdbk   <none>           <none>


$ oc -n  openshift-openstack-infra rsh keepalived-wj46ios629a-b4pbw-master-0                                                                                                                                                                                                  
sh-4.2# ps aux                                                                                                                                                                                                                                                                  
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND                                                                                                                                                                                                      
root           1  0.0  0.0 123020  6912 ?        Ss   01:33   0:00 /usr/sbin/keepalived -f /etc/keepalived/keepalived.conf --dont-fork --vrrp --log-detail --log-console                                                                                                        
root           8  0.1  0.0 127288  6244 ?        S    01:33   0:03 /usr/sbin/keepalived -f /etc/keepalived/keepalived.conf --dont-fork --vrrp --log-detail --log-console                                                                                                        
root       20500  0.0  0.0  11836  2792 pts/0    Ss   02:11   0:00 /bin/sh                                                                                                                                                                                                      
root       20548  0.0  0.0  51768  3472 pts/0    R+   02:11   0:00 ps aux                                                                                                                                                                                                       
sh-4.2# cat /etc/keepalived/keepalived.conf                                                                                                                                                                                                                                     
sh-4.2# cat /etc/keepalived/keepalived.conf                                                                                                                                                                                                                                     
vrrp_script chk_ocp {
    script "/usr/bin/curl -o /dev/null -kLfs https://localhost:6443/readyz && /usr/bin/curl -o /dev/null -kLfs http://localhost:50936/readyz"
    interval 1
    weight 50
}

# TODO: Improve this check. The port is assumed to be alive.
# Need to assess what is the ramification if the port is not there.
vrrp_script chk_ingress {
    script "/usr/bin/curl -o /dev/null -Lfs http://localhost:1936/healthz/ready"
    interval 1
    weight 50
}

vrrp_instance wj46ios629a_API {
    state BACKUP
    interface ens3
    virtual_router_id 197
    priority 40
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass wj46ios629a_api_vip
    }
    virtual_ipaddress {
        192.168.0.5/18
    }
    track_script {
        chk_ocp
    }
}

vrrp_instance wj46ios629a_INGRESS {
    state BACKUP
    interface ens3
    virtual_router_id 180
    priority 40
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass wj46ios629a_ingress_vip
    }
    virtual_ipaddress {
        192.168.0.7/18
    }
    track_script {
        chk_ingress
    }
}

Comment 5 errata-xmlrpc 2020-10-27 16:05:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.