Bug 1844387

Summary: 4.6: OpenStack: keepalive health check only fails on connection errors, not non-200 http rc
Product: OpenShift Container Platform Reporter: Stefan Schimanski <sttts>
Component: Machine Config OperatorAssignee: Yossi Boaron <yboaron>
Status: CLOSED ERRATA QA Contact: weiwei jiang <wjiang>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.5CC: asegurap, dahernan, ingvarr.zhmakin, mnguyen, yboaron
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Keepalived is used to provide HA for both API and default router, Keepalived instance in each node monitors local health by curling local entity (e.g: local kube-apiserver) health endpoint. The used curl command failed only when the tcp connection failed, not on http non-200 errors. Consequence: Keepalived sometimes didn't failover to another healthy node although local entity was unhealthy. which leads to errors in API requests. Fix: Update curl command to fail also when the server replied with non-200 retcode. Result: API and Ingress failover to a healthy node in case of failure in a local entity.
Story Points: ---
Clone Of: 1844384
: 1873401 (view as bug list) Environment:
Last Closed: 2020-10-27 16:05:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1844384, 1844446, 1873401    

Description Stefan Schimanski 2020-06-05 09:54:58 UTC
+++ This bug was initially created as a clone of Bug #1844384 +++

Description of problem:

OpenStack keepalive health check only fails on connection errors:

  https://github.com/openshift/machine-config-operator/blame/master/templates/master/00-master/openstack/files/openstack-keepalived-keepalived.yaml#L6

Background: `curl -s` does not fail on non-200 errors with successful tcp connect.

--- Additional comment from Antonio Murdaca on 2020-06-05 11:53:18 CEST ---

Moving to the openstack owners

Comment 3 weiwei jiang 2020-06-29 02:24:02 UTC
Checked with 4.6.0-0.nightly-2020-06-26-035408, moved to verified.

$ oc version
Client Version: 4.6.0-202006270004.p0-ad8b00f
Server Version: 4.6.0-0.nightly-2020-06-26-035408
Kubernetes Version: v1.18.3+8871b3d

$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-06-26-035408   True        False         17m     Cluster version is 4.6.0-0.nightly-2020-06-26-035408

$ oc get nodes -o wide 
NAME                             STATUS   ROLES    AGE   VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION          CONTAINER-RUNTIME                                                               
wj46ios629a-b4pbw-master-0       Ready    master   36m   v1.18.3+ba54539   192.168.2.202   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev                               
wj46ios629a-b4pbw-master-1       Ready    master   36m   v1.18.3+ba54539   192.168.1.6     <none>        Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev                               
wj46ios629a-b4pbw-master-2       Ready    master   36m   v1.18.3+ba54539   192.168.1.184   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev                               
wj46ios629a-b4pbw-worker-j9zl8   Ready    worker   20m   v1.18.3+ba54539   192.168.2.27    <none>        Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev                               
wj46ios629a-b4pbw-worker-mjrfc   Ready    worker   22m   v1.18.3+ba54539   192.168.3.56    <none>        Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev                               
wj46ios629a-b4pbw-worker-mwdbk   Ready    worker   24m   v1.18.3+ba54539   192.168.2.119   <none>        Red Hat Enterprise Linux CoreOS 46.82.202006260140-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-30.dev.rhaos4.6.git0a84af5.el8-dev  

$ oc get pods -n openshift-openstack-infra -o wide 
NAME                                            READY   STATUS    RESTARTS   AGE   IP              NODE                             NOMINATED NODE   READINESS GATES
coredns-wj46ios629a-b4pbw-master-0              1/1     Running   0          35m   192.168.2.202   wj46ios629a-b4pbw-master-0       <none>           <none>
coredns-wj46ios629a-b4pbw-master-1              1/1     Running   0          35m   192.168.1.6     wj46ios629a-b4pbw-master-1       <none>           <none>
coredns-wj46ios629a-b4pbw-master-2              1/1     Running   0          35m   192.168.1.184   wj46ios629a-b4pbw-master-2       <none>           <none>
coredns-wj46ios629a-b4pbw-worker-j9zl8          1/1     Running   0          21m   192.168.2.27    wj46ios629a-b4pbw-worker-j9zl8   <none>           <none>
coredns-wj46ios629a-b4pbw-worker-mjrfc          1/1     Running   0          21m   192.168.3.56    wj46ios629a-b4pbw-worker-mjrfc   <none>           <none>
coredns-wj46ios629a-b4pbw-worker-mwdbk          1/1     Running   0          23m   192.168.2.119   wj46ios629a-b4pbw-worker-mwdbk   <none>           <none>
haproxy-wj46ios629a-b4pbw-master-0              2/2     Running   0          35m   192.168.2.202   wj46ios629a-b4pbw-master-0       <none>           <none>
haproxy-wj46ios629a-b4pbw-master-1              2/2     Running   0          35m   192.168.1.6     wj46ios629a-b4pbw-master-1       <none>           <none>
haproxy-wj46ios629a-b4pbw-master-2              2/2     Running   0          35m   192.168.1.184   wj46ios629a-b4pbw-master-2       <none>           <none>
keepalived-wj46ios629a-b4pbw-master-0           1/1     Running   0          35m   192.168.2.202   wj46ios629a-b4pbw-master-0       <none>           <none>
keepalived-wj46ios629a-b4pbw-master-1           1/1     Running   0          35m   192.168.1.6     wj46ios629a-b4pbw-master-1       <none>           <none>
keepalived-wj46ios629a-b4pbw-master-2           1/1     Running   0          35m   192.168.1.184   wj46ios629a-b4pbw-master-2       <none>           <none>
keepalived-wj46ios629a-b4pbw-worker-j9zl8       1/1     Running   0          20m   192.168.2.27    wj46ios629a-b4pbw-worker-j9zl8   <none>           <none>
keepalived-wj46ios629a-b4pbw-worker-mjrfc       1/1     Running   0          21m   192.168.3.56    wj46ios629a-b4pbw-worker-mjrfc   <none>           <none>
keepalived-wj46ios629a-b4pbw-worker-mwdbk       1/1     Running   0          23m   192.168.2.119   wj46ios629a-b4pbw-worker-mwdbk   <none>           <none>
mdns-publisher-wj46ios629a-b4pbw-master-0       1/1     Running   0          35m   192.168.2.202   wj46ios629a-b4pbw-master-0       <none>           <none>
mdns-publisher-wj46ios629a-b4pbw-master-1       1/1     Running   0          35m   192.168.1.6     wj46ios629a-b4pbw-master-1       <none>           <none>
mdns-publisher-wj46ios629a-b4pbw-master-2       1/1     Running   0          35m   192.168.1.184   wj46ios629a-b4pbw-master-2       <none>           <none>
mdns-publisher-wj46ios629a-b4pbw-worker-j9zl8   1/1     Running   0          20m   192.168.2.27    wj46ios629a-b4pbw-worker-j9zl8   <none>           <none>
mdns-publisher-wj46ios629a-b4pbw-worker-mjrfc   1/1     Running   0          21m   192.168.3.56    wj46ios629a-b4pbw-worker-mjrfc   <none>           <none>
mdns-publisher-wj46ios629a-b4pbw-worker-mwdbk   1/1     Running   0          24m   192.168.2.119   wj46ios629a-b4pbw-worker-mwdbk   <none>           <none>


$ oc -n  openshift-openstack-infra rsh keepalived-wj46ios629a-b4pbw-master-0                                                                                                                                                                                                  
sh-4.2# ps aux                                                                                                                                                                                                                                                                  
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND                                                                                                                                                                                                      
root           1  0.0  0.0 123020  6912 ?        Ss   01:33   0:00 /usr/sbin/keepalived -f /etc/keepalived/keepalived.conf --dont-fork --vrrp --log-detail --log-console                                                                                                        
root           8  0.1  0.0 127288  6244 ?        S    01:33   0:03 /usr/sbin/keepalived -f /etc/keepalived/keepalived.conf --dont-fork --vrrp --log-detail --log-console                                                                                                        
root       20500  0.0  0.0  11836  2792 pts/0    Ss   02:11   0:00 /bin/sh                                                                                                                                                                                                      
root       20548  0.0  0.0  51768  3472 pts/0    R+   02:11   0:00 ps aux                                                                                                                                                                                                       
sh-4.2# cat /etc/keepalived/keepalived.conf                                                                                                                                                                                                                                     
sh-4.2# cat /etc/keepalived/keepalived.conf                                                                                                                                                                                                                                     
vrrp_script chk_ocp {
    script "/usr/bin/curl -o /dev/null -kLfs https://localhost:6443/readyz && /usr/bin/curl -o /dev/null -kLfs http://localhost:50936/readyz"
    interval 1
    weight 50
}

# TODO: Improve this check. The port is assumed to be alive.
# Need to assess what is the ramification if the port is not there.
vrrp_script chk_ingress {
    script "/usr/bin/curl -o /dev/null -Lfs http://localhost:1936/healthz/ready"
    interval 1
    weight 50
}

vrrp_instance wj46ios629a_API {
    state BACKUP
    interface ens3
    virtual_router_id 197
    priority 40
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass wj46ios629a_api_vip
    }
    virtual_ipaddress {
        192.168.0.5/18
    }
    track_script {
        chk_ocp
    }
}

vrrp_instance wj46ios629a_INGRESS {
    state BACKUP
    interface ens3
    virtual_router_id 180
    priority 40
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass wj46ios629a_ingress_vip
    }
    virtual_ipaddress {
        192.168.0.7/18
    }
    track_script {
        chk_ingress
    }
}

Comment 5 errata-xmlrpc 2020-10-27 16:05:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196