Bug 1800969

Summary: keepalived conf file generated for IPv6 cluster contains health checks via IPv4
Product: OpenShift Container Platform Reporter: Victor Voronkov <vvoronko>
Component: InstallerAssignee: Yossi Boaron <yboaron>
Installer sub component: OpenShift on Bare Metal IPI QA Contact: Victor Voronkov <vvoronko>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: augol, bperkins, bschmaus, stbenjam, yboaron
Version: 4.3.zKeywords: Triaged
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: In baremetal-ipi Keepalived is used to provide IP failover for both API-VIP and INGRESS-VIP, Keepalived runs repeatedly a script to monitor local component (e.g: ocp api-server) status to decide which node should own the VIP. In IPV6 deployment Keepalived uses IPV4 local address (i.e: 127.0.0.1) to check local component status. Consequence: In IPv6 deployments, Keepalived may receive a wrong component staus Fix: Update Keepalived script to use localhost which should be resolved to 127.0.0.1 in V4 and ::1 in V6 Result: Keepalived monitors local component status using the correct local IP address.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:14:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Victor Voronkov 2020-02-09 15:46:15 UTC
Description of problem:
cat /etc/keepalived/keepalived.conf
...

vrrp_script chk_ocp {
    script "/usr/bin/curl -o /dev/null -kLs https://0:6443/readyz"
...
vrrp_script chk_ingress {
    script "/usr/bin/curl -o /dev/null -kLs http://0:1936/healthz"

causing these checks to be performed on IPv4:
curl -kLsvvv https://0:6443/readyz
*   Trying 0.0.0.0...
* TCP_NODELAY set
* Connected to 0 (127.0.0.1) port 6443 (#0)
...
curl -kLsv http://0:1936/healthz
*   Trying 0.0.0.0...
* TCP_NODELAY set
* Connected to 0 (127.0.0.1) port 1936 (#0)




How reproducible:
Fully reproducable

Steps to Reproduce:
Deploy IPv6 cluster

Actual results:
health checks executed over IPv4

Expected results:
health checks to be performed over IPv6

Comment 1 Yossi Boaron 2020-02-10 07:28:04 UTC
In the latest 4.4 tree, Keepalived check scripts in keepalived.conf use 'localhost' and not '0'.
See [1], while in 4.3 tree '0' is used.

Did you run the test with 4.3 or 4.4 (cause you filed the bug on 4.4)?


In [2] u can find the keepalived.conf from my env (see [3] for OC version)


If we should run IPV6 on 4.3, I guess we should backport this fix to 4.3.


[1] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/baremetal/files/baremetal-keepalived-keepalived.yaml#L21
[2]

vrrp_script chk_ocp {
    script "/usr/bin/curl -o /dev/null -kLs https://localhost:6443/readyz"
    interval 1
    weight 50
}

vrrp_script chk_dns {
    script "/usr/bin/host -t SRV _etcd-server-ssl._tcp.ostest.test.metalkube.org localhost"
    interval 1
    weight 50
}

# TODO: Improve this check. The port is assumed to be alive.
# Need to assess what is the ramification if the port is not there.
vrrp_script chk_ingress {
    script "/usr/bin/curl -o /dev/null -kLs http://localhost:1936/healthz"
    interval 1
    weight 50
}
 
[3]
[kni@worker-0 dev-scripts]$ oc version 
Client Version: 4.4.0-0.ci-2020-02-08-192852
Server Version: 4.4.0-0.ci-2020-02-08-192852
Kubernetes Version: v1.17.1
[kni@worker-0 dev-scripts]$

Comment 2 Victor Voronkov 2020-02-10 07:44:27 UTC
My bad, Yossi, I tested on 4.3.0-0.nightly-2020-02-03-115336-ipv6.1
Fixing the bug OCP version and yes, we test on IPv6, so backport is required.

Comment 4 Stephen Benjamin 2020-02-24 14:39:52 UTC
Moving this to 4.5. To get this change in 4.4 at this point, you'll need to fix it in 4.5, and clone this bug to 4.4.

Comment 5 Victor Voronkov 2020-03-12 09:50:15 UTC
Verified on 4.4.0-0.nightly-2020-03-11-212258

all healthchecks resolve localhost to IPv6 localhost = ::1 

cat /etc/keepalived/keepalived.conf
rrp_script chk_ocp {
    script "/usr/bin/curl -o /dev/null -kLs https://localhost:6443/readyz"
    interval 1
    weight 50
}

vrrp_script chk_dns {
    script "/usr/bin/host -t SRV _etcd-server-ssl._tcp.ocp-edge-cluster.qe.lab.redhat.com localhost"
    interval 1
    weight 50
}

# TODO: Improve this check. The port is assumed to be alive.
# Need to assess what is the ramification if the port is not there.
vrrp_script chk_ingress {
    script "/usr/bin/curl -o /dev/null -kLs http://localhost:1936/healthz"
    interval 1
    weight 50
}

curl -kLs https://localhost:6443/readyz -vvv
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 6443 (#0)

Comment 8 errata-xmlrpc 2020-07-13 17:14:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409