Description of problem: cat /etc/keepalived/keepalived.conf ... vrrp_script chk_ocp { script "/usr/bin/curl -o /dev/null -kLs https://0:6443/readyz" ... vrrp_script chk_ingress { script "/usr/bin/curl -o /dev/null -kLs http://0:1936/healthz" causing these checks to be performed on IPv4: curl -kLsvvv https://0:6443/readyz * Trying 0.0.0.0... * TCP_NODELAY set * Connected to 0 (127.0.0.1) port 6443 (#0) ... curl -kLsv http://0:1936/healthz * Trying 0.0.0.0... * TCP_NODELAY set * Connected to 0 (127.0.0.1) port 1936 (#0) How reproducible: Fully reproducable Steps to Reproduce: Deploy IPv6 cluster Actual results: health checks executed over IPv4 Expected results: health checks to be performed over IPv6
In the latest 4.4 tree, Keepalived check scripts in keepalived.conf use 'localhost' and not '0'. See [1], while in 4.3 tree '0' is used. Did you run the test with 4.3 or 4.4 (cause you filed the bug on 4.4)? In [2] u can find the keepalived.conf from my env (see [3] for OC version) If we should run IPV6 on 4.3, I guess we should backport this fix to 4.3. [1] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/baremetal/files/baremetal-keepalived-keepalived.yaml#L21 [2] vrrp_script chk_ocp { script "/usr/bin/curl -o /dev/null -kLs https://localhost:6443/readyz" interval 1 weight 50 } vrrp_script chk_dns { script "/usr/bin/host -t SRV _etcd-server-ssl._tcp.ostest.test.metalkube.org localhost" interval 1 weight 50 } # TODO: Improve this check. The port is assumed to be alive. # Need to assess what is the ramification if the port is not there. vrrp_script chk_ingress { script "/usr/bin/curl -o /dev/null -kLs http://localhost:1936/healthz" interval 1 weight 50 } [3] [kni@worker-0 dev-scripts]$ oc version Client Version: 4.4.0-0.ci-2020-02-08-192852 Server Version: 4.4.0-0.ci-2020-02-08-192852 Kubernetes Version: v1.17.1 [kni@worker-0 dev-scripts]$
My bad, Yossi, I tested on 4.3.0-0.nightly-2020-02-03-115336-ipv6.1 Fixing the bug OCP version and yes, we test on IPv6, so backport is required.
https://github.com/openshift-kni/machine-config-operator/pull/9
Moving this to 4.5. To get this change in 4.4 at this point, you'll need to fix it in 4.5, and clone this bug to 4.4.
Verified on 4.4.0-0.nightly-2020-03-11-212258 all healthchecks resolve localhost to IPv6 localhost = ::1 cat /etc/keepalived/keepalived.conf rrp_script chk_ocp { script "/usr/bin/curl -o /dev/null -kLs https://localhost:6443/readyz" interval 1 weight 50 } vrrp_script chk_dns { script "/usr/bin/host -t SRV _etcd-server-ssl._tcp.ocp-edge-cluster.qe.lab.redhat.com localhost" interval 1 weight 50 } # TODO: Improve this check. The port is assumed to be alive. # Need to assess what is the ramification if the port is not there. vrrp_script chk_ingress { script "/usr/bin/curl -o /dev/null -kLs http://localhost:1936/healthz" interval 1 weight 50 } curl -kLs https://localhost:6443/readyz -vvv * Trying ::1... * TCP_NODELAY set * Connected to localhost (::1) port 6443 (#0)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409