Description of problem: baremetal-runtimecfg (haproxy-monitor) sets/removes firewall rule that redirects API traffic to LB, it runs [1] to verify kube-api health status via the LB. It uses IPv4 localhost address (127.0.0.1)to communicate with the local LB, it should use 'localhost' to cover both IPv4 and IPv6 cases. Additionally, based on [2] we should check 'readyz' endpoint and not 'healthz' [1] https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/utils/utils.go#L86 [2] https://github.com/openshift/installer/blob/master/docs/dev/kube-apiserver-health-check.md#load-balancer-health-check-probe Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag
*** Bug 1847083 has been marked as a duplicate of this bug. ***
*** Bug 1847086 has been marked as a duplicate of this bug. ***
This was fixed by https://github.com/openshift/baremetal-runtimecfg/pull/68
Verified on 4.6.0-0.nightly-2020-07-15-065024, see the detailes below: On the first terminal: [kni@provisionhost-0-0 ~]$ ssh core.qe.lab.redhat.com hostname master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com [kni@provisionhost-0-0 ~]$ oc get pods -n openshift-apiserver -o wide apiserver-6bbb844d98-hd924 1/1 Running 0 19m fd01:0:0:2::f master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com <none> <none> [kni@provisionhost-0-0 ~]$ oc rsh -n openshift-apiserver apiserver-6bbb844d98-hd924 sh-4.2# while true; do curl -k https://localhost:8443/readyz; done okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokoko... On the second terminal: [kni@provisionhost-0-0 ~]$ oc debug node/master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com ... sh-4.2# chroot /host sh-4.4# bash [root@master-0-0 /]# ps aux | grep "openshift-apiserver start" root 139003 6.6 0.6 1761432 205092 ? Ssl 13:51 2:14 openshift-apiserver start --config=/var/run/configmaps/config/config.yaml -v=2 ... [root@master-0-0 /]# kill -INT 139003 First terminal output: ...okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokcommand terminated with exit code 137 ============================================================================================= Related issues: on the first terminal I got the error: [kni@provisionhost-0-0 ~]$ oc debug node/master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com Starting pod/master-0-0ocp-edge-cluster-0qelabredhatcom-debug ... To use host binaries, run `chroot /host` Removing debug pod ... error: Back-off pulling image "registry.redhat.io/rhel7/support-tools" The bug has already opened here: https://bugzilla.redhat.com/show_bug.cgi?id=1782852 Used workaround from there: $ oc tag -d openshift/tools:latest $ oc tag -n openshift $(oc get pods -n openshift-multus -l app=multus -o jsonpath='{.items[0].spec.containers[?(@.name=="kube-multus")].image}') tools:latest $ oc get imagetag -n openshift tools:latest
Link for the workaround in the previous comment is incorrect. See here: https://bugzilla.redhat.com/show_bug.cgi?id=1728135#c32
Verified on 4.6.0-0.nightly-2020-07-15-065024, see the details below: After shut down all of the kube-apiservers, the haproxy-monitor removed the firewall rule in less than 30 seconds (greater than 30 seconds would suggest it's still using /healthz). (verified that we didn't break anything with this fix) [kni@provisionhost-0-0 ~]$ ssh core.qe.lab.redhat.com [core@master-0-2 ~]$ while true; do sleep 1; sudo crictl rm -f $(sudo crictl ps --name haproxy | awk 'FNR==2{ print $1}'); done [core@master-0-2 ~]$ date Sun Jul 19 07:01:15 UTC 2020 [core@master-0-2 ~]$ sudo cat /var/log/pods/openshift-kni-infra_haproxy-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com_1bea838fdefc74b7bc393e1b9a638c96/haproxy-monitor/2.log 2020-07-19T07:01:22.602585718+00:00 stderr F time="2020-07-19T07:01:22Z" level=info msg="API is not reachable through HAProxy" 2020-07-19T07:01:22.633405741+00:00 stderr F time="2020-07-19T07:01:22Z" level=info msg="Config change detected" configChangeCtr=1 curConfig="{6443 9445 50000 [{master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::138 6443} {master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::13d 6443} {master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::143 6443}] ::}" Ben, please advise if I have anything else to check?
As discussed, the ticket is verified by https://bugzilla.redhat.com/show_bug.cgi?id=1847082#c4 In the case when the connection fails, localhost resolves to both ipv4 and ipv6 and automatically handle both. [kni@provisionhost-0-0 ~]$ oc rsh -n openshift-apiserver apiserver-6bbb844d98-pjxsg sh-4.2# curl -k https://localhost:6443/readyz curl: (7) Failed connect to localhost:6443; Connection refused sh-4.2# curl -k -vvv https://localhost:6443/readyz * About to connect() to localhost port 6443 (#0) * Trying ::1... * Connection refused * Trying 127.0.0.1... * Connection refused * Failed connect to localhost:6443; Connection refused * Closing connection 0 curl: (7) Failed connect to localhost:6443; Connection refused