Bug 1847082 - [IPI baremetal] baremetal-runtimecfg k8s health-check use hardcoded IPv4 local address (127.0.0.1)
Summary: [IPI baremetal] baremetal-runtimecfg k8s health-check use hardcoded IPv4 loca...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Ben Nemec
QA Contact: Aleksandra Malykhin
URL:
Whiteboard:
: 1847083 1847086 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-15 15:46 UTC by Yossi Boaron
Modified: 2020-08-13 16:49 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Use of ipv4 address in ipv6 deployment Consequence: Fix: Result:
Clone Of:
Environment:
Last Closed: 2020-08-13 16:49:20 UTC
Target Upstream Version:
Embargoed:
bnemec: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-runtimecfg pull 68 0 None closed Bug 1847086 : Update kube-api health check to use localhost 2021-01-09 22:22:33 UTC

Description Yossi Boaron 2020-06-15 15:46:58 UTC
Description of problem:

baremetal-runtimecfg (haproxy-monitor) sets/removes firewall rule that redirects API traffic to LB, it runs [1] to verify kube-api health status via the LB. 
 
It uses IPv4 localhost address (127.0.0.1)to communicate with the local LB, it should use 'localhost' to cover both IPv4 and IPv6 cases.

Additionally, based on [2] we should check 'readyz' endpoint and not 'healthz'  


[1] https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/utils/utils.go#L86
[2] https://github.com/openshift/installer/blob/master/docs/dev/kube-apiserver-health-check.md#load-balancer-health-check-probe

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Beth White 2020-06-16 16:25:20 UTC
*** Bug 1847083 has been marked as a duplicate of this bug. ***

Comment 2 Beth White 2020-06-16 16:25:38 UTC
*** Bug 1847086 has been marked as a duplicate of this bug. ***

Comment 3 Ben Nemec 2020-07-01 16:31:28 UTC
This was fixed by https://github.com/openshift/baremetal-runtimecfg/pull/68

Comment 4 Aleksandra Malykhin 2020-07-15 15:00:02 UTC
Verified on 4.6.0-0.nightly-2020-07-15-065024, see the detailes below:

On the first terminal:
[kni@provisionhost-0-0 ~]$ ssh core.qe.lab.redhat.com hostname
master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com

[kni@provisionhost-0-0 ~]$ oc get pods -n openshift-apiserver -o wide
apiserver-6bbb844d98-hd924   1/1     Running   0          19m   fd01:0:0:2::f    master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   <none>           <none>

[kni@provisionhost-0-0 ~]$ oc rsh -n openshift-apiserver apiserver-6bbb844d98-hd924
sh-4.2# while true; do curl -k https://localhost:8443/readyz; done
okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokoko...


On the second terminal:
[kni@provisionhost-0-0 ~]$ oc debug node/master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com
...
sh-4.2# chroot /host
sh-4.4# bash
[root@master-0-0 /]# ps aux | grep "openshift-apiserver start"
root      139003  6.6  0.6 1761432 205092 ?      Ssl  13:51   2:14 openshift-apiserver start --config=/var/run/configmaps/config/config.yaml -v=2
...
[root@master-0-0 /]# kill -INT 139003


First terminal output:
...okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokcommand terminated with exit code 137



=============================================================================================
Related issues:
on the first terminal I got the error:
[kni@provisionhost-0-0 ~]$ oc debug node/master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com
Starting pod/master-0-0ocp-edge-cluster-0qelabredhatcom-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...
error: Back-off pulling image "registry.redhat.io/rhel7/support-tools"


The bug has already opened here:  https://bugzilla.redhat.com/show_bug.cgi?id=1782852
Used workaround from there:
$ oc tag -d openshift/tools:latest
$ oc tag  -n openshift $(oc get pods -n openshift-multus -l app=multus -o jsonpath='{.items[0].spec.containers[?(@.name=="kube-multus")].image}') tools:latest
$ oc get imagetag -n openshift tools:latest

Comment 5 Aleksandra Malykhin 2020-07-15 15:23:31 UTC
Link for the workaround in the previous comment is incorrect. See here: https://bugzilla.redhat.com/show_bug.cgi?id=1728135#c32

Comment 6 Aleksandra Malykhin 2020-07-19 07:09:11 UTC
Verified on 4.6.0-0.nightly-2020-07-15-065024, see the details below:
After shut down all of the kube-apiservers, the haproxy-monitor removed the firewall rule in less than 30 seconds (greater than 30 seconds would suggest it's still using /healthz).
(verified that we didn't break anything with this fix)

[kni@provisionhost-0-0 ~]$ ssh core.qe.lab.redhat.com 
[core@master-0-2 ~]$ while true; do sleep 1; sudo crictl rm -f $(sudo crictl ps --name haproxy | awk 'FNR==2{ print $1}'); done
[core@master-0-2 ~]$ date 
Sun Jul 19 07:01:15 UTC 2020
[core@master-0-2 ~]$ sudo cat /var/log/pods/openshift-kni-infra_haproxy-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com_1bea838fdefc74b7bc393e1b9a638c96/haproxy-monitor/2.log

2020-07-19T07:01:22.602585718+00:00 stderr F time="2020-07-19T07:01:22Z" level=info msg="API is not reachable through HAProxy"
2020-07-19T07:01:22.633405741+00:00 stderr F time="2020-07-19T07:01:22Z" level=info msg="Config change detected" configChangeCtr=1 curConfig="{6443 9445 50000 [{master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::138 6443} {master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::13d 6443} {master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::143 6443}] ::}"

Ben, please advise if I have anything else to check?

Comment 7 Aleksandra Malykhin 2020-07-20 14:05:37 UTC
As discussed, the ticket is verified by https://bugzilla.redhat.com/show_bug.cgi?id=1847082#c4

In the case when the connection fails, localhost resolves to both ipv4 and ipv6 and automatically handle both.

[kni@provisionhost-0-0 ~]$ oc rsh -n openshift-apiserver apiserver-6bbb844d98-pjxsg
sh-4.2# curl -k https://localhost:6443/readyz
curl: (7) Failed connect to localhost:6443; Connection refused
sh-4.2# curl -k -vvv  https://localhost:6443/readyz
* About to connect() to localhost port 6443 (#0)
*   Trying ::1...
* Connection refused
*   Trying 127.0.0.1...
* Connection refused
* Failed connect to localhost:6443; Connection refused
* Closing connection 0
curl: (7) Failed connect to localhost:6443; Connection refused


Note You need to log in before you can comment on or make changes to this bug.