Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1847082

Summary: [IPI baremetal] baremetal-runtimecfg k8s health-check use hardcoded IPv4 local address (127.0.0.1)
Product: OpenShift Container Platform Reporter: Yossi Boaron <yboaron>
Component: InstallerAssignee: Ben Nemec <bnemec>
Installer sub component: OpenShift on Bare Metal IPI QA Contact: Aleksandra Malykhin <amalykhi>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: medium CC: asegurap, bnemec
Version: 4.5Keywords: Triaged
Target Milestone: ---Flags: bnemec: needinfo-
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Use of ipv4 address in ipv6 deployment Consequence: Fix: Result:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-13 16:49:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yossi Boaron 2020-06-15 15:46:58 UTC
Description of problem:

baremetal-runtimecfg (haproxy-monitor) sets/removes firewall rule that redirects API traffic to LB, it runs [1] to verify kube-api health status via the LB. 
 
It uses IPv4 localhost address (127.0.0.1)to communicate with the local LB, it should use 'localhost' to cover both IPv4 and IPv6 cases.

Additionally, based on [2] we should check 'readyz' endpoint and not 'healthz'  


[1] https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/utils/utils.go#L86
[2] https://github.com/openshift/installer/blob/master/docs/dev/kube-apiserver-health-check.md#load-balancer-health-check-probe

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Beth White 2020-06-16 16:25:20 UTC
*** Bug 1847083 has been marked as a duplicate of this bug. ***

Comment 2 Beth White 2020-06-16 16:25:38 UTC
*** Bug 1847086 has been marked as a duplicate of this bug. ***

Comment 3 Ben Nemec 2020-07-01 16:31:28 UTC
This was fixed by https://github.com/openshift/baremetal-runtimecfg/pull/68

Comment 4 Aleksandra Malykhin 2020-07-15 15:00:02 UTC
Verified on 4.6.0-0.nightly-2020-07-15-065024, see the detailes below:

On the first terminal:
[kni@provisionhost-0-0 ~]$ ssh core.qe.lab.redhat.com hostname
master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com

[kni@provisionhost-0-0 ~]$ oc get pods -n openshift-apiserver -o wide
apiserver-6bbb844d98-hd924   1/1     Running   0          19m   fd01:0:0:2::f    master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   <none>           <none>

[kni@provisionhost-0-0 ~]$ oc rsh -n openshift-apiserver apiserver-6bbb844d98-hd924
sh-4.2# while true; do curl -k https://localhost:8443/readyz; done
okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokoko...


On the second terminal:
[kni@provisionhost-0-0 ~]$ oc debug node/master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com
...
sh-4.2# chroot /host
sh-4.4# bash
[root@master-0-0 /]# ps aux | grep "openshift-apiserver start"
root      139003  6.6  0.6 1761432 205092 ?      Ssl  13:51   2:14 openshift-apiserver start --config=/var/run/configmaps/config/config.yaml -v=2
...
[root@master-0-0 /]# kill -INT 139003


First terminal output:
...okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokcommand terminated with exit code 137



=============================================================================================
Related issues:
on the first terminal I got the error:
[kni@provisionhost-0-0 ~]$ oc debug node/master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com
Starting pod/master-0-0ocp-edge-cluster-0qelabredhatcom-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...
error: Back-off pulling image "registry.redhat.io/rhel7/support-tools"


The bug has already opened here:  https://bugzilla.redhat.com/show_bug.cgi?id=1782852
Used workaround from there:
$ oc tag -d openshift/tools:latest
$ oc tag  -n openshift $(oc get pods -n openshift-multus -l app=multus -o jsonpath='{.items[0].spec.containers[?(@.name=="kube-multus")].image}') tools:latest
$ oc get imagetag -n openshift tools:latest

Comment 5 Aleksandra Malykhin 2020-07-15 15:23:31 UTC
Link for the workaround in the previous comment is incorrect. See here: https://bugzilla.redhat.com/show_bug.cgi?id=1728135#c32

Comment 6 Aleksandra Malykhin 2020-07-19 07:09:11 UTC
Verified on 4.6.0-0.nightly-2020-07-15-065024, see the details below:
After shut down all of the kube-apiservers, the haproxy-monitor removed the firewall rule in less than 30 seconds (greater than 30 seconds would suggest it's still using /healthz).
(verified that we didn't break anything with this fix)

[kni@provisionhost-0-0 ~]$ ssh core.qe.lab.redhat.com 
[core@master-0-2 ~]$ while true; do sleep 1; sudo crictl rm -f $(sudo crictl ps --name haproxy | awk 'FNR==2{ print $1}'); done
[core@master-0-2 ~]$ date 
Sun Jul 19 07:01:15 UTC 2020
[core@master-0-2 ~]$ sudo cat /var/log/pods/openshift-kni-infra_haproxy-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com_1bea838fdefc74b7bc393e1b9a638c96/haproxy-monitor/2.log

2020-07-19T07:01:22.602585718+00:00 stderr F time="2020-07-19T07:01:22Z" level=info msg="API is not reachable through HAProxy"
2020-07-19T07:01:22.633405741+00:00 stderr F time="2020-07-19T07:01:22Z" level=info msg="Config change detected" configChangeCtr=1 curConfig="{6443 9445 50000 [{master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::138 6443} {master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::13d 6443} {master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::143 6443}] ::}"

Ben, please advise if I have anything else to check?

Comment 7 Aleksandra Malykhin 2020-07-20 14:05:37 UTC
As discussed, the ticket is verified by https://bugzilla.redhat.com/show_bug.cgi?id=1847082#c4

In the case when the connection fails, localhost resolves to both ipv4 and ipv6 and automatically handle both.

[kni@provisionhost-0-0 ~]$ oc rsh -n openshift-apiserver apiserver-6bbb844d98-pjxsg
sh-4.2# curl -k https://localhost:6443/readyz
curl: (7) Failed connect to localhost:6443; Connection refused
sh-4.2# curl -k -vvv  https://localhost:6443/readyz
* About to connect() to localhost port 6443 (#0)
*   Trying ::1...
* Connection refused
*   Trying 127.0.0.1...
* Connection refused
* Failed connect to localhost:6443; Connection refused
* Closing connection 0
curl: (7) Failed connect to localhost:6443; Connection refused