Bug 1849432 - haproxy pod on one of masters in CrashLoopBackOff after deploy of OpenShift BareMetall ipv6
Summary: haproxy pod on one of masters in CrashLoopBackOff after deploy of OpenShift B...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.6.0
Assignee: Yossi Boaron
QA Contact: Aleksandra Malykhin
URL:
Whiteboard:
Depends On:
Blocks: 1862874
TreeView+ depends on / blocked
 
Reported: 2020-06-21 16:22 UTC by Lubov
Modified: 2020-10-27 16:08 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: In baremetal platform, an infra-dns container run in each host to support node names resolution and other internal DNS records. to complete the picture, an NM script updates host's resolv.conf to point to the infra-dns container. Additionally, when pods created they receive their DNS configuration file (/etc/resolv.conf) from their hosts. In this case, the HAProxy pod was created before NM scripts update the host's resolv.conf. Consequence: HAProxy pod repeatedly failed because the api-int internal DNS record is not resolvable. Fix: Verify that resov.conf of HAProxy pod is identical to the host resolv.conf file. Result: HAProxy runs with no error.
Clone Of:
: 1862874 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:08:12 UTC
Target Upstream Version:


Attachments (Terms of Use)
haproxy-monitor log (3.82 MB, text/plain)
2020-06-21 16:22 UTC, Lubov
no flags Details
haproxy container log (3.86 MB, text/plain)
2020-06-21 16:24 UTC, Lubov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1872 0 None closed Bug 1849432: [baremetal] verify resolv.conf in HAProxy static pod synced with host resolv.conf 2021-02-17 08:33:22 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:08:39 UTC

Description Lubov 2020-06-21 16:22:24 UTC
Created attachment 1698234 [details]
haproxy-monitor log

Created attachment 1698234 [details]
haproxy-monitor log

Description of problem:
Immediately after Openshift successfully deployed on BareMetal env with both barmetal and provision IPV6 network, on one of masters haproxy pod is crashlooping

$ oc get pods -n openshift-kni-infra
NAME                                                             READY   STATUS    RESTARTS   AGE
haproxy-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com          2/2     Running   2          140m
haproxy-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com          2/2     Running   0          140m
haproxy-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com          1/2     CrashLoopBackOff   48         141m

$ oc logs haproxy-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com -c haproxy-monitor -n openshift-kni-infra
time="2020-06-21T16:02:03Z" level=info msg="Failed to get master Nodes list" err="Get https://api-int.ocp-edge-cluster-0.qe.lab.redhat.com:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D: dial tcp: lookup api-int.ocp-edge-cluster-0.qe.lab.redhat.com on [fe80::5054:ff:fe40:bfc7%enp5s0]:53: no such host"
time="2020-06-21T16:02:03Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig
time="2020-06-21T16:02:03Z" level=info msg="GetLBConfig failed, sleep half of interval and retry" kubeconfigPath=/var/lib/kubelet/kubeconfig 

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-06-20-194346

How reproducible:
happened in 2 from 3 deployments

Steps to Reproduce:
1. Deploy OpenShift 
2.
3.

Actual results:
haproxy pod in state CrashLoopBackOff

Expected results:
All pods in Running/Complete status


Additional info:
attaching logs and must-gather

http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ1849432_must-gather.zip

Comment 1 Lubov 2020-06-21 16:24:39 UTC
Created attachment 1698235 [details]
haproxy container log

Comment 3 Lubov 2020-06-22 13:41:23 UTC
1. The reported pod was the only one reported as problematic
2. It was basic standard deployment
3. I got the same problem on 2 different servers. It is not always reproducible. I'll ping you

Comment 5 Micah Abbott 2020-08-11 20:35:36 UTC
@Amit could your team handle verifying this BZ?

Comment 9 errata-xmlrpc 2020-10-27 16:08:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.