1849432 – haproxy pod on one of masters in CrashLoopBackOff after deploy of OpenShift BareMetall ipv6

Bug 1849432 - haproxy pod on one of masters in CrashLoopBackOff after deploy of OpenShift BareMetall ipv6

Summary: haproxy pod on one of masters in CrashLoopBackOff after deploy of OpenShift B...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Yossi Boaron
QA Contact:	Aleksandra Malykhin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1862874
TreeView+	depends on / blocked

Reported:	2020-06-21 16:22 UTC by Lubov
Modified:	2020-10-27 16:08 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: In baremetal platform, an infra-dns container run in each host to support node names resolution and other internal DNS records. to complete the picture, an NM script updates host's resolv.conf to point to the infra-dns container. Additionally, when pods created they receive their DNS configuration file (/etc/resolv.conf) from their hosts. In this case, the HAProxy pod was created before NM scripts update the host's resolv.conf. Consequence: HAProxy pod repeatedly failed because the api-int internal DNS record is not resolvable. Fix: Verify that resov.conf of HAProxy pod is identical to the host resolv.conf file. Result: HAProxy runs with no error.
Clone Of:
Clones:	1862874 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:08:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
haproxy-monitor log (3.82 MB, text/plain) 2020-06-21 16:22 UTC, Lubov	no flags	Details
haproxy container log (3.86 MB, text/plain) 2020-06-21 16:24 UTC, Lubov	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1872	0	None	closed	Bug 1849432: [baremetal] verify resolv.conf in HAProxy static pod synced with host resolv.conf	2021-02-17 08:33:22 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:08:39 UTC

Description Lubov 2020-06-21 16:22:24 UTC

Created attachment 1698234 [details]
haproxy-monitor log

Created attachment 1698234 [details]
haproxy-monitor log

Description of problem:
Immediately after Openshift successfully deployed on BareMetal env with both barmetal and provision IPV6 network, on one of masters haproxy pod is crashlooping

$ oc get pods -n openshift-kni-infra
NAME                                                             READY   STATUS    RESTARTS   AGE
haproxy-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com          2/2     Running   2          140m
haproxy-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com          2/2     Running   0          140m
haproxy-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com          1/2     CrashLoopBackOff   48         141m

$ oc logs haproxy-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com -c haproxy-monitor -n openshift-kni-infra
time="2020-06-21T16:02:03Z" level=info msg="Failed to get master Nodes list" err="Get https://api-int.ocp-edge-cluster-0.qe.lab.redhat.com:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D: dial tcp: lookup api-int.ocp-edge-cluster-0.qe.lab.redhat.com on [fe80::5054:ff:fe40:bfc7%enp5s0]:53: no such host"
time="2020-06-21T16:02:03Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig
time="2020-06-21T16:02:03Z" level=info msg="GetLBConfig failed, sleep half of interval and retry" kubeconfigPath=/var/lib/kubelet/kubeconfig 

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-06-20-194346

How reproducible:
happened in 2 from 3 deployments

Steps to Reproduce:
1. Deploy OpenShift 
2.
3.

Actual results:
haproxy pod in state CrashLoopBackOff

Expected results:
All pods in Running/Complete status


Additional info:
attaching logs and must-gather

http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ1849432_must-gather.zip

Comment 1 Lubov 2020-06-21 16:24:39 UTC

Created attachment 1698235 [details]
haproxy container log

Comment 3 Lubov 2020-06-22 13:41:23 UTC

1. The reported pod was the only one reported as problematic
2. It was basic standard deployment
3. I got the same problem on 2 different servers. It is not always reproducible. I'll ping you

Comment 5 Micah Abbott 2020-08-11 20:35:36 UTC

@Amit could your team handle verifying this BZ?

Comment 9 errata-xmlrpc 2020-10-27 16:08:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.