Bug 1849432

Summary: haproxy pod on one of masters in CrashLoopBackOff after deploy of OpenShift BareMetall ipv6
Product: OpenShift Container Platform Reporter: Lubov <lshilin>
Component: Machine Config OperatorAssignee: Yossi Boaron <yboaron>
Status: CLOSED ERRATA QA Contact: Aleksandra Malykhin <amalykhi>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.5CC: amalykhi, asegurap, bperkins, miabbott, rbartal, shardy, yboaron
Target Milestone: ---Keywords: Triaged
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: In baremetal platform, an infra-dns container run in each host to support node names resolution and other internal DNS records. to complete the picture, an NM script updates host's resolv.conf to point to the infra-dns container. Additionally, when pods created they receive their DNS configuration file (/etc/resolv.conf) from their hosts. In this case, the HAProxy pod was created before NM scripts update the host's resolv.conf. Consequence: HAProxy pod repeatedly failed because the api-int internal DNS record is not resolvable. Fix: Verify that resov.conf of HAProxy pod is identical to the host resolv.conf file. Result: HAProxy runs with no error.
Story Points: ---
Clone Of:
: 1862874 (view as bug list) Environment:
Last Closed: 2020-10-27 16:08:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1862874    
Attachments:
Description Flags
haproxy-monitor log
none
haproxy container log none

Description Lubov 2020-06-21 16:22:24 UTC
Created attachment 1698234 [details]
haproxy-monitor log

Created attachment 1698234 [details]
haproxy-monitor log

Description of problem:
Immediately after Openshift successfully deployed on BareMetal env with both barmetal and provision IPV6 network, on one of masters haproxy pod is crashlooping

$ oc get pods -n openshift-kni-infra
NAME                                                             READY   STATUS    RESTARTS   AGE
haproxy-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com          2/2     Running   2          140m
haproxy-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com          2/2     Running   0          140m
haproxy-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com          1/2     CrashLoopBackOff   48         141m

$ oc logs haproxy-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com -c haproxy-monitor -n openshift-kni-infra
time="2020-06-21T16:02:03Z" level=info msg="Failed to get master Nodes list" err="Get https://api-int.ocp-edge-cluster-0.qe.lab.redhat.com:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D: dial tcp: lookup api-int.ocp-edge-cluster-0.qe.lab.redhat.com on [fe80::5054:ff:fe40:bfc7%enp5s0]:53: no such host"
time="2020-06-21T16:02:03Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig
time="2020-06-21T16:02:03Z" level=info msg="GetLBConfig failed, sleep half of interval and retry" kubeconfigPath=/var/lib/kubelet/kubeconfig 

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-06-20-194346

How reproducible:
happened in 2 from 3 deployments

Steps to Reproduce:
1. Deploy OpenShift 
2.
3.

Actual results:
haproxy pod in state CrashLoopBackOff

Expected results:
All pods in Running/Complete status


Additional info:
attaching logs and must-gather

http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ1849432_must-gather.zip

Comment 1 Lubov 2020-06-21 16:24:39 UTC
Created attachment 1698235 [details]
haproxy container log

Comment 3 Lubov 2020-06-22 13:41:23 UTC
1. The reported pod was the only one reported as problematic
2. It was basic standard deployment
3. I got the same problem on 2 different servers. It is not always reproducible. I'll ping you

Comment 5 Micah Abbott 2020-08-11 20:35:36 UTC
@Amit could your team handle verifying this BZ?

Comment 9 errata-xmlrpc 2020-10-27 16:08:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196