Cause:
In baremetal platform, an infra-dns container run in each host to support node names resolution and other internal DNS records. to complete the picture, an NM script updates host's resolv.conf to point to the infra-dns container.
Additionally, when pods created they receive their DNS configuration file (/etc/resolv.conf) from their hosts.
In this case, the HAProxy pod was created before NM scripts update the host's resolv.conf.
Consequence:
HAProxy pod repeatedly failed because the api-int internal DNS record is not resolvable.
Fix:
Verify that resov.conf of HAProxy pod is identical to the host resolv.conf file.
Result:
HAProxy runs with no error.
Created attachment 1698234[details]
haproxy-monitor log
Created attachment 1698234[details]
haproxy-monitor log
Description of problem:
Immediately after Openshift successfully deployed on BareMetal env with both barmetal and provision IPV6 network, on one of masters haproxy pod is crashlooping
$ oc get pods -n openshift-kni-infra
NAME READY STATUS RESTARTS AGE
haproxy-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com 2/2 Running 2 140m
haproxy-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com 2/2 Running 0 140m
haproxy-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com 1/2 CrashLoopBackOff 48 141m
$ oc logs haproxy-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com -c haproxy-monitor -n openshift-kni-infra
time="2020-06-21T16:02:03Z" level=info msg="Failed to get master Nodes list" err="Get https://api-int.ocp-edge-cluster-0.qe.lab.redhat.com:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D: dial tcp: lookup api-int.ocp-edge-cluster-0.qe.lab.redhat.com on [fe80::5054:ff:fe40:bfc7%enp5s0]:53: no such host"
time="2020-06-21T16:02:03Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig
time="2020-06-21T16:02:03Z" level=info msg="GetLBConfig failed, sleep half of interval and retry" kubeconfigPath=/var/lib/kubelet/kubeconfig
Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-06-20-194346
How reproducible:
happened in 2 from 3 deployments
Steps to Reproduce:
1. Deploy OpenShift
2.
3.
Actual results:
haproxy pod in state CrashLoopBackOff
Expected results:
All pods in Running/Complete status
Additional info:
attaching logs and must-gather
http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ1849432_must-gather.zip
1. The reported pod was the only one reported as problematic
2. It was basic standard deployment
3. I got the same problem on 2 different servers. It is not always reproducible. I'll ping you
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2020:4196