Bug 1833160

Summary:	OCP 4.3.15 UPI bare metal installation only two of three nodes are active
Product:	OpenShift Container Platform	Reporter:	Steven Ellis <sellis>
Component:	RHCOS	Assignee:	Ben Howard <behoward>
Status:	CLOSED NOTABUG	QA Contact:	Michael Nguyen <mnguyen>
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.3.z	CC:	bbreard, imcleod, jligon, miabbott, nstielau
Target Milestone:	---
Target Release:	4.6.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-15 19:09:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Steven Ellis 2020-05-07 23:26:22 UTC

Description of problem:

ocp 4.3.15 on Bare Metal UPI. Environment deployes with no issues with the 4.3.9 installer. With 4.3.15 install is reporting

Error while reconciling 4.3.15: the cluster operator openshift-apiserver has not yet successfully rolled out

We also only have two of the 3 masters as active nodes with etcd running.

Version-Release number of the following components:

openshift-install-linux-4.3.15.tar.gz
openshift-client-linux-4.3.15.tar.gz

How reproducible:

Consistent

Steps to Reproduce:

mkdir baremetal
cp install-config-redpill.yaml baremetal/install-config.yaml
openshift-install create manifests --dir=baremetal

# We need the masters to be schedulable so we don't run this step
#sed -i "s/mastersSchedulable: true/mastersSchedulable: false/" baremetal/manifests/cluster-scheduler-02-config.yml

# Then generate the ign files
openshift-install create ignition-configs --dir=baremetal

openshift-install --dir=baremetal wait-for bootstrap-complete \
      --log-level=info

Boostrap completes

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.15    True        False         31m     Error while reconciling 4.3.15: the cluster operator monitoring is degraded



Actual results:
oc get nodes
NAME                     STATUS   ROLES           AGE    VERSION
etcd-2.test.bionode.io   Ready    master,worker   142m   v1.16.2
localhost                Ready    master,worker   142m   v1.16.2


All the nodes should be etcd-[0-2].test.bionode.io

No SRV records were requested during bootstrap

Expected results:

NAME                     STATUS   ROLES           AGE     VERSION
etcd-0.test.bionode.io   Ready    master,worker   7m19s   v1.16.2
etcd-1.test.bionode.io   Ready    master,worker   7m41s   v1.16.2
etcd-2.test.bionode.io   Ready    master,worker   7m20s   v1.16.2


Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Abhinav Dahiya 2020-05-11 16:44:44 UTC

> Actual results:
oc get nodes
NAME                     STATUS   ROLES           AGE    VERSION
etcd-2.test.bionode.io   Ready    master,worker   142m   v1.16.2
localhost                Ready    master,worker   142m   v1.16.2

This makes it seem like the nodes are failing to get the correct hostnames/nodenames. So moving to rhcos tema to triage.

Comment 4 Steven Ellis 2020-05-15 05:28:16 UTC

It looks like the problem occurs when either

1 - the dns server gets overloaded with queries, or

2 - reverse dns issues or

3 - IPV6 address resolution issues.

I've moved my config to just using DNSMasq and tuned the config. Currently I can reliably deploy 4.3.19

Comment 5 Micah Abbott 2020-05-15 19:09:34 UTC

Based on comment #4, it looks like this was related to DNS resolution problems.

I don't think there is much that can be done on the RHCOS side for DNS resolution issues; closing as NOTABUG.

If you think there is more that should be done here, please reopen.