1833160 – OCP 4.3.15 UPI bare metal installation only two of three nodes are active

Bug 1833160 - OCP 4.3.15 UPI bare metal installation only two of three nodes are active

Summary: OCP 4.3.15 UPI bare metal installation only two of three nodes are active

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.3.z
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Ben Howard
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-07 23:26 UTC by Steven Ellis
Modified:	2020-05-15 19:09 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-15 19:09:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1814576	0	high	CLOSED	Bootstrap stuck on waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True	2023-10-06 19:26:51 UTC
Red Hat Bugzilla	1832120	0	unspecified	CLOSED	OCP 4.4 UPI bare metal installation bootstrap etcd Degraded	2021-02-22 00:41:40 UTC

Description Steven Ellis 2020-05-07 23:26:22 UTC

Description of problem:

ocp 4.3.15 on Bare Metal UPI. Environment deployes with no issues with the 4.3.9 installer. With 4.3.15 install is reporting

Error while reconciling 4.3.15: the cluster operator openshift-apiserver has not yet successfully rolled out

We also only have two of the 3 masters as active nodes with etcd running.

Version-Release number of the following components:

openshift-install-linux-4.3.15.tar.gz
openshift-client-linux-4.3.15.tar.gz

How reproducible:

Consistent

Steps to Reproduce:

mkdir baremetal
cp install-config-redpill.yaml baremetal/install-config.yaml
openshift-install create manifests --dir=baremetal

# We need the masters to be schedulable so we don't run this step
#sed -i "s/mastersSchedulable: true/mastersSchedulable: false/" baremetal/manifests/cluster-scheduler-02-config.yml

# Then generate the ign files
openshift-install create ignition-configs --dir=baremetal

openshift-install --dir=baremetal wait-for bootstrap-complete \
      --log-level=info

Boostrap completes

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.15    True        False         31m     Error while reconciling 4.3.15: the cluster operator monitoring is degraded



Actual results:
oc get nodes
NAME                     STATUS   ROLES           AGE    VERSION
etcd-2.test.bionode.io   Ready    master,worker   142m   v1.16.2
localhost                Ready    master,worker   142m   v1.16.2


All the nodes should be etcd-[0-2].test.bionode.io

No SRV records were requested during bootstrap

Expected results:

NAME                     STATUS   ROLES           AGE     VERSION
etcd-0.test.bionode.io   Ready    master,worker   7m19s   v1.16.2
etcd-1.test.bionode.io   Ready    master,worker   7m41s   v1.16.2
etcd-2.test.bionode.io   Ready    master,worker   7m20s   v1.16.2


Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Abhinav Dahiya 2020-05-11 16:44:44 UTC

> Actual results:
oc get nodes
NAME                     STATUS   ROLES           AGE    VERSION
etcd-2.test.bionode.io   Ready    master,worker   142m   v1.16.2
localhost                Ready    master,worker   142m   v1.16.2

This makes it seem like the nodes are failing to get the correct hostnames/nodenames. So moving to rhcos tema to triage.

Comment 4 Steven Ellis 2020-05-15 05:28:16 UTC

It looks like the problem occurs when either

1 - the dns server gets overloaded with queries, or

2 - reverse dns issues or

3 - IPV6 address resolution issues.

I've moved my config to just using DNSMasq and tuned the config. Currently I can reliably deploy 4.3.19

Comment 5 Micah Abbott 2020-05-15 19:09:34 UTC

Based on comment #4, it looks like this was related to DNS resolution problems.

I don't think there is much that can be done on the RHCOS side for DNS resolution issues; closing as NOTABUG.

If you think there is more that should be done here, please reopen.

Note You need to log in before you can comment on or make changes to this bug.