Bug 1833160

Summary: OCP 4.3.15 UPI bare metal installation only two of three nodes are active
Product: OpenShift Container Platform Reporter: Steven Ellis <sellis>
Component: RHCOSAssignee: Ben Howard <behoward>
Status: CLOSED NOTABUG QA Contact: Michael Nguyen <mnguyen>
Severity: high Docs Contact:
Priority: medium    
Version: 4.3.zCC: bbreard, imcleod, jligon, miabbott, nstielau
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-15 19:09:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steven Ellis 2020-05-07 23:26:22 UTC
Description of problem:

ocp 4.3.15 on Bare Metal UPI. Environment deployes with no issues with the 4.3.9 installer. With 4.3.15 install is reporting

Error while reconciling 4.3.15: the cluster operator openshift-apiserver has not yet successfully rolled out

We also only have two of the 3 masters as active nodes with etcd running.

Version-Release number of the following components:

openshift-install-linux-4.3.15.tar.gz
openshift-client-linux-4.3.15.tar.gz

How reproducible:

Consistent

Steps to Reproduce:

mkdir baremetal
cp install-config-redpill.yaml baremetal/install-config.yaml
openshift-install create manifests --dir=baremetal

# We need the masters to be schedulable so we don't run this step
#sed -i "s/mastersSchedulable: true/mastersSchedulable: false/" baremetal/manifests/cluster-scheduler-02-config.yml

# Then generate the ign files
openshift-install create ignition-configs --dir=baremetal

openshift-install --dir=baremetal wait-for bootstrap-complete \
      --log-level=info

Boostrap completes

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.15    True        False         31m     Error while reconciling 4.3.15: the cluster operator monitoring is degraded



Actual results:
oc get nodes
NAME                     STATUS   ROLES           AGE    VERSION
etcd-2.test.bionode.io   Ready    master,worker   142m   v1.16.2
localhost                Ready    master,worker   142m   v1.16.2


All the nodes should be etcd-[0-2].test.bionode.io

No SRV records were requested during bootstrap

Expected results:

NAME                     STATUS   ROLES           AGE     VERSION
etcd-0.test.bionode.io   Ready    master,worker   7m19s   v1.16.2
etcd-1.test.bionode.io   Ready    master,worker   7m41s   v1.16.2
etcd-2.test.bionode.io   Ready    master,worker   7m20s   v1.16.2


Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Abhinav Dahiya 2020-05-11 16:44:44 UTC
> Actual results:
oc get nodes
NAME                     STATUS   ROLES           AGE    VERSION
etcd-2.test.bionode.io   Ready    master,worker   142m   v1.16.2
localhost                Ready    master,worker   142m   v1.16.2

This makes it seem like the nodes are failing to get the correct hostnames/nodenames. So moving to rhcos tema to triage.

Comment 4 Steven Ellis 2020-05-15 05:28:16 UTC
It looks like the problem occurs when either

1 - the dns server gets overloaded with queries, or

2 - reverse dns issues or

3 - IPV6 address resolution issues.

I've moved my config to just using DNSMasq and tuned the config. Currently I can reliably deploy 4.3.19

Comment 5 Micah Abbott 2020-05-15 19:09:34 UTC
Based on comment #4, it looks like this was related to DNS resolution problems.

I don't think there is much that can be done on the RHCOS side for DNS resolution issues; closing as NOTABUG.

If you think there is more that should be done here, please reopen.