Bug 2024826

Summary: [RHOS/IPI] Masters are not joining a clusters when installing on OpenStack
Product: OpenShift Container Platform Reporter: Lukas Bednar <lbednar>
Component: InstallerAssignee: Martin André <m.andre>
Installer sub component: OpenShift on OpenStack QA Contact: Udi Shkalim <ushkalim>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, ibesso, lbednar, m.andre, pprinett
Version: 4.10Keywords: Triaged
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:29:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
openshift_install.log none

Description Lukas Bednar 2021-11-19 08:08:45 UTC
Created attachment 1842681 [details]
openshift_install.log

Thanks for opening a bug report!
Before hitting the button, please fill in as much of the template below as you can.
If you leave out information, it's harder to help you.
Be ready for follow-up questions, and please respond in a timely manner.
If we can't reproduce a bug we might close your issue.
If we're wrong, PLEASE feel free to reopen it and explain why.

Version: 4.10.0-0.nightly-2021-11-15-034648

$ openshift-install version

openshift-install 4.10.0-0.nightly-2021-11-15-034648
built from commit 8fc863d833b1b361efc61c81998890e1305bcf9b
release image registry.ci.openshift.org/ocp/release@sha256:4975c19c8d645f0bfa68e770c16c688bc0590b20440de0265c802aae774aa1b7
release architecture amd64


Platform: openstack


Please specify:
* IPI

What happened?

Staring at 15th of Nov we see our OCP-4.10 deployments failing.
We use nightly channel (registry.ci.openshift.org/ocp/release:4.10) to deploy 4.10 clusters.
We use IPI deployment on top of OpenStack (RHOS-D & RHOS-C01 failing on both).

The masters nodes are not joining the clusters. When observing console of the masters nodes, I see only four containers there, and I see errors in coredns-monitor container.

I don't see any suspicions error on bootstrap, it is waiting for the masters to join, so it can schedule some workload on it.


$ crictl ps
CONTAINER           IMAGE                                                                                                                    CREATED             STATE               NAME                 ATTEMPT             POD ID
59f033eb5dd4c       232fb20b94dcad8030a95eef9c09c9f9f4e89f1685ac405ada44d0203d8d07c5                                                         25 minutes ago      Running             coredns-monitor      0                   8346d59db7493
2b1dfbd079c2c       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f0c1b89092c1966baa30586089f8698f2768b346717194f925cd80dfd84ed040   25 minutes ago      Running             coredns              0                   8346d59db7493
ca7d0d4a174b7       232fb20b94dcad8030a95eef9c09c9f9f4e89f1685ac405ada44d0203d8d07c5                                                         25 minutes ago      Running             keepalived-monitor   0                   b0a7e4c78fef6
60594257d94dc       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3e96c1755163ecb2827bf4b4d1dfdabf2a125e6aeef620a0b8ba52d0c450432c   25 minutes ago      Running             keepalived           0                   b0a7e4c78fef6

$ crictl logs 59f033eb5dd4c
time="2021-11-18T16:56:56Z" level=error msg="Failed to build client config: invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]"

What did you expect to happen?

The cluster comes up as usually, with three master and three worker nodes.

How to reproduce it (as minimally and precisely as possible)?

$ openshift-install create cluster --dir cnv-qe.rhcloud.com/c01-lbednar --log-level debug

Anything else we need to know?

I attached, logs from bootstrap vm, from master-0, install log & install config.

Comment 7 Martin André 2021-11-19 10:30:57 UTC
The issue was introduced with https://github.com/openshift/machine-config-operator/pull/2823/. 

The local nameserver is only prepended if the /var/run/NetworkManager/resolv.conf has a default search domain.
https://github.com/openshift/machine-config-operator/blob/e1fbf07d69ca1176a523d0568260c76a6add9f9c/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml#L72

On PSI, the default resolv.conf looks like this:
[core@mandre-psi-8x4w9-master-0 ~]$ cat /var/run/NetworkManager/resolv.conf
# Generated by NetworkManager
nameserver x.x.x.x

And thus, the resulting resolv.conf generated by the NetworkManager-resolv-prepender script is:
[core@mandre-psi-8x4w9-master-0 ~]$ cat /etc/resolv.conf 
# Generated by KNI resolv prepender NM dispatcher script
nameserver x.x.x.x

Comment 8 ShiftStack Bugwatcher 2021-11-25 16:12:55 UTC
Removing the Triaged keyword because:

* the QE automation assessment (flag qe_test_coverage) is missing

Comment 11 Udi Shkalim 2022-01-16 12:09:26 UTC
Verified on: 4.10.0-0.nightly-2022-01-08-215919
CI run for profile 23_IPI on OSP16 & FIPS on & OVN & csidriver passed

Comment 15 errata-xmlrpc 2022-03-10 16:29:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056