Bug 2024826 - [RHOS/IPI] Masters are not joining a clusters when installing on OpenStack
Summary: [RHOS/IPI] Masters are not joining a clusters when installing on OpenStack
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.10
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.10.0
Assignee: Martin André
QA Contact: Udi Shkalim
Depends On:
TreeView+ depends on / blocked
Reported: 2021-11-19 08:08 UTC by Lukas Bednar
Modified: 2022-03-10 16:30 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2022-03-10 16:29:41 UTC
Target Upstream Version:

Attachments (Terms of Use)
openshift_install.log (109.17 KB, text/plain)
2021-11-19 08:08 UTC, Lukas Bednar
no flags Details

System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2835 0 None open Bug 2024826: Allow resolv prepender without default search domain 2021-11-19 11:10:43 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:30:04 UTC

Description Lukas Bednar 2021-11-19 08:08:45 UTC
Created attachment 1842681 [details]

Thanks for opening a bug report!
Before hitting the button, please fill in as much of the template below as you can.
If you leave out information, it's harder to help you.
Be ready for follow-up questions, and please respond in a timely manner.
If we can't reproduce a bug we might close your issue.
If we're wrong, PLEASE feel free to reopen it and explain why.

Version: 4.10.0-0.nightly-2021-11-15-034648

$ openshift-install version

openshift-install 4.10.0-0.nightly-2021-11-15-034648
built from commit 8fc863d833b1b361efc61c81998890e1305bcf9b
release image registry.ci.openshift.org/ocp/release@sha256:4975c19c8d645f0bfa68e770c16c688bc0590b20440de0265c802aae774aa1b7
release architecture amd64

Platform: openstack

Please specify:

What happened?

Staring at 15th of Nov we see our OCP-4.10 deployments failing.
We use nightly channel (registry.ci.openshift.org/ocp/release:4.10) to deploy 4.10 clusters.
We use IPI deployment on top of OpenStack (RHOS-D & RHOS-C01 failing on both).

The masters nodes are not joining the clusters. When observing console of the masters nodes, I see only four containers there, and I see errors in coredns-monitor container.

I don't see any suspicions error on bootstrap, it is waiting for the masters to join, so it can schedule some workload on it.

$ crictl ps
CONTAINER           IMAGE                                                                                                                    CREATED             STATE               NAME                 ATTEMPT             POD ID
59f033eb5dd4c       232fb20b94dcad8030a95eef9c09c9f9f4e89f1685ac405ada44d0203d8d07c5                                                         25 minutes ago      Running             coredns-monitor      0                   8346d59db7493
2b1dfbd079c2c       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f0c1b89092c1966baa30586089f8698f2768b346717194f925cd80dfd84ed040   25 minutes ago      Running             coredns              0                   8346d59db7493
ca7d0d4a174b7       232fb20b94dcad8030a95eef9c09c9f9f4e89f1685ac405ada44d0203d8d07c5                                                         25 minutes ago      Running             keepalived-monitor   0                   b0a7e4c78fef6
60594257d94dc       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3e96c1755163ecb2827bf4b4d1dfdabf2a125e6aeef620a0b8ba52d0c450432c   25 minutes ago      Running             keepalived           0                   b0a7e4c78fef6

$ crictl logs 59f033eb5dd4c
time="2021-11-18T16:56:56Z" level=error msg="Failed to build client config: invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory, unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory]"

What did you expect to happen?

The cluster comes up as usually, with three master and three worker nodes.

How to reproduce it (as minimally and precisely as possible)?

$ openshift-install create cluster --dir cnv-qe.rhcloud.com/c01-lbednar --log-level debug

Anything else we need to know?

I attached, logs from bootstrap vm, from master-0, install log & install config.

Comment 7 Martin André 2021-11-19 10:30:57 UTC
The issue was introduced with https://github.com/openshift/machine-config-operator/pull/2823/. 

The local nameserver is only prepended if the /var/run/NetworkManager/resolv.conf has a default search domain.

On PSI, the default resolv.conf looks like this:
[core@mandre-psi-8x4w9-master-0 ~]$ cat /var/run/NetworkManager/resolv.conf
# Generated by NetworkManager
nameserver x.x.x.x

And thus, the resulting resolv.conf generated by the NetworkManager-resolv-prepender script is:
[core@mandre-psi-8x4w9-master-0 ~]$ cat /etc/resolv.conf 
# Generated by KNI resolv prepender NM dispatcher script
nameserver x.x.x.x

Comment 8 ShiftStack Bugwatcher 2021-11-25 16:12:55 UTC
Removing the Triaged keyword because:

* the QE automation assessment (flag qe_test_coverage) is missing

Comment 11 Udi Shkalim 2022-01-16 12:09:26 UTC
Verified on: 4.10.0-0.nightly-2022-01-08-215919
CI run for profile 23_IPI on OSP16 & FIPS on & OVN & csidriver passed

Comment 15 errata-xmlrpc 2022-03-10 16:29:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.