1867718 – Unable to update OCP4.5 in disconnected env: cluster operator openshift-apiserver is degraded

Bug 1867718 - Unable to update OCP4.5 in disconnected env: cluster operator openshift-apiserver is degraded

Summary: Unable to update OCP4.5 in disconnected env: cluster operator openshift-apise...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Douglas Smith
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Depends On:	1852802
Blocks:	1862865
TreeView+	depends on / blocked

Reported:	2020-08-10 15:21 UTC by Luke Meyer
Modified:	2020-08-18 12:36 UTC (History)
CC List:	27 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1852802
Environment:
Last Closed:	2020-08-18 12:36:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 1 Dan Winship 2020-08-10 16:40:36 UTC

We're going in circles: ovnkube-node is not writing out the readiness indicator file because it's not ready, because it can't start:

    2020-07-12T14:02:44.606840279Z ++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
    2020-07-12T14:02:44.800053754Z Unable to connect to the server: dial tcp: lookup api-int.ocp-edge-cluster-0.qe.lab.redhat.com on 192.168.123.1:53: no such host

But that's not a cluster-dns-operator problem is it? The problem is that the DNS record for the apiserver loadbalancer does not exist. I'm not sure who is responsible for that in a bare metal cluster.

Comment 2 Russell Bryant 2020-08-10 17:06:37 UTC

Is there a current must-gather available for this issue?

Even better, if someone can provide access to a broken cluster, please contact Antoni Segura Puimedon and myself - rbryant and asegurap

Comment 3 Feng Pan 2020-08-10 17:32:59 UTC

See comment #2.

Comment 4 Daneyon Hansen 2020-08-10 17:49:54 UTC

> The problem is that the DNS record for the apiserver loadbalancer does not exist.

Neither the dns operator or the ingress operator are responsible for creating the DNS zones, records and LBs for the control plane. According to [1], users are responsible for DNS management on bare metal. If one or more Kubernetes Service resources are created as part of the upgrade, I would expect the same DNS requirement to be true.

[1] https://docs.openshift.com/container-platform/4.5/installing/installing_bare_metal/installing-restricted-networks-bare-metal.html#installation-dns-user-infra_installing-restricted-networks-bare-metal

Comment 5 Ben Nemec 2020-08-10 21:50:08 UTC

(In reply to Daneyon Hansen from comment #4)
> > The problem is that the DNS record for the apiserver loadbalancer does not exist.
> 
> Neither the dns operator or the ingress operator are responsible for
> creating the DNS zones, records and LBs for the control plane. According to
> [1], users are responsible for DNS management on bare metal. If one or more
> Kubernetes Service resources are created as part of the upgrade, I would
> expect the same DNS requirement to be true.
> 
> [1]
> https://docs.openshift.com/container-platform/4.5/installing/
> installing_bare_metal/installing-restricted-networks-bare-metal.
> html#installation-dns-user-infra_installing-restricted-networks-bare-metal

Note that those docs are for UPI, and I believe this is IPI (someone correct me if I'm wrong though).

For IPI, only the external API and ingress records are required. api-int and other internal records are provided by our internal coredns instance. The fact that it's trying to do a lookup from 192.168.123.1 makes me think there may have been an issue with the prepender script that is supposed to point the node at itself for DNS resolution (each node runs a copy of coredns with these records). However, I don't see anywhere we could get the NetworkManager logs from must-gather so it's hard to say for sure.

Comment 8 Douglas Smith 2020-08-11 13:23:27 UTC

This was reported on a 4.5.x-4.5.y kind of upgrade, and we're wondering if the same symptoms appear when we do a latest 4.4.z upgrade to a latest 4.5.z upgrade -- is this something that you QE can perform? If the latest 4.4.z upgrade to a latest 4.5.z works, then we can advise the customer to move forward in their lab while we work the problem from an engineering standpoint in parallel. Thanks!

Note You need to log in before you can comment on or make changes to this bug.

aconstan
aos-bugs
aputtur
athomas
augol
bbennett
beth.white
bnemec
danw
dhansen
dmellado
dosmith
eparis
fpan
jhou
lmohanty
mfojtik
omichael
rbryant
scuppett
smiron
stbenjam
sttts
weliang
xiuwang
xxia
zzhao