We're going in circles: ovnkube-node is not writing out the readiness indicator file because it's not ready, because it can't start: 2020-07-12T14:02:44.606840279Z ++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}' 2020-07-12T14:02:44.800053754Z Unable to connect to the server: dial tcp: lookup api-int.ocp-edge-cluster-0.qe.lab.redhat.com on 192.168.123.1:53: no such host But that's not a cluster-dns-operator problem is it? The problem is that the DNS record for the apiserver loadbalancer does not exist. I'm not sure who is responsible for that in a bare metal cluster.
Is there a current must-gather available for this issue? Even better, if someone can provide access to a broken cluster, please contact Antoni Segura Puimedon and myself - rbryant and asegurap
See comment #2.
> The problem is that the DNS record for the apiserver loadbalancer does not exist. Neither the dns operator or the ingress operator are responsible for creating the DNS zones, records and LBs for the control plane. According to [1], users are responsible for DNS management on bare metal. If one or more Kubernetes Service resources are created as part of the upgrade, I would expect the same DNS requirement to be true. [1] https://docs.openshift.com/container-platform/4.5/installing/installing_bare_metal/installing-restricted-networks-bare-metal.html#installation-dns-user-infra_installing-restricted-networks-bare-metal
(In reply to Daneyon Hansen from comment #4) > > The problem is that the DNS record for the apiserver loadbalancer does not exist. > > Neither the dns operator or the ingress operator are responsible for > creating the DNS zones, records and LBs for the control plane. According to > [1], users are responsible for DNS management on bare metal. If one or more > Kubernetes Service resources are created as part of the upgrade, I would > expect the same DNS requirement to be true. > > [1] > https://docs.openshift.com/container-platform/4.5/installing/ > installing_bare_metal/installing-restricted-networks-bare-metal. > html#installation-dns-user-infra_installing-restricted-networks-bare-metal Note that those docs are for UPI, and I believe this is IPI (someone correct me if I'm wrong though). For IPI, only the external API and ingress records are required. api-int and other internal records are provided by our internal coredns instance. The fact that it's trying to do a lookup from 192.168.123.1 makes me think there may have been an issue with the prepender script that is supposed to point the node at itself for DNS resolution (each node runs a copy of coredns with these records). However, I don't see anywhere we could get the NetworkManager logs from must-gather so it's hard to say for sure.
This was reported on a 4.5.x-4.5.y kind of upgrade, and we're wondering if the same symptoms appear when we do a latest 4.4.z upgrade to a latest 4.5.z upgrade -- is this something that you QE can perform? If the latest 4.4.z upgrade to a latest 4.5.z works, then we can advise the customer to move forward in their lab while we work the problem from an engineering standpoint in parallel. Thanks!