Description of the problem: Multinode dualstack spoke cluster (nmstate static) in an ipv4 hub env fails to complete installation. Spoke cluster network cluster operator is stuck with one ovnkube-master in CLBO network 4.9.0-fc.0 True True True 43m DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-77lz4 is in CrashLoopBackOff State... ov Release version: ACM: 2.4.0-DOWNSTREAM-2021-10-11-09-00-19 Hub OCP: 4.9.0-0.nightly-2021-10-08-232649 Spoke OCP: 4.9.0-fc.0: - IPv4 hub + IPv4 libvirt env - Dualstack Spoke Steps to reproduce: 1. Deploy IPv4 OCP 4.9 hub with RHACM 2.4 downstream snapshot + Assisted Service 2. Deploy a dualstack Multinode spoke cluster with NMstate static IP networking Actual results: - Spoke CD CR never shows complete - Spoke cluster network cluster operator is stuck in CLBO Expected results: - Installation succeeds Additional info:
I successfully deployed ta spoke cluster with same scenario, except in an ipv6 disconnected hub, so perhaps there was a difference with the ocp version used in the spoke. I will continue to try to narrow this down, but leave open for tracking.
When seeing the CrashLoopBackOff next time, can you please gather the logs from the specific pod that is failing ? I have already observed a similar issue that is reported to ovn-kubernetes in https://bugzilla.redhat.com/show_bug.cgi?id=2011502. I suppose this one here is also something with ovn-k8s, but cannot confirm without log or access to the environment.
Reproduced: [kni@provisionhost-0-0 ~]$ oc --kubeconfig kubeconfig-spoke get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.0-rc.4 True False False 9h baremetal 4.9.0-rc.4 True True False 9h Applying metal3 resources cloud-controller-manager 4.9.0-rc.4 True False False 10h cloud-credential 4.9.0-rc.4 True False False 10h cluster-autoscaler 4.9.0-rc.4 True False False 10h config-operator 4.9.0-rc.4 True False False 10h console 4.9.0-rc.4 True False False 9h csi-snapshot-controller 4.9.0-rc.4 True False False 10h dns 4.9.0-rc.4 True False False 9h etcd 4.9.0-rc.4 True False False 9h image-registry 4.9.0-rc.4 True False False 9h ingress 4.9.0-rc.4 True False False 9h insights 4.9.0-rc.4 True False False 9h kube-apiserver 4.9.0-rc.4 True False False 9h kube-controller-manager 4.9.0-rc.4 True False False 9h kube-scheduler 4.9.0-rc.4 True False False 9h kube-storage-version-migrator 4.9.0-rc.4 True False False 10h machine-api 4.9.0-rc.4 True False False 9h machine-approver 4.9.0-rc.4 True False False 9h machine-config 4.9.0-rc.4 True False False 9h marketplace 4.9.0-rc.4 True False False 10h monitoring 4.9.0-rc.4 True False False 9h network 4.9.0-rc.4 True True True 10h DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-nqvs4 is in CrashLoopBackOff State node-tuning 4.9.0-rc.4 True False False 9h openshift-apiserver 4.9.0-rc.4 True False False 9h openshift-controller-manager 4.9.0-rc.4 True False False 9h openshift-samples 4.9.0-rc.4 True False False 9h operator-lifecycle-manager 4.9.0-rc.4 True False False 9h operator-lifecycle-manager-catalog 4.9.0-rc.4 True False False 10h operator-lifecycle-manager-packageserver 4.9.0-rc.4 True False False 9h service-ca 4.9.0-rc.4 True False False 10h storage 4.9.0-rc.4 True False False 10h [kni@provisionhost-0-0 ~]$ oc --kubeconfig kubeconfig-spoke get pod -n openshift-ovn-kubernetes|grep -v Run|grep -v Comple NAME READY STATUS RESTARTS AGE ovnkube-master-nqvs4 5/6 CrashLoopBackOff 8 (16s ago) 11m
From the container logs I can see ``` F1014 15:05:42.710703 1 ovndbchecker.go:118] unable to turn on memory trimming for SB DB, stderr: 2021-10-14T15:05:42Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (Connection refused) ``` I'm not sure if anything can be done from the AI side
Must-gather attached does not contain directory `namespaces/openshift-ovn-kubernetes` whereas above we have the output of `oc --kubeconfig kubeconfig-spoke get pod -n openshift-ovn-kubernetes`. It does not match, in order to investigate further we need to have must-gather from the cluster that failed and matches the reported error messages. From the event-filter I can see that there are problems with OVN-K8s, but with no *full* logs from the failing pod nothing more can be said. Given that there is no immediate access to the affected cluster, I'm dropping the URGENT priority. There is nothing that can be done from the engineering side right now and URGENT should be used for issues that need a developer's attention "right here, right now".
This not reproduced constantly - We will try to reproduce / keep an eye out and grab the logs when it happens again.
Will continue to track on bz2022144, which is already in the openshift networking product. *** This bug has been marked as a duplicate of bug 2022144 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days