Description of problem: Intermittently 1 of 3 ovnkube-master-* pods are stuck in clbo after ipi bm deployment. This is a dualstack deployment that was deployed from a hub ocp cluster via ACM/Assisted Service (env was ipv4 connected). Causes network cluster operator to fail deployment. network 4.9.5 True True True 22h DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-8nhnh is in CrashLoopBackOff State DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - last change 2021-11-10T15:15:50Z Version-Release number of selected component (if applicable): 4.9.5 How reproducible: Intermittent Steps to Reproduce: 1. Deploy a dualstack multinode openshift spoke cluster via an ocp hub cluster with RHACM / Assisted service 2. Check ocp spoke deployment after install should be complete Actual results: In some cases the network co does not complete due to 1 in 3 ovnkube-masters being in CLBO - [kni@provisionhost-0-0 tmp]$ oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-8nhnh 4/6 CrashLoopBackOff 199 (102s ago) 6h11m ovnkube-master-jlv6z 6/6 Running 0 6h11m ovnkube-master-np686 6/6 Running 0 22h ovnkube-node-42qrn 4/4 Running 0 22h ovnkube-node-jst2b 4/4 Running 0 22h ovnkube-node-nk8fp 4/4 Running 1 (22h ago) 22h ovnkube-node-rmhvb 4/4 Running 0 22h ovnkube-node-zfw5g 4/4 Running 0 22h Expected results: Deployment succeeds every time Additional info: I can provide an environment with the issue replicated
I've now tried to reproduce 7 times on different hypervisors (Finding this via libvirt) and I cannot reproduce it, although I am sure I have seen it before. Interesting logs I found: I1109 22:40:19.747784 1 node_tracker.go:162] Processing possible switch / router updates for node mds-master-0-0 I1109 22:40:19.747805 1 node_tracker.go:179] Node mds-master-0-0 has invalid / no gateway config: k8s.ovn.org/l3-gateway-config annotation not found for node "mds-master-0-0" I1109 22:40:19.747833 1 node_tracker.go:143] Node mds-master-0-0 switch + router changed, syncing services W1109 22:40:40.308090 1 ovn.go:969] k8s.ovn.org/node-chassis-id annotation not found for node mds-master-0-0 I1109 22:41:07.953097 1 node_tracker.go:179] Node mds-master-0-0 has invalid / no gateway config: k8s.ovn.org/l3-gateway-config annotation not found for node "mds-master-0-0" Where master-0-0 is the pod in clbo. I do not have the environment now to check the node annotation, but I have asked for my team to let me know if they hit it. Any ideas or is this known?
As a test, in a successful deploy I removed the l3-gateway and node-chassis-id annotation for master-0-0 node and restarted the ovnkube-master-* pods. But they all started back up successfully. I showed the l3-gateway annotation error in the logs , but the pods started successfully. I1111 14:55:55.919577 1 node_tracker.go:179] Node mdchad-master-0-0 has invalid / no gateway config: k8s.ovn.org/l3-gateway-config annotation not found for node "mdchad-master-0-0" Not sure if the annotation is related or not, although I don't know why the annotation would be missing..
Looking further into the logs for the clbo pod (ovnkube-master-2r8wq) I see that the nbdb and sbdb containers got into the CLBO state, while other containers were running. # ndbd container has no logs and is in CrashLoopBackOff nbdb: container is clbo State: Waiting Reason: CrashLoopBackOff oc logs -f ovnkube-master-2r8wq -c nbdb # sdbd container is clbo and logs do not have much info sbdb: container is clbo State: Waiting Reason: CrashLoopBackOff sdbd logs: ]0;kni@provisionhost-0-0:~[kni@provisionhost-0-0 ~]$ oc logs -f ovnkube-master-2r8wq -c sdbd[K[K[Kbdb + [[ -f /env/_master ]] + ovn_kubernetes_namespace=openshift-ovn-kubernetes + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt' + transport=ssl + ovn_raft_conn_ip_url_suffix= + [[ 192.168.138.15 == *\:* ]] + db=sb + db_port=9642 + ovn_db_file=/etc/ovn/ovnsb_db.db ++ bracketify 192.168.138.15 ++ case "$1" in ++ echo 192.168.138.15 + OVN_ARGS='--db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=192.168.138.15 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt' + CLUSTER_INITIATOR_IP=192.168.138.16 ++ date -Iseconds + echo '2021-11-10T14:57:20+00:00 - starting sbdb CLUSTER_INITIATOR_IP=192.168.138.16' 2021-11-10T14:57:20+00:00 - starting sbdb CLUSTER_INITIATOR_IP=192.168.138.16 + initial_raft_create=true + initialize=false + [[ ! -e /etc/ovn/ovnsb_db.db ]] + [[ false == \t\r\u\e ]] + exec /usr/share/ovn/scripts/ovn-ctl --db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=192.168.138.15 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-sb-log=-vconsole:info -vfile:off' run_sb_ovsdb
It appears that some of the database containers are unable to make progress; we'd expect lots more log messages after the "ovn-ctl ... run_sb_ovsdb" command in the script, with output from the ovsdb-server processes themselves. Because we don't see that, we would need to debug futher. One option could be resource contention; the nodes apparently have 4 CPUs and 16GB RAM.
*** Bug 2013048 has been marked as a duplicate of this bug. ***
With the original dualstack spoke that had ovnkube master clbo, both nbdb and sbdb containers were clbo. In this case I just see sbdb with logs: oc logs -p ovnkube-master-bf4s6 -c sbdb + [[ -f /env/_master ]] + ovn_kubernetes_namespace=openshift-ovn-kubernetes + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt' + transport=ssl + ovn_raft_conn_ip_url_suffix= + [[ fd2e:6f44:5dd8:a::11 == *\:* ]] + ovn_raft_conn_ip_url_suffix=':[::]' + db=sb + db_port=9642 + ovn_db_file=/etc/ovn/ovnsb_db.db ++ bracketify fd2e:6f44:5dd8:a::11 ++ case "$1" in ++ echo '[fd2e:6f44:5dd8:a::11]' + OVN_ARGS='--db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=[fd2e:6f44:5dd8:a::11] --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt' + CLUSTER_INITIATOR_IP=fd2e:6f44:5dd8:a::10 ++ date -Iseconds + echo '2021-11-15T13:47:22+00:00 - starting sbdb CLUSTER_INITIATOR_IP=fd2e:6f44:5dd8:a::10' 2021-11-15T13:47:22+00:00 - starting sbdb CLUSTER_INITIATOR_IP=fd2e:6f44:5dd8:a::10 + initial_raft_create=true + initialize=false + [[ ! -e /etc/ovn/ovnsb_db.db ]] + [[ false == \t\r\u\e ]] + exec /usr/share/ovn/scripts/ovn-ctl --db-sb-cluster-local-port=9644 '--db-sb-cluster-local-addr=[fd2e:6f44:5dd8:a::11]' --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-sb-log=-vconsole:info -vfile:off' run_sb_ovsdb
Temp workaround: remove the pidfile for the failing container. rm /var/run/ovn/ovn?b_db.pid
*** Bug 2032541 has been marked as a duplicate of this bug. ***
This does not reproduce with 4.10.0-0.nightly-2022-01-23-013716. Moving to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056