Description of problem: Observed following failure logs under ovnkube-master containers while bringing up an OCP OVNKubernetes cluster: # oc logs -n openshift-ovn-kubernetes ovnkube-master-g4cbb -c ovn-dbchecker + [[ -f /env/_master ]] ++ date '+%m%d %H:%M:%S.%N' + echo 'I1216 16:18:23.871854853 - ovn-dbchecker - start ovn-dbchecker' I1216 16:18:23.871854853 - ovn-dbchecker - start ovn-dbchecker + exec /usr/bin/ovndbchecker --config-file=/run/ovnkube-config/ovnkube.conf --loglevel 4 --sb-address ssl:10.0.0.4:9642,ssl:10.0.0.6:9642,ssl:10.0.0.7:9642 --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt --sb-cert-common-name ovn --nb-address ssl:10.0.0.4:9641,ssl:10.0.0.6:9641,ssl:10.0.0.7:9641 --nb-client-privkey /ovn-cert/tls.key --nb-client-cert /ovn-cert/tls.crt --nb-client-cacert /ovn-ca/ca-bundle.crt --nb-cert-common-name ovn I1216 16:18:23.886142 1 config.go:1306] Parsed config file /run/ovnkube-config/ovnkube.conf I1216 16:18:23.886197 1 config.go:1307] Parsed config: {Default:{MTU:1400 ConntrackZone:64000 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 RawClusterSubnets:10.128.0.0/14/23 ClusterSubnets:[]} Logging:{File: CNIFile: Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:5} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableEgressIP:true} Kubernetes:{Kubeconfig: CACert: APIServer:https://api-int.qe-anurag93.qe.azure.devcluster.openshift.com:6443 Token: CompatServiceCIDR: RawServiceCIDRs:172.30.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes MetricsBindAddress: OVNMetricsBindAddress: MetricsEnablePprof:false OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes: NoHostSubnetNodes:nil} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false externalID: exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false externalID: exec:<nil>} Gateway:{Mode:local Interface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64} MasterHA:{ElectionLeaseDuration:60 ElectionRenewDeadline:30 ElectionRetryPeriod:20} HybridOverlay:{Enabled:false RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789}} I1216 16:18:23.890592 1 ovndbmanager.go:23] Starting DB Checker to ensure cluster membership and DB consistency I1216 16:18:23.890645 1 ovndbmanager.go:44] Starting ensure routine for Raft db: /etc/ovn/ovnsb_db.db I1216 16:18:23.890675 1 ovndbmanager.go:44] Starting ensure routine for Raft db: /etc/ovn/ovnnb_db.db W1216 16:19:23.901813 1 ovndbmanager.go:100] Unable to get cluster status for: /etc/ovn/ovnnb_db.db, stderr: 2020-12-16T16:19:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnnb_db.ctl ovn-appctl: cannot connect to "/var/run/ovn/ovnnb_db.ctl" (No such file or directory) . . . # oc logs -n openshift-ovn-kubernetes ovnkube-master-g4cbb -c nbdb + [[ -f /env/_master ]] + ovn_kubernetes_namespace=openshift-ovn-kubernetes + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt' + transport=ssl + ovn_raft_conn_ip_url_suffix= + [[ 10.0.0.7 == *\:* ]] + db=nb + db_port=9641 + ovn_db_file=/etc/ovn/ovnnb_db.db + MASTER_IP=10.0.0.4 ++ date -Iseconds + echo '2020-12-16T18:52:29+00:00 - starting nbdb MASTER_IP=10.0.0.4, K8S_NODE_IP=10.0.0.7' 2020-12-16T18:52:29+00:00 - starting nbdb MASTER_IP=10.0.0.4, K8S_NODE_IP=10.0.0.7 + initial_raft_create=true + initialize=false + [[ ! -e /etc/ovn/ovnnb_db.db ]] + [[ false == \t\r\u\e ]] ++ bracketify 10.0.0.7 ++ case "$1" in ++ echo 10.0.0.7 + exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.0.7 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off' run_nb_ovsdb 2020-12-16T18:52:29Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log ovsdb-server: ovsdb error: server does not belong to cluster ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory) 2020-12-16T18:52:29Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2020-12-16T18:52:29Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2020-12-16T18:52:30Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2020-12-16T18:52:30Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2020-12-16T18:52:30Z|00005|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: waiting 2 seconds before reconnect 2020-12-16T18:52:32Z|00006|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2020-12-16T18:52:32Z|00007|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2020-12-16T18:52:32Z|00008|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: waiting 4 seconds before reconnect 2020-12-16T18:52:36Z|00009|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2020-12-16T18:52:36Z|00010|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2020-12-16T18:52:36Z|00011|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: continuing to reconnect in the background but suppressing further logging 2020-12-16T18:52:59Z|00012|fatal_signal|WARN|terminating with signal 14 (Alarm clock) /usr/share/openvswitch/scripts/ovs-lib: line 109: 86 Alarm clock "$@" Waiting for OVN_Northbound to come up ... failed! Must-gather: http://shell.lab.bos.redhat.com/~anusaxen/must-gather.local.1214168368365172141/ Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2020-12-14-165231 How reproducible: Steps to Reproduce: 1.Bring up OVNKubernetes cluster with 3 masters 3 workers 2. 3. Actual results: ovn-master pod ended up in crashloopbackoff mode Expected results: Cluster should come up fine Additional info: # oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovn-ipsec-6qglp 1/1 Running 0 164m ovn-ipsec-kqtc8 1/1 Running 0 164m ovn-ipsec-pqdgh 1/1 Running 0 150m ovn-ipsec-sb5mz 1/1 Running 0 151m ovn-ipsec-w9bgg 1/1 Running 0 164m ovn-ipsec-wsd67 1/1 Running 0 151m ovnkube-master-fwrk9 6/6 Running 1 164m ovnkube-master-g4cbb 5/6 CrashLoopBackOff 33 164m ovnkube-master-nxq7r 6/6 Running 4 164m ovnkube-node-5wvtj 3/3 Running 0 151m ovnkube-node-cvg2n 3/3 Running 1 150m ovnkube-node-kcnd2 3/3 Running 0 151m ovnkube-node-qh95n 3/3 Running 0 164m ovnkube-node-qx4fn 3/3 Running 0 164m ovnkube-node-xh58s 3/3 Running 0 164m ovs-node-47fcl 1/1 Running 0 164m ovs-node-58hvm 1/1 Running 0 164m ovs-node-8spmg 1/1 Running 0 164m ovs-node-h2dqr 1/1 Running 0 151m ovs-node-j786b 1/1 Running 0 151m ovs-node-zqlwt 1/1 Running 0 150m Seems like ovnnb_db.ctl was not created somehow # oc exec ovnkube-master-g4cbb -- ls -l /var/run/ovn Defaulting container name to northd. Use 'oc describe pod/ovnkube-master-g4cbb -n openshift-ovn-kubernetes' to see all of the containers in this pod. total 20 srwxr-x---. 1 root root 0 Dec 16 16:17 ovn-controller.2637.ctl -rw-r--r--. 1 root root 5 Dec 16 16:17 ovn-controller.pid srwxr-x---. 1 root root 0 Dec 16 16:18 ovn-nbctl.12.ctl -rw-r-----. 1 root root 788 Dec 16 16:19 ovn-nbctl.log -rw-r--r--. 1 root root 3 Dec 16 16:18 ovn-nbctl.pid srwxr-x---. 1 root root 0 Dec 16 16:17 ovn-northd.1.ctl -rw-r--r--. 1 root root 2 Dec 16 16:17 ovn-northd.pid srwxr-x---. 1 root root 0 Dec 16 16:18 ovnsb_db.ctl -rw-r--r--. 1 root root 2 Dec 16 16:18 ovnsb_db.pid srwxr-x---. 1 root root 0 Dec 16 16:18 ovnsb_db.sock nbdb size is different on crashed master ---------------------------------------- # oc rsh -n openshift-ovn-kubernetes ovnkube-master-g4cbb Defaulting container name to northd. Use 'oc describe pod/ovnkube-master-g4cbb -n openshift-ovn-kubernetes' to see all of the containers in this pod. sh-4.4# ls -l /etc/openvswitch/ total 7628 -rw-r-----. 1 root root 11930 Dec 16 16:18 ovnnb_db.db -rw-r-----. 1 root root 5817857 Dec 16 19:02 ovnsb_db.db sh-4.4# exit exit # oc exec -n openshift-ovn-kubernetes ovnkube-master-fwrk9 -- ls -l /etc/openvswitch/ Defaulting container name to northd. Use 'oc describe pod/ovnkube-master-fwrk9 -n openshift-ovn-kubernetes' to see all of the containers in this pod. total 8456 -rw-r-----. 1 root root 2629732 Dec 16 19:02 ovnnb_db.db -rw-r-----. 1 root root 5817780 Dec 16 19:02 ovnsb_db.db # oc exec -n openshift-ovn-kubernetes ovnkube-master-nxq7r -- ls -l /etc/openvswitch/ Defaulting container name to northd. Use 'oc describe pod/ovnkube-master-nxq7r -n openshift-ovn-kubernetes' to see all of the containers in this pod. total 8512 -rw-r-----. 1 root root 2628931 Dec 16 19:02 ovnnb_db.db -rw-r-----. 1 root root 5819008 Dec 16 19:02 ovnsb_db.db OVN/Openvswitch RPMS --------------------- sh-4.4# rpm -qa | grep -i ovn ovn2.13-20.09.0-21.el8fdn.x86_64 ovn2.13-host-20.09.0-21.el8fdn.x86_64 ovn2.13-central-20.09.0-21.el8fdn.x86_64 ovn2.13-vtep-20.09.0-21.el8fdn.x86_64 sh-4.4# rpm -qa | grep -i openv openvswitch2.13-2.13.0-72.el8fdp.x86_64 openvswitch2.13-ipsec-2.13.0-72.el8fdp.x86_64 python3-openvswitch2.13-2.13.0-72.el8fdp.x86_64 openvswitch2.13-devel-2.13.0-72.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-22.el8fdp.noarch
Worth to mention that this is rarely reproducible. I have just seen it once on 4.7 till now
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633