1908469 – nbdb failed to come up while bringing up OVNKubernetes cluster

Bug 1908469 - nbdb failed to come up while bringing up OVNKubernetes cluster

Summary: nbdb failed to come up while bringing up OVNKubernetes cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Aniket Bhat
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-16 19:24 UTC by Anurag saxena
Modified:	2021-02-24 15:45 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:45:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 406	None	closed	Bug 1903660: Handle pruning of unhealthy db files on disk	2021-02-20 09:01:21 UTC
Github	ovn-org ovn-kubernetes pull 1930	None	closed	db: Handle pruning of unhealthy db files on disk	2021-02-20 09:01:21 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:45:53 UTC

Description Anurag saxena 2020-12-16 19:24:13 UTC

Description of problem: Observed following failure logs under ovnkube-master containers while bringing up an OCP OVNKubernetes cluster:

# oc logs -n openshift-ovn-kubernetes ovnkube-master-g4cbb -c ovn-dbchecker
+ [[ -f /env/_master ]]
++ date '+%m%d %H:%M:%S.%N'
+ echo 'I1216 16:18:23.871854853 - ovn-dbchecker - start ovn-dbchecker'
I1216 16:18:23.871854853 - ovn-dbchecker - start ovn-dbchecker
+ exec /usr/bin/ovndbchecker --config-file=/run/ovnkube-config/ovnkube.conf --loglevel 4 --sb-address ssl:10.0.0.4:9642,ssl:10.0.0.6:9642,ssl:10.0.0.7:9642 --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt --sb-cert-common-name ovn --nb-address ssl:10.0.0.4:9641,ssl:10.0.0.6:9641,ssl:10.0.0.7:9641 --nb-client-privkey /ovn-cert/tls.key --nb-client-cert /ovn-cert/tls.crt --nb-client-cacert /ovn-ca/ca-bundle.crt --nb-cert-common-name ovn
I1216 16:18:23.886142       1 config.go:1306] Parsed config file /run/ovnkube-config/ovnkube.conf
I1216 16:18:23.886197       1 config.go:1307] Parsed config: {Default:{MTU:1400 ConntrackZone:64000 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 RawClusterSubnets:10.128.0.0/14/23 ClusterSubnets:[]} Logging:{File: CNIFile: Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:5} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableEgressIP:true} Kubernetes:{Kubeconfig: CACert: APIServer:https://api-int.qe-anurag93.qe.azure.devcluster.openshift.com:6443 Token: CompatServiceCIDR: RawServiceCIDRs:172.30.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes MetricsBindAddress: OVNMetricsBindAddress: MetricsEnablePprof:false OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes: NoHostSubnetNodes:nil} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false externalID: exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false externalID: exec:<nil>} Gateway:{Mode:local Interface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64} MasterHA:{ElectionLeaseDuration:60 ElectionRenewDeadline:30 ElectionRetryPeriod:20} HybridOverlay:{Enabled:false RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789}}
I1216 16:18:23.890592       1 ovndbmanager.go:23] Starting DB Checker to ensure cluster membership and DB consistency
I1216 16:18:23.890645       1 ovndbmanager.go:44] Starting ensure routine for Raft db: /etc/ovn/ovnsb_db.db
I1216 16:18:23.890675       1 ovndbmanager.go:44] Starting ensure routine for Raft db: /etc/ovn/ovnnb_db.db
W1216 16:19:23.901813       1 ovndbmanager.go:100] Unable to get cluster status for: /etc/ovn/ovnnb_db.db, stderr: 2020-12-16T16:19:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnnb_db.ctl
ovn-appctl: cannot connect to "/var/run/ovn/ovnnb_db.ctl" (No such file or directory)
.
.
.


# oc logs -n openshift-ovn-kubernetes ovnkube-master-g4cbb -c nbdb
+ [[ -f /env/_master ]]
+ ovn_kubernetes_namespace=openshift-ovn-kubernetes
+ ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt'
+ transport=ssl
+ ovn_raft_conn_ip_url_suffix=
+ [[ 10.0.0.7 == *\:* ]]
+ db=nb
+ db_port=9641
+ ovn_db_file=/etc/ovn/ovnnb_db.db
+ MASTER_IP=10.0.0.4
++ date -Iseconds
+ echo '2020-12-16T18:52:29+00:00 - starting nbdb  MASTER_IP=10.0.0.4, K8S_NODE_IP=10.0.0.7'
2020-12-16T18:52:29+00:00 - starting nbdb  MASTER_IP=10.0.0.4, K8S_NODE_IP=10.0.0.7
+ initial_raft_create=true
+ initialize=false
+ [[ ! -e /etc/ovn/ovnnb_db.db ]]
+ [[ false == \t\r\u\e ]]
++ bracketify 10.0.0.7
++ case "$1" in
++ echo 10.0.0.7
+ exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.0.7 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '--ovn-nb-log=-vconsole:info -vfile:off' run_nb_ovsdb
2020-12-16T18:52:29Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
ovsdb-server: ovsdb error: server does not belong to cluster
ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory)
2020-12-16T18:52:29Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2020-12-16T18:52:29Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
2020-12-16T18:52:30Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2020-12-16T18:52:30Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
2020-12-16T18:52:30Z|00005|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: waiting 2 seconds before reconnect
2020-12-16T18:52:32Z|00006|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2020-12-16T18:52:32Z|00007|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
2020-12-16T18:52:32Z|00008|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: waiting 4 seconds before reconnect
2020-12-16T18:52:36Z|00009|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2020-12-16T18:52:36Z|00010|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
2020-12-16T18:52:36Z|00011|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: continuing to reconnect in the background but suppressing further logging
2020-12-16T18:52:59Z|00012|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/usr/share/openvswitch/scripts/ovs-lib: line 109:    86 Alarm clock             "$@"
Waiting for OVN_Northbound to come up ... failed!

Must-gather: http://shell.lab.bos.redhat.com/~anusaxen/must-gather.local.1214168368365172141/


Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2020-12-14-165231


How reproducible:


Steps to Reproduce:
1.Bring up OVNKubernetes cluster with 3 masters 3 workers
2.
3.

Actual results: ovn-master pod ended up in crashloopbackoff mode


Expected results: Cluster should come up fine


Additional info:
# oc get pods -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS   AGE
ovn-ipsec-6qglp        1/1     Running            0          164m
ovn-ipsec-kqtc8        1/1     Running            0          164m
ovn-ipsec-pqdgh        1/1     Running            0          150m
ovn-ipsec-sb5mz        1/1     Running            0          151m
ovn-ipsec-w9bgg        1/1     Running            0          164m
ovn-ipsec-wsd67        1/1     Running            0          151m
ovnkube-master-fwrk9   6/6     Running            1          164m
ovnkube-master-g4cbb   5/6     CrashLoopBackOff   33         164m
ovnkube-master-nxq7r   6/6     Running            4          164m
ovnkube-node-5wvtj     3/3     Running            0          151m
ovnkube-node-cvg2n     3/3     Running            1          150m
ovnkube-node-kcnd2     3/3     Running            0          151m
ovnkube-node-qh95n     3/3     Running            0          164m
ovnkube-node-qx4fn     3/3     Running            0          164m
ovnkube-node-xh58s     3/3     Running            0          164m
ovs-node-47fcl         1/1     Running            0          164m
ovs-node-58hvm         1/1     Running            0          164m
ovs-node-8spmg         1/1     Running            0          164m
ovs-node-h2dqr         1/1     Running            0          151m
ovs-node-j786b         1/1     Running            0          151m
ovs-node-zqlwt         1/1     Running            0          150m


Seems like ovnnb_db.ctl was not created somehow

# oc exec ovnkube-master-g4cbb -- ls -l /var/run/ovn
Defaulting container name to northd.
Use 'oc describe pod/ovnkube-master-g4cbb -n openshift-ovn-kubernetes' to see all of the containers in this pod.
total 20
srwxr-x---. 1 root root   0 Dec 16 16:17 ovn-controller.2637.ctl
-rw-r--r--. 1 root root   5 Dec 16 16:17 ovn-controller.pid
srwxr-x---. 1 root root   0 Dec 16 16:18 ovn-nbctl.12.ctl
-rw-r-----. 1 root root 788 Dec 16 16:19 ovn-nbctl.log
-rw-r--r--. 1 root root   3 Dec 16 16:18 ovn-nbctl.pid
srwxr-x---. 1 root root   0 Dec 16 16:17 ovn-northd.1.ctl
-rw-r--r--. 1 root root   2 Dec 16 16:17 ovn-northd.pid
srwxr-x---. 1 root root   0 Dec 16 16:18 ovnsb_db.ctl
-rw-r--r--. 1 root root   2 Dec 16 16:18 ovnsb_db.pid
srwxr-x---. 1 root root   0 Dec 16 16:18 ovnsb_db.sock

nbdb size is different on crashed master
----------------------------------------
# oc rsh -n openshift-ovn-kubernetes ovnkube-master-g4cbb
Defaulting container name to northd.
Use 'oc describe pod/ovnkube-master-g4cbb -n openshift-ovn-kubernetes' to see all of the containers in this pod.
sh-4.4# ls -l /etc/openvswitch/
total 7628
-rw-r-----. 1 root root   11930 Dec 16 16:18 ovnnb_db.db   
-rw-r-----. 1 root root 5817857 Dec 16 19:02 ovnsb_db.db
sh-4.4# exit
exit

# oc exec -n openshift-ovn-kubernetes ovnkube-master-fwrk9 -- ls -l /etc/openvswitch/
Defaulting container name to northd.
Use 'oc describe pod/ovnkube-master-fwrk9 -n openshift-ovn-kubernetes' to see all of the containers in this pod.
total 8456
-rw-r-----. 1 root root 2629732 Dec 16 19:02 ovnnb_db.db
-rw-r-----. 1 root root 5817780 Dec 16 19:02 ovnsb_db.db

# oc exec -n openshift-ovn-kubernetes ovnkube-master-nxq7r -- ls -l /etc/openvswitch/
Defaulting container name to northd.
Use 'oc describe pod/ovnkube-master-nxq7r -n openshift-ovn-kubernetes' to see all of the containers in this pod.
total 8512
-rw-r-----. 1 root root 2628931 Dec 16 19:02 ovnnb_db.db
-rw-r-----. 1 root root 5819008 Dec 16 19:02 ovnsb_db.db

OVN/Openvswitch RPMS
---------------------
sh-4.4# rpm -qa | grep -i ovn
ovn2.13-20.09.0-21.el8fdn.x86_64
ovn2.13-host-20.09.0-21.el8fdn.x86_64
ovn2.13-central-20.09.0-21.el8fdn.x86_64
ovn2.13-vtep-20.09.0-21.el8fdn.x86_64

sh-4.4# rpm -qa | grep -i openv
openvswitch2.13-2.13.0-72.el8fdp.x86_64
openvswitch2.13-ipsec-2.13.0-72.el8fdp.x86_64
python3-openvswitch2.13-2.13.0-72.el8fdp.x86_64
openvswitch2.13-devel-2.13.0-72.el8fdp.x86_64
openvswitch-selinux-extra-policy-1.0-22.el8fdp.noarch

Comment 1 Anurag saxena 2020-12-17 14:27:35 UTC

Worth to mention that this is rarely reproducible. I have just seen it once on 4.7 till now

Comment 8 errata-xmlrpc 2021-02-24 15:45:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.